This is a complete do-over of a chain of blog posts from my main blog (which has an unfortunate url, for our purposes here) in which I had pretty much failed at introducing a complicated subject. That subject: The distributed computing future, and my MOS/DCA project, which I am now calling Project MAD.
To be fair, the process of writing those blog posts helped me pull those difficult thoughts out of the deep recesses of my brain. I originally started with project MOS/DCA in 2005, and for all of that time, it has lived only in my head, with only some utterly farcical notes in an old 3-ring binder to prove that I ever had such an idea in the first place. Getting it to the point where I could speak intelligently on the topic with a professional seemed at times an impossible task, because I had to run laps around my own brain just to corral the thoughts into order. For this reason, my previous blog posts were… scattered. There was indeed a narrative through-line to the ramblings, but you'd have to be a bit off in the head to really look at them for the first time and say, “I see where this is going.”
I'll try to be a bit less off-putting in this presentation.
The Distributed Computing Present and Future
I look around me today, and I see numerous problems in computing that share a common ancestry: we want to take several discrete computing systems and weave them seamlessly into a single solution to our problems.
This is, if I may summarize, not a problem computers were originally designed to handle. In fact, many early computers weren't designed to communicate with other computers in any way, shape, or form. When they did communicate, it was even slower and more restricted than the rest of the computer, which is to say: it was very slow and capable of very little. And yet, the model for computers in general actually hasn't evolved that much. Each computer consists of a central processor with a main memory bank, at least one “cold storage” for keeping data (especially to boot the computer, but also for user files), and then a number of special-purpose peripherals, each adding to the system's capabilities in a fairly narrow way.
Computers have gotten much better, but they haven't gotten much different. It is true that the proliferation of CPU cores, and the common presence of graphics and AI accelerator chips, does change them in some respects. But perhaps the most distinguishing thing about computers nowadays, and especially phones, is that they assume the existence of, and access to, the Internet. The internet and computer networking in general are technical marvels, but in many ways it's clear that we don't know how to use them most effectively--just as we aren't necessarily making good use of additional CPU cores in general. Oh, we are using all those things, and many other interconnection technologies and other things - cell networks for one, but also some new and some old bus technology such as ZigBee, I²C, serial, Thread, and Bluetooth among others. But we frequently fumble with integrating them seamlessly into new technical products. The “Internet of Things” is a good example here: these attempts at just sticking computers into things and depending on the Internet to glue the pieces together have done no small amount of damage to people and to tech companies' reputations.
Why? The answer is that we have not got a really good solution to the general case. Which specific general case? Well, specifically, I am talking about having multiple, separate computers whose capabilities should all be combined to form a gestalt system. Gestalt, if you don't know, is a German word that we have come to understand as “Creating a whole that is more (valuable, useful, etc) than the sum of its parts,” and that describes the goal and promise of distributed computing very well. As long as we have several full computers connected together and at our disposal, the combination of them should be greater than the same number of isolated computers.
In specific applications, this is already true, but those specific applications are not the general case. You can design computer systems networks, and the applications that run on them, so that you are making very good use of the specific system you designed. And yet, if I wish to (say) transcribe audio from the microphone on one computer, using under-utilized space on my home server instead of tying up resources on the computer with the microphone, I would have to either find someone who's already made exactly that solution, or… I'd have to make it myself, most likely hacking together a disorderly and unruly collection of parts that were never meant for this specific task. Or, third option: I give up, my server remains idle, and my PC remains overworked.
When I suggest that we have a "Distributed Computing Future", what I am suggesting is that computers can become part of a gestalt system automatically, and you can improve the system as a whole by adding pieces to it. So if I were doing, say, video editing work, and just threw another computer onto the network that was good at video editing (the computing part of it, at least), I wouldn't need to work on that computer specifically when editing video, nor would I need to design some hacky way to connect my normal PC to that video editing powerhouse. The system as a whole could become better at editing video, because the capability exists within the gestalt.
In a way, this has always been what Project MAD was about, even in 2005 when I understood almost nothing. Back then, when we had only single-core processors, I was thinking about just adding more processors to computers. I was thinking about scaling a gestalt system by adding more systems to it. And although back then I was too young and uneducated to understand what I had, I was chasing the tail of a general solution to problems that, still today, only have good solutions in specific cases.
What is Project MAD?
At the beginning, I split the problem I was wrestling with into two pieces. On the one hand, I wanted to divide computing hardware into chunks, and have very fast and very low-latency communications between them. I had no idea how to do this, being a CS undergraduate, but I still filled pages of my notebooks with thoughts about accelerator chips, side channels, and direct memory access busses. To this day, I still use the same name for this part of the project: the Distributed Computing Architecture, which is the D in Project MAD.
On the other hand, I wanted to create an operating system that would turn a collection of hardware into a gestalt system. I had no idea how to do this, being a CS undergraduate, but I still filled pages of my notebooks with thoughts about kernels, remote procedure calls, event chains, and distributed filesystems. To this day, I still use the same name for this part of the project: The Modular Operating System, which is the M in Project MAD.
Those two pieces together were interesting, but they were not a solution; they were a problem that needed a solution. I recognized very early that there were so many tricky puzzles involved in the design, some of which I had no ready solution to, that it was not even worth bringing up to most people. So, I stopped worrying about this side-project of mine and focused on finishing my degree, and then I had a major depressive episode that took me time to get out of. But even in that depression, I would sometimes pull out my notes on Project MOS/DCA and consider the problems I'd given myself, searching for suitably elegant solutions. Actually implementing the project was never a realistic goal; chasing after the design was simply one of those little eccentric things I did.
The last piece of the puzzle only came to me recently, and in fact, it was because I decided to try to write out articles for my previous blog posts explaining Project MOS/DCA. I had some existing notes about how the operating system would need to spread program chunks around the gestalt system. I wanted to use shell-style syntax to direct the flow of data between computers within a gestalt system, and as such, I considered the idea of just sending “wrapping scripts” along with a remote procedure call. Those scripts could do as little as nothing, outside of calling the intended remote function (envision for example, a smart lock latching or unlatching a door), but could also be pressed into service for things like verification and validation of arguments and return values, handling of errors, and so on. In fact, if my prior blog posts weren't now private, you could look back and see me talking about exactly that, in extraordinarily confused fashion. But because my main blog domain is my (now 20+ year old) net alias, SuperSayu.com, it feels inappropriate to use that when trying to actually reach any kind of audience. Plus, again... those blog posts were terrible.
The point is, in the process of trying to explain everything I had in my head, my thinking evolved, or perhaps broke through. “Wrapping scripts around remote procedure calls” is a terrible overall design, if only because of the security nightmare involved. But the general concept of executing code on remote nodes instead of simply making remote calls… that was the right idea, if you looked at it just right. As I examined it from the many angles the project presented, I came to realize that there was a simple problem at the core of many of our troubles with distributed systems:
Programs aren't built like that. Just like our computers still look a lot like computers from half a century ago, so do programs: they are designed to be in one place, tucked entirely within a single segment of a single memory bank, and that's it. That's what “a program” is, fundamentally. If your needs are distributed across multiple computers, set up multiple programs and connect them. Or, I suppose, you could try just using remote procedure calls, but the concept of RPCs as we have it to today is… not sufficient to build complicated systems on top of. It is more humble, more primitive.
My test case for judging these kinds of solutions has always been a distributed PC, in which the storage disk, the input stack, the network, the main processor, and the graphics processor are each on different machines, with a network I call the “backbone” (distinct from internet/LAN access) connecting them. There are certain assumptions I simply allow--that you can communicate between hardware freely, even passing large chunks of memory back and forth in reasonable time. At the same time, I look cautiously askance at solutions that assume too much. For instance, I try to minimize transfers over the network as a matter of policy; overuse and congestion may be inevitable, but (for example) negotiating between the CPU and GPU using only remote procedure calls seems like a sloppy, slow, and data-heavy endeavor, and thus an inadequate solution.
But the test case isn't simply whether or not it can work, it's whether or not it's feasible to program for. Imagine you have a simple application that takes input from a USB device, does some advanced processing in the CPU, and outputs visuals through the GPU. One of the things that turned me off from thinking too much about this project at the beginning, after my first initial rush of obsession faded, was the sheer scope of what you have to learn and understand in order to write even such a trivial program. You needed to know what peripherals were attached to the system, and where, and how to interact with them. You needed to handle, in some fashion, the disconnect between your application and those peripherals. Even an ideally designed backbone, which simply passes messages and data back and forth, opens up a whole world that applications programmers must in some sense understand if they want to work on top of it; they must, in some sense, write middleware that takes the network into account. Compared to a standard application on a monolithic computer, which resides in a single chunk of memory and uses standard library functions, there's no contest as to which is more friendly to write, debug, and administer for.
The goal is getting to that point with distributed computing: making it so that applications programmers don't need to write the middleware, where necessary features are standard features, where (ideally) programming language constructs can make the distinction between remote and local procedure calls paper-thin, whenever possible. But all of that requires a concept around which to organize, a concept that took me a great many years to stumble upon. It is in that context that I'd like to introduce the A in Project MAD.
The Agentic Distributed Application Model
The Agentic Distributed Application model is an answer to this idea of making use of remote resources by distributing a single application across multiple computers. Of course this is all theory, because I never advanced beyond a Bachelor's in Computer Science, but the general goal is to create an application deployment pattern that empowers us to face the challenges of distributed computing, by splitting any given application up into pieces, each responding to remote requests across the gestalt system's backbone network.
That probably sounds unimpressive, and the general concept of putting a small application on a network device is far from new. The ADA model specifically is about controlling the assumptions of a given application fragment, known as an agent. Each Agent can only assume certain things about what local resources it has access to; for example, it is not a given that Agents have any local disk space or internet access, unless that is the specific resource that Agent was designed to interact with. While the application as a whole may need these resources, each Agent is meant to rely on other Agents whenever it requires a resource not guaranteed to be local to itself. Thus, each Agent exists to provide an API to the rest of the application, so that each other Agent's needs can be fulfilled.
At the center of the model is the ADA server, which is a necessary concession to security, helps make the user model work, and helps smooth away the network boundary, allowing applications to treat remote procedure calls much more like they treat local function calls. Each computer in the gestalt system must have an ADA server in order to deploy Agents or have Agents deployed to it. The servers communicate with each other, catalogue the resources of the gestalt system, translate requests for named resources into properly directed remote procedure calls, and confirm and handle incoming ADA requests. Each Agent will post a private (Application-internal) API schema containing available data and available functions, and may post a similar public (system-facing) API schema, or multiple lists tailored to specific applications, users, or user-application-sessions. It is these API schemas that are used to verify and direct requests over the backbone.
By using a variant of the Universal Resource Identifier (URI) schema, any application Agent can make an unambiguous request to the ADA server corresponding to any exposed API, most notably its own Application-internal API. As long as you control your assumptions at compile time, to a programmer, utilizing this internal API can look no different from having your program divided into namespaces, and compilers can automatically generate API requests wherever the programmer requests resources across a namespace boundary. This means that most internal API requests, nominally passed as strings, can be verified at compile time.
Ultimately, an application programmer or power user with the applications' permission can get a full listing of all the internal resources of the application, organized hierarchically as though it were a filesystem, and all exposed resources therein are valid targets for ADA server requests. Because of this filesystem-like structure, we can make the system more flexible as well. For example, third-party libraries and embedded applications, or even external applications, can be assigned places in the hierarchy similar to Agents, even if the library or application itself is divided into multiple Agents, so long as it provides a consumable API. Likewise, there can be API translation or API forwarding entries that allow you to target an internal API point and trust that the translator will target the correct external API. This is useful, for example, to add version compatibility shims in case a library or external application breaks API compatibility, or in case a security flaw must be worked around for libraries that cannot be updated.
Because, as I'll describe below, all ADA applications must operate under the auspices of a user, and all users are required to track their authorized applications, and all users currently active in the system can be enumerated, you can create a global listing of everything currently exported in the system, organized as a hierarchical filesystem. Assuming that, for instance, the same mechanism can be used to list application files and user files, this index becomes a true filesystem of the distributed system, with unambiguous unique identifiers for every resource and capability of the system.
That statement might seem slightly more significant in a moment.
The User Model of a Decentralized Operating System
Ultimately the ADA Model is intended to be part of a larger system, and specifically, the Modular Operating System. One of the significant problems I ran into when originally working out what the MOS project even was, is that a distributed operating system is necessarily a decentralized operating system. Each physical machine requires its own local administrator, if only to oversee the boot process, application management, and error handling, but there are no guarantees that a gestalt system has exactly one computer that can be understood as the core of the system. There may be multiple CPUs, multiple disks, multiple users… or there may be none of the above. What happens if you configure two systems to each have a centralized CPU and boot disk, and then combine the two together as peers? Who should be considered the master of the system, and how do you handle the discrepancy?
My fundamental understanding of such a system, today, is that assuming each module is its own local administrator, there is no specific need for a central authority. There can be one, and specifically in secure environments such as enterprises and governments, it makes sense to boot the entire gestalt system into a mode where modifying the system layout from its known state is not allowed, and only a specific list of users and/or applications is allowed. Similar to a verified boot schema, the system will not start if any unauthorized change is detected. However, even that mechanism must fundamentally be a decentralized one; each module must verify that it is who and what it claims to be, and that its neighbors are who and what they claim to be, with even a single module raising an exception being sufficient to cause an exception. Unauthorized changes made while the system is operating would be likewise suspect. And it must still be possible to boot the gestalt system itself into a less-secure state in order to make changes, which means that the boot state of the system must be verifiable after the fact, and changes to the system must be detectable and verifiable.
But that leaves us with the question of what the core of a decentralized system should actually be, and the answer is the user, or specifically a user-login-session. Anything that is not a hardware-specific service (such as base operating system processes and peripheral drivers) should be tied to a user session, and in absence of any user session (or a secured boot mode), all hardware in the system should boot into and remain at a “factory state”, that is, entirely unconfigured, even if they have disks for long-term storage, and are therefore capable of state. Except as necessary, all system configuration should be done on the auspices of a user, even if the user session is a service (ie background, non-interactive) session.
If that sounds odd to the programmers and system administrators in the audience, it's probably because there are a lot of system service processes involved in running a computer, many of which need (or would make good use of) configuration. Most of those will end up being hardware-local processes, not ADA applications, but there are overlaps, for example in the GUI, and that may be a tricky contradiction to resolve in your mind. A hardware module may provide generic GUI services, but things like themes and session management are user-specific. Absent a user login, or even absent any user accounts capable of logging in, the generic GUI services would still run as hardware-local processes, but there would not be any kind of stateful session.
It is important to understand that any given user session is likely to be centralized on a specific hardware module containing the user's files, and in fact user-specific hardware modules are one of the more interesting parts of the overall distributed computer design, most specifically because they may take complete control of their own local resources and refuse to offer them to the system in general. In other words, only those applications approved by the user will be able to run Agents on the same hardware machine as the user's files, and no one--not even a network administrator or other system authority--can override the user's sovereignty over their own hardware. Because the only way for the system to interact with the user module is by its exported API, no other process or Application in the gestalt system can change this configuration and open up the user module, meaning that user modules are first-class citizens, immune to any other administrator's access. Absent a valid login session, the only thing the system can do is request a user login; once a login session has begun, the session itself determines what is made available and to whom.
Because the user module is separate, the login process can simply refuse to start on systems it deems unacceptable, such as systems that have unusual system topology or suspect modules, or such as systems that are not currently booted into a verified-layout mode, or who are in a verified layout but not one the user is familiar with. If it seems likely that the system is insecure, it would be better not to start anything, to minimize the possibility of compromised or insecure software being exploited. With physical access to the hardware, the files will still potentially be accessible; but if a hardware module elsewhere in the system has been coopted by or some new, suspect module installed, detecting and refusing to function in the presence of such a module is still the most secure and an arguably appropriate choice of action. If the suspect module or modules can be routed around, the user may simply choose to ignore modules that are unacceptable, making them inaccessible to user applications, but this is somewhat less secure than the maximum-paranoia solution.
The idea of user-specific hardware also heavily implies that the user session is portable. If the entire user application stack is stored on a piece of dedicated hardware, one which merely makes use of available hardware to run the applications, then any hardware will do, so long as the user is satisfied that the computer is secure. Assuming the login process detects the system hardware and adjusts accordingly, this can also allow the same login stack to run different environments under different circumstances, for example, allowing a smartphone to run both a handheld environment on its built-in display, and a separate desktop mode for larger systems, and a third display mode for e.g. televisions, projectors, or in-car displays, all from the same general collection of files. There have been attempts to accomplish exactly these things in specific cases, but the use cases here all arise from the problem being solved in the general case... at least, in theory.
Also notable here is the general acceptance of the idea of a multi-user system. Indeed, theoretically, you could have multiple monitors and multiple input devices on a single machine, with multiple user sessions each making use of a specific input-output combination, and the fact that the system is decentralized means that the hardware and software overlap between the user sessions may be minimal This is a far less secure state than a single user machine, but it makes a certain sense when multiple users each only need minimal resources for, eg, browsing the internet. And because most of the computers making up these machines is involatile, unable to be affected by users, this might be a cost-effective kiosk strategy.
The Decentralized System's Base
The acceptance of multiple users in a decentralized system, however, raises a specific question: how do you handle administrative tasks like assigning displays and input devices? As simple as this sounds, it might not be difficult for a hacker to create virtual screens and input devices on a machine and assign them, invisibly, to a backdoor login, or add software input devices to a logged-in user, perhaps replacing your existing keyboard and mouse. This is another example of a problem that can have easy, specific solutions in the short term, but benefits from a solution to the general case of the problem. Specifically, some system functions benefit from being objective, that is, not owned by any logged-in user, and ideally being tamper-evident, allowing users to withdraw from the system if faults are detected.
One example is an unbiased accounting of the system hardware topology, that is, how all the hardware modules in a system are linked together. While earlier I summarized a Distributed Computing system as operating on top of a “backbone network”, networks can be either wide, with several hubs that each connect to many units, or they may be tall, with long chains of single links. An unbiased accounting of the system topology would allow a power user to see, for example, that a fake keyboard and mouse exist on an Internet-access module, or that a module which has only one physical connection reports itself as having child nodes containing input devices, network adapters, or secure disks.
It also helps to know whether there are currently, or have been since your last login, other users on a system that you believe to be secure. All of these would help you feel assured that it is safe to type your password on a keyboard, or otherwise confirm your login to a supposedly secure computer.
Long ago, I decided to distinguish these objective services from the “root user” concept familiar to UNIX systems, not least because a decentralized system cannot have a single “root”. As such, I have generally referred to these as “base” services; they are a ground floor that exists beneath logged in users, but they do not spring from a single root source. User-facing hardware services such as the GUI and user input devices should have some access to the system base, so that (for example) an interactive user can hit a dedicated button to request base access, and then reassign input devices to specific displays, etc, with the display outputs on the system being over-ridden, not displaying user output; this is distinct from their own user session accepting such changes, and a user session might logout if it believes an unauthorized input device has been added to the session, or the addition may require confirmation using a known-good input device, etc.
Some of the services I would generally consider “base” are foundational to the ADA and to Project MAD in general, such as the ability to deploy application agents. A full and objective accounting of all distributed applications, all application agents, and all users, is unquestionably a base service; not only should users not be asked to provide these services, users should not be able to tamper with their functioning in any way.
The list of running ADA applications, for instance, may be useful to detect when an application, user login, or ADA server has crashed; in such a case, there may not be an explicit signal sent out alerting other ADA servers to kill the Agents of any applications that are no longer running, perhaps because of a hardware failure or other serious fault. In such cases, all parts of a gestalt system should refuse to continue hosting Agents with invalid parent applications or users as soon as they are detected as such, just as each ADA server will refuse to allow new agents to be deployed if no ADA server in the network will vouch for the existence of the application and/or user that it claims to represent. For the safety and security of the gestalt system, the relationship between users, applications, and agents must be preserved, which requires objective, base reporting.
Notably, there is no “base” user, and in particular no global base user login. Base services are run as local hardware services; there should be no single login that allows a user, even an administrator, to tamper with all hardware attached to the system at once. In a production environment, no hardware attached to the system should have a local hardware login exposed at all. Ideally, all hardware updates and changes should be independently managed by local processes, and at maximum security, firmware updates may only be performed out-of-band, with direct hardware access. Likewise, there should be objective measures used to sign computer firmware, or some other tamper-evident seal used to detect when the apparent and true state of a piece of hardware are not in alignment, which may indicate malfeasance.
All Part of the Plan
This is by no means an exhaustive accounting of the details of Project MAD. There is a lot to say, for instance, about hardware drivers and APIs, and how they factor into the operating system. I likewise have many desires and theories, however unreasonable or ungrounded, about the ideal hardware of a MAD computer.
It is, however, a vastly better summary of the overall design of a MAD computer than anything I have managed to put together before. My previous blog posts were, quite frankly, a disaster, and while I've summarized the project to others before, it's generally been difficult to condense the entire idea into a digestible, if perhaps not quite bite-sized, chunk.
Perhaps most importantly, it is my first public introduction of the ADA Model, which I think is key to making good use of distributed systems in general, and I believe I have done an acceptable job of summarizing that. Because the ADA, or more accurately, the problem that the ADA is a solution to… that is crucially important. Finding a general solution to the problem of running applications on a distributed system, is going to be a cornerstone of how we use computers in the future.
If we try to forge a future without something like the ADA model, we will by definition end up with systems that attempt to control a distributed system without unifying that distributed system. From the perspective of an ADA model application, the entire distributed system is one system. That isn't a promise, it is the premise. The ADA model is not some finished product, nor even a fleshed-out specification, and in fact I am not capable of finishing it on my own. It is a problem and a unique point of view, and I believe that it is only by embracing the problem that we will be able to make best use of a world that is rapidly filling with independent, and frequently underutilized, computers.
Likewise with the rest of Project MAD. I do not have any kind of fleshed-out set of standards that would explain what a Distributed Computing Architecture computer, or a Modular Operating System, would look like. I have problems, and I believe those problems have solutions. In the end, it would be acceptable, and possibly wise, for other people's solutions to those problems to supplant my own.
At present, I feel like I'm the only one who's actually thought this entire set of problems through--and to be fair, it's such a specific, large, and complicated set of problems that I doubt most people, even in the higher annals of Computer Science academia, would choose to do so. And it's possible that it takes a person like me, who is neuro-divergent and socially isolated, to actually keep chasing that set of problems long enough to find a solution.
At the same time, as with many technologies and discoveries, I look back on the achievement and feel like it's all too simple, as though surely someone else will or has come up with these answers to these problems, or perhaps better solutions. But realistically--who would? For-profit corporations would prefer much simpler solutions rather than reinventing of all of computing. Academics would want a solution that could be summarized in a single publication, or at least, a solution that can be broken down into discrete units of publishable results. Existing standards bodies and open-source cooperatives would most likely, generally, disdain anyone who told them to abandon their existing work, so such a solution wouldn't bubble up from within them.
No comments:
Post a Comment