Thursday, July 31, 2025

Remote Procedure Calls under MOS/ADA

I have cast some aspersions against the existing format of remote procedure calls on this blog, and it would be unkind of me not to admit that I have never actually used remote procedure call mechanisms myself. I know that RPCs do not meet my requirements, because my requirements are insanely high and weirdly specific, but I can admit I don't really know a whole lot about the RPC mechanism in existing operating systems.

Instead of trying to talk about existing RPC mechanisms, let's talk about calling and handling remote procedures under Project MAD.

MAD Procedures

I've already said that Agentic Distributed Applications expose a list of procedures available to be called remotely, both procedures only intended for internal use and those more publicly exported, and I've said that applications are supposed to deploy an Agent onto remote nodes in order to make use of resources. I've heavily implied that resources are, therefore, not intended to be consumed except by Agents, I'm happy to mostly leave that there, with an asterisk saying that some very basic resources, such as sensors, might be accessible remotely.

If that sounds unrelated, it's not quite. The directory of available API calls that the ADA server builds, must have a built-in distinction for what calls can be made remotely, and what calls are consumable only by local Agents. After all, it makes no sense to deploy Agents to consume your own application's internal API, which is itself provided by your own Agent already; the internal API is one of possibly many examples of remote APIs. When you need to decide where something should fall, there are a number of questions to ask, and some more obvious than others. For example, a purely remote API call only makes sense if either:

The call is stateless; it does not store data or set internal state that will affect future calls; generally, if multiple calls are made by different applications, there is no way for them to interfere with one another except in reasonable, predictable ways.
Or, the procedure manages its internal state in such a way that no combination of remote calls can have unintended consequences, for example, internally managing its memory such that different callers have different memory dedicated to them.

There are interesting side topics here, such as the need for a distributed configuration mechanism, but that's not what I'd like to focus on today.

The typical example I have in my mind when thinking about remote calls, and why Agents are necessary, is the GUI. If you want to create a window to display information to the user, you are asking the graphics subsystem to set aside a bunch of memory for your application to manage. Ultimately, this memory needs to be co-managed by the Application and the graphics subsystem; it can't be purely under the Application's control, because the back-end needs to read that data to actually display it, and may need to take control over the memory if something happens to the Application. But, it can't be purely under the graphics subsystem's control, because the Application needs to freely make a lot of fast modifications to the memory, possibly not using the subsystem's helper functions at all to do so.

Calls like this should never be made without an Agent, because the subsystem needs something to take responsibility for that reserved memory. If you aren't counting on Agents, then you need some other mechanism to notify the graphics subsystem when an Application quits or crashes, or the hardware that the Application was running on gets disconnected. Even if you had very fast mechanisms for accessing the video memory remotely, and a dedicated network bus that lets you pipe in uncompressed video data straight into the buffers without causing network congestion, the need to manage the memory itself, and assign responsibility for it, remains.

Contrast that, however, with something like an atomic sensor read, setting a hardware light to on or off, or even creating a simple popup notification in the GUI. Even when these actions have side effects, they aren't the kind of side effects that explicitly need management. You could draw a distinction, say, if you wanted to be the only Application in control of that hardware light, and to lock out all other users, or if you wanted a modal dialog box in your GUI that the user must interact with, and which the application needs feedback from; those would require management and therefore an Agent. But if you simply wish to flip a light on or off, or set its color value or intensity, or if you merely wish to put a bit of text in front of the user, that does not. It may require authorization; you may need confirmation that the application is allowed to do these things. But you don't necessarily need an ongoing presence.

But even when talking about actual remote procedure calls, my expectations of a remote procedure call mechanism are higher than average. Some of the reasons why are better explained later; for instance, it is my hope that the MOS system directory is also a directory of types, and that type data is used in detailing APIs, and can be used for verification of the incoming parameters. That one-sentence summary hides a number of other thoughts and details, but it's better not to get into them right this moment.

Decentralized Procedure Calls

A more important thing to say about MOS/ADA remote procedure calls is that the distributed and decentralized nature of the system requires some explicit handling. Let's say that you have an application that uses a machine learning chip to monitor an external camera, and when the ML chip detects that a bird is in front of the camera, it takes a picture and puts it on your computer monitor. For our purposes, assume that all of these are on separate hardware, and therefore remote to each other. There are numerous data streams involved in this process; most notably, from the camera to the ML chip and from the camera to the display. However, the trigger that puts the picture on-screen doesn't come from the camera, it comes from the ML chip.

There are a couple valid ways for the ML chip to display an image from the camera on the monitor; the ML chip could store the image frame and pass it to the display if it meets the requirements, or the ML chip could ask the display Agent to go and fetch an image frame, or the ML chip could ask the camera Agent to send a frame to the display. But I will argue that the “correct” form of this request sends two requests: one to the camera, asking it to send a frame of data to the display, and one to the display, asking it to be ready to display the frame of data it is about to receive. This solution keeps the ML chip from needing to keep frames, but it also minimizes the overall wait, by having two operations begin in parallel. The display process and the image send process both start at roughly the same time, and by the time the image data makes it to the display, the display is ready to handle it.

But perhaps the most interesting reason to argue for this interpretation of inter-Agent communication, is that you can imagine writing the request in a single line of highly readable code:

video.display( camera.capture() )

This line makes intuitive sense to a programmer; you are sending the camera data to the video display. But what is perhaps most important is what this single line of code says about how remote procedure calls under the MOS/ADA specification should work.

Specifically, in all remote procedure calls, you eventually need to send raw data as parameters to a remote function. It is our common experience in programming that you are only able to access data that is directly under your control, but any distributed application is going to need to refer to data that exists somewhere else, when making a declarative statement of intent. In this case, from the ML hardware node, we are calling a function on one remote hardware node, passing as parameter data from another hardware node. All of this data is under the application's purview; there is no problem with the scope of the data involved, no reason why this request should be impossible. But with existing libraries and programming languages, it may be a very awkward three-way handshake to try to coordinate.

Rather than saying “This is the ideal form of this data stream,” I am saying that if we work hard to enable this kind of decentralized data command as a basic programming language feature, it will empower programmers to more easily perform operations in parallel. And programming in parallel is somewhat of a problem; computers have run multiple programs at once for, what, 35, 40 years? More? And yet many programmers still think, fundamentally, in single-threaded terms. The biggest exception I can find to this rule is actually webpage programming, because the very nature of web programming depends on remote data requests.

But instead of that, and not coincidentally, the compound statement I wrote above reads more like a shell script, where two applications are both started, and input from one is piped to the other. You could easily see how the syntax of this compound remote procedure call can be extended to run dozens of operations in parallel - by simply passing more parameters to the destination function, each sourced from a different Agent. If the chain of logic got more complicated, for example, the camera image was passed through a filter on another module, perhaps to remove all image data except the bird (with another ML algorithm, perhaps on a different ML hardware node, or a different Agent on the same hardware node), you could see how the statement would remain a perfectly valid, single-line statement:

video.display( ml_filter.isolate_bird( camera.capture() ) )

And if this seems a bit redundant for a single image frame, consider that if we were setting up a video stream instead, this same chain of logic becomes a workflow in which each hardware node does its own job and nothing more. Ideally, once this workflow is set up, it can operate at the fastest possible speed, making full use of the parallel architecture to get things done without overwhelming any single piece of hardware.

Looking Under the Hood

In the meantime, we are left with a question that people may find uncomfortable: supposing that this chain of logic is acceptable, how exactly do we phrase this remote procedure call in low-level code? Even just using the earlier example, without the filter, it can seem complicated. The video.display function needs to know it will be waiting for a data block parameter, one not included in the RPC call that starts the function. And the camera needs to know that the return value of the incoming RPC request will not be sent back to the Agent that made the request, but sent to another Agent specifically to be used as part of a function call that it did not initiate.

The answer I have currently is that the MOS/ADA resource directory, which I said is used to expose API calls, can also expose chunks of memory for reading and writing--and specific to this use of that mechanism, there needs to be a syntax for reserving memory for temporary variables. There's a lot to unpack there, I admit; I was under the impression that I had already suggested in my first post that the ADA indexed memory, but I look back and don't see it (I did refer to ‘application resources’, but I wasn't explicit), so it's worth taking a moment to justify that.

Recall that the ADA's exposed API is handled primarily by the ADA server, and that all requests come with a sending application, agent, and user; as such, when I suggest that raw memory can be read or written, we are not talking about leaving memory access unprotected. There are many processes in a distributed system that will be both data-heavy and latency-sensitive, such as parsing video, and as such, it's generally best to have some mechanism to simply transfer memory, because if there isn't a built-in mechanism, applications programmers will simply write wrappers to do the same thing, leading to code that does nothing beyond circumventing the limitations of the system. And while those explicit wrappers are good in some circumstances, especially when they add verification and validation or similar checks, that's not a good justification for having no direct memory access mechanism.

Equally, having the ADA server provide direct access to application memory is good for debuggers, administrators, and power users. Debugging a distributed application is a nasty business; you don't have all of your memory in the same scope all at once, meaning that if the system doesn't have some mechanism to expose any part of the application's memory on demand, you will with only the uncomfortable dance familiar to all programmers, where you insert code randomly to check, log, or output values. While programmers will inevitably do that anyway, it behooves us to have a proper solution to the problem.

Likewise, a power user may take a relatively standard application and want to get specific data out of it. An easy example that comes to mind are video games; while there are many data values the game developer would not want to give you access to, it would not be hard to have, for instance, your player health (and maximum health) exposed for reading (but obviously not writing), and a power user could create a third-party application to display your health bar on a separate display, or as a video overlay that can be moved around the screen. As trivial as that may sound, players may prefer their data in a different format (bars, dials, or raw numbers) than the game designer intended. Likewise, access to application internals can be good for accessibility, turning what would normally be video into sound, or sound into text, or text into braille, without necessarily receiving the developer's explicit permission or counting on them providing software hooks.

Coming back to our remote procedure calls, however, it makes a certain sense to be able to pass ADA URI paths as parameters to functions, or as the destination for return values. When we are talking about setting up a data workflow between two remote targets, it makes sense to have a syntax specific to creating temporary names, unambiguously describing a location that will await a single very specific bit of incoming data, possibly only from a specific source. In our example case, the target display sets aside memory with that ID and waits for all parameters to be received; the camera source sends its procedure return value to that ID on the target Agent. And once the data is received, the display function goes about its merry business.

Of course, in reality, things are more complicated than that.

If you create a mechanism for setting aside memory, that can be exploited maliciously for a denial of service attack. Presuming for now that the sender of a request can at least be verified, this denial of service can be mitigated by detecting unreasonable requests and having the requester, not the data sender, be blacklisted. (Unless it can be guaranteed, it should not be assumed, that the two are part of the same application, but I would assume that when a malicious actor is detected, the entire application gets blacklisted, for obvious reasons. For some multi-user systems, it may be the entire user that gets blacklisted, and an Administrator will be notified.) But what if the request gets through the sender's entire workflow before the memory gets set aside at the receiver, due to network congestion or some other slowdown factor? Does the sender's data packet get lost in transit? Does the sender get treated like a malicious actor if it requests to write into a data block that doesn't exist?

The answer to that, at least for now, is that these requests are handled by the ADA server on all three sides, meaning we have the opportunity and obligation to handle these exceptions as a matter of policy. For example, the data sender may not attempt to send data out unless the receiver acknowledges that the named data opening exists, and the data may not even be generated by the sender until the request is acknowledged at the receiver (because again, generating data with a malicious request can create a denial of service exploit). And part of the point of having Agents being involved on all three sides, is that if there is an error due to network congestion or something similar, that error should percolate up through the chain of logic, distributed across the system, until it is handled by all relevant Agents.

Suppose for example that the requester's data packet to the receiver disappears, and therefore, when the sender attempts to send data to the receiver, there is no such named temporary variable to write to. This will cause an error at the sender, but the sender is only an intermediary; it may be logged there, but the error condition needs to be reported to the requester, who actually set up the chain of events. And because the sender's operation fails, that will generate its own error, and the error stack is sent back to the requester. It is there that the error gets its full context: the receiver was not ready for the sender's data packet, despite the requester having definitively sent out both. If this happens repeatedly, there may be a larger issue to investigate, and may require the intervention of a programmer or administrator.

This Isn't Everything

There are other interesting topics in this kind of distributed, remote API environment. For example, callbacks and event delegates are a common mechanism in event-driven software models such as the GUI, and while these callbacks should be handled entirely by Agents (again, because it involves managing relationships, just as with memory as described above), it is worth having a discussion about event subscription in a distributed system. And I do mean a discussion; I'm not sure I could put together another long-winded rant on the topic, but I'm sure there are pieces to the question that are more complicated than I know. We are, after all, talking about users and applications and Agents, with possibly all of these being different between the event source and the subscribers. There are matters of security, efficiency, and best practices that I may not be aware of.

Likewise, as I teased before, there is a question of distributed configuration, and although I'll go into that more in another post, I will say that distributed configuration is meant to be handled with an explicit mechanism in the MOS/ADA schema. That configuration needs to account for local hardware module policy, system administrator policy, user policy, application default configuration, stored configuration changes in the user files, temporary values specific to the user login session and/or parent application session (aka, the “environment” as understood in existing operating systems), as well as values specific to the user application session (which can be understood as application-local variables, except that they represent and are best described as configuration). Having an explicit mechanism to gather and resolve contradictions in this wide field of configuration sources only makes sense, and the general goal is to be able to simply query the configuration and receive an answer, no matter where it comes from.

There are also probably other questions I wouldn't have any immediate knowledge of or answer for. Perhaps complications that arise where hardware APIs interface with the MOS/ADA APIs. Perhaps complications where code libraries and embedded applications interface with the parent application, or where those embedded functions and parent application functions share (or arguably compete for) resources on remote nodes. There are doubtless complications when it comes to truly confirming that applications are operating under the auspices of a user, or that an Agent is truly what it claims to be, running the code it is claiming to run.

The general Modular OS Agentic Distributed Applications model has a lot of nuances and complexity that I have no right to try to decide. I can only, and have only, sketched out the broad strokes of this system. I think I have answers, but only testing and implementation will determine how right, or wrong, I am. All I can really say for sure, is that if you are trying to build an entire operating system on top of remote procedure calls, you have a lot of work to do in order to ensure that the system is powerful, stable, and easy for programmers to understand and manage. And… forgive me if I may sully the name of remote procedure calls as they currently exist, but I really think it's going to take a lot more.

Most likely, the full list of what it does take won't begin to take shape until people who know these systems a lot better than me have their say.

Sunday, July 27, 2025

What's in a Name?

I suppose I can expect that most people who find their way to this blog post will expect me to give a rationalization for the project name, "MAD," because it seems to be exactly the kind of unsexy name that any reasonable person wouldn't attach to their life's work. Or, if I may sum up, I'd have to be mad to call my project MAD, especially if it isn't mad, because consumers and therefore industry partners would get mad about being associated with a mad project, even in principle. As mad as those people would have to be to take things at face value and trust that the name of a project means exactly what it seems to, that certainly is exactly the sort of mad world we live in.

Anyway, no, that's not what this blog post is about. Don't worry, it's a relatively short one.

In writing my previous blog post introducing the ADA model and Project MAD in general, I said the following two things:

To this day, I still use the same name for [the hardware subproject]: the Distributed Computing Architecture, which is the D in Project MAD.

To this day, I still use the same name for [the operating systems subproject]: The Modular Operating System, which is the M in Project MAD.

It has been on my mind since then, that these two names deserve to be swapped.

In truth, "Modular" is exactly the right descriptor for the kind of distributed hardware that differentiates the hardware design of the DCA from normal computers; it's all about taking a small payload, attaching it directly to a network bus, and then simply slotting that small, independent module into its place in a larger system. So pervasive is my belief that these hardware chunks deserve the term that I am constantly trying not to refer to hardware nodes in a distributed system, in general, as modules, even in cases where I cannot assume I am talking about DCA-specific hardware.

According to Wikipedia:

Modularity is the degree to which a system's components may be separated and recombined, often with the benefit of flexibility and variety in use. The concept of modularity is used primarily to reduce complexity by breaking a system into varying degrees of interdependence and independence across and "hide the complexity of each part behind an abstraction and interface".

In truth, this definition of modularity fits better with the hardware design than it does the operating system. While yes, the Modular OS concept enables a modular computing, it's less correct to say that the operating system, per se, is modular. It's not wrong, exactly; we're still talking about reducing the complexity of the operating system by breaking it into pieces. Ideally, the operating system of each hardware component will be aggressively minimal, as it will only encompass the processor and software ecosystem, the backbone, and a specific payload. And... some parts of that are arguably less in the domain of the operating system, and more about bundled software.

But the implication of a modular operating system is that the operating system is built on top of modules, not that the modular operating system is used to create modules, which themselves will be part of a larger gestalt system. Technically, it's both; the effective operating system of a given hardware component will be expanded by all other modules attached to the gestalt system, and therefore, each component is arguably made of all the other operating system nodes. But that's really not the best possible word for the phenomenon.

Indeed, "Distributed operating system" is a more correct label. It speaks of an operating system whose whole is divided into parts, and each of those parts is supported by a different piece of hardware. Being distributed comes with the kind of benefits the word tends to imply, such as not having a single point of failure; the operating system will continue to function even if a piece of it fails, because the effective core of the operating system is held up by all of the operating system "fragments" together. As long as a single chunk of the operating system remains, you could say that there is an operating system on a whole distributed system, even if every other component has crashed, been destroyed, or been removed. And like the Ship of Theseus, if you have the system running while you slowly replace every single hardware component (assuming that it is hotpluggable, which I feel comfortable presuming), at no time will the system as a whole cease to be.

In contrast, "Distributed computing architecture" is... honestly not the best name for the hardware. It describes the overall goal, but not the means, and so it is a poor description overall. What is it about the architecture, specifically, that is distributed? It is intended to be capable of creating a distributed system, but there is nothing, and will never be anything, in the actual definition of the DCA that prevents a single hardware component from being a completely independent computer, containing all of the hardware necessary to run a full operating system and user application stack. A full computer that also had a DCA expansion port would be a computer that could be expanded and extended, but where that full computer has not been expanded or extended, there is nothing "distributed" about its architecture.

No, the right word for that hardware is a module, even when the module itself contains a full and independent computer. A module can contain a full computer, or it can contain a mere extension; what's important is that the module can be separated from a system and recombined. As I pointed out in the previous post, there is already a specific use case that proves this point: a theoretical cell phone that can be plugged into a DCA computer to make use of its processors, input, and output. The phone itself remains a separate and independent computer, but it can also be understood as a module which can be integrated into a larger system.

So what's in a name? Why do I still use the old and arguably inapt terms?

Well, being honest, at the current time the largest reason why is my own pride. Not pride, as in "I am proud of my naming conventions," but pride as in, this is mine and I don't want to change it. That's a little disingenuous; a better explanation is that I want to preserve the origin and legacy of the project, simply because I want the project's origin and legacy to somehow redeem me of my failures since then. The times when I was depressed and struggling, when people could point to all my failures and say that I was unworthy, even at those times I could point to the ongoing legacy of the project and suggest that I was still, technically, working on it. That my periods of isolation, loneliness, and despair were somehow necessary, rather than simply being failures.

There are other reasons, though. One good one is that if you simply swap the first word in each name, "Distributed Operating System" becomes problematic: its acronym is DOS, which (in case anyone in the audience is not aware) is already an acronym in use, specifically in the operating systems space. That use of DOS is arguably a legacy, as it refers mostly to pre-Windows command interpreters, but at the current time, an unfortunate number of people in the computing sphere would hear the term "DOS" and think of something else, and specifically something vastly inferior to the scope and scale of the project I'm talking about. DOS was, in its own era, ubiquitous and inescapable, something that academics, business professionals, and home users all came to know very well indeed. The acronym MOS, while not 100% unique, does not suffer from this problem. Likewise but less damning, MCA has a few more existing uses as an acronym than DCA, though not any that strike me as being that kind of immediate problem.

Plus, I feel that MOS/DCA rolls off the tongue better than DOS/MCA. Not that I'm using that acronym pair alone, anymore. As I am moving towards "Project MAD" as a shorthand for the overarching project, it definitely does not matter whether you switch around the M and the D. But it's hard to just dismiss twenty years of the acronyms rolling around in my head. It's not just something that was in my head once or twice, but something I've come back to, something I've invested time and sanity into.

Most likely, if this ever becomes a real project, I'll cheerfully swap the acronyms and never look back. Where things currently are, with me just trying to explain to the universe that I have something here worth talking about, I'd rather not shake up all of my existing notes for no reason. I'm not even sure that I'm going to shift most of my private notes from MOS/DCA to MAD, even though I agree that acronym better suits the project. In my head, the original two projects are still just... what the project as a whole is.

Perhaps that will change as Project MAD develops more.

Saturday, July 26, 2025

Project MAD, in Theory

This is a complete do-over of a chain of blog posts from my main blog (which has an unfortunate url, for our purposes here) in which I had pretty much failed at introducing a complicated subject. That subject: The distributed computing future, and my MOS/DCA project, which I am now calling Project MAD.

To be fair, the process of writing those blog posts helped me pull those difficult thoughts out of the deep recesses of my brain. I originally started with project MOS/DCA in 2005, and for all of that time, it has lived only in my head, with only some utterly farcical notes in an old 3-ring binder to prove that I ever had such an idea in the first place. Getting it to the point where I could speak intelligently on the topic with a professional seemed at times an impossible task, because I had to run laps around my own brain just to corral the thoughts into order. For this reason, my previous blog posts were… scattered. There was indeed a narrative through-line to the ramblings, but you'd have to be a bit off in the head to really look at them for the first time and say, “I see where this is going.”

I'll try to be a bit less off-putting in this presentation.

The Distributed Computing Present and Future

I look around me today, and I see numerous problems in computing that share a common ancestry: we want to take several discrete computing systems and weave them seamlessly into a single solution to our problems.

This is, if I may summarize, not a problem computers were originally designed to handle. In fact, many early computers weren't designed to communicate with other computers in any way, shape, or form. When they did communicate, it was even slower and more restricted than the rest of the computer, which is to say: it was very slow and capable of very little. And yet, the model for computers in general actually hasn't evolved that much. Each computer consists of a central processor with a main memory bank, at least one “cold storage” for keeping data (especially to boot the computer, but also for user files), and then a number of special-purpose peripherals, each adding to the system's capabilities in a fairly narrow way.

Computers have gotten much better, but they haven't gotten much different. It is true that the proliferation of CPU cores, and the common presence of graphics and AI accelerator chips, does change them in some respects. But perhaps the most distinguishing thing about computers nowadays, and especially phones, is that they assume the existence of, and access to, the Internet. The internet and computer networking in general are technical marvels, but in many ways it's clear that we don't know how to use them most effectively--just as we aren't necessarily making good use of additional CPU cores in general. Oh, we are using all those things, and many other interconnection technologies and other things - cell networks for one, but also some new and some old bus technology such as ZigBee, I²C, serial, Thread, and Bluetooth among others. But we frequently fumble with integrating them seamlessly into new technical products. The “Internet of Things” is a good example here: these attempts at just sticking computers into things and depending on the Internet to glue the pieces together have done no small amount of damage to people and to tech companies' reputations.

Why? The answer is that we have not got a really good solution to the general case. Which specific general case? Well, specifically, I am talking about having multiple, separate computers whose capabilities should all be combined to form a gestalt system. Gestalt, if you don't know, is a German word that we have come to understand as “Creating a whole that is more (valuable, useful, etc) than the sum of its parts,” and that describes the goal and promise of distributed computing very well. As long as we have several full computers connected together and at our disposal, the combination of them should be greater than the same number of isolated computers.

In specific applications, this is already true, but those specific applications are not the general case. You can design computer systems networks, and the applications that run on them, so that you are making very good use of the specific system you designed. And yet, if I wish to (say) transcribe audio from the microphone on one computer, using under-utilized space on my home server instead of tying up resources on the computer with the microphone, I would have to either find someone who's already made exactly that solution, or… I'd have to make it myself, most likely hacking together a disorderly and unruly collection of parts that were never meant for this specific task. Or, third option: I give up, my server remains idle, and my PC remains overworked.

When I suggest that we have a "Distributed Computing Future", what I am suggesting is that computers can become part of a gestalt system automatically, and you can improve the system as a whole by adding pieces to it. So if I were doing, say, video editing work, and just threw another computer onto the network that was good at video editing (the computing part of it, at least), I wouldn't need to work on that computer specifically when editing video, nor would I need to design some hacky way to connect my normal PC to that video editing powerhouse. The system as a whole could become better at editing video, because the capability exists within the gestalt.

In a way, this has always been what Project MAD was about, even in 2005 when I understood almost nothing. Back then, when we had only single-core processors, I was thinking about just adding more processors to computers. I was thinking about scaling a gestalt system by adding more systems to it. And although back then I was too young and uneducated to understand what I had, I was chasing the tail of a general solution to problems that, still today, only have good solutions in specific cases.

What is Project MAD?

At the beginning, I split the problem I was wrestling with into two pieces. On the one hand, I wanted to divide computing hardware into chunks, and have very fast and very low-latency communications between them. I had no idea how to do this, being a CS undergraduate, but I still filled pages of my notebooks with thoughts about accelerator chips, side channels, and direct memory access busses. To this day, I still use the same name for this part of the project: the Distributed Computing Architecture, which is the D in Project MAD.

On the other hand, I wanted to create an operating system that would turn a collection of hardware into a gestalt system. I had no idea how to do this, being a CS undergraduate, but I still filled pages of my notebooks with thoughts about kernels, remote procedure calls, event chains, and distributed filesystems. To this day, I still use the same name for this part of the project: The Modular Operating System, which is the M in Project MAD.

Those two pieces together were interesting, but they were not a solution; they were a problem that needed a solution. I recognized very early that there were so many tricky puzzles involved in the design, some of which I had no ready solution to, that it was not even worth bringing up to most people. So, I stopped worrying about this side-project of mine and focused on finishing my degree, and then I had a major depressive episode that took me time to get out of. But even in that depression, I would sometimes pull out my notes on Project MOS/DCA and consider the problems I'd given myself, searching for suitably elegant solutions. Actually implementing the project was never a realistic goal; chasing after the design was simply one of those little eccentric things I did.

The last piece of the puzzle only came to me recently, and in fact, it was because I decided to try to write out articles for my previous blog posts explaining Project MOS/DCA. I had some existing notes about how the operating system would need to spread program chunks around the gestalt system. I wanted to use shell-style syntax to direct the flow of data between computers within a gestalt system, and as such, I considered the idea of just sending “wrapping scripts” along with a remote procedure call. Those scripts could do as little as nothing, outside of calling the intended remote function (envision for example, a smart lock latching or unlatching a door), but could also be pressed into service for things like verification and validation of arguments and return values, handling of errors, and so on. In fact, if my prior blog posts weren't now private, you could look back and see me talking about exactly that, in extraordinarily confused fashion. But because my main blog domain is my (now 20+ year old) net alias, SuperSayu.com, it feels inappropriate to use that when trying to actually reach any kind of audience. Plus, again... those blog posts were terrible.

The point is, in the process of trying to explain everything I had in my head, my thinking evolved, or perhaps broke through. “Wrapping scripts around remote procedure calls” is a terrible overall design, if only because of the security nightmare involved. But the general concept of executing code on remote nodes instead of simply making remote calls… that was the right idea, if you looked at it just right. As I examined it from the many angles the project presented, I came to realize that there was a simple problem at the core of many of our troubles with distributed systems:

Programs aren't built like that. Just like our computers still look a lot like computers from half a century ago, so do programs: they are designed to be in one place, tucked entirely within a single segment of a single memory bank, and that's it. That's what “a program” is, fundamentally. If your needs are distributed across multiple computers, set up multiple programs and connect them. Or, I suppose, you could try just using remote procedure calls, but the concept of RPCs as we have it to today is… not sufficient to build complicated systems on top of. It is more humble, more primitive.

My test case for judging these kinds of solutions has always been a distributed PC, in which the storage disk, the input stack, the network, the main processor, and the graphics processor are each on different machines, with a network I call the “backbone” (distinct from internet/LAN access) connecting them. There are certain assumptions I simply allow--that you can communicate between hardware freely, even passing large chunks of memory back and forth in reasonable time. At the same time, I look cautiously askance at solutions that assume too much. For instance, I try to minimize transfers over the network as a matter of policy; overuse and congestion may be inevitable, but (for example) negotiating between the CPU and GPU using only remote procedure calls seems like a sloppy, slow, and data-heavy endeavor, and thus an inadequate solution.

But the test case isn't simply whether or not it can work, it's whether or not it's feasible to program for. Imagine you have a simple application that takes input from a USB device, does some advanced processing in the CPU, and outputs visuals through the GPU. One of the things that turned me off from thinking too much about this project at the beginning, after my first initial rush of obsession faded, was the sheer scope of what you have to learn and understand in order to write even such a trivial program. You needed to know what peripherals were attached to the system, and where, and how to interact with them. You needed to handle, in some fashion, the disconnect between your application and those peripherals. Even an ideally designed backbone, which simply passes messages and data back and forth, opens up a whole world that applications programmers must in some sense understand if they want to work on top of it; they must, in some sense, write middleware that takes the network into account. Compared to a standard application on a monolithic computer, which resides in a single chunk of memory and uses standard library functions, there's no contest as to which is more friendly to write, debug, and administer for.

The goal is getting to that point with distributed computing: making it so that applications programmers don't need to write the middleware, where necessary features are standard features, where (ideally) programming language constructs can make the distinction between remote and local procedure calls paper-thin, whenever possible. But all of that requires a concept around which to organize, a concept that took me a great many years to stumble upon. It is in that context that I'd like to introduce the A in Project MAD.

The Agentic Distributed Application Model

The Agentic Distributed Application model is an answer to this idea of making use of remote resources by distributing a single application across multiple computers. Of course this is all theory, because I never advanced beyond a Bachelor's in Computer Science, but the general goal is to create an application deployment pattern that empowers us to face the challenges of distributed computing, by splitting any given application up into pieces, each responding to remote requests across the gestalt system's backbone network.

That probably sounds unimpressive, and the general concept of putting a small application on a network device is far from new. The ADA model specifically is about controlling the assumptions of a given application fragment, known as an agent. Each Agent can only assume certain things about what local resources it has access to; for example, it is not a given that Agents have any local disk space or internet access, unless that is the specific resource that Agent was designed to interact with. While the application as a whole may need these resources, each Agent is meant to rely on other Agents whenever it requires a resource not guaranteed to be local to itself. Thus, each Agent exists to provide an API to the rest of the application, so that each other Agent's needs can be fulfilled.

At the center of the model is the ADA server, which is a necessary concession to security, helps make the user model work, and helps smooth away the network boundary, allowing applications to treat remote procedure calls much more like they treat local function calls. Each computer in the gestalt system must have an ADA server in order to deploy Agents or have Agents deployed to it. The servers communicate with each other, catalogue the resources of the gestalt system, translate requests for named resources into properly directed remote procedure calls, and confirm and handle incoming ADA requests. Each Agent will post a private (Application-internal) API schema containing available data and available functions, and may post a similar public (system-facing) API schema, or multiple lists tailored to specific applications, users, or user-application-sessions. It is these API schemas that are used to verify and direct requests over the backbone.

By using a variant of the Universal Resource Identifier (URI) schema, any application Agent can make an unambiguous request to the ADA server corresponding to any exposed API, most notably its own Application-internal API. As long as you control your assumptions at compile time, to a programmer, utilizing this internal API can look no different from having your program divided into namespaces, and compilers can automatically generate API requests wherever the programmer requests resources across a namespace boundary. This means that most internal API requests, nominally passed as strings, can be verified at compile time.

Ultimately, an application programmer or power user with the applications' permission can get a full listing of all the internal resources of the application, organized hierarchically as though it were a filesystem, and all exposed resources therein are valid targets for ADA server requests. Because of this filesystem-like structure, we can make the system more flexible as well. For example, third-party libraries and embedded applications, or even external applications, can be assigned places in the hierarchy similar to Agents, even if the library or application itself is divided into multiple Agents, so long as it provides a consumable API. Likewise, there can be API translation or API forwarding entries that allow you to target an internal API point and trust that the translator will target the correct external API. This is useful, for example, to add version compatibility shims in case a library or external application breaks API compatibility, or in case a security flaw must be worked around for libraries that cannot be updated.

Because, as I'll describe below, all ADA applications must operate under the auspices of a user, and all users are required to track their authorized applications, and all users currently active in the system can be enumerated, you can create a global listing of everything currently exported in the system, organized as a hierarchical filesystem. Assuming that, for instance, the same mechanism can be used to list application files and user files, this index becomes a true filesystem of the distributed system, with unambiguous unique identifiers for every resource and capability of the system.

That statement might seem slightly more significant in a moment.

The User Model of a Decentralized Operating System

Ultimately the ADA Model is intended to be part of a larger system, and specifically, the Modular Operating System. One of the significant problems I ran into when originally working out what the MOS project even was, is that a distributed operating system is necessarily a decentralized operating system. Each physical machine requires its own local administrator, if only to oversee the boot process, application management, and error handling, but there are no guarantees that a gestalt system has exactly one computer that can be understood as the core of the system. There may be multiple CPUs, multiple disks, multiple users… or there may be none of the above. What happens if you configure two systems to each have a centralized CPU and boot disk, and then combine the two together as peers? Who should be considered the master of the system, and how do you handle the discrepancy?

My fundamental understanding of such a system, today, is that assuming each module is its own local administrator, there is no specific need for a central authority. There can be one, and specifically in secure environments such as enterprises and governments, it makes sense to boot the entire gestalt system into a mode where modifying the system layout from its known state is not allowed, and only a specific list of users and/or applications is allowed. Similar to a verified boot schema, the system will not start if any unauthorized change is detected. However, even that mechanism must fundamentally be a decentralized one; each module must verify that it is who and what it claims to be, and that its neighbors are who and what they claim to be, with even a single module raising an exception being sufficient to cause an exception. Unauthorized changes made while the system is operating would be likewise suspect. And it must still be possible to boot the gestalt system itself into a less-secure state in order to make changes, which means that the boot state of the system must be verifiable after the fact, and changes to the system must be detectable and verifiable.

But that leaves us with the question of what the core of a decentralized system should actually be, and the answer is the user, or specifically a user-login-session. Anything that is not a hardware-specific service (such as base operating system processes and peripheral drivers) should be tied to a user session, and in absence of any user session (or a secured boot mode), all hardware in the system should boot into and remain at a “factory state”, that is, entirely unconfigured, even if they have disks for long-term storage, and are therefore capable of state. Except as necessary, all system configuration should be done on the auspices of a user, even if the user session is a service (ie background, non-interactive) session.

If that sounds odd to the programmers and system administrators in the audience, it's probably because there are a lot of system service processes involved in running a computer, many of which need (or would make good use of) configuration. Most of those will end up being hardware-local processes, not ADA applications, but there are overlaps, for example in the GUI, and that may be a tricky contradiction to resolve in your mind. A hardware module may provide generic GUI services, but things like themes and session management are user-specific. Absent a user login, or even absent any user accounts capable of logging in, the generic GUI services would still run as hardware-local processes, but there would not be any kind of stateful session.

It is important to understand that any given user session is likely to be centralized on a specific hardware module containing the user's files, and in fact user-specific hardware modules are one of the more interesting parts of the overall distributed computer design, most specifically because they may take complete control of their own local resources and refuse to offer them to the system in general. In other words, only those applications approved by the user will be able to run Agents on the same hardware machine as the user's files, and no one--not even a network administrator or other system authority--can override the user's sovereignty over their own hardware. Because the only way for the system to interact with the user module is by its exported API, no other process or Application in the gestalt system can change this configuration and open up the user module, meaning that user modules are first-class citizens, immune to any other administrator's access. Absent a valid login session, the only thing the system can do is request a user login; once a login session has begun, the session itself determines what is made available and to whom.

Because the user module is separate, the login process can simply refuse to start on systems it deems unacceptable, such as systems that have unusual system topology or suspect modules, or such as systems that are not currently booted into a verified-layout mode, or who are in a verified layout but not one the user is familiar with. If it seems likely that the system is insecure, it would be better not to start anything, to minimize the possibility of compromised or insecure software being exploited. With physical access to the hardware, the files will still potentially be accessible; but if a hardware module elsewhere in the system has been coopted by or some new, suspect module installed, detecting and refusing to function in the presence of such a module is still the most secure and an arguably appropriate choice of action. If the suspect module or modules can be routed around, the user may simply choose to ignore modules that are unacceptable, making them inaccessible to user applications, but this is somewhat less secure than the maximum-paranoia solution.

The idea of user-specific hardware also heavily implies that the user session is portable. If the entire user application stack is stored on a piece of dedicated hardware, one which merely makes use of available hardware to run the applications, then any hardware will do, so long as the user is satisfied that the computer is secure. Assuming the login process detects the system hardware and adjusts accordingly, this can also allow the same login stack to run different environments under different circumstances, for example, allowing a smartphone to run both a handheld environment on its built-in display, and a separate desktop mode for larger systems, and a third display mode for e.g. televisions, projectors, or in-car displays, all from the same general collection of files. There have been attempts to accomplish exactly these things in specific cases, but the use cases here all arise from the problem being solved in the general case... at least, in theory.

Also notable here is the general acceptance of the idea of a multi-user system. Indeed, theoretically, you could have multiple monitors and multiple input devices on a single machine, with multiple user sessions each making use of a specific input-output combination, and the fact that the system is decentralized means that the hardware and software overlap between the user sessions may be minimal This is a far less secure state than a single user machine, but it makes a certain sense when multiple users each only need minimal resources for, eg, browsing the internet. And because most of the computers making up these machines is involatile, unable to be affected by users, this might be a cost-effective kiosk strategy.

The Decentralized System's Base

The acceptance of multiple users in a decentralized system, however, raises a specific question: how do you handle administrative tasks like assigning displays and input devices? As simple as this sounds, it might not be difficult for a hacker to create virtual screens and input devices on a machine and assign them, invisibly, to a backdoor login, or add software input devices to a logged-in user, perhaps replacing your existing keyboard and mouse. This is another example of a problem that can have easy, specific solutions in the short term, but benefits from a solution to the general case of the problem. Specifically, some system functions benefit from being objective, that is, not owned by any logged-in user, and ideally being tamper-evident, allowing users to withdraw from the system if faults are detected.

One example is an unbiased accounting of the system hardware topology, that is, how all the hardware modules in a system are linked together. While earlier I summarized a Distributed Computing system as operating on top of a “backbone network”, networks can be either wide, with several hubs that each connect to many units, or they may be tall, with long chains of single links. An unbiased accounting of the system topology would allow a power user to see, for example, that a fake keyboard and mouse exist on an Internet-access module, or that a module which has only one physical connection reports itself as having child nodes containing input devices, network adapters, or secure disks.

It also helps to know whether there are currently, or have been since your last login, other users on a system that you believe to be secure. All of these would help you feel assured that it is safe to type your password on a keyboard, or otherwise confirm your login to a supposedly secure computer.

Long ago, I decided to distinguish these objective services from the “root user” concept familiar to UNIX systems, not least because a decentralized system cannot have a single “root”. As such, I have generally referred to these as “base” services; they are a ground floor that exists beneath logged in users, but they do not spring from a single root source. User-facing hardware services such as the GUI and user input devices should have some access to the system base, so that (for example) an interactive user can hit a dedicated button to request base access, and then reassign input devices to specific displays, etc, with the display outputs on the system being over-ridden, not displaying user output; this is distinct from their own user session accepting such changes, and a user session might logout if it believes an unauthorized input device has been added to the session, or the addition may require confirmation using a known-good input device, etc.

Some of the services I would generally consider “base” are foundational to the ADA and to Project MAD in general, such as the ability to deploy application agents. A full and objective accounting of all distributed applications, all application agents, and all users, is unquestionably a base service; not only should users not be asked to provide these services, users should not be able to tamper with their functioning in any way.

The list of running ADA applications, for instance, may be useful to detect when an application, user login, or ADA server has crashed; in such a case, there may not be an explicit signal sent out alerting other ADA servers to kill the Agents of any applications that are no longer running, perhaps because of a hardware failure or other serious fault. In such cases, all parts of a gestalt system should refuse to continue hosting Agents with invalid parent applications or users as soon as they are detected as such, just as each ADA server will refuse to allow new agents to be deployed if no ADA server in the network will vouch for the existence of the application and/or user that it claims to represent. For the safety and security of the gestalt system, the relationship between users, applications, and agents must be preserved, which requires objective, base reporting.

Notably, there is no “base” user, and in particular no global base user login. Base services are run as local hardware services; there should be no single login that allows a user, even an administrator, to tamper with all hardware attached to the system at once. In a production environment, no hardware attached to the system should have a local hardware login exposed at all. Ideally, all hardware updates and changes should be independently managed by local processes, and at maximum security, firmware updates may only be performed out-of-band, with direct hardware access. Likewise, there should be objective measures used to sign computer firmware, or some other tamper-evident seal used to detect when the apparent and true state of a piece of hardware are not in alignment, which may indicate malfeasance.

All Part of the Plan

This is by no means an exhaustive accounting of the details of Project MAD. There is a lot to say, for instance, about hardware drivers and APIs, and how they factor into the operating system. I likewise have many desires and theories, however unreasonable or ungrounded, about the ideal hardware of a MAD computer.

It is, however, a vastly better summary of the overall design of a MAD computer than anything I have managed to put together before. My previous blog posts were, quite frankly, a disaster, and while I've summarized the project to others before, it's generally been difficult to condense the entire idea into a digestible, if perhaps not quite bite-sized, chunk.

Perhaps most importantly, it is my first public introduction of the ADA Model, which I think is key to making good use of distributed systems in general, and I believe I have done an acceptable job of summarizing that. Because the ADA, or more accurately, the problem that the ADA is a solution to… that is crucially important. Finding a general solution to the problem of running applications on a distributed system, is going to be a cornerstone of how we use computers in the future.

If we try to forge a future without something like the ADA model, we will by definition end up with systems that attempt to control a distributed system without unifying that distributed system. From the perspective of an ADA model application, the entire distributed system is one system. That isn't a promise, it is the premise. The ADA model is not some finished product, nor even a fleshed-out specification, and in fact I am not capable of finishing it on my own. It is a problem and a unique point of view, and I believe that it is only by embracing the problem that we will be able to make best use of a world that is rapidly filling with independent, and frequently underutilized, computers.

Likewise with the rest of Project MAD. I do not have any kind of fleshed-out set of standards that would explain what a Distributed Computing Architecture computer, or a Modular Operating System, would look like. I have problems, and I believe those problems have solutions. In the end, it would be acceptable, and possibly wise, for other people's solutions to those problems to supplant my own.

At present, I feel like I'm the only one who's actually thought this entire set of problems through--and to be fair, it's such a specific, large, and complicated set of problems that I doubt most people, even in the higher annals of Computer Science academia, would choose to do so. And it's possible that it takes a person like me, who is neuro-divergent and socially isolated, to actually keep chasing that set of problems long enough to find a solution.

At the same time, as with many technologies and discoveries, I look back on the achievement and feel like it's all too simple, as though surely someone else will or has come up with these answers to these problems, or perhaps better solutions. But realistically--who would? For-profit corporations would prefer much simpler solutions rather than reinventing of all of computing. Academics would want a solution that could be summarized in a single publication, or at least, a solution that can be broken down into discrete units of publishable results. Existing standards bodies and open-source cooperatives would most likely, generally, disdain anyone who told them to abandon their existing work, so such a solution wouldn't bubble up from within them.

Computing Project MAD