Computing Project MAD: Conceptual Walkthrough: Distributed Applications

It may be surprising, since I am here to talk about distributed computing, but I hate process forking - making copies of a running application, each running the same code but each dedicated to a different task. I'm sure I'm not alone in this, because it's not as though people are running around calling fork() for every little thing; multithreaded and multiprocess apps are not exactly common, not unless you're doing something that really requires such a technology.

Frankly, it's a headache to try to understand. In theory, it should be simple; there's a copy of your program, but you know which copy is which, so you just do different things with them. If before that, you open up a pipe, both copies of the application will have access to the same file descriptor for that pipe, and they can communicate through it. But all of those things being technically possible isn't a plan. When that pattern is laid out in a book, it leaves you wondering: what am I supposed to do with it?

I talked in several places about programmers being obliged to reinvent the wheel, and process forking is a good example; forking wouldn't be such a terrible thing, if it was paired with a functional remote procedure call mechanism. When two copies of the same program are connected with a pipe (or two), you can use it to pass data, but that data transfer is mostly pointless without context. Adding context, labeling the data, makes it essentially a very simple, poorly implemented remote procedure call, where the "function" that is being "called" is simply an index specifying what to do with the data, out of a set of options fixed at compile time. While process forking would still be a bit confusing if the process forking mechanism came with a built in RPC mechanism, at least the programmer would not have to wrestle with all the complexity of implementing their own RPC stack from scratch while also trying to build program logic around it.

Although I wasn't thinking of it when I designed the system, the ADA distributed applications model is basically exactly this, and to help you understand it, let's walk through the analogy a little more closely.

As I see it, forking has two fundamental problems with it; first, the fork is undifferentiated. It is easy to overlook the fact that it is a feature of programming languages that we can attach variable names to data in the first place; when I say that forks are undifferentiated, what I mean is that this deliberate feature and mechanism of programming languages no longer functions. In each half of the program, a large number of symbols exist but point to the wrong data, and it is entirely up to the programmer to recognize this and keep track of which symbols are now incorrect.

This is in fact the reason why I have specified that under the ADA, programs should be written and compiled such that each module which is to be depoloyed separately, shall have its own variables confined to a single namespace. Namespaces in this context may be nothing more than a collection of variables given a group name, but for our purposes, it draws an explicit distinction between what is, and is not, accessible within a specific context. You could, of course, do the same thing with a general forked application - but it still leaves us with the problem of needing to create a new remote procedure call mechanism.

Let's take the "remote" part out of the equation for now, and only talk about a forked application communicating with another copy of itself. A procedure call mechanism in this context is a pipe that you periodically check (or block and wait), looking for a data stream in a fixed format. That data stream will tell you what function to call, or what variable to write to, or what variable to read from, and it may also include parameters for the function (or an index to an array, etc) and then you return any out-values through the pipe to the other side with a similar mechanism. Likewise, the calling side of the pipe will send a packet in the expected format down the pipe, and then block and wait for a return value (or merely an acknowledgement that the function is being run, or that it has completed successfully, depending on the context).

You as the programmer will most likely want to select some highly specific subset of functions or variables that you want to use with this mechanism - when building it yourself, you will most likely only accept some very few "commands" that you listen for and respond to from the other half of your application. Because both halves are the same applicaiton, though, you could also use a lookup table that lets you unambiguously access any variable or function present on your half of the application - there are, after all, a fixed number of those functions and variables, and you know for a fact that all of them have unique identifiers, because you use those unique identifiers while programming. And once you have selected a function and/or variable to access, any parameters that get sent along by the request can be type-checked, because you know the method signature or variable type.

Doing all of that work by hand would be an awful experience, and most if not all of that hand-done work would be wasted, as you are unlikely to call most of the functions or access most of the variables. If however, it were a function of the programming language, compiler, and linker - that is eminently possible, because that is, essentially, the compiler's whole job. Translating an identifier to a function, checking the number and type of the parameters, all of that is necessary to create the program in the first place.

Suppose, then, that you had a compiler that created an indexing function. These kind of functions exist, when you use reflection tools - the indexing function would take a variable name or function and some arbitrary parameters, and call the function with those parameters, and then the indexing funciton would return the variable or return value when it's done. Let's suppose at first that exactly one indexing function is created for your whole program. You as the programmer might simply pass anything that comes down the fork pipe to the indexing function (unless you just called a function, in which case you would wait for the return value and parse that) - in fact, ideally, there would be a built-in function that just reads commands from the pipe in exactly this way, and a paired function that writes to the pipe in exactly the way the reader expects.

This is still relatively unsafe; there is a lot of potential for race conditions and so on, and there is nothing stopping you from requesting data from the fork that is actually stored somewhere else. But assuming you had some need to fork an application and run it, this reflection-based indexing function would free the programmer from having to figure out how to pass messages back and forth, free them from having to write any middleware, and simply allow the programmer to use the capabilities of the other process, its data and functions.

But let's go back to the namespace thing. You see, we are still left with the problem I started out on - that a forked process has duplicate identifiers, half of which are simply wrong, because the actual memory location where they are stored is in another process. But, you know which process they are stored in, and you have a well-established mechanism for requesting data or calling functions in that process. If you knew for certain at compile time that some subset of those identifiers would be in the forked process, the compiler might translate any requests for those identifiers to instead use the pipe-send function.

But wait, there's a problem: both forks of the application are running the same code, so at compile time, it's impossible to distinguish whether the request is going to be requesting data from the original or forked process. This is where namespaces come in - you may not know whether a given namespace is in one process or the other, but you should know for certain that the contents of a specific namespace will all be in the same place. It's only when you exit the namespace that there is any chance you will be trying to reach data that exists in another process. That's simple, then: in the forking function, allow the programmer to specify which namespaces are being detached, resulting in two collections of namespaces, one associated with the original process, and one with the fork. This will be a runtime check, in case there is any reason to change which namespaces get separated in different circumstances - because it is a runtime check, a checking function will be called every time any bit of code requests data or functions outside its own namespace. (There are ways to reduce the overhead, but ignore that for now.)

Well, if you can do all this once, why not do it multiple times? Say you have a program with five total namespaces and you want each in its own process. There's no reason why that should be difficult, should it? You would simply need the runtime check, the one that knows which process contains a given piece of data, to also select the appropriate pipe to use. There is, admittedly, one complication - making sure that all forked processes receive updated routing tables - but since you've built a robust framework for sharing data already, that hardly seems like much of an imposition.

It's worth pointing out, though, that if this system works correctly, it should also work correctly if you never fork the application at all. In that case, there is no confusion about which process owns which variable - all of them are stored in the same process. So, your program will truly not care whether it is currently running in a single process, two processes, or five processes; when you as the programmer insist on calling a function or fetching data, the function will be called or the data fetched, no matter which process currently stores that particular piece of data.

Now, forking a process is necessarily something that happens in the same place where your application currently is, which sharply limits the usefulness of everything I've just described. Not to say that it's useless - I think anyone who is interested in forking processes would appreciate having all of this functionality - but what I've just described is not actually the functionality that I want from the system. A distributed application, as I've described it, is one where this forking process happens between two computers - where you can run part of the program logic on an entirely separate machine. And conceptually, once you've done everything I've said above, that's not actually a difficult proposition... to an extent, at least. You need some permissions, and you need to make certain that dependencies and the like are synchronized, but the technicalities of starting a process somewhere else, and binding a pipe that connects the two processes over the network, are not particularly difficult.

I've said this in other places, but to reiterate here, the value of doing this fork over the network is allowing your program to place specific bits of code where they are needed, without the program itself needing to care that those bits are not on the same machine. Generally, these "specific bits" are in one of three categories: input, output, or computation. The benefit I'm talking about is not accelerating your program with parallelization (you can do that if you have a whole lot of computing to do). No, the benefit I'm talking about is a kind of program that we don't currently possess - one that replaces the idea of remote desktops and SSH tunneling with running an application in one place, and handling the input and output in other places.

This category of applications is something that should be simple. If you want to edit files, you generally want the file editor to be where the files are, because it will involve a lot of reading from and writing to the disk; compared to that, GUI updates and keyboard and mouse inputs are much less sensitive. I for instance have an NFS volume that holds many of my files attached to this computer, but if I want to unzip files on that machine, I tend to open a remote shell and perform the action local to where the files are, rather than having my desktop wait on network traffic in both directions for every file operation. Likewise, some tasks that depend on accelerators, such as video editing, are best done when the data and the accelerator are in one place, but that doesn't necessarily require that the user is in that same place. As long as input and output latency is within acceptable parameters, you can do a lot remotely - and it's only easier if the entire output stack is being run on the machine closest to the output itself, for the same reason. A lot of system GUI updates, especially, are highly redundant; you can summarize the changes with a few bytes, while a single new image frame may be several kilobytes, and sending every single frame in an easily-reproduced GUI animation may be millions of times more expensive than doing it locally.

Instead of asking what is required to live in a world where this is possible, it is better to ask why we don't. The legacy of computing simply assumes that everything will be in one place, and comparatively little of our development has been focused on splitting a program up into pieces distributed across a larger system. We are encountering this need more and more, and so more and more people are getting exposed to it, but until the default way we write programs allows this kind of cross-computer access natively, accessing a program running somewhere else will continue to be a tedious affair.

And to be fair, this "conceptual walkthrough" is more difficult than it sounds, in that it requires tweaks to compilers and programming languages, on top of providing a mechanism to deploy those forked programs on another machine. Those things happen, and there are experts in those fields, but if anyone is wondering why I haven't written my own code... rewriting program languages is simply a task that's beyond me! Tasks that are much simpler for the compiler and linker would be an enormous headache to write by hand, but that doesn't mean that it's easy to change how compliers work.

But for the people who actually know how to do all these things... I imagine that if they just knew what they were trying to accomplish, they could change the world. There is a lot that's possible, but which simply won't happen without the right support and tooling.

Computing Project MAD

Saturday, May 9, 2026

Conceptual Walkthrough: Distributed Applications

No comments:

Post a Comment

About Me