Understanding rubyx through historic decisions ( 2018-07-25 )

Off course the architecture gives a good overview of the system as it is. But it does not explain how we got there. And sometimes knowing the journey makes it easier to understand where one is. So i shall try to highlight the four or five main

Macbook + Ruby == Rasperry Pi

When i bought my first 30Euro Pi i noticed that ruby is unusable on it. Looking at how slow ruby actually is, it occurred to me that ruby just about turns the Pi into my first 286 laptop (running at 6MHz), which is the same as turning my MacBook Pro into a Pi.

Off course, while working on web-apps, which can be parallelized so easily, and with a company paying both developer and hardware, the std ruby argument holds. But since i wanted to use my pi for demanding projects something had to be done.

Judy, the importance of cpu cache

Judy is a really really fast digital tree, kind of hash. I actually built a memory database with it that was also really really fast. When connecting it to rails i ran into the above problem, the niceties of ActiveRecord (ruby) brought performance of my extension (c) down by a factor of 40.

But anyway, the point is that Judy's speed is based on a radical optimisation for cache lines (and key compression). This means all data structures are exactly a cpu cache line big. As i learned, cpu's do not access memory in word sizes, but instead, always a cache line at a time. This basically lead to ruby-x's memory model, which is fixed sized objects, multiples of a cache-line.

Microkernel

As a young engineer, i thought, as my peers, that Linux (then 0.93) was the greatest thing. Only much later did i learn that it is just a copy really, and the reason it got popular was not technical, but licensing (Same reason it is in Android i believe). The reason it stayed popular is inertia, in other words writing device drivers is hard.

Synthesis, L4, and Minix, are good proof that the superior architecture is the Microkernel. Eg L4 can run another OS as an application with about 4% performance degradation. Or Minix can recover from a device driver failure.

This, plus the fact that we have bundler, brought me to the approach that: If you can leave it out, do. Much of the functionality that is in ruby (mri), will never be in RubyX, but rather supplied by gems.

System interrupts

In the beginning i was off course contemplating how much of c based systems i would use. Like LLVM, which is off course a great tool, though made for c-ish applications. Or libc, which again is really for c apps to access the kernel.

The sheer size of the functionality one inherits almost swayed me. Even i had long since determined that one of ruby's biggest flaws, it's std-lib, came from modelling and using libc.

Then i learned assembler and looked at libc implementations and learned what i believe made the decision: Kernel calls are not really calls at all. They are software interrupts, which basically means you fill some registers, flick the switch, and the next instruction you can collect the result in a specified register. This may look like a call, and off course, by using libc it is presented as a call, but it is not. It is a very simple set of assembler instructions.

For me this meant there is very very little benefit in using c, either in it's libc form, or assembler/linker (i had found a ruby gem to do that easily), or, maybe most importantly, the c calling convention. All of these things are great for c programs, but they are just not made for dynamic languages and that would have brought a whole sloth of problems.

Return address is a parameter

In C calling (probably other languages too), the return address is determined in the callee, usually by pushing the pc to the stack. But Arm has a different way, an instruction called Branch With Link, that actually stores the pc in a separate register called Link.

And this made me realise, that really, the return address is always a parameter to a function. Like other parameters it uses a register. It is the C way to hide this implicit parameter, much in the same way it is the oo way of hiding the self parameter.

By this time i was already coding some rudimentary calling convention and it did not take long to verify this in code. It is in fact quite easy to determine the return address at compile time and pass it explicitly. (Easy if one does not use a c linker that is)

OO calling convention

Another thing that deterred me from C is the way they use the stack. It is so completely not oo and cryptic. It is in other words very difficult to unwind, and almost impossible to implement closures.

Since the assembly had progressed easily, i made performance tests with an oo calling convention, and determined that the price would be about 50%. Since currently the gap is more than an order of magnitude, this seemed ok, given that it would make the compilation process so much easier.

The resulting calling convention uses normal Message objects that form a linked list, rather than a stack. Since they are completely standard objects, manipulation both at run and compile time is totally integrated.

Function calling has been working for years, but recently i cracked dynamic method dispatch too, which was not that hard really. Currently the work is progressing to blocks, and the clear structure does help a lot. And while exceptions (or bindings) are not started, i think they will come with relative ease (compared to the c way), since the structures are very simple.

Decisions that affect the future

Metasm

I gave Metasm several long looks. After all it has assembler and disassembler for at least 10 cpu's, and support for several binary formats, including elf. The reason not to use it was not that it is big (including much we don't need). But rather that it is unmaintained and unresponsive.

It would be great to split all that code into several gems, a core and one per cpu / binary format / assembly, disassembly. Only the core would need to be integrated into rubyx, and one could just use the platform specific gems. But I am not the one to do this work, was the decision.

Lock free Concurrency

Concurrency will have to be part of the core, even if it is just to get a gc working. The work that Massalin did already showed how effective lock free concurrency is, but Dr Cliff took it into the modern (java) world by publishing a lock free hash that he later run on some crazy machine with 800 cpus.

I am not sure whether it will be better to port the java code, or try a diy version. And off course to even get started on this rubyx will need the compare and swap primitives that underly the lock free approach. But all in due time.

The actual concurrency i am envisioning as two os-threads per core. One for kernel interaction and one for normal operation. Kernel calls would never be executed on the second, but always queued on dedicated kernel threads. The non kernel threads would be used to run fibers. If we insert some little check into the calling, switching could happen very often and because of the linked list approach would be very very fast. And because of the offloading of kennel calls would never stall (completely). This way one can achieve the sort of millions of fibers erlang is known for.

House keeping and garbage collection

Often, in systems that are designed to be collected, the base object has some field to support this. This was deliberately left out. RubyX only has objects, so the field would have to be an Object, which is too much overhead. Or there would have to be dedicated instruction to deal with a raw data word which is too much overhead in another way.

Gc will be a completely external gem, so experimenting will be easy and encouraged. Gc implementers will just have to use their own structures to keep track of the state that they need. Judy style digital trees can do this by actually using less memory than a field would use, but handcrafted bitfields will also be good.

The actual marking phase should be relatively easy, as the world is known completely. There are no grey stack areas where one has to guess, as all objects are typed and the type determines which slots are objects. Not even registers are grey area, as we switch cooperatively; only the Message register is ever valid.

In fact, all this makes even moving objects relatively easy. Though there is off course the effort of going through the world to find all backlinks. But if that done during a mark, it comes at relatively low cost.

All in all a very interesting topic, and surely someone will come up with some great idea. And off course we there will have e to be the most rudimentary from the start, just enough to work and give someone motivation to improve it.