Object storage ( 2014-08-19 )

While trying to figure out what i am coding i had to attack this storage format before i wanted to. The immediate need is for code dumps, that are concise but readable. I started with yaml but that just takes too many lines, so it’s too difficult to see what is going on.

I just finished it, it’s a sort of condensed yaml i call sof (salama object file), but i want to take the moment to reflect why i did this, what the bigger picture is, where sof may go.

Program lifecycle

Let’s take a step back to mother smalltalk: there was the image. The image was/is the state of all the objects in the system. Even threads, everything. Absolute object thinking taken to the ultimate. A great idea off course, but doomed to ultimately fail because no man is an island (so no vm is either).

Development

Software development is a team sport, a social activity at it’s core. This is not always realised, when the focus is too much on the outcome, but when you look at it, everything is done in teams.

The other thing not really taken into account in the standard developemnt model is that it is a process in time that really only gets jucy with a first customer released version. Then you get into branches for bugs and features, versions with major and minor and before long you’r in a jungle of code.

Code centered

But all that effort is concentrated on code. Ok nowadays schema evlolution is part of the game, so the existance of data is acknowledged, but only as an external thing. Nowhere near that smalltalk model.

But off course a truely object oriented program is not just code. It’s data too. Maybe currently “just” configuration and enums/constants and locales, but that is exactly my point.

The lack of defined data/object storage is holding us back, making all our programs fruit-flies. I mean it lives a short time and dies. A program has no way of “learning”, of accumulating data/knowledge to use in a next invocation.

Optimisation example

Let’s take optimisation as an example. So a developer runs tests (rubyprof/valgrind or something) with some output and makes program changes accordingly. But there are two obvious problems. Firstly the data is collected in development not production. Secondly, and more importantly, a person is needed.

Of course a program could quite easily monitor itself, possibly over a long time, possibly only when not at epak load. And surely some optimisations could be automated, a bit like the O1 .. On compiler switches, more and more effort could be exerted on critical regions. Possibly all the way to super-optimisation.

But even if we did this, and a program would improve/jit itself, the fruits of this work are only usable during that run of that program. Future invocations, just like future versions of that program do not benefit. And thus start again, just like in Groundhog day.

Storage

So to make that optimisation example work, we would need a storage: Theoretically we could make the program change it’s own executable/object files, in ruby even it’s source. Theoretically, as we have no representation of the code to work on.

In salama we do have an internal representation, both at the code level (ast) and the compiled code (CompiledMethod, Intructions and friends).

Storage Format

Going back to the Image we can ask why was it doomed to fail: because of the binary, proprietary implementation. Not because of the idea as such.

Binary data needs either a rigourous specification and/or software to work on it. Work, what work? We need to merge the data between installations, maintain versions and branches. That sounds a lot like version control, because it basically is. Off course this “could” have been solved by the smalltalk people, but wasn’t. I think it’s fair to say that git was the first system to solve that problem.

And git off course works with diff, and so for a 3-way merge to be successful we need a text format. Which is why i started with yaml, and which is why also sof is text-based.

The other benefit is off course human readability.

So now we have an object file * format in text, and we have git. What we do with it is up to us. (* well, i only finished the writer. reading/parsing is “left as an excercise for the reader”:-)

Sof as object file format

Ok, i’ll sketch it a little: Salama would use sof as it’s object file format, and only sof would ever be stored in git. For developers to work, tools would create source and when that is edited compile it to sof.

A program would be a repository of sof and resource files. Some convention for load order would be helpful and some “area” where programs may collect data or changes to the program. Some may off course alter the sof’s directly.

How, when and how automatically changes are merged (via git) is up to developer policy . But it is easily imaginable that data in program designated areas get merged back into the “mainstream” automatically.