3. Kernel Code Generator

For, behold, I create new heavens and a new earth.
-- The Bible, Isaiah

3.1 Fundamentals

Kernel code synthesis is the name given to the idea of creating executable machine code at runtime as a means of improving operating system performance. This idea distinguishes Synthesis from all other operating systems research efforts, and is what helps make Synthesis efficient.

Runtime code generation is the process of creating executable machine code during program execution for use later during the same execution [16]. This is in contrast to the usual way, where all the code that a program runs has been created at compile time, before program execution starts. In the case of an operating system kernel like Synthesis, the "program" is the operating system kernel, and the term "program execution" refers to the kernel's execution, which lasts from the time the system is started to the time it is shut down.

There are performance benefits in doing runtime code generation because there is more information available at runtime. Special code can be created based on the particular data to be processed, rather than relying on general-purpose code that is slower. Runtime code generation can extend the benefits of detailed compile-time analysis by allowing certain data-dependent optimizations to be postponed to runtime, where they can be done more effectively because there is more information about the data. We want to make the best possible use of the information available at compile-time, and use runtime code generation to optimize data-dependent execution.

The goal of runtime code generation can be stated simply:

Never evaluate something more than once.

For example, suppose that the expression, A * A + A * B + B * B is to be evaluated for many different A while holding B = 1. It is more efficient to evaluate the reduced expression obtained by replacing B with 1: A * A + A + 1.

The problem is one of knowing how soon we can know what value a variable has, and how that information can be used to improve the program's code. In the previous example, if it can be deduced at compile time that B = 1, then a good compiler can perform precisely the reduction shown. But usually we can not know ahead of time what value a variable will have. B might be the result of a long calculation whose value is hard if not impossible to predict until the program is actually run. But when it is run, and we know B, runtime code generation allows us to use the newly-acquired information to reduce the expression.

Specifically, we create specialized code once the value of B becomes known, using an idea called partial evaluation [15]. Partial evaluation is the building of simpler, easierto-evaluate expressions from complex ones by substituting variables that have a known, constant value with that constant. When two or more of these constants are combined in an arithmetic or logical operation, or when one of the constants is an identity for the operation, the operation can be eliminated. In the previous example, we no longer have to compute B * B, since we know it is 1, and we do not need to compute A * B, since we know it is A.

There are strong parallels between runtime code generation and compiler code generation, and many of the ideas and terminology carry over from one to the other. Indeed, anything that a compiler does to create executable code can also be performed at runtime. But because compilation is an off-line process, there is usually less concern about the cost of code generation and therefore one has a wider palette of techniques to choose from. A compiler can afford to use powerful, time-consuming analysis methods and perform sophisticated optimizations - a luxury not always available at runtime.

Three optimizations are of special interest to us, not only because they are easy to do, but because they are also effective in improving code quality. They are: constant folding, constant propagation, procedure inlining. x = 5; y = 4 * x; becomes x = 5; y = 4 * 5; through constant propagation; 4 * 5 then becomes 20 through constant folding. Procedure inlining substitutes the body of a procedure, with its local variables appropriately renamed to avoid conflicts, in place of its call.

There are three costs associated with runtime code generation: creation cost, paid each time a piece of code is created; execution cost, paid each time the code is used; and management costs, to keep track of where the code is and how it is being used. The hope is to use the information available at runtime to create better code than would otherwise be possible. In order to win, the savings of using the runtime-created code must exceed the cost of creating and managing that code. This means that for many applications, a fast code generator that creates good code will be superior to a slow code generator that creates excellent code. (The management problem is analogous to keeping track of ordinary, heap-allocated data structures, and the costs are similar, so they will not be considered further.)

Synthesis focuses on techniques for implementing very fast runtime code generation. The goal is to broaden its applicability and extend its benefits, making it cheap enough so that even expressions and procedures that are not re-used often still benefit from having their code custom-created at runtime. To this end, the places where runtime code generation is used are limited to those where it is clear at compile time what the possible reductions will be. The following paragraphs describe the idea, while the next section describes the specific techniques.

A fast runtime code generator can be built by making full use of the information available at compile time. In our example, we know at compile time that B will be held constant, but we do not know what the constant will be. But we can predict at compile-time what form the reduced expression will have: A * A + C1 * A + C2. A * A + C1 * A + C2 into newly allocated memory and computes and fills the constants: C1 = B and C2 = B * B.

Optimizations to the runtime-created code can also be pre-computed. In this example, interesting optimizations occur when B is 0, 1, or a power of two. Separate templates for each of these cases allow the most efficient code possible to be generated. The point is that there is plenty of information available at compile time to allow not just simple substitution of variables by constants, but also interesting and useful optimizations to happen at runtime with minimal analysis.

The general idea is: treat runtime code generation as if it were just another "function" to be optimized, and apply the idea of partial evaluation recursively. That is, just as in the previous example we partially-evaluate the expression A * A + A * B + B * B with respect to the variable held constant, we can partially-evaluate the optimizations with respect to the parameters that the functions will be specialized under, with the result being specialized code-generator functions.

Looking at a more complex example, suppose that the compiler knows, either through static control-flow analysis, or simply by the programmer telling it through some directives, that the function f(p1, ...) = 4 * p1 + ... will be specialized at runtime for constant p1. The compiler can deduce that the expression 4 * p1 will reduce to a constant, but it does not know what particular value that constant will have. It can capture this knowledge in a custom code generator for f that computes the value 4 * p1 when p1 becomes known and stores it in the correct spot in the machine code of the specialized function f, bypassing the need for analysis at runtime. In another example, consider the function g, g(p1, ...) = if(p1 != 10) S1; else S2;,

The idea applies recursively. For example, once we have a code generator for a particular kind of expression or statement, that same generator can be used each time that kind of expression occurs, even if it is in a different part of the program. Doing this limits the proliferation of code generators and keeps the program size small. The resulting runtime code generator has a hierarchical structure, with generators for the large functions calling sub-generators to create the individual statements, which in turn call yet lower-level generators, and so on, until at the bottom we have very simple generators that, for example, move a constant into a machine register in the most efficient way possible.

3.2 Methods of Runtime Code Generation

The three methods Synthesis uses to create machine code are: factoring invariants, collapsing layers, executable data structures.

3.2.1 Factoring Invariants

The factoring invariants method is equivalent to partial evaluation where it is known at compile time the variables over which a function will be partially evaluated. It is based on the observation that a functional restriction is usually easier to calculate than the original function. Consider a general function:

F_big(p1, p2, ... , pn)

If we know that parameter p1 will be held constant over a set of invocations, we can factor it out to obtain an equivalent composite function:

[ F^create(p1) ] (p2, ... , pn) ≡ F_big(p1, p2, ... , pn)

F ^create is a second-order function. Given the parameter p1, F ^create returns another function, F_small, which is the restriction of F _big that has absorbed the constant argument p1:

F_small(p2, ... , pn) ⊂ F_big(p1, p2, ... , pn)

If F ^create is independent of global data, then for a given p1, F ^create will always compute the same F _small regardless of global state. This allows F^create(p1) to be evaluated once and the resulting F _small used thereafter. If F _small is executed m times, generating and using it pays off when

Cost(F^create) + m * Cost(F_small) < m * Cost(F_big)

As the "factoring invariants" name suggests, this method resembles the constant propagation and constant folding optimizations done by compilers. The analogy is strong, but the difference is also significant. Constant folding eliminates static code and calculations. In addition, Factoring Invariants can also simplify dynamic data structure traversals that depend on the constant parameter p1.

For example, we can apply this idea to improve the performance of the read system function. When reading a particular file, constant parameters include the device that the file resides on, the address of the kernel buffers, and the process performing the read. We can use file open as F^create; the F _small it generates becomes our read function. F ^create consists of many small procedure templates, each of which knows how to generate code for a basic operation such as "read disk block", "process TTY input", or "enqueue data." The parameters passed to F ^create determine which of these code-generating procedures are called and in what order. The final F _small is created by filling these templates with addresses of the process table, device registers, and the like.

3.2.2 Collapsing Layers

The collapsing layers method is equivalent to procedure inlining where it is known at compile time which procedures might be inlined. It is based on the observation that in a layered design, separation between layers is a part of specification, not implementation. In other words, procedure calls and context switches between functional layers can be bypassed at execution time. Let us consider an example from the layered OSI model:

F_big(p1, p2, ... , pn) ≡ F_applica(p1, F_present(p2, F_session( ... F_datalnk(pn) ... )))

F _applica is a function at the Application layer that calls successive lower layers to send a message. Through in-line code substitution of F _present in F_applica, we can obtain an equivalent flat function by eliminating the procedure call from the Application to the Presentation layer:

F_flatapplica(p1, p2, F_session( ... )) ≡ F_applica(p1, F_present(p2, F_session( ... )))

The process to eliminate the procedure call can be embedded into two second-order functions. F^createpresent returns code equivalent to F _present and suitable for in-line insertion. F^createapplica incorporates that code to generate F flatapplica.

F ^create _applica(p1, ^create _present(p2, _flatapplica(p1,

This technique is analogous to in-line code substitution for procedure calls in compiler code generation. In addition to the elimination of procedure calls, the resulting code typically exhibit opportunities for further optimization, such as Factoring Invariants and elimination of data copying.

By induction, F ^create _present can eliminate the procedure call to the Session layer, and down through all layers. When we execute F ^create _flatapplica to establish a virtual circuit, the F _flatapplica code used thereafter to send and receive messages may consist of only sequential code. The performance gain analysis is similar to the one for factoring invariants.

3.2.3 Executable Data Structures

The executable data structures method reduces the traversal time of data structures that are frequently traversed in a preferred way. It works by storing node-specific traversal code along with the data in each node, making the data structure self-traversing.

Consider an active job queue managed by a simple round-robin scheduler. Each element in the queue contains two short sequences of code: stopjob and startjob. stopjob saves the registers and branches into the next job's startjob routine (in the next element in queue). The startjob restores the new job's registers, installs the address of its own stopjob in the timer interrupt vector table, and resumes processing.

An interrupt causing a context switch will execute the current program's stopjob, startjob.

3.2.4 Performance Gains

Runtime code generation and partial evaluation can be thought of as a way of caching frequently visited states. It is interesting to contrast this type of caching with the caching that existing systems do using ordinary data structures. Generally, systems use data structures to capture state and remember expensive-to-compute values. For example, when a file is opened, a data structure is built to describe the file, including its location on disk and a pointer to the procedure to be used to read it. The read procedure interprets state stored in the data structure to determine what work is to be done and how to do it.

In contrast, code synthesis encodes state directly into generated procedures. The resulting performance gains extend beyond just saving the cost of interpreting a data structure. To see this, let us examine the performance gains obtained from hard-wiring a constant directly into the code compared to fetching it from a data structure. Hardwiring embeds the constant in the instruction stream, so there is an immediate savings that comes from eliminating one or two levels of indirection and obviating the need to pass the structure pointer. These can be attributed to "saving the cost of interpretation." But hardwiring also opens up the possibility of further optimizations, such as constant folding, while fetching from a data structure admits no such optimizations. Constant folding becomes possible because once it is known that a parameter will be, say, 2, all pure functions of that parameter will likewise be constant and can be evaluated once and the constant result used thereafter. A similar flavor of optimization arises with IF-statements. In the code fragment "if(C) S1; else S2;", where the conditional, C, depends only on constant parameters, the generated code will contain either S1 or S2, never both, and no test. It is with this cascade of optimization possibilities that code synthesis obtains its most significant performance gains. The following section illustrates some of the places in the kernel where runtime code generation is used to advantage.

3.3 Uses of Code Synthesis in the Kernel

3.3.1 Buffers and Queues

Buffers and queues can be implemented more efficiently with runtime code generation than without.

char buf[100], *bufp = &buf[0], *endp = &buf[100];
Put(c)
{
    *bufp++ = c;
    if(bufp == endp)
	flush();
}

Put:		// (character is passed register d0)
    move.l (bufp),a0	// (1) Load buffer pointer into register a0
    move.b d0,(a0)+	// (2) Store the character and increment the a0 register
    move.l a0,(bufp) 	// (3) Update the buffer pointer
    cmp.l (endp),a0	// (4) Test for end-of-buffer
    beq flush		// ... if end, jump to flush routine
    rts 		// ... otherwise return

Figure 3.1: Hand-crafted assembler implementation of a buffer

Figure 3.1 shows a good, hand-written 68030 assembler implementation of a buffer.

The C language code illustrates the intended function, while the 68030 assembler code shows the work involved. The work consists of: (1) loading the buffer pointer into a machine register; (2) storing the character in memory while incrementing the pointer register; (3) updating the buffer pointer in memory; and (4) testing for the end-of-buffer condition. This fragment executes in 28 machine cycles not counting the procedure call overhead.

Put:		// (character is passed register d0)
    move.l (P),a0	// Load buffer pointer into register a0
    move.b d0,(a0,D)	// Store the character
    addq.w #1,(P+2)	// Update the buffer pointer and test if reached end
    beq flush		// ... if end, jump to flush routine
    rts			// ... otherwise return

Figure 3.2: Better buffer implementation using code synthesis

Table 3.1: CPU Cycles for Buffer-Put
68030 CPU, 25MHz, 1-wait-state main memory
	Cold cache	Warm cache
Code-synthesis (CPU cycles)	29	20
Hand-crafted assembly (CPU cycles)	37	28
Speedup	1.4	1.4

Figure 3.2 shows the code-synthesis implementation of a buffer, which is 40% faster. Table 3.1 gives the actual measurements. The improvement comes from the elimination of the cmp instruction, for a savings of 8 cycles. The code relies on the implicit test for zero that occurs at the end of every arithmetic operation. Specifically, we arrange that the lower 16 bits of the pointer variable be zero when the end of buffer is reached, so that incrementing the pointer also implicitly tests for end-of-buffer.

This is done for a general pointer as follows. The original bufp pointer is represented as the sum of two quantities: a pointer-like variable, P, D. P + D, move.b d0,(a0,D)" D, P + D points to the end of the buffer, P is 0 modulo 2¹⁶, that is, the least significant 16 bits of P are zero. The "addq.w #1,(P+2)" instruction then increments the lower 16 bits of the buffer pointer and also implicitly tests for end-of-buffer, which is indicated by a 0 result. For buffer sizes greater than 2 ¹⁶ bytes, the flush routine can propagate the carry-out to the upper bits, flushing the buffer when the true end is reached.

Table 3.2: Comparison of C-Language "stdio" Libraries
10 ⁷ Executions of:	Execution time, seconds	Size: Bytes/Invocation
Unix "putchar" macro	21.4 user; 0.1 system	132
Synthesis "putchar" macro	13.0 user; 0.1 system	30
Synthesis "putchar" function	19.0 user; 0.1 system	8

This performance gain can only be had using runtime code generation, because D must be a constant, embedded in the buffer's machine code, to take advantage of the fast memory-reference instruction. Were D a variable, the loss of fetching its value and indexing would offset the gain from eliminating the compare instruction. The 40% savings is significant because buffers and queues are used often. Another advantage is improved locality of reference: code synthesis puts both code and data in the same page of memory, increasing the likelihood of cache hits in the memory management unit's address translation cache.

Outside the kernel, the Synthesis implementation of the C-language I/O library, "stdio," uses code-synthesized buffers at the user level. In a simple experiment, I replaced the Unix stdio library with the Synthesis version. I compiled and ran a simple test program that invokes the putchar macro ten million times, using first the native Unix stdio library supplied with the Sony NEWS workstation, and then the Synthesis version. Table 3.2 shows the Synthesis macro version is 1.6 times faster, and over 4 times smaller, than the Unix version.

The drastic reduction in code size comes about because code synthesis can take advantage of the extra knowledge available at runtime to eliminate execution paths that cannot be taken. The putchar operation, as defined in the C library, actually supports three kinds of buffering: block-buffered, line-buffered and unbuffered. Even though only one of these can be in effect at any one time, the C putchar macro must include code to handle all of them, since it cannot know ahead of time which one will be used. In contrast, code synthesis creates only the code handling the kind of buffering actually desired for the particular file being written to. Since putchar, being a macro, is expanded in-line every time it appears in the source code, the savings accumulate rapidly.

Table 3.2 also shows that the Synthesis "putchar" function is slightly faster than the Unix macro - a dramatic result, that even incurring a procedure call overhead, code synthesis still shows a speed advantage over conventional code in-lined with a macro.

3.3.2 Context Switches

One reason that context switches are expensive in traditional systems like Unix is that they always save and restore the entire CPU context, even though that may not be necessary. For example, a process that did not use floating point since it was switched in does not need to have its floating-point registers saved when it is switched out. Another reason is that saving context is often implemented as a two-step procedure: the CPU registers are first placed in a holding area, freeing them so they can be used to perform calculations and traverse data structures to find out where the context was to have been put, and finally copying it there from the holding area.

A Synthesis context switch takes less time because only the part of the context being used is preserved, not all of it, and because the critical path traversing the ready queue is minimized with an executable data structure.

The first step is to know how much context to preserve. Context switches can happen synchronously or asynchronously with thread execution. Asynchronous context switches are the result of external events forcing preemption of the processor, for example, at the end of a CPU quantum. Since they can happen at any time, it is hard to know in advance how much context is being used, so we preserve all of it. Synchronous context switches, on the other hand, happen as a result of the thread requesting them, for example, when relinquishing the CPU to wait for an I/O operation to finish. Since they occur in specific, well-defined points in the thread's execution, we can know exactly how much context will be needed and therefore can arrange to preserve only that much. For example, suppose a read procedure needs to block and wait for I/O to finish. Since it has already saved some registers on the stack as part of the normal procedure-call mechanism, there is no need to preserve them again as they will only be overwritten upon return.

proc:
	:
	:
	{Save necessary context}
	bsr swtch
res:
	{Restore necessary context}
	:
	:

swtch:
	move.l	(Current),a0	 // (1) Get address of current thread's TTE
	move.l	sp,(a0)		 // (2) Save its stack pointer
	bsr	find_next_thread // (3) Find another thread to run
	move.l	a0,(Current)	 // (4) Make that one current
	move.l	(a0),sp		 // (5) Load its stack pointer
	rts			 // (6) Go run it!

Figure 3.3: Context Switch

Figure 3.3 illustrates the general idea. When a kernel thread executes code that decides that it should block, it saves whatever context it wishes to preserve on the active stack. It then calls the scheduler, swtch; doing so places the thread's program counter on the stack. At this point, the top of stack contains the address where the thread is to resume execution when it unblocks, with the machine registers and the rest of the context below that. In other words, the thread's context has been reduced to a single register: its stack pointer. The scheduler stores the stack pointer into the thread's control block, known as the thread table entry (TTE), which holds the thread state when it is not executing. It then selects another thread to run, shown as a call to the find next thread procedure in the figure, but actually implemented as an executable data structure as discussed later. The variable Current is updated to reflect the new thread and its stack pointer is loaded into the CPU. A return-from-subroutine (rts) instruction starts the thread running. It continues where it had left off (at label res), where it pops the previously-saved state off the stack and proceeds with its work.

Figure 3.4 shows two TTEs. Each TTE contains code fragments that help with context switching: sw_in and sw_in_mmu, sw_out, sw_in or sw_in_mmu procedure. To switch out a thread, the processor executes the thread's sw_out procedure.

Figure 3.4: Thread Context

Notice how the ready-to-run threads (waiting for CPU) are chained in an executable circular queue. A jmp instruction at the end of the sw_out procedure of the preceding thread points to the sw_in procedure of the following thread. Assume thread-0 is currently running. When its time quantum expires, the timer interrupt is vectored to thread-0's sw_out. sw_in or sw_in_mmu. sw_in_mmu when a change of address space is required; otherwise control flows to sw_in. ¹

¹ Previous papers incorrectly cite a floating-point context switch time of 15 µs [25] [18]. This error is believed to have been caused by a bug in the Synthesis assembler, which incorrectly filled the operand field of the floating-point move-multiple-registers instruction causing it to preserve just one register, instead of all eight. Since very few Synthesis applications use floating point, this bug remained undetected for a long time.

Table 3.3 summarizes the time taken by the various types of context switches in Synthesis, saving and restoring all the integer registers. These times include the hardware interrupt service overhead -- they show the elapsed time from the execution of the last instruction in the suspended thread to the first instruction in the next thread. Previously published papers report somewhat lower figures [25] [18]. This is because they did not include the interrupt-service overhead, and because of some extra overhead incurred in handling the 68882 floating point unit on the Sony NEWS workstation that does not occur on the Quamachine, as discussed later. For comparison, a call to a null procedure in the C language takes 1.4 microseconds, and the Sony Unix context switch takes 170 microseconds.

Table 3.3: Cost of Thread Scheduling and Context Switch
68030 CPU, 25MHz, 1-wait-state main memory, cold cache
Type of context switch	Time (µs)
Integer registers only	10.5
Floating-point	52
Integer, change address space	16
Floating-point, change address space	56
Null procedure call (C language)	1.4
Sony NEWS, Unix	170
NeXT Machine, Mach	510

In addition to reducing ready-queue traversal time, specialized context-switch code enables further optimizations, to move only needed data. The previous paragraph already touched on one of the optimizations: bypassing the MMU address space switch when it is not needed. The other optimizations occur in the handling of floating point registers, described now, and in the handling of interrupts, described in the next section.

Switching the floating point context is expensive because of the large amount of state that must be saved. The registers are 96 bits wide; moving all eight registers requires 24 transfers of 32 bits each. The 68882 coprocessor compounds this cost, because each word transferred requires two bus cycles: one to fetch it from the coprocessor, and one to write it to memory. The result is that it takes about 50 microseconds just to save and restore the hundred-plus bytes of information comprising the floating point coprocessor state. This is more than five times the cost of doing an entire context switch without the floating point.

Since preserving floating point context is so expensive, we use runtime tests to see if floating point had been used to avoid saving state that is not needed. Threads start out assuming floating point will not be used, and their context-switch code is created without it. When context-switching out, the context-save code checks whether the floating point unit had been used. It does this using the fsave instruction of the Motorola 68882 floating point coprocessor, which saves only the internal microcode state of the floating point processor [20]. The saved state can be tested to see if it is not null. If so, the user-visible floating-point state is saved, and the context-switch code re-created to include the floating-point context in subsequent context switches. Since the majority of threads in Synthesis do not use floating point, the savings are significant.

Unfortunately, after a thread executes its first floating point instruction, floating point context will have to be preserved from that point on, even if no further floating-point instructions are issued. The context must be restored upon switch-in because a floating point instruction might be executed. The context must be saved upon switch-out even if no floating point instructions had been executed since switch-in because the 68882 cannot detect a lack of instruction execution. It can only tell us if its state is completely null. This is bad because sometimes a thread may use floating-point at first, for example, to initialize a table, and then not again. But with the 68882, we can only optimize the case when floating point is never used.

The Quamachine has hardware to alleviate the problem. Its floating-point unit - also a 68882 - can be enabled and disabled by software command, allowing a lazyevaluation of floating-point context switches. Switching in a thread for execution loads its integer state and disables the floating-point unit. When a thread executes its first floating point instruction since the switch, it takes an illegal instruction trap. The kernel then loads the necessary state, first saving any prior state that may have been left there, reenables the floating-point unit, and the thread resumes with the interrupted instruction. The trap is taken only on the first floating-point instruction following a switch, and adds only 3 µs to the overhead of restoring the state. This is more than compensated for by the other savings: integer context-switch becomes 1.5 µs faster because there is no need for an fsave instruction to test for possible floating-point use; and even floating-point threads benefit when they block without a floating point instruction being issued since they were switched in, saving the cost of restoring and then saving that context. Indeed, if only a single thread is using floating point, the floating point context is never switched, remaining in the coprocessor.

3.3.3 Interrupt Handling

A special case of context switching occurs in interrupt handling. Many systems, such as Unix, Unix kernel reveals that not only are all integer registers saved on each interrupt, but the active portion of the floating-point context as well. This is one of the reasons that interrupt handling is expensive on a traditional system, and the reason why the designers of those systems try hard to avoid frequent interrupts. As shown earlier, preserving the floating-point state can be very expensive. Doing so is superfluous unless the interrupt handler uses floating point; most do not.

Synthesis interrupt handling is faster because it saves and restores only the part of the context that will be used by the service routine, not all of it. Code synthesis allows partial context to be saved efficiently. Since different interrupt procedures use different amounts of context, we can not, in general, know how much context to preserve until the interrupt is linked to its service procedure. Furthermore, it may be desirable to change service procedures, for example, when changing or installing new I/O drivers in the running kernel. Without code synthesis, we would have to save the union of all contexts used by all procedures that could be called from the interrupt, slowing down all because of the needs of a few.

Examples taken from the Synthesis Sound-IO device driver illustrate the ideas and provide performance numbers. The Sound-IO device is a general-purpose, high-quality audio input and output device with stereo, 16-bit analog-to-digital and digital-to-analog converters, and a direct-digital input channel from a CD player. This device interrupts the processor once for every sound sample - 44100 times per second - a very high number by conventional measures. It is normally inconceivable to attach such high-rate interrupt sources to the main processor. Sony Unix, ² Efficient interrupt handing is mandatory, and the rest of this section shows how Synthesis can service high interrupt rates efficiently.

²The Sony workstation has two processors, one of which is dedicated to I/O, including servicing I/O interrupts using a somewhat lighter-weight mechanism. They solve the overhead problem with specialized processors -- a trend that appears to be increasingly common. But this solution compounds latency, and does not negate my point, which is that existing operating systems have high overhead that discourage frequent invocations.

Several benefits of runtime code generation combine to improve the efficiency of interrupt handing in Synthesis: the use of the high-speed buffering code described in Section 3.3.1, the ability to create interrupt routines that save and restore only the part of the context being used, and the use of layer-collapsing to merge separate functions together.

intr:	move.l	a0,-(sp)	// Save register a0
	move.l	(P),a0		// Get buffer pointer into reg. a0
	move.l (cd_port),(a0,D)	// Store CD data into address P+D
	addq.w #4,(P+2)		// Increment low 16 bits of P.
	beq	cd_done		// ... flush buffer if full
	move.l (sp)+,a0		// Restore register a0
	rte			// Return from interrupt

Figure 3.5: Synthesized Code for Sound Interrupt Processing - CD Active

Figure 3.5 shows the actual Synthesis code created to handle the Sound-IO interrupts when only the CD-player is active. It begins by saving a single register, a0, since that is the only one used. This is followed by the code for the specific sound I/O channels, in this case, the CD-player. The code is similar to the fast buffer described in 3.3.1, synthesized to move data from the CD port directly into the user's buffer. If the other input sources (such as the A-to-D input) also become active, the interrupt routine is re-written, placing their buffer code immediately following the CD-player's. The code ends by restoring the a0 register and returning from interrupt.

s.intr:
	move.l a0,-(sp)		// Save register a0
	tst.b (cd_active)	// Is the CD device active?
	beq cd_no		// ... no, jump
	move.l (cd_buf),a0	// Get CD buffer pointer into reg. a0
	move.l (cd_port),(a0)+	// Store CD data; increment pointer
	move.l a0,(cd_buf)	// Update CD buffer pointer
	subq.l #1,(cd_cnt)	// Decrement buffer count
	beq cd_flush		// ... jump if buffer full
cd_no:
	tst.b (ad_active)	// Is the AD device active?
	beq ad_no		// ... no, jump
	:
	: [handle AD device, similar to CD code]
	:
ad_no:
	tst.b (da_active)	// Is the DA device active?
	beq da_no		// ... no, jump
	:
	: [handle DA device, similar to CD code]
	:
da_no:
	move.l (sp)+,a0		// Restore register a0
	rte			// Return from interrupt

Figure 3.6: Sound Interrupt Processing, Hand-Assembler

Figure 3.6 shows the best I have been able to achieve using hand-written assembly language, without the use of code synthesis. Like the Synthesis version, this uses only a single register, so we save and restore only that one. ³ But without code synthesis, we must include code for all the Sound-IO sources -- CD, AD, and DA -- testing and branching around the parts for the currently inactive channels. In addition, we can no longer use the fast buffer implementation of section 3.3.1 because that requires code synthesis.

³ Most existing systems neglect even this simple optimization. They save and restore all the registers, all the time.

Figure 3.7 shows another version, this one written in C, and invoked by a short assembly-language dispatch routine. It preserves only those registers clobbered by C procedure calls, and is representative of a carefully-written interrupt routine in C.

s_intr:
	movem.l ,-(sp)
	bsr _sound_intr
	movem.l (sp)+,
 	rte

sound_intr()
{
    if(cd_active) {
	*cd_buf++ = *cd_port;
	if(--cd_cnt < 0)
	    cd_flush();
    }
    if(ad_active) {
	...
    }
    if(da_active) {
	...
    }
}

Figure 3.7: Sound Interrupt Processing, C Code

The performance differences are summarized in Table 3.4. Measurements are divided into three groups. The first group, consisting of just a single row, shows the time taken by the hardware to process an interrupt and immediately return from it, without doing anything else. The second group shows the time taken by the various implementations of the interrupt handler when just the CD-player input channel is active. The third group is like the second, but with two active sources: the CD-player and AD channels.

Table 3.4: Processing Time for Sound-IO Interrupts
68030 CPU, 25MHz, 1-wait-state main memory, cold cache
	Time in µS	Speedup
Null Interrupt	2.0	--
CD-in, code-synth	3.7	--
CD-in, assembler	6.0	2.4
CD-in, C	9.7	4.5
CD-in, C & Unix	17.1	8.9
CD+DA, code-synth	5.1	--
CD+DA, assembler	7.7	1.8
CD+DA, C	11.3	3.0
CD+DA, C & Unix	18.9	5.5

Within each group of measurements, there are four rows. The first three rows show the time taken by the code synthesis, hand-assembler, and C implementations of the interrupt handler, in that order. The code fragments measured are those of figures 3.5, 3.6, and 3.7; the C code was compiled on the Sony NEWS workstation using "cc -O". The last row shows the time taken by the C version of the handler, but dispatched the way that Sony Unix does, preserving all the machines registers prior to the call. The left column tells the elapsed execution time, in microseconds. The right column gives the ratio of times between the code synthesis implementation and the others. The null-interrupt time is subtracted before computing the ratio to give a better picture of what the implementation-specific performance increases are.

As can be seen from the table, the performance gains of using code synthesis are impressive. With only one channel active, we get more than twice the performance of handwritten assembly language, almost five times more efficient than well-written C, and very nearly an order of magnitude better than traditional Unix interrupt service. Furthermore, the non-code-synthesis versions of the driver supports only the two-channel, 16-bit linearencoding sound format. Extending support, as Synthesis does, to other sound formats, such as µ-Law, either requires more tests in the sound interrupt handler or an extra level of format conversion code between the application and the sound driver. Either option adds overhead that is not present in the code synthesis version, and would increase the time shown.

With two channels active, the gain is still significant though somewhat less than that for one channel. The reason is that the overhead-reducing optimizations of code synthesis -- collapsing layers and preserving only context that is used -- become less important as the amount of work increases. But other optimizations of code synthesis, such as the fast buffer, continue to be effective and scale with the work load. In the limit, as the number of active channels becomes large, the C and assembly versions perform equally well, and the code synthesis version is about 40% faster.

3.3.4 System Calls

Another use of code synthesis is to minimize the overhead of invoking system calls. In Synthesis the term "system call" is somewhat of a misnomer because the Synthesis system interface is based on procedure calls. A Synthesis system call is really a procedure call that happens to cross the protection boundary between user and kernel. This is important because, as we will see in Chapter 4, each Synthesis service has a set of procedures associated with it that delivers that service. Since the set of services provided is extensible, we need a more general way of invoking them. Combining procedure calls with runtime code generation lets us do this efficiently.

// --- User-level stub procedure ---
proc:
	moveq	#N,d2		// Load procedure index
	trap	#15		// Trap to kernel
	rts			// Return

// --- Dispatch to kernel procedure ---
trap15:
	cmp.w	#MAX,d2		// Check that procedure index is in range
	bhs	bad_call	// ... jump if not
	move.l	(tab$,pc,d2*4),a2 // Get the procedure address
	jsr	(a2)		// Call it
	rte			// Return to user-level

.align 4		// Table of kernel procedure addresses...
tab$:
	dc.l fn0, fn1, fn2, fn3, ..., fnN

Figure 3.8: User-to-Kernel Procedure Call

Figure 3.8 shows how. The generated code consists of two parts: a user part, shown at the top of the figure, and a kernel part, shown at the bottom. The user part loads the procedure index number into the d2 register and executes the trap instruction, switching the processor into kernel mode where it executes the kernel part of the code, beginning at label trap15.

Runtime code generation gives the following advantages: each thread has its own table of vectors for exceptions and interrupts, including trap 15.

Furthermore, by thinking of kernel invocation not as a system call - which conjures up thoughts of heavyweight processing and large overheads - but as a procedure call, many other optimizations become easier to see. For example, ordinary procedures preserve only those registers which they use; kernel procedures can do likewise. Procedure calling conventions also do not require that all the registers be preserved across a call. Often, a number of registers are allowed to be "trashed" by the call, so that simple procedures can execute without preserving anything at all. Kernel procedures can follow this same convention. The fact that kernel procedures are called from user level does not make them special; one merely has to properly address the issues of protection, which is discussed further in Section 3.4.2.

Besides dispatch, we also need to address the problem of how to move data between user space and kernel as efficiently as possible. There are two kinds of moves required: passing procedure arguments and return values, and passing large buffers of data. For passing arguments, the user-level stub procedures are generated to pass as many arguments as possible in the CPU registers, bypassing the expense of accessing the user stack from kernel mode. Return values are likewise passed in registers, and moved elsewhere as needed by the user-level stub procedure. This is similar in idea to using CPU registers for passing short messages in the V system [9].

Passing large data buffers is made efficient using virtual memory tricks. The main idea is: when the kernel is invoked, it has the user address space mapped in. Synthesis reserves part of each address space for the kernel. This part is normally inaccessible from user programs. But when the processor executes the trap and switches into kernel mode, the kernel part of the address space becomes accessible in addition to the user part, and the kernel procedure can easily move data back and forth using the ordinary machine instructions. Prior to beginning such a move, the kernel procedure checks that no pointer refers to locations outside the user's address space - an easy check due to the way the virtual addresses are chosen: a single limit-comparison (two instructions) suffices.

Further optimizations are also possible. Since the user-level stub is a real procedure, it can be in-line substituted into its caller. This can be done lazily -- the stub is written so that each time a call happens, it fetches the return address from the stack and modifies that point in the caller. Since the stubs are small, space expansion is minimal. Besides being effective, this mechanism requires minimal support from the language system to identify potential in-lineable procedure calls.

Another optimization bypasses the kernel procedure dispatcher. There are 16 possible traps on the 68030. Three of these are already used, leaving 13 free for other purposes, such as to directly call heavily-used kernel procedures. If a particular kernel procedure is expected to be used often, an application can invoke the cache procedure call, and Synthesis will allocate an unused trap, set it to call the kernel procedure directly, and re-create the user-level stub to issue this trap. Since this trap directly calls the kernel procedure, there is no longer any need for a limit check or a dispatch table. Pre-assigned traps can also be used to import execution environments. Indeed, the Synthesis equivalent of the Unix concept of "stdin" and "stdout" is implemented with cached kernel procedure calls. Specifically, trap 1 writes to stdout, and trap 2 reads from stdin.

Combining both optimizations results in a kernel procedure call that costs just a little more than a trap instruction. The various costs are summarized in Table 3.5. The middle block of measurements show the cost of various Synthesis null kernel procedure calls: the general-dispatched, non-inlined case; the general-dispatched, with the user-level stub inlined into the application's code; cached-kernel-trap, non-inlined; and cached-kerneltrap, inlined. For comparison, the cost of a null trap and a null procedure call in the C language is shown on the top two lines, and the cost of the trivial getpid system call in Unix and Mach is shown on the bottom two lines.

3.4 Other Issues 3.4.1 Kernel Size

Kernel size inflation is an important concern in Synthesis due to the potential redundancy in the many Fsmall and F flat programs generated by the same F^create. This could be particularly bad if layer collapsing were used too enthusiastically. To limit memory use, F ^create can generate either in-line code or subroutine calls to shared code. The decision of when to expand in-line is made by the programmer writing F^create. Full, memory-hungry in-line expansion is usually reserved for specific uses where its benefits are greatest: the performance-critical, frequently-executed paths of a function, where the performance gains justify increased memory use. Less frequently executed parts of a function are stored in a common area, shared by all instances through subroutine calls.

Table 3.5: Cost of Null System Call
68030 CPU, 25MHz, 1-wait-state main memory
	µS, cold cache	µS, warm cache
C procedure call	1.2	1.0
Null trap	1.9	1.6
Kernel call, general dispatch	4.2	3.5
Kernel call, general, in-lined	3.5	2.9
Kernel call, cached-trap	3.5	2.7
Kernel call, cached and in-lined	2.7	2.1
Unix, getpid	40	--
Mach, getpid	88	--

In-line expansion does not always cost memory. If a function is small enough, expanding it in-line can take the same or less space than calling it. Examples of functions that are small enough include character-string comparisons and buffer-copy. For functions with many runtime-invariant parameters, the size expansion of inlining is offset by a size decrease that comes from not having to pass as many parameters.

In practice, the actual memory needs are modest. Table 3.6 shows the total memory used by a full kernel -- including I/O buffers, virtual memory, network support, and a window system with two memory-resident fonts.

Table 3.6: Kernel Memory Requirements
System Activity	Memory Use, as code + data (Kbytes)
Boot image for full kernel	140
One thread running	Boot + 0.4 + 8
File system and disk buffers	Boot + 6 + 400
100 threads, 300 open files	Boot + 80 + 1400

3.4.2 Protecting Synthesized Code

The classic solutions used by other systems to protect their kernels from unauthorized tampering by user-level applications also work in the presence of synthesized code. Like many other systems, Synthesis needs at least two hardware-supported protection domains: a privileged mode that allows access to all the machine's resources, and a restricted mode that lets ordinary calculations happen but restricts access to resources. The privileged mode is called supervisor mode, and the restricted mode, user mode.

Kernel data and code - both synthesized and not - are protected using memory management to make the kernel part of each address space inaccessible to user-level programs. Synthesized routines run in supervisor mode, so they can perform privileged operations such as accessing protected buffer pages.

User-level programs enter supervisor mode using the trap instruction. This instruction provides a controlled - and the only - way for user-level programs to enter supervisor mode. The synthesized routine implementing the desired system service is accessed through a jump table in the protected area of the address space. The user program specifies an index into this table, ensuring the synthesized routines are always entered at the proper entry points. This protection mechanism is similar to Hydra's use of C-lists to prevent the forgery of capabilities [34].

Once in kernel mode, the synthesized code handling the requested service can begin to do its job. Further protection is unnecessary because, by design, the kernel code generator only creates code that touches data the application is allowed to touch. For example, were a file inaccessible, its read procedure would never have been generated. Just before returning control to the caller, the synthesized code reverts to the previous (user-level) mode.

There is still the question of invalidating the code when the operation it performs is no longer valid -- for example, invalidating the read procedure after a files has been closed. Currently, this is done by setting the corresponding function pointers in the KPT to an invalid address, preventing further calls to the function. The function's reference counter is then decremented, and its memory freed when the count reaches zero.

3.4.3 Non-coherent Instruction Cache

A common assumption in the design of processors is that a program's instructions will not change as the program runs. For that reason, most processor's instruction caches are not coherent - writes to memory are not reflected in the cache. Runtime code generation violates this assumption, requiring that the instruction cache be flushed whenever code changes happen. Too much cache flushing reduces performance, both because programs execute slower when the needed instructions are not in cache and because flushing itself may be an expensive operation.

The performance of self-modifying code, like that found in executable data structures, suffers the most from an incoherent instruction cache. This is because the ratio of code modification to use tends to be high. Ideally, we would like to flush with cache-line granularity to avoid losing good entries. Some caches provide only an all-or-nothing flush. But even line-at-a-time granularity has its disadvantages: it needs machine registers to hold the parameters, registers that may not be available during interrupt service without incurring the cost of saving and restoring them. Unfortunately for Synthesis, most cases of self-modifying code actually occur inside interrupt service routines where small amounts of data (e.g., one character for a TTY line) must be processed with minimal overhead. Fortunately, in all important cases the cost has been reduced to zero through careful layout of the code in memory using knowledge of the 68030 cache architecture to cause the subsequent instruction fetch to replace the cache line that needs flushing. But that trick is neither general nor portable.

For the vast majority of code synthesis applications, an incoherent cache is not a big problem. The cost of flushing even a large cache contributes relatively little compared to the cost of allocating memory and creating the code. If code generation happens infrequently relative to the code's use, as is usually the case, the performance hit is small.

Besides the performance hit from a cold cache, cache flushing itself may be slow. On the 68030 processor, for example, the instruction to flush the cache is privileged. Although this causes no special problems for the Synthesis kernel, it does force user-level programs that modify code to make a system call to flush the cache. I do not see any protectionrelated reason why that instruction must be privileged; perhaps making it so simplified processor design.

3.5 Summary

This chapter showed:

that code synthesis allows important operating system functions such as buffering, context switching, interrupt handing, and system call dispatch to be implemented 1.4 to 2.4 times more efficiently than is possible using the best assemblylanguage implementation without code synthesis and 1.5 to 5 times better than well-written C code;
that code synthesis is also effective at the user-level, achieving an 80% improvement for basic operations such as putchar; and
that the anticipated size penalty does not, in fact, happen.

Before leaving this section, I want to call a moment's more attention to the interrupt handlers of Section 3.3.3. At first glance - and even on the second and third - the C-language code it looks to be as minimal as it can get. There does not seem to be any fat to cut. Table 3.4 has shown otherwise. The point is that sometimes, sources of overhead are hidden, not so easy to spot and optimize. They are a result of assumptions made and the programming language used, whether it be in the form of a common calling convention for procedures, or in conventions followed to simplify linking routines to interrupts. This section has shown that code synthesis is an important technique that enables general procedure interfacing while preserving -- and often bettering - the efficiency of custom-crafted code.

The next chapter now shows how Synthesis is structured and how synergy between kernel code synthesis and good software engineering leads to a system that is general and easily expandable, but at the same time efficient.