Chumby PhotoFrame 2

I’ve hacked up the Chumby PhotoFrame source from by Wansti.

Fixed a few bugs and created a USB image to enable http://localhost/image/ browsing on the Chumby.

I’m working on an CGI to allow uploading of images directly to the Chumby.

Please give me feedback.


This image contains lighttpd and ruby.
Images can be uploaded directly using http://ip.of.your.chumby/image/upload.cgi.

Chumby – A Hacker’s Microcosm

I got a Chumby as a holiday gift; it’s really cool and really hackable.

The good:

  • Hardware and software is open-source.
  • Runs Linux.
  • Very easy to hack: SSH, BusyBox, lighttpd, Ruby, etc.
  • Two USB ports.
  • Audio out (internal speakers sound amazingly good for their size).
  • Built-in microphone (haven’t figured out how to use it).
  • Can play FLAC files from USB drive.

The bad:

  • It’s very difficult to get it to show pictures off a USB drive. The only way apparently is to install a local HTTP server and use a widget:
  • Flash widgets/apps only — Flash is not open-source.
  • Switching between widgets and channels is fiddly.
  • One button is not enough considering widgets take up the whole screen modally.
  • It needs more RAM. 64MB is just barely enough.

Below is the “Virtual Chumby” showing my “Clocks” channel:

Overall it’s a very cool device. Right now it’s in the bedroom but it’s too useful to sit there as an alarm clock.

At some point I’d like to try to interface a small webcam to it.

Luxeed LED Keyboard Driver for Linux

Hardware Review

This is a really cool keyboard, except that the ESC and function keys do not light up and are strangely located and difficult to press. I also don’t understand why the spacebar does not light up; seems like it’s the most important key. The construction is not very good and the feel is a bit mushy. However, this keyboard is fun for hacking and could be really great for user training.

Device Driver and Source Code

After much troubleshooting, I have created a simple C driver library for the Luxeed deTA 100/200 keyboard using libusb. This driver allows the LEDs under each key to be set from C from most Unix-like operating systems. This is my first attempt at reverse-engineering the protocol of a USB device.

Source is located here:


There is a user-space single-threaded, non-blocking I/O daemon using libevent that accepts multiple TCP connections on port 12345 to execute pixel update commands. I’ve been able to get time between frames down to 0.025 seconds.


Commands sent from clients via TCP to the server:

set <RED> <GREEN> <BLUE> <KEY>...
wait <SECONDS>
get <KEY>

Keys can be specified using names (e.g.: “TAB”, “LSHIFT”, “A”), key IDs (e.g. “#20” same as “Y”) or positions relative to the top-left corner (e.g.: “2@1” same as “W”). The “get” command returns the current color of a key. The key colors are atomically uploaded to the keyboard with “update” command.

Building and Running the Server

 > cd luxeed/src
 > make install-prereqs
 > make
 > sudo ./luxeed --server & # start the server

Basic Commands

Make the number keys light up blue:

 > netcat localhost 12345
set 00 00 ff 1 2 3 4 5 6 7 8 9 0 

Simple client that flashes keys:

 > sudo ./luxeed --server & # start the server
 > cat flash_keys 

c="ff ff ff"
c2="00 00 00"
while [ $# -gt 0 ]
  case "$1"
      t="$2"; shift 2
      c="$2"; shift 2
      c2="$2"; shift 2

while true
  netcat localhost 12345 <<END
set $c $keys
wait $t
set $c2 $keys
wait $t
 > ./flash_keys ENTER # flash the ENTER key

Since the server can service multiple clients, we can flash multiple keys at different rates and colors using multiple clients:

  > ./flash_keys -t 0.2 -c 'ff 00 00' TAB &
  > ./flash_keys ENTER &

Maybe it’s time for a demo video.

Things left:

  • Pausing updates to the keyboard for a short time period after any keypress would allow animations but not let them interfere with the typist. The keyboard I/O channel is too slow to handle typing and animations (~10 frame/sec) at the same time.
  • A bulk update command that maps a PNM image representing the keyboard would be helpful.
  • udev rules to prevent the need for running the server as root.
  • Debian packaging.

There are some I/O throttling issues; sometimes the keyboard disconnects (or requires a power cycle) if data is sent to it too quickly. The only message I could decipher is a bulk keyboard update message of 5 65-byte blocks with an embedded checksum. Maybe there is a smaller differential update payload that will only update specific keys which would be more useful for faster animation. There may be a response message that needs to be read.

If you have this keyboard, give it a try!

Ruby: Performance of Symbol Construction

Measurements of Symbol Constructor Expressions.


ruby 1.8.6 (2008-08-08 patchlevel 286) [i686-linux]:

> /cnu/bin/ruby symbol_benchmark.rb
                               user     system      total        real
Null Test                  0.740000   0.000000   0.740000 (  0.742914)
'foo_bar'                  1.670000   0.000000   1.670000 (  1.661374)
"foo_bar"                  1.620000   0.000000   1.620000 (  1.625221)
:foo_bar                   0.890000   0.000000   0.890000 (  0.886903)
:'foo_bar'                 0.880000   0.000000   0.880000 (  0.878555)
:"foo_bar"                 0.860000   0.000000   0.860000 (  0.867110)
"foo_bar".to_sym           2.830000   0.000000   2.830000 (  2.830536)
str.to_sym                 2.050000   0.000000   2.050000 (  2.052756)
:"{str}"                   0.880000   0.000000   0.880000 (  0.881367)
:"foo_#{'bar'}"            0.860000   0.000000   0.860000 (  0.854944)
"foo_#{'bar'}".to_sym      2.880000   0.000000   2.880000 (  2.942499)
:"foo_#{bar}"              4.280000   0.000000   4.280000 (  4.290880)
"foo_#{bar}".to_sym        4.930000   0.000000   4.930000 (  4.929801)

ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]:

> /usr/bin/ruby symbol_benchmark.rb
                               user     system      total        real
Null Test                  0.780000   0.000000   0.780000 (  0.782060)
'foo_bar'                  3.480000   0.770000   4.250000 (  4.263535)
"foo_bar"                  3.400000   0.820000   4.220000 (  4.224401)
:foo_bar                   2.280000   0.950000   3.230000 (  3.223590)
:'foo_bar'                 2.370000   0.860000   3.230000 (  3.249133)
:"foo_bar"                 2.310000   0.940000   3.250000 (  3.283796)
"foo_bar".to_sym           5.010000   0.980000   5.990000 (  6.029690)
str.to_sym                 3.740000   0.930000   4.670000 (  4.706827)
:"{str}"                   2.440000   0.810000   3.250000 (  3.254253)
:"foo_#{'bar'}"            2.470000   0.780000   3.250000 (  3.245739)
"foo_#{'bar'}".to_sym      4.800000   0.970000   5.770000 (  5.767003)
:"foo_#{bar}"              7.490000   0.990000   8.480000 (  8.506735)
"foo_#{bar}".to_sym        8.640000   0.890000   9.530000 (  9.532481)

ruby 1.9.0 (2008-11-25 revision 20347) [i686-linux]:

> ~/local/ruby/trunk/bin/ruby symbol_benchmark.rb
                               user     system      total        real
Null Test                  0.690000   0.000000   0.690000 (  0.695087)
'foo_bar'                  1.760000   0.000000   1.760000 (  1.772417)
"foo_bar"                  1.780000   0.000000   1.780000 (  1.772031)
:foo_bar                   0.720000   0.000000   0.720000 (  0.764095)
:'foo_bar'                 0.710000   0.000000   0.710000 (  0.710952)
:"foo_bar"                 0.720000   0.000000   0.720000 (  0.717041)
"foo_bar".to_sym           3.370000   0.000000   3.370000 (  3.371836)
str.to_sym                 2.340000   0.010000   2.350000 (  2.347026)
:"{str}"                   0.720000   0.000000   0.720000 (  0.723032)
:"foo_#{'bar'}"            0.710000   0.000000   0.710000 (  0.712813)
"foo_#{'bar'}".to_sym      3.370000   0.020000   3.390000 (  3.399618)
:"foo_#{bar}"              5.030000   0.000000   5.030000 (  5.037542)
"foo_#{bar}".to_sym        5.060000   0.010000   5.070000 (  5.112535)

rubinius 0.10.0 (ruby 1.8.6 compatible) (c458077eb) (12/31/2009) [i686-pc-linux-gnu]:

> ~/local/ruby/rubininus/bin/rbx symbol_benchmark.rb
                               user     system      total        real
Null Test                  2.369585   0.000000   2.369585 (  2.369602)
'foo_bar'                  3.616918   0.000000   3.616918 (  3.616949)
"foo_bar"                  3.672719   0.000000   3.672719 (  3.672752)
:foo_bar                   2.407764   0.000000   2.407764 (  2.407796)
:'foo_bar'                 2.462789   0.000000   2.462789 (  2.462824)
:"foo_bar"                 2.401784   0.000000   2.401784 (  2.401812)
"foo_bar".to_sym           7.127442   0.000000   7.127442 (  7.127472)
str.to_sym                15.128944   0.000000  15.128944 ( 15.128974)
:"{str}"                   2.411760   0.000000   2.411760 (  2.411789)
:"foo_#{'bar'}"            2.437595   0.000000   2.437595 (  2.437629)
"foo_#{'bar'}".to_sym      7.120185   0.000000   7.120185 (  7.120209)
:"foo_#{bar}"             20.144837   0.000000  20.144837 ( 20.144864)
"foo_#{bar}".to_sym       20.222530   0.000000  20.222530 ( 20.222891)

jruby 1.1.5 (ruby 1.8.6 patchlevel 114) (2008-11-03 rev 7996) [i386-java]:

> ~/local/ruby/jruby-1.1.5/bin/jruby symbol_benchmark.rb
                               user     system      total        real
Null Test                  0.936000   0.000000   0.936000 (  0.936172)
'foo_bar'                  1.886000   0.000000   1.886000 (  1.886079)
"foo_bar"                  1.880000   0.000000   1.880000 (  1.880344)
:foo_bar                   1.379000   0.000000   1.379000 (  1.379957)
:'foo_bar'                 5.842000   0.000000   5.842000 (  5.842242)
:"foo_bar"                 5.834000   0.000000   5.834000 (  5.833263)
"foo_bar".to_sym           2.597000   0.000000   2.597000 (  2.597294)
str.to_sym                 2.306000   0.000000   2.306000 (  2.306082)
:"{str}"                   5.655000   0.000000   5.655000 (  5.655672)
:"foo_#{'bar'}"            5.785000   0.000000   5.785000 (  5.785285)
"foo_#{'bar'}".to_sym      2.585000   0.000000   2.585000 (  2.584877)
:"foo_#{bar}"              7.532000   0.000000   7.532000 (  7.531846)
"foo_#{bar}".to_sym        8.361000   0.000000   8.361000 (  8.360630)


require 'benchmark'

def bm_expr(x, expr, title = nil)
  e = <<"END"
  n = 10_000_000
  str = 'foo_bar'
  bar = 'bar'
  #begin; GC.enable; GC.start; GC.disable; rescue nil; end
  begin; GC.start; rescue nil; end{(title || expr).inspect}) {
    n.times { 
  eval e

end { | x |
  bm_expr x, "", 'Null Test' 
  bm_expr x, "'foo_bar'" 
  bm_expr x, '"foo_bar"' 
  bm_expr x, ':foo_bar' 
  bm_expr x, ":'foo_bar'" 
  bm_expr x, ':"foo_bar"' 
  bm_expr x, '"foo_bar".to_sym' 
  bm_expr x, 'str.to_sym' 
  bm_expr x, ':"{str}"' 
  bm_expr x, ':"foo_#{\'bar\'}"'
  bm_expr x, '"foo_#{\'bar\'}".to_sym'
  bm_expr x, ':"foo_#{bar}"'
  bm_expr x, '"foo_#{bar}".to_sym'

First time I ran the benchmarks without forcing and disabling GC :'foo_bar' appeared faster than :foo_bar. Reminder to self: always disable GC around benchmarks, unless you are benchmarking GC.

The Unreasonable Effectiveness of Computing

In 1960, the physicist Eugene Wigner published an article titled The Unreasonable Effectiveness of Mathematics in the Natural Sciences, arguing that the way in which the mathematical structure of a physical theory often points the way to further advances in that theory and even to empirical predictions, is not a coincidence but must reflect some larger and deeper truth about both mathematics and physics.

Is computing the deeper “truth” about mathematics and physics and reality in general? Is the true nature of reality computational? — The universe as cellular automation operating on simple rules at very small discrete scales giving rise to emergent properties at larger scales.

Front cover of Wired issue 16.07 – “The End of Science” contends that “The quest for knowledge used to begin with grand theories. Now it begins with massive amounts of data.”

John Horgan’s 1996 book – The End of Science: Facing the Limit of Knowledge in the Twilight of the Scientific Age covers perception of the slowing down of science: “Has all the knowledge worth pursuing become known?”

Scientific American: The Self-Organizing Quantum Universe describes a discrete, self-organizing, fractal, computational model of space-time: “Causal Dynamical Triangulations.” At smaller scales the model is discrete, at larger scales classical and quantum properties of space-time emerge.

Is a computational theory of reality the future of science? Are we at a threshold of science where theory is irrelevant or at least unverifiable without computing? Does the increasing use of computational modeling of the physical world point towards the inevitability of a “computational theory of everything”?

Is computing the new “Queen of the Sciences”?

Parameterized Word Tagging in Latently-Typed Languages


I have been interested in optimal low-bit tag schemes on machine words in latently-typed languages. There are different schools of thought on how to handle the trade-offs between Fixnums (integers smaller than word size) and allocated objects.

Ian Piumarta’s libid object library is a very simple prototype-based object system with unresolved method delegation. It is used in the Pepsi/Coke COLA system (available at as the basis for all object representations and meta-behaviors. libid assumes that a reference to an object’s method table, or “vtable” resides at the word before the object’s address.

It was chosen as a test-bed for testing tagging scheme’s because the Pepsi/Coke system is built using itself. Libid was changed to be parametric with respect to word tags. This was done by abstracting references to NULL oop pointers and object slot accesses with macros to handle the tagging and untagging of Fixnum and allocated objects pointers. At a later point the idc compiler might be also changed to be tag parametric.

Scheme: TAG(Fixnum) == 1

This is the default for libid. The default tagging scheme was parameterized where Fixnums are tagged with a 1 in the least-significant bit and where 0 is a special case for a Nil object.

  0...000000: Nil
  x...xxxxx0: allocated objects
  x...xxxxx1: Fixnums


  • 0 is assumed to be the Nil object
    • NULL pointer checks required when determining its vtable.
    • Object pointers are the same boxed and unboxed
    • LSB is 0
    • vtable is located at object pointer word – 4.
    • Fixnums have 1 in lowest-bit.
    • Fixnum vtable is stored in global variable.
    • Fixnum + and - operations require down-shifts.
    • Fixnum * and / operations require down-shifts.
  • Tag needs to be added after up-shifts.

Scheme: TAG(Fixnum) == 0

This scheme should perform better than TAG(Fixnum) == 1, because Fixnum arithmetic operations are less expensive and since most object slot accesses are non-zero pointer offsets in both schemes.

  x...xxxxx0: Fixnums
  x...xxxxx1: allocated objects & Nil


  • NULL pointers are avoided by never storing them
    • uses a statically allocated Nil object with a valid Nil vtable.
    • vtable(nil) is handled by the generic vtable(object).
    • Object pointers are offset by 1
    • vtable at object pointer word – 5.
    • Tag removal is optimized away into all slot offsets.
    • Fixnums have 0 in lowest-bit.
    • + and - operations require no shifts.
  • * and / operations require shifting of only one argument.


Executes 10,000,000 message sends and +(Fixnum, Fixnum) operations with no Fixnum overflow tests using GCC 4.2.3 on ix86 with the following compile-time options:

-Wall -Wreturn-type -Werror -D_GNU_SOURCE=1 \
-Wno-unused-value -w \
-march=prescott -fomit-frame-pointer \
-falign-functions=16 -funroll-loops

GCC optimizations were also varied.

This is a naive test of raw message and Fixnum addition speed. More substantial tests will be presented at a later date.

Test results

Intel(R) Pentium(R) M processor 1.20GHz

                      n=10                              EXE
                      AVG(s) NIL=0 INT_TAG OBJ_TAG GCC SIZE
                      ====== ===== ======= ======= === ====
./benchmark_1_1_0_3   7.687  Y     1       0       -O3 42774
./benchmark_0_0_1_1   8.002  N     0       1       -O1 34780
./benchmark_0_0_1_2   8.111  N     0       1       -O2 35100
./benchmark_0_0_1_3   8.113  N     0       1       -O3 38500
./benchmark_1_1_0_1   8.593  Y     1       0       -O1 35445
./benchmark_1_1_0_2   9.05   Y     1       0       -02 35365
Intel(R) Core(TM)2 Duo CPU     T7100  @ 1.80GHz

                      n=10                              EXE
                      AVG(s) NIL=0 INT_TAG OBJ_TAG GCC SIZE
                      ====== ===== ======= ======= === ====
./benchmark_1_1_0_3   1.794  Y     1       0       -O3 43912
./benchmark_0_0_1_3   1.83   N     0       1       -O3 38870
./benchmark_0_0_1_2   1.874  N     0       1       -O2 36222
./benchmark_0_0_1_1   1.901  N     0       1       -O1 35966
./benchmark_1_1_0_2   1.906  Y     1       0       -O2 36503
./benchmark_1_1_0_1   2.007  Y     1       0       -O1 36711


  • TAG(Fixnum)==0 performs better with -O1 optimization.
  • TAG(Fixnum)==1 performs better with -O3 optimization on older processors.
  • Modern processors have deeper pipelines.


The results are surprising. I’m still investigating. I suspect:

  • Processors with deeper pipelines can execute more instructions concurrently.
  • The loss of NULL==Nil==0 causes Nil checks to be more expensive.
  • Memory constrained devices may perform better with TAG(Fixnum)==0,-O1.
  • Non-aggressive code generators (idc and many JIT compilers) may perform better with TAG(Fixnum)==0,-O1
  • -O3 optimization is moving code in very mysterious ways.
  • Costs due to pointer offsets with TAG(object)==1 because libid has already incremented object pointers ahead of the usual place for a vtable or “type” slot at offset 0.

The first slot to be accessed is at offset 0 from the object pointer. First slots are generally hot. Having to add an additional negative offset may have an additional cost that would be present if a “type” slot at offset 0 but is not present when TAG(object)==0 and object pointers are ahead of the vtable slot: offset of 0 costs nothing, any non-zero offset has a cost.


struct Cons_allocated {
  oop vtable; /* offset -4 */
  struct Cons {
    oop car;  /* offset 0 */
    oop cdr;  /* offset 4 */
  } cons;
} cons_allocated;
oop cons_ref = &cons_allocated.cons;


struct Cons {
  oop vtable; /* offset 0 */
  oop car;    /* offset 4 */
  oop cdr;    /* offset 8 */
oop cons_ref = &cons;

I will post a link to code shortly.


TM : The implementation of a real-time, single-threaded, type-segmented, conservative garbage collector


TM is a real-time, non-relocating, conservative, allocator using type- and color-segregated lists, based on Henry Baker’s “Treadmill” allocator. The source code for TM is located at

TM interleaves marking, scanning and sweeping during each mutator’s call to tm_alloc().

TM attempts to limit the amount of work in the collector to avoid stopping the world for long periods of time. It does not require a separate thread for the collector as all collection activities are interleaved with allocation.

Full documentation is located at:

Data Structures

An allocation unit is a tm_node. Each tm_node is allocated from a block of system memory (tm_block) sized and aligned to a multiple of the operating system’s page size. Each tm_block is dedicated to a particular type (tm_type). Each tm_type represents a tm_node allocation size, thus each tm_block is uniformly parceled into tm_nodes of the same size. In general, each power-of-two size has a tm_type. However, tm_types can be created for any size.

Node Coloring

Each tm_node has a “color” tag describing its current allocation state:

  1. tm_WHITE: free, unused.
  2. tm_ECRU: allocated, potentially unused.
  3. tm_GREY: allocated, marked used, not scanned for pointers.
  4. tm_BLACK: allocated, marked used, scanned for pointers.

Each tm_type has it’s own tm_WHITE, tm_ECRU, tm_GREY, and tm_BLACK doubly-linked lists. Each tm_node is a member of one, and only one, of its type’s color lists. To keep the tm_node header small, the node’s color is encoded in the lower two bits of the tm_node._prev pointer.

When a tm_node’s color changes, it is moved to a different colored list in its tm_type for processing in a different allocation phase.

Node types can be explicitly created for common sizes that are not a power of two, to reduce memory fragmentation.

Allocation Phases

The allocator interleaves the following phases with allocation:

  1. tm_UNMARK: Unmarking scanned blocks for subsequent rooting. (tm_BLACK -> tm_ECRU)
  2. tm_ROOT: Root scanning and marking nodes for scanning. (tm_ECRU -> tm_GREY)
  3. tm_SCAN: Scanning marked nodes for interior pointers. (tm_GREY -> tm_BLACK)
  4. tm_SWEEP: Freeing unmarked nodes. (tm_ECRU -> tm_WHITE)

During each call to tm_alloc(), each phase will do some collection work before returning a newly allocated tm_node:

  1. tm_UNMARK: Allocate tm_nodes from the tm_type’s tm_WHITE list until it is empty. A fixed number of tm_BLACK nodes are returned to the tm_ECRU list. Memory pressure may cause switching to the tm_ROOT phase.
  2. tm_ROOT: A fixed number of roots are scanned for pointers into allocated space. tm_ECRU nodes with valid references are moved to tm_GREY. After all roots are scanned, the next phase is tm_SCAN.
  3. tm_SCAN: A fixed number of node bytes of tm_GREY nodes are scanned for interior pointers and moved to tm_BLACK list. Once all tm_GREY nodes are scanned, the next phase is tm_SWEEP.
  4. tm_SWEEP: If there are no more nodes to be scanned, sweep a fixed number of tm_ECRU nodes to the tm_WHITE list. Once all tm_ECRU nodes are sweeped, the collector returns to the tm_UNMARK phase.

As nodes are allocated, marked, scanned, sweeped and unmarked they are moved between the colored lists as follows:

Node States

In general, newly allocated tm_nodes are taken from the tm_type’s tm_WHITE list and placed on its tm_ECRU list. If the type’s tm_WHITE list is empty, an allocation tm_block is requested from the operating system and is scheduled for parceling new tm_WHITE nodes for the type. If the allocation tm_block becomes empty, the process is repeated: another allocation block is requested and parceled.

A limited number of new free nodes are parceled from the type’s current allocation block as needed and are initialized as new tm_WHITE nodes.

New tm_blocks may be requested from the operating system during all phases, if the type’s tm_WHITE list or allocation block is empty. The reasoning is the operating system should be able to allocate a new allocation block faster than a collection that would need to completely “stop the world”.

All phases, except the tm_SWEEP phase, allocate nodes by moving them from the tm_WHITE list to the tm_ECRU list. The tm_SWEEP phase allocated nodes from tm_WHITE to tm_GREY, because tm_ECRU nodes are considered to be unmarked garbage during the tm_SWEEP phase and tm_BLACK nodes are considered “in-use’ but are not scanned for interior pointers during the tm_SWEEP phase. Using tm_GREY during tm_SWEEP is also related to tm_BLACK node mutation, which gives rise to requiring a write barrier on tm_BLACK nodes.

The tm_SWEEP phase will always attempt to scan any tm_GREY nodes before continuing to sweep any tm_ECRU nodes.

The tm_BLACK color might seem a better choice for tm_SWEEP allocations, but this would force young nodes, which are more likely to be garbage, to be kept until the next tm_SWEEP phase. Coloring tm_SWEEP allocations tm_BLACK would also prevent any new interior pointers stored in nodes that may reference tm_ECRU nodes from being scanned before tm_SWEEP is complete.

To prevent thrashing the operating system with tm_block allocation and free requests, a limited number of unused blocks are kept on a global free list.

Aligned Blocks

Aligned blocks allow determining if generic word is a pointer to heap-allocated tm_node to be computed with simple pointer arithmetic and a bit vector check.

To determine a pointer’s allocation:

  1. Checking a bit vector indexed by the pointer’s page number. The bit is set when tm_nodes are parceled and cleared when the entire tm_block is unused and returned to the free block list or operating system.
  2. Mask off the insignificant page bits to construct an aligned tm_block address.
  3. Determine the size of tm_nodes in the tm_block from the block’s tm_type.
  4. Determine if the pointer resides in the data portion of the tm_node by considering the tm_node and tm_block linked-list headers to be “holes” in the address space.

Type Segregation

Nodes are segregated by type. Each type has a size. By default, type sizes are rounded up to the next power of two. A specific allocation type can be requested with tm_adesc_for_size(). The allocation descriptor can then be used by tm_alloc_desc(). The opaque element can be used to store additional mutator data.

Each node type has its own colored lists, allocated block lists and accounting. Segregating node types allows allocation requests to be done without scanning tm_WHITE nodes for best fit. However, since types and the blocks are segregated, and nodes of a larger size are not scavenged for smaller sise, this could least to poor actual memory utilization in mutators with small numbers of allocations for many sizes, since a single node allocation for a given size will cause at least one block to requested from the operating system.

The color lists are logically combined from all type for iteration using nested type and node iterators.

Write Barriers

During the tm_SCAN and tm_SWEEP phase, any tm_BLACK node that is mutated must be rescanned due to the possible introduction of new references from the tm_BLACK node to tm_ECRU nodes. This is achieved by calling tm_write_barrier(&R) in the mutator after modifying R’s contents.

There are two versions of the write barrier:

  1. tm_write_barrier(R)
  2. tm_write_barrier_pure(R)

tm_write_barrier_pure(R) can be called when R is guaranteed to be a pointer to the head of a node allocated by tm_alloc(). tm_write_barrier_pure(R) cannot be used if R might be an interior pointer or a pointer to a stack or root-allocated object. tm_write_barrier(R) should be used if the address of R is not known to be a pointer to the heap, the stack or global roots.
If R is a pointing into global roots, tm_write_barrier(R) will cause global root rescanning, if the collector is in the tm_SCAN phase.

Stack writes are not barriered, because stack scanning occurs atomically at the end of tm_ROOT.

Unfriendly Mutators

When entering code where the write barrier protocol is not followed, tm_disable_write_barrier() can be called to put the collector into a “stop-world” collection mode until tm_enable_write_barrier() is called.

A virtual memory write-barrier based on mprotect() might be easier to manage than requiring the mutator to call the write barrier. Recoloring and marking root set pages can be done in hardware assuming the overhead of mprotect() and the SIGSEGV signal handler is low when changing phases and colors.


  • TM is not currently thread-safe.
  • TM does not currently support allocations larger than a tm_block. This will be fixed by using another page-indexed bit vector. A “block-header-in-page” bit vector marks the page of each tm_block header. This bit vector will be scanned backwards to locate the first page that contains the allocation’s block header.
  • TM does not currently support requests for page-aligned allocations. This could be achieved by using a hash table to map page-aligned allocations to its tm_block.


  1. The Treadmill: Real-Time Garbage Collection Without Motion Sickness, Henry G. Baker,

Ruby Internals: Why RUBY_FIXNUM_FLAG should be 0x00

Type tags in MRI Ruby VALUE

Internally, values in MRI Ruby are 32-bit (at least for 32-bit processors). Some of the least-significant bits are used to store type information. See the VALUE definition in include/ruby/ruby.h. Using type tag bits avoids allocating additional memory for commonly-used immutable values, like integers.

Ruby uses a single-bit tag of 0x01 as the Fixnum type tag. The remaining bits, are used to store the Fixnum’s signed value. This is an immediate value; it doesn’t require storage for the Fixnum value to be allocated, unlike the heap space that would be required for a String. Other types will use different tag bits.

Since Ruby uses 31 bits to store the Fixnum’s value, all the other types use 0 for least-significant bit; Ruby uses dynamic length tags:

 3         2         1
10987654321098765432109876543210 bit index 
sxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx1 Fixnum
00000000000000000000000000000000 false
00000000000000000000000000000010 true
00000000000000000000000000000100 nil
00000000000000000000000000000110 undef
xxxxxxxxxxxxxxxxxxxxxxxx00001110 Symbol
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx0x0 All other types

A non-zero Fixnum type tag requires that the tag must be masked or shifted away on Fixnum operands before numeric operations. In a running program, Fixnum is probably the most common numeric type and quite possibly the most common type of all; it makes sense to make the most common operations on Fixnum: +, -, *, / as fast and as simple as possible. It’s also likely that addition is the most common operation applied to Fixnum.

Imagine the following bit of Ruby code:

x = 1
y = 3
puts x + y

Internally, INT2FIX(X) is used to create a Ruby value from a C int X:

#define INT2FIX(X) (((X) << 1) | RUBY_FIXNUM_FLAG)
#define FIX2INT(X) ((X) >> 1)


INT2FIX(1) => 3
INT2FIX(3) => 7

FIX2INT(X) shifts one bit down which removes the tag and returns the original integer.

To compute x + y, Ruby internally, in numeric.c: fix_plus(), does the following:

x + y                             =>
INT2FIX(FIX2INT(x) + FIX2INT(y))  =>
((3 >> 1) + (7 >> 1)) << 1 | 1    =>

FIX2INT(9)                        => 
(9 >> 1)                          =>

If a type-tag of 0x00 was used for Fixnums, there is no tag to remove and addition or subtraction on Fixnums requires no tag manipulations. Assume:

#define INT2FIX(X) ((X) << 1)
#define FIX2INT(X) ((X) >> 1)

The Ruby expression x + y would simply be x + y in C code, assuming no underflow or overflow into Bignum.

Multiplication with zero Fixnum tags is very simple: only one of the operands needs to be shifted down:

#define FIXMUL(X, Y)((X) >> 1) * (Y))

Fixnum division: the result of the division is shifted up:

#define FIXDIV(X, Y) (((X) / (Y)) << 1)

Two-bit type tags on 32-bit architectures

Oaklisp and LL ( use 2-bit tags for all values. LL uses the following tag scheme:

 3         2         1
10987654321098765432109876543210 bit index 
sxxxxxxxxxxxxxxxxxxxxxxxxxxxxx00 <fixnum>
pppppppppppppppppppppppppppppp01 <locative>
seeeeeeeemmmmmmmmmmmmmmmmmmmmm10 <flonum>
pppppppppppppppppppppppppppppp11 All other types

Floating-point (<flonum>) values are C floats which sacrifice the 2 least-significant mantissa bits for the type tag. Locatives are safe pointers to other values. Oaklisp stores floating-point values as allocated objects and uses a tag for other common immediate values: characters, etc.

The rationale for choosing a fixed-size lower 2-bit type tag, opposed to a dynamic-length type tag, as in Ruby, or high-bit tags, like some older Lisp implementations, is as follows:

C compilers and dynamic memory allocators will align allocations to word boundaries for performance reasons, so there cannot not be a pointer to an object that would require some of the lower bits of a pointer, except for sub-word access, e.g.: char *. 32-bit words are 4 bytes long; the lower 2 bits of any object pointer will always be zero, and are free to be used for type tagging.

If references to allocated objects are encoded using a 0x03 type tag, tag removal could be:

  #define RBASIC(X) ((struct RBasic*)((X) - 3))

Assuming that most of the time the interpreter is referencing structure members of the object, and does not need the actual address of the object:

  struct RBasic {
      VALUE flags; /* struct offset: + 0 */
      VALUE klass; /* struct offset: + 4 */

  RBASIC(X)->klass =>
  ((struct RBasic*)((X) - 3))->klass

C compilers convert the pointer->member expression into an offset from an address. For 32-bit VALUEs:

  RBASIC(X)->flags => *(VALUE*)((X) - 3 + 0)
  RBASIC(X)->klass => *(VALUE*)((X) - 3 + 4)

Using subtraction as the tag removal operation, instead of (X & ~3),
allows the C compiler to constant fold the tag removal and the structure offset:

  RBASIC(X)->flags => *(VALUE*)((X) - 3)
  RBASIC(X)->klass => *(VALUE*)((X) + 1)

Therefore, there is no additional tag removal cost to reference structure members with non-zero offsets. One could reorder the members depending on which is “hotter”.

Research shows that tag manipulation is a heavy cost, esp. for numerics; any tagging scheme should be as simple and consistent as possible.

For example, determining the class of a VALUE could be inlined:

  #define CLASS_OF(v) ((v) & 3) ? RBASIC(v)->klass : rb_cFixnum)

Two-bit tags naturally align with word boundaries on 32-bit processors. Thus, zero tags for integers on 32-bit processors allows pointer arithmetic on VALUE*, as in the Array#[] method, to require no tag manipulation or multiplication to offset into the allocated Array’s elements.

Thanks to Gary Wright for inspiring me to write about this.