Monthly Archives: June 2008

Parameterized Word Tagging in Latently-Typed Languages


Abstract

I have been interested in optimal low-bit tag schemes on machine words in latently-typed languages. There are different schools of thought on how to handle the trade-offs between Fixnums (integers smaller than word size) and allocated objects.

Ian Piumarta’s libid object library is a very simple prototype-based object system with unresolved method delegation. It is used in the Pepsi/Coke COLA system (available at http://vpri.org/fonc_wiki/index.php/Sources) as the basis for all object representations and meta-behaviors. libid assumes that a reference to an object’s method table, or “vtable” resides at the word before the object’s address.

It was chosen as a test-bed for testing tagging scheme’s because the Pepsi/Coke system is built using itself. Libid was changed to be parametric with respect to word tags. This was done by abstracting references to NULL oop pointers and object slot accesses with macros to handle the tagging and untagging of Fixnum and allocated objects pointers. At a later point the idc compiler might be also changed to be tag parametric.

Scheme: TAG(Fixnum) == 1

This is the default for libid. The default tagging scheme was parameterized where Fixnums are tagged with a 1 in the least-significant bit and where 0 is a special case for a Nil object.

  0...000000: Nil
  x...xxxxx0: allocated objects
  x...xxxxx1: Fixnums

Details:

  • 0 is assumed to be the Nil object
    • NULL pointer checks required when determining its vtable.
    • Object pointers are the same boxed and unboxed
    • LSB is 0
    • vtable is located at object pointer word – 4.
    • Fixnums have 1 in lowest-bit.
    • Fixnum vtable is stored in global variable.
    • Fixnum + and - operations require down-shifts.
    • Fixnum * and / operations require down-shifts.
  • Tag needs to be added after up-shifts.

Scheme: TAG(Fixnum) == 0

This scheme should perform better than TAG(Fixnum) == 1, because Fixnum arithmetic operations are less expensive and since most object slot accesses are non-zero pointer offsets in both schemes.

  x...xxxxx0: Fixnums
  x...xxxxx1: allocated objects & Nil

Details:

  • NULL pointers are avoided by never storing them
    • uses a statically allocated Nil object with a valid Nil vtable.
    • vtable(nil) is handled by the generic vtable(object).
    • Object pointers are offset by 1
    • vtable at object pointer word – 5.
    • Tag removal is optimized away into all slot offsets.
    • Fixnums have 0 in lowest-bit.
    • + and - operations require no shifts.
  • * and / operations require shifting of only one argument.

Test

Executes 10,000,000 message sends and +(Fixnum, Fixnum) operations with no Fixnum overflow tests using GCC 4.2.3 on ix86 with the following compile-time options:

-Wall -Wreturn-type -Werror -D_GNU_SOURCE=1 \
-Wno-unused-value -w \
-march=prescott -fomit-frame-pointer \
-falign-functions=16 -funroll-loops

GCC optimizations were also varied.

This is a naive test of raw message and Fixnum addition speed. More substantial tests will be presented at a later date.

Test results

Intel(R) Pentium(R) M processor 1.20GHz

                      n=10                              EXE
                      AVG(s) NIL=0 INT_TAG OBJ_TAG GCC SIZE
                      ====== ===== ======= ======= === ====
./benchmark_1_1_0_3   7.687  Y     1       0       -O3 42774
./benchmark_0_0_1_1   8.002  N     0       1       -O1 34780
./benchmark_0_0_1_2   8.111  N     0       1       -O2 35100
./benchmark_0_0_1_3   8.113  N     0       1       -O3 38500
./benchmark_1_1_0_1   8.593  Y     1       0       -O1 35445
./benchmark_1_1_0_2   9.05   Y     1       0       -02 35365
Intel(R) Core(TM)2 Duo CPU     T7100  @ 1.80GHz

                      n=10                              EXE
                      AVG(s) NIL=0 INT_TAG OBJ_TAG GCC SIZE
                      ====== ===== ======= ======= === ====
./benchmark_1_1_0_3   1.794  Y     1       0       -O3 43912
./benchmark_0_0_1_3   1.83   N     0       1       -O3 38870
./benchmark_0_0_1_2   1.874  N     0       1       -O2 36222
./benchmark_0_0_1_1   1.901  N     0       1       -O1 35966
./benchmark_1_1_0_2   1.906  Y     1       0       -O2 36503
./benchmark_1_1_0_1   2.007  Y     1       0       -O1 36711

Summary

  • TAG(Fixnum)==0 performs better with -O1 optimization.
  • TAG(Fixnum)==1 performs better with -O3 optimization on older processors.
  • Modern processors have deeper pipelines.

Analysis

The results are surprising. I’m still investigating. I suspect:

  • Processors with deeper pipelines can execute more instructions concurrently.
  • The loss of NULL==Nil==0 causes Nil checks to be more expensive.
  • Memory constrained devices may perform better with TAG(Fixnum)==0,-O1.
  • Non-aggressive code generators (idc and many JIT compilers) may perform better with TAG(Fixnum)==0,-O1
  • -O3 optimization is moving code in very mysterious ways.
  • Costs due to pointer offsets with TAG(object)==1 because libid has already incremented object pointers ahead of the usual place for a vtable or “type” slot at offset 0.

The first slot to be accessed is at offset 0 from the object pointer. First slots are generally hot. Having to add an additional negative offset may have an additional cost that would be present if a “type” slot at offset 0 but is not present when TAG(object)==0 and object pointers are ahead of the vtable slot: offset of 0 costs nothing, any non-zero offset has a cost.

E.g.:


struct Cons_allocated {
  oop vtable; /* offset -4 */
  struct Cons {
    oop car;  /* offset 0 */
    oop cdr;  /* offset 4 */
  } cons;
} cons_allocated;
oop cons_ref = &cons_allocated.cons;

Verses:


struct Cons {
  oop vtable; /* offset 0 */
  oop car;    /* offset 4 */
  oop cdr;    /* offset 8 */
};
oop cons_ref = &cons;

I will post a link to code shortly.

References