Category Archives: Performance

Ruby: Thread stack leak patch accepted into REE.

This patch reduces the stack buffer memory footprint of dead Threads as early as possible, rather than waiting until the Thread can be GCed.

This is applicable only to the zero-copy context switch patch.

http://code.google.com/p/rubyenterpriseedition/issues/detail?id=57

http://blog.phusion.nl/2011/02/12/ruby-enterprise-edition-1-8-7-2011-01-released/

ChicagoRuby Ruby Code Tweaks slides, code and video

The slides from my ChicagoRuby 2010/05/04 presentation :

http://kurtstephens.com/pub/ruby/ruby_code_tweaks/

All the raw data used to generate the graph should be referenced in the slides.

The code used to generate the slides is here:

http://github.com/kstephens/ruby_code_tweaks

I’m looking to increase the set of code “Problems” to cover other tiny code idioms and platform issues, for example: regular expressions, numerics, etc. If you have ideas, take a look at the code and contact me.

Justin Love gave a fantastic presentation on lambda and closure.

Thanks to everyone who came — hope it was helpful.

Video from the talk:

Ruby Code Performance Tweaks by Kurt Stephens from ChicagoRuby on Vimeo.

Ruby: Fixnum#gcd accepted into MRI

Ruby rational.rb clean-up and the Fixnum#gcd primitive was refactored into a new MRI extension. Fixnum#gcd is now defined during require ‘rational’.

This dramatically improves performance of Ruby Date class.

http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/branches/ruby_1_8/ChangeLog?…

http://redmine.ruby-lang.org/issues/show/2561

See http://devdriven.com/2007/03/ruby-date-rational-fixnumgcd-hack-increased-app-performance-by-15/ .

Ruby 1.8: Improved Rational performance by 15%

This should also speed up DateTime. This will not help 1.9 performance.

The attached file is based on MRI 1.8.6 rational.rb.

 > ruby rational_performance.rb 
                                              user     system      total        real
test_it                                  32.930000   3.030000  35.960000 ( 35.971832)
test_it                                  33.840000   2.910000  36.750000 ( 36.758585)
test_it ks_rational                      29.110000   2.460000  31.570000 ( 31.572762)

Overview:

  • case x; when Foo; ...; end is faster than if Foo.kind_of?(x).
  • Avoid recuring on ephemeral objects.
  • Avoid local variables, in-line expressions.
  • Avoid return in tail position.
  • String interpolation: "#{x}/#{y}" is faster than x.to_s + "/" + y.to_s.
  • Implement #-, #zero?, #nonzero? natively.
  • #abs returns self if > 0.

MRI 1.8.7 patch to follow shortly.

Abstraction .vs. Optimization

Abstractions can lead to greater flexibility and correctness at the expense of speed or size — there are good reasons that most programs are not crafted in machine code.

The key to improving an abstraction’s performance is to compile it. Dynamic environments like Common Lisp are superior to languages and tools like: Java, C++, and C#, in this regard, because one can create abstractions that compile to efficient code.

If the abstraction is cumbersome it might be a poor abstraction or its platform is poor for abstraction. Environments that give seamless access to a language’s compiler from inside the language itself are superior platforms for abstraction.

Ruby: Caching #to_s for immutables (and a possible future for constant-folding)

Reference: http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/26869

I have a proof-of-concept patch to MRI that caches #to_s values for immutable values. It is implemented using a few fixed-size hash tables. http://github.com/kstephens/ruby/commits/to_s_maybe_frozen/

It reduces the number of #to_s result objects by 1890 during the MRI test suite for NilClass#to_s, TrueClass#to_s, FalseClass#to_s, Symbol#to_s, and Float#to_s.

It requires a minor semantic change to Ruby core. This minor change could cascade into a huge performance improvement for all Ruby implementations — as will be illustrated later:

#to_s may return frozen Strings.

This appears to not be a problem since any callers of #to_s are likely to anticipate that the receiver may already be a String and are not going to mutate it — #to_s is a coercion. The current MRI test suite passes if some #to_s results are frozen.

For code that may expect #to_s to return a mutable, an Object#dup_if_frozen method might be helpful. This method will return self.dup if the receiver is #frozen? and is not an immediate or an immutable. (Aside: a fast #dup_unless_frozen method might be helpful for general memoization of computations!)

This caching technique could be extended into other immutables (e.g.: the Numerics) and objects whose #to_s representations never change (e.g.: Class, Module?) and for #inspect under similar constraints.

In the patch, Fixnum#to_s is not cached because Fixnums are often incremented during long loops; any cache for it is quickly churned. However, this could be enabled if it proves useful in practice.

If this new semantic for #to_s is reasonable, I recommend explicitly storing frozen strings for true.to_s, false.to_s, nil.to_s and storing Symbol#to_s with each Symbol, likewise for #inspect.

In practice, most Ruby String literals become garbage immediately. If Symbol#to_s was guaranteed to be always be cached, this would enable the use of:

puts :"some string"

instead of

puts "some string"

as an in-line memoized frozen String that creates no garbage when calling puts which will call #to_s on its argument, but never mutate the result. A parser or compiler could recognize Symbol#to_s as an operation with no side-effect and elide it, providing a true String constant. This idiom would eliminate the pointless String garbage created by the evaluation of every String literal.

This is far more expressive and concise than:

SOME_STRING = "some string".freeze
...
puts SOME_STRING

The alternative to :"some string" might be to memoize all String literals as frozen. This is a superior syntax and semantic — old code would need to change on a massive scale, but any issues would be easy to diagnose:

str = ''       # Make a mutable empty string.
str << "foo"   # "foo" is garbage
str << "bar"   # "bar" is garbage

would become:

str = ''.dup   # Make a mutable empty string.
str << "foo"   # "foo" is not garbage
str << "bar"   # "bar" is not garbage

The latter is backwards-compatible with the current String literal semantics.

Ruby: Performance of Symbol Construction

Measurements of Symbol Constructor Expressions.

n=10_000_000

ruby 1.8.6 (2008-08-08 patchlevel 286) [i686-linux]:

> /cnu/bin/ruby symbol_benchmark.rb
                               user     system      total        real
Null Test                  0.740000   0.000000   0.740000 (  0.742914)
'foo_bar'                  1.670000   0.000000   1.670000 (  1.661374)
"foo_bar"                  1.620000   0.000000   1.620000 (  1.625221)
:foo_bar                   0.890000   0.000000   0.890000 (  0.886903)
:'foo_bar'                 0.880000   0.000000   0.880000 (  0.878555)
:"foo_bar"                 0.860000   0.000000   0.860000 (  0.867110)
"foo_bar".to_sym           2.830000   0.000000   2.830000 (  2.830536)
str.to_sym                 2.050000   0.000000   2.050000 (  2.052756)
:"{str}"                   0.880000   0.000000   0.880000 (  0.881367)
:"foo_#{'bar'}"            0.860000   0.000000   0.860000 (  0.854944)
"foo_#{'bar'}".to_sym      2.880000   0.000000   2.880000 (  2.942499)
:"foo_#{bar}"              4.280000   0.000000   4.280000 (  4.290880)
"foo_#{bar}".to_sym        4.930000   0.000000   4.930000 (  4.929801)

ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]:

> /usr/bin/ruby symbol_benchmark.rb
                               user     system      total        real
Null Test                  0.780000   0.000000   0.780000 (  0.782060)
'foo_bar'                  3.480000   0.770000   4.250000 (  4.263535)
"foo_bar"                  3.400000   0.820000   4.220000 (  4.224401)
:foo_bar                   2.280000   0.950000   3.230000 (  3.223590)
:'foo_bar'                 2.370000   0.860000   3.230000 (  3.249133)
:"foo_bar"                 2.310000   0.940000   3.250000 (  3.283796)
"foo_bar".to_sym           5.010000   0.980000   5.990000 (  6.029690)
str.to_sym                 3.740000   0.930000   4.670000 (  4.706827)
:"{str}"                   2.440000   0.810000   3.250000 (  3.254253)
:"foo_#{'bar'}"            2.470000   0.780000   3.250000 (  3.245739)
"foo_#{'bar'}".to_sym      4.800000   0.970000   5.770000 (  5.767003)
:"foo_#{bar}"              7.490000   0.990000   8.480000 (  8.506735)
"foo_#{bar}".to_sym        8.640000   0.890000   9.530000 (  9.532481)

ruby 1.9.0 (2008-11-25 revision 20347) [i686-linux]:

> ~/local/ruby/trunk/bin/ruby symbol_benchmark.rb
                               user     system      total        real
Null Test                  0.690000   0.000000   0.690000 (  0.695087)
'foo_bar'                  1.760000   0.000000   1.760000 (  1.772417)
"foo_bar"                  1.780000   0.000000   1.780000 (  1.772031)
:foo_bar                   0.720000   0.000000   0.720000 (  0.764095)
:'foo_bar'                 0.710000   0.000000   0.710000 (  0.710952)
:"foo_bar"                 0.720000   0.000000   0.720000 (  0.717041)
"foo_bar".to_sym           3.370000   0.000000   3.370000 (  3.371836)
str.to_sym                 2.340000   0.010000   2.350000 (  2.347026)
:"{str}"                   0.720000   0.000000   0.720000 (  0.723032)
:"foo_#{'bar'}"            0.710000   0.000000   0.710000 (  0.712813)
"foo_#{'bar'}".to_sym      3.370000   0.020000   3.390000 (  3.399618)
:"foo_#{bar}"              5.030000   0.000000   5.030000 (  5.037542)
"foo_#{bar}".to_sym        5.060000   0.010000   5.070000 (  5.112535)

rubinius 0.10.0 (ruby 1.8.6 compatible) (c458077eb) (12/31/2009) [i686-pc-linux-gnu]:

> ~/local/ruby/rubininus/bin/rbx symbol_benchmark.rb
                               user     system      total        real
Null Test                  2.369585   0.000000   2.369585 (  2.369602)
'foo_bar'                  3.616918   0.000000   3.616918 (  3.616949)
"foo_bar"                  3.672719   0.000000   3.672719 (  3.672752)
:foo_bar                   2.407764   0.000000   2.407764 (  2.407796)
:'foo_bar'                 2.462789   0.000000   2.462789 (  2.462824)
:"foo_bar"                 2.401784   0.000000   2.401784 (  2.401812)
"foo_bar".to_sym           7.127442   0.000000   7.127442 (  7.127472)
str.to_sym                15.128944   0.000000  15.128944 ( 15.128974)
:"{str}"                   2.411760   0.000000   2.411760 (  2.411789)
:"foo_#{'bar'}"            2.437595   0.000000   2.437595 (  2.437629)
"foo_#{'bar'}".to_sym      7.120185   0.000000   7.120185 (  7.120209)
:"foo_#{bar}"             20.144837   0.000000  20.144837 ( 20.144864)
"foo_#{bar}".to_sym       20.222530   0.000000  20.222530 ( 20.222891)

jruby 1.1.5 (ruby 1.8.6 patchlevel 114) (2008-11-03 rev 7996) [i386-java]:

> ~/local/ruby/jruby-1.1.5/bin/jruby symbol_benchmark.rb
                               user     system      total        real
Null Test                  0.936000   0.000000   0.936000 (  0.936172)
'foo_bar'                  1.886000   0.000000   1.886000 (  1.886079)
"foo_bar"                  1.880000   0.000000   1.880000 (  1.880344)
:foo_bar                   1.379000   0.000000   1.379000 (  1.379957)
:'foo_bar'                 5.842000   0.000000   5.842000 (  5.842242)
:"foo_bar"                 5.834000   0.000000   5.834000 (  5.833263)
"foo_bar".to_sym           2.597000   0.000000   2.597000 (  2.597294)
str.to_sym                 2.306000   0.000000   2.306000 (  2.306082)
:"{str}"                   5.655000   0.000000   5.655000 (  5.655672)
:"foo_#{'bar'}"            5.785000   0.000000   5.785000 (  5.785285)
"foo_#{'bar'}".to_sym      2.585000   0.000000   2.585000 (  2.584877)
:"foo_#{bar}"              7.532000   0.000000   7.532000 (  7.531846)
"foo_#{bar}".to_sym        8.361000   0.000000   8.361000 (  8.360630)

symbol_benchmark.rb:

require 'benchmark'

def bm_expr(x, expr, title = nil)
  e = <<"END"
  n = 10_000_000
  str = 'foo_bar'
  bar = 'bar'
  #begin; GC.enable; GC.start; GC.disable; rescue nil; end
  begin; GC.start; rescue nil; end
  x.report(#{(title || expr).inspect}) {
    n.times { 
      #{expr}
    }
  }
END
  eval e

end

Benchmark.bm(25) { | x |
  bm_expr x, "", 'Null Test' 
  bm_expr x, "'foo_bar'" 
  bm_expr x, '"foo_bar"' 
  bm_expr x, ':foo_bar' 
  bm_expr x, ":'foo_bar'" 
  bm_expr x, ':"foo_bar"' 
  bm_expr x, '"foo_bar".to_sym' 
  bm_expr x, 'str.to_sym' 
  bm_expr x, ':"{str}"' 
  bm_expr x, ':"foo_#{\'bar\'}"'
  bm_expr x, '"foo_#{\'bar\'}".to_sym'
  bm_expr x, ':"foo_#{bar}"'
  bm_expr x, '"foo_#{bar}".to_sym'
}

First time I ran the benchmarks without forcing and disabling GC :'foo_bar' appeared faster than :foo_bar. Reminder to self: always disable GC around benchmarks, unless you are benchmarking GC.

Ruby Internals: Why RUBY_FIXNUM_FLAG should be 0x00

Type tags in MRI Ruby VALUE

Internally, values in MRI Ruby are 32-bit (at least for 32-bit processors). Some of the least-significant bits are used to store type information. See the VALUE definition in include/ruby/ruby.h. Using type tag bits avoids allocating additional memory for commonly-used immutable values, like integers.

Ruby uses a single-bit tag of 0x01 as the Fixnum type tag. The remaining bits, are used to store the Fixnum’s signed value. This is an immediate value; it doesn’t require storage for the Fixnum value to be allocated, unlike the heap space that would be required for a String. Other types will use different tag bits.

Since Ruby uses 31 bits to store the Fixnum’s value, all the other types use 0 for least-significant bit; Ruby uses dynamic length tags:

 3         2         1
10987654321098765432109876543210 bit index 
--------------------------------
sxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx1 Fixnum
00000000000000000000000000000000 false
00000000000000000000000000000010 true
00000000000000000000000000000100 nil
00000000000000000000000000000110 undef
xxxxxxxxxxxxxxxxxxxxxxxx00001110 Symbol
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx0x0 All other types

A non-zero Fixnum type tag requires that the tag must be masked or shifted away on Fixnum operands before numeric operations. In a running program, Fixnum is probably the most common numeric type and quite possibly the most common type of all; it makes sense to make the most common operations on Fixnum: +, -, *, / as fast and as simple as possible. It’s also likely that addition is the most common operation applied to Fixnum.

Imagine the following bit of Ruby code:

x = 1
y = 3
puts x + y

Internally, INT2FIX(X) is used to create a Ruby value from a C int X:

#define INT2FIX(X) (((X) << 1) | RUBY_FIXNUM_FLAG)
#define FIX2INT(X) ((X) >> 1)

Thus:

INT2FIX(1) => 3
INT2FIX(3) => 7

FIX2INT(X) shifts one bit down which removes the tag and returns the original integer.

To compute x + y, Ruby internally, in numeric.c: fix_plus(), does the following:

x + y                             =>
INT2FIX(FIX2INT(x) + FIX2INT(y))  =>
((3 >> 1) + (7 >> 1)) << 1 | 1    =>
9

FIX2INT(9)                        => 
(9 >> 1)                          =>
4

If a type-tag of 0x00 was used for Fixnums, there is no tag to remove and addition or subtraction on Fixnums requires no tag manipulations. Assume:

#define INT2FIX(X) ((X) << 1)
#define FIX2INT(X) ((X) >> 1)

The Ruby expression x + y would simply be x + y in C code, assuming no underflow or overflow into Bignum.

Multiplication with zero Fixnum tags is very simple: only one of the operands needs to be shifted down:

#define FIXMUL(X, Y)((X) >> 1) * (Y))

Fixnum division: the result of the division is shifted up:

#define FIXDIV(X, Y) (((X) / (Y)) << 1)

Two-bit type tags on 32-bit architectures

Oaklisp and LL (http://kurtstephens.com/pub/ll/) use 2-bit tags for all values. LL uses the following tag scheme:

 3         2         1
10987654321098765432109876543210 bit index 
--------------------------------
sxxxxxxxxxxxxxxxxxxxxxxxxxxxxx00 <fixnum>
pppppppppppppppppppppppppppppp01 <locative>
seeeeeeeemmmmmmmmmmmmmmmmmmmmm10 <flonum>
pppppppppppppppppppppppppppppp11 All other types

Floating-point (<flonum>) values are C floats which sacrifice the 2 least-significant mantissa bits for the type tag. Locatives are safe pointers to other values. Oaklisp stores floating-point values as allocated objects and uses a tag for other common immediate values: characters, etc.

The rationale for choosing a fixed-size lower 2-bit type tag, opposed to a dynamic-length type tag, as in Ruby, or high-bit tags, like some older Lisp implementations, is as follows:

C compilers and dynamic memory allocators will align allocations to word boundaries for performance reasons, so there cannot not be a pointer to an object that would require some of the lower bits of a pointer, except for sub-word access, e.g.: char *. 32-bit words are 4 bytes long; the lower 2 bits of any object pointer will always be zero, and are free to be used for type tagging.

If references to allocated objects are encoded using a 0x03 type tag, tag removal could be:

  #define RBASIC(X) ((struct RBasic*)((X) - 3))

Assuming that most of the time the interpreter is referencing structure members of the object, and does not need the actual address of the object:

  struct RBasic {
      VALUE flags; /* struct offset: + 0 */
      VALUE klass; /* struct offset: + 4 */
  };

  RBASIC(X)->klass =>
  ((struct RBasic*)((X) - 3))->klass

C compilers convert the pointer->member expression into an offset from an address. For 32-bit VALUEs:

  RBASIC(X)->flags => *(VALUE*)((X) - 3 + 0)
  RBASIC(X)->klass => *(VALUE*)((X) - 3 + 4)

Using subtraction as the tag removal operation, instead of (X & ~3),
allows the C compiler to constant fold the tag removal and the structure offset:

  RBASIC(X)->flags => *(VALUE*)((X) - 3)
  RBASIC(X)->klass => *(VALUE*)((X) + 1)

Therefore, there is no additional tag removal cost to reference structure members with non-zero offsets. One could reorder the members depending on which is “hotter”.

Research shows that tag manipulation is a heavy cost, esp. for numerics; any tagging scheme should be as simple and consistent as possible.

For example, determining the class of a VALUE could be inlined:

  #define CLASS_OF(v) ((v) & 3) ? RBASIC(v)->klass : rb_cFixnum)

Two-bit tags naturally align with word boundaries on 32-bit processors. Thus, zero tags for integers on 32-bit processors allows pointer arithmetic on VALUE*, as in the Array#[] method, to require no tag manipulation or multiplication to offset into the allocated Array’s elements.

Thanks to Gary Wright for inspiring me to write about this.

Ruby: Date / Rational / Fixnum#gcd hack increased app performance by 15%

UPDATE: Fixnum#gcd was accepted int MRI 1.8: See http://devdriven.com/2010/02/ruby-fixnumgcd-accepted-into-mri/ .

Ruby Date uses Rational heavily, which calls Integer#gcd for every new Rational. The Integer#gcd method is generic to handle Bignums, but performs terribly for Fixnum#gcd(Fixnum), which is probably the most often case.

This RubyInline hack saved 15% execution time in a large Rails application:

require 'inline'

class Fixnum
  inline do | builder |
    builder.c_raw '
    static
    VALUE 
    gcd(int argc, VALUE *argv, VALUE self) {
      if ( argc != 1 ) {
        rb_raise(rb_eArgError, "wrong number of arguments (%d for %d)",
                 argc, 1);
      }
      /* Handle Fixnum#gcd(Fixnum) case directly. */
      if ( FIXNUM_P(argv[0]) ) {
        /* fprintf(stderr, "Using Fixnum#gcd(Fixnum)\n"); */
        long a = FIX2LONG(self);
        long b = FIX2LONG(argv[0]);
        long min = a < 0 ? - a : a;
        long max = b < 0 ? - b : b;
        while ( min > 0 ) {
          int tmp = min;
          min = max % min;
          max = tmp;
        }
        return LONG2FIX(max);
      } else {
        /* fprintf(stderr, "Using super#gcd\n"); */
        return rb_call_super(1, argv);
      }
    }
    '
  end
end

Update:

Sorry for the late reply. If the code above does not work via cut-and-paste, download it from here.

This will be released soon as a gem dynamic library called speedfreaks,
with other performance-enhancing snippets.

Thanks for the feedback!