Ruby: Caching #to_s for immutables (and a possible future for constant-folding)


I have a proof-of-concept patch to MRI that caches #to_s values for immutable values. It is implemented using a few fixed-size hash tables.

It reduces the number of #to_s result objects by 1890 during the MRI test suite for NilClass#to_s, TrueClass#to_s, FalseClass#to_s, Symbol#to_s, and Float#to_s.

It requires a minor semantic change to Ruby core. This minor change could cascade into a huge performance improvement for all Ruby implementations — as will be illustrated later:

#to_s may return frozen Strings.

This appears to not be a problem since any callers of #to_s are likely to anticipate that the receiver may already be a String and are not going to mutate it — #to_s is a coercion. The current MRI test suite passes if some #to_s results are frozen.

For code that may expect #to_s to return a mutable, an Object#dup_if_frozen method might be helpful. This method will return self.dup if the receiver is #frozen? and is not an immediate or an immutable. (Aside: a fast #dup_unless_frozen method might be helpful for general memoization of computations!)

This caching technique could be extended into other immutables (e.g.: the Numerics) and objects whose #to_s representations never change (e.g.: Class, Module?) and for #inspect under similar constraints.

In the patch, Fixnum#to_s is not cached because Fixnums are often incremented during long loops; any cache for it is quickly churned. However, this could be enabled if it proves useful in practice.

If this new semantic for #to_s is reasonable, I recommend explicitly storing frozen strings for true.to_s, false.to_s, nil.to_s and storing Symbol#to_s with each Symbol, likewise for #inspect.

In practice, most Ruby String literals become garbage immediately. If Symbol#to_s was guaranteed to be always be cached, this would enable the use of:

puts :"some string"

instead of

puts "some string"

as an in-line memoized frozen String that creates no garbage when calling puts which will call #to_s on its argument, but never mutate the result. A parser or compiler could recognize Symbol#to_s as an operation with no side-effect and elide it, providing a true String constant. This idiom would eliminate the pointless String garbage created by the evaluation of every String literal.

This is far more expressive and concise than:

SOME_STRING = "some string".freeze

The alternative to :"some string" might be to memoize all String literals as frozen. This is a superior syntax and semantic — old code would need to change on a massive scale, but any issues would be easy to diagnose:

str = ''       # Make a mutable empty string.
str << "foo"   # "foo" is garbage
str << "bar"   # "bar" is garbage

would become:

str = ''.dup   # Make a mutable empty string.
str << "foo"   # "foo" is not garbage
str << "bar"   # "bar" is not garbage

The latter is backwards-compatible with the current String literal semantics.