Martian Chronicles
Evil Martians’ team blog

How Ruby 2.2 can cause an out-of-memory server crash

Bad news—Ruby (MRI) can cause an out-of-memory server crash. The issue has first appeared with version 2.2.0-preview and remained ever since.

Good news—the mysterious bug has finally been dealt with. It was extremely difficult and took me quite a lot of time and effort. This madness of a bug-fixing deserves a separate post, and I promise I’ll describe it in all detail.

Now I would like to explain what this bug and my patch are about.

To start with, this would have been impossible to achieve without the support of my friends Ravil Bayramgalin, ruby-debug expert who showed me the ropes and was with me from the very beginning, and Vladimir Menshakov, outstanding C-specialist who helped me with the C aspects of Ruby application debug.

So, Ruby (MRI) interpreter is written in C. It means that many standard Ruby library methods are implemented in C as well. Ruby MRI uses the YARV virtual machine to interpret the code. Hence, a Ruby program has two stacks—a Ruby one and a C one. In the main thread of your program, the C stack is limited to the standard stack value used in *nix systems (sysctl kern.stack_depth_max) which is 8 MB by default, while the Ruby stack is limited to the vm_core.h value which is 1 MB by default. You can change it by setting the environment variable. export RUBY_THREAD_VM_STACK_SIZE=10000000—this is the command I used to set the Ruby stack limit to 10 MB for all the programs.

It is very easy to write a recursion or create a self-referencing object with Ruby. For instance, it could be done like that:

def recursion

recursion # => StackOverflowException

# example 2

class Foo
  def to_s
    puts self
end # => StackOverflowException

In both cases, calling the method results in an endless recursion and sooner or later Ruby will throw a stack overflow error message. The question is—which stack will be overflown? In case 1, obviously, it will be the Ruby stack since it is the Ruby method that is called recursively. In case 2, everything is way more complicated.

The standard Kernel#puts methods calls $stdout.puts which is the IO#puts method. It is written in C and looks as follows:

rb_io_puts(int argc, const VALUE *argv, VALUE out)
    int i;
    VALUE line;
    // ...
    for (i=0; i<argc; i++) {
        line = rb_obj_as_string(argv[i]);

Of interest is the part that gains control in our case, namely rb_obj_as_string(argv[i]). This method is defined in string.c. Its main function is to call the Ruby to_s method when working with a non-String object. Thus, there is a recursion that does contain not only the Ruby methods and objects but also the C ones. Therefore, if the C stack is by default 8 times larger than the Ruby one, then it is evidently the Ruby stack that gets overflown, isn’t it?

It is, and it is not.

If this code is run in the main thread, then it is really so, since, as I have already mentioned, the main Ruby thread has certain limitations. But if this code is run in a new thread relative to the main one, it will be different. If we look at vm_core.h, vm.c and thread_pthread.c, we will see that for a C stack, new threads are limited to 1 MB—the same limit as the Ruby stack has. As the result, if code sample 2 is run (with a puts/to_s recursion) inside {}, it is the C stack that will be reported to have caused an overflow error. You can check whether it is the native stack that has been overflown by rebuilding Ruby with a couple of printf functions used to process stack overflow errors. The Ruby stack gets checked for overflow at vm_insnhelper.c:36, the C stack at signal.c (you’ll have to look for the check_stack_overflow function since its exact location changes from version to version).

Curiouser and curiouser, isn’t it? What on earth does signal.c have to do with this? It is here because depending on whether we run on Mac or Linux, when the native process stack gets overflown, the program either receives the BUS or the SEGV (segmentation fault) signal. Ruby processes a segfault inside a thread—not a big deal, it seems. But important is not what it does, but how it does that.

The signal processing code indicates that GC gets disabled. Then the algorithm checks whether it was a stack overflow—if it was, then an ordinary Ruby error on stack overflow is thrown. It is by no means an error requiring the process shutdown. And this is where the trouble begins—remember that GC has already been disabled with no enabling possible.

Previously, instead of the whole GC, it used the stress mode by assigning a value of 1 (true) to ruby_disable_gc_stress. The commit 0c391a55d3ed4637e17462d9b9b8aa21e64e2340 has changed this algorithm, removed the ruby_disable_gc_stress variable and changed the logic so that when assigning the new internal variable ruby_disable_gc true, the heap GC check should always return false. Since in signal processing this variable can only be assigned 1 (true) and since it doesn’t get changed anywhere else, processing the signal results in GC being disabled.

What is the problem? Well, it’s obvious. The process did not crash and works the way it should. Of course, there was a thread error, but it is never a surprise and hence, rescue could be used. The thing is that objects won’t ever deallocate memory in the process until it uses all server memory available.

In the real application, this error was caused by two gems, rollbar and oauth2, both operating correctly. However, it could be any other gem. Rollbar was in asynchronous mode which meant that to throw an error a new thread was created where the error would get dumped in JSON. And OAuth2 would keep a self-reference in the error. The way Rails and JSON work (all instance variables being dumped in JSON) causes an endless recursion, including the C-based JSON methods that cause the C stack overflow with the new thread. As the result, one error leads to GC being disabled. Consequently, server crash is unavoidable.

My patch only changes a single line. The received signal gets erased as soon as the interpreter understands that a stack overflow error has occurred. I suggest enabling GC almost simultaneously:

The patch has already been accepted to MRI and is already included with the 2.3 preview release. I’m not sure if it will be included in the latest 2.2 minor version though—haven’t seen the backport yet.