Behind the Scenes: Building a Dynamic Instrumentation Agent for Ruby

TL;DR

Building a Ruby Dynamic Instrumentation Agent is no easy task. We’ve been working hard at Sqreen to make our protection transparent and frictionless. The Sqreen agent is based on dynamic instrumentation in order to detect and block security issues inside the application without requiring code modification.

Instrumentation allows modifying a program behavior by inserting additional logic to this program. This new logic will later be called the “business logic”.

Hence dynamic instrumentation is instrumentation that is performed at runtime (as opposed to instrumentation performed on the source code, or in the compiled bytecode). The concept of instrumentation at runtime is widely used in application performance monitoring solutions, such as New Relic.

There are many advantages to dynamic instrumentation:

  • Extremely fast to setup since there is no source code modification
  • At runtime, the program is fully loaded and can be fully observed, including all third party libraries
  • Whatever the code, when instrumentation is involved, the functions can be hooked. Changing source code, or using different versions of it (dev, staging, prod) is transparent.

An agent is a software component (e.g. library) that retrieves data from inside a program and which usually communicates with the outside world in order to report statistics or specific data.

The agent described here is aimed at instrumenting production systems, and thus involves sophisticated methods to maximize stability with a minimum impact on performance.

The agent is divided into several parts:

  1. The instrumentation engine that is used to handle the low-level machinery of enriching a method with new code;
  2. The callbacks manager that manages to add and remove callbacks on methods;
  3. The data recording mechanism, the part that receives data from the callbacks and sends it to our backend.

These parts will be described below.

Instrumenting Ruby code

Instrumenting a Ruby program is safe and easy because of the high-level of reflexivity built into Ruby itself. On top of that, Ruby is extremely permissive in the way it allows to rename methods and to add methods to classes or singletons.

The basic instrumentation paradigm is:

  1. Create a new method that calls the original method;
  2. Rename the original method;
  3. Rename the new method, so it matches the original method’s name.

Two different ways are used to instrument class methods and instance methods due to their different nature.

Instrumenting methods

Instrumenting instance methods

The code to instrument an instance method is:

The key parts of this code are:

  1. The use of define_callback_method to generate the wrapper. This is an agent internal helper that can do any business logic needed to be included, and also basically safely calls the new method;
  2. alias_method: http://ruby-doc.org/core-2.4.0/Module.html#method-i-alias_method allows to give a new name to an existing method;
  3. send: https://ruby-doc.org/core-2.4.0/Object.html#method-i-send allows to call any method by its name (formally, its symbol).

Instrumenting class methods

The code is very similar to the instance methods class:

The difference is that since the destination is not an instance, the class definition needs to be changed.

Recovering from callback errors

This instrumentation scheme adds very few code beside the actual business logic. The most sensitive part of a reliability point of view may be the business logic itself.

The agent code uses a very defensive scheme to prevent any bug in this code. Each time a callback is called, it is wrapped into exception handling code. If an exception occurs, custom code for this callback should be discarded so the program proceeds directly to the original code. The exception that would be generated can be sent to a specific endpoint for further analysis.

De-instrumenting – back to the origins

De-instrumenting a method is straightforward. Since the original method’s name was changed, the only necessary action is to set its name back to the original one.

Allowing arbitrary callbacks

Given the instrumentation primitives, an interface is needed to make it safe and easy – yet performant – to add callbacks.

First, various callbacks can be set on the same method. They are stored in a method specific list. The access to this list is critical: adding and removing callbacks should not interfere with the execution.

Special care should be given to the way any callback is executed. Since the callbacks are arbitrary, no assumption can be made about them. There is a chance that a callback, during its execution, makes use of the instrumented method (e.g., a log method is instrumented, but the callback needs to log something itself). Without proper safeguards, this would enter an infinite instrumentation loop.

Eventually, the callbacks can be set in 3 different positions:

Pre Post Failing
Position prior the instrumented function after the instrumented function if the instrumented function fails
Arguments (in addition of class and original exceptions) Return value Raised exception

High-level overview of the callback places:

Performance

The callback machinery is all pure Ruby standard library. It doesn’t involve:

  • I/O
  • Locking (only checks, locks are just set when callbacks are added or removed)

These characteristics make this code layer very efficient.

Reinstrumenting

The first time a callback needs to be set in a given method, the agent replaces it with the generic instrumentation method. Then the callback is added to a callback list related to the original method.

Later, if a new callback needs to be set in the original method, the agent will detect if this method is already instrumented and will only add the callback to the callback list for this method.

Data transmission

The information computed by the agent (e.g. statistics) can be sent to the outside world in a performant and robust way. Adding an external network access inside a callback would slow down the original code too much, so communication should be performed asynchronously. Each time the agent has information to transmit, the information is sent to a local queue .

The only overhead to the original code is thus a Queue#put call.

The next step is to transmit the data in the queue to a remote server. In order to make it as lightweight as possible, a dedicated thread is used. This thread is started at the agent initialization. Basically, all this thread do is waiting for the queue to get populated, with a Queue#pop – this call is blocking as long as the queue is empty. So it uses no active resource as long as the queue is empty. As soon as an item is received from the queue, whenever the thread gets to run, the item will be sent to the remote servers.

This implementation is straightforward since it also relies on standard Ruby objects: threads and queues. They have been built to work together, in a performant way.

ruby-instrumentation@2x.png

Shaping the data

There are many reasons to post-process the data gathered by the agent. A common use case is data privacy. A common example is logged SQL queries that can be stripped from any strings or integers that would contain business data. Another example in exception logging would be to transmit only certain kind of variables.

The data could also be aggregated so the agent computes average response time rather than sending all of the response times.

Performance

The data thread does two things:

  1. I/O, since it send data to the remote servers;
  2. Waiting for the queue to be populated.

While waiting, which it main behavior, it allows the Ruby VM to run computation on other threads. Hence this thread has a very limited impact on the original code performance.

This thread is used for concurrency – not parallelism since it relies on I/O to trigger the thread context switch. This means that the Ruby Global Interpreter Lock (GIL) is not an issue in this use case.

Robustness

There are two cases where this data recording scheme can become an issue.

Data flooding

If a client is under a particular situation where a lot of data need to be sent, then the thread will spend too much time trying reaching out to servers.

To work around this issue one should not send every event as soon as they arrive. The first time an event needs to be reported to the servers, it is. If a second event occurs within a small timeframe, it is stored and will be sent some time later. If more events occur meanwhile, they will all be sent together, in the same batch.

Queue filling up too fast

But what happens when a queue can’t be reduced fast enough? This can occur for many reasons, e.g. if the network is down (or if the remote servers are down), or if the Ruby VM does not switch to the thread for any reason. In this case, the queue is never emptied and grows quickly, raising the memory usage.

Here a capped queue is used. This limits the queue to store only a fixed number of events, which put a hard limit on the memory growth. The older events get discarded as new events get in.

Going further

We’ve described in this post the high-level concepts of an industrial grade instrumentation agent. Be careful with implementation details though, as this is what makes the difference!

Instrumentation agents can be used to handle many different tasks, such as performance monitoring, error monitoring or security.

At Sqreen, our Ruby agent leverages instrumentation techniques in order to protect Rails and Sinatra applications at runtime against security events. Sqreen helps developers get full visibility and protection against security threats. Cyber-attacks are blocked at runtime without traffic redirection or code modification. Suspicious and fraudulent activities from/targeting user accounts are identified to detect attackers early.

Feel free to ask questions if you want to know more about Ruby instrumentation or how we do it at Sqreen!

About the Author

Jean-Baptiste Aviat spent half a decade hunting vulnerabilities at Apple, helping developers solve them, and developing security software. He is now CTO at Sqreen.