Embracing metrics as new tests

November 13, 2017

Topics

Do you often care about what happens to your app after it is deployed? As a diligent developer, you have written good tests, so you assume your code behaves correctly in production. “Correct”, however, does not necessarily mean “efficient”. Read on to see how easy it is to add monitoring tools like Prometheus to your developer workflow, and how to report to Prometheus from your Ruby application.

Life after “git push”

With the emergence of the test-driven philosophy in the last decade, developers no longer question themselves whether they should write tests. That is especially true for the Ruby community. According to a quick survey recently conducted by Vladimir Dementyev, 94% of Ruby developers cover their code with tests. RSpec, Capybara, Cucumber and alike now have their stable place in our tool belts. Continuous integration systems like Travis CI are a must for any decently scaled collaboration project. We know how to make sure that a repository never hosts terrible code.

However, while everything is under control in our cozy development environment, production is a whole different story. It is messy. In a typical web application, your code is tested to its limits. It interacts with other people’s code. It is plummeted by thousands of requests from real users.

It is next to impossible to predict all bottlenecks while testing on localhost; you need to see how your code performs in action. You can rely on your operations team, but no wonders are going to happen unless your code knows how to report its metrics.

Does the following sound familiar?

Inside a project's chat room:

OPERATIONS: Hey, devs! bell_n_whistles_collector daemon eats away bnw01 server's memory exponentially during the last hour. What can we do?
DEVS: How many objects are in bells and whistles queues? Do we still generate last hour's reports?
OPERATIONS: Wish we could tell...

The next several hours are spent on digging through logs—just to find out that the project-wide database had been overloaded with another process, which caused bells_n_whistles_collector to grow enormous queues, which drastically increased RAM usage.

What if such situations could be avoided? What if our bells_n_whistles_collector had generated a beautiful, consistent stream of metrics that would make finding a problem a breeze?

If you see something, say something

Let’s transport ourselves to an ideal world and imagine that our code already is a good citizen that correctly reports its surroundings and notes any suspicious activity. Then ~~the police~~ the operations team can react in time to restore order. First, let’s define what our bells_n_whistles_collector does:

It receives a bell over HTTP;
It increments the corresponding counter in the key/value storage (bells queue);
It processes a bell and puts it into a database (whistles queue);
Every hour, it generates reports based on user templates.

Nothing fancy. Let’s see what kinds of metrics we can get from this:

Response time to the client;
Response time from the database;
Response time from the key/value storage;
Time it took to generate an hourly report;
Number of requests processed (total and per queue);
Number of failed requests (total and per queue);

Having all that data would allow us to graphically represent our key parameters and visually grep any problem instantly.

Enough theory. Meet Prometheus

Let’s set abstractions aside for a moment and talk about instrumentation with Prometheus, as this particular tool is well tested in a number of our projects.

We are going to work with metrics. There can be quite a number of them, so in order not to confuse ourselves, let’s agree that every metric should:

… have a clear, explicit name. For instance:
- http_requests—bad. Nothing in the name tells us about this metric’s type;
- http_server_requests—better, we can infer request’s direction;
- http_server_bells_requests—infer request direction and queue’s name;
- http_server_bells_requests_counter—best! Request direction, queue, suffix. Suffix shows the type of a metric.

The usefulness of this approach will become evident once we need more metrics. Metrics named like metric1 or my_awesome_metric prevent us from grasping a big picture.

… strictly follow a declared type. There are two main types:
- counter—numeric value that can only be incremented;
- gauge—numeric value that can change both ways;

These two types can be compounded into more complex metrics of types histogram or summary. These allow us to see a distribution of our values. Avoid metrics that change their type mid-stream. If yesterday it showed a total number of requests and today it shows a number of requests per second—it is not of good use.

Do not be afraid to overdo metrics. It is easier to skip a few metrics while scraping Prometheus output (a stream of your metrics will be exposed over the endpoint), then to add new ones when the code had already been locked for production.

This is how Prometheus metrics look like in their raw form. As you can see, it is plain text served over HTTP API:

# HELP http_server_bells_requests_total total incoming POST requests to /bells
# TYPE http_server_bells_requests_total counter
http_server_bells_requests_total{instance="bnw01"} 7864
# HELP db_response_time time to complete request to database
# TYPE db_response_time gauge
db_response_time{instance="bnw01", query="insert", db_server="db01"} 0.299516448e+09
# HELP db_whistles_request_total total requests to database in whistles queue
# TYPE db_whistles_request_total counter
db_whistles_request_total{instance="bnw01", kind = "success", db_server="db01"} 7578
db_whistles_request_total{instance="bnw01", kind = "failure", db_server="db01"} 245

HOW-TO: Prometheus and Ruby

The easiest way to start with Prometheus is to use a library for your favorite language from the list. If you are out of luck and your language is not on the list, you may help the community out by writing your own library per these guidelines, or, by using the very same guidelines, build a metrics interface for a single project.

Let’s assume our bells_n_whistles_collector is a Sinatra application. We will install the Prometheus Ruby Client that relies on middleware to operate and initialize it config.ru:

# config.ru

require 'rack'
require 'prometheus/client/rack/collector'
require 'prometheus/client/rack/exporter'

use Rack::Deflater, if: ->(_env, _status, _headers, body) { body.any? && body[0].length > 512 }
use Prometheus::Client::Rack::Collector
use Prometheus::Client::Rack::Exporter

run ->(_env) { [200, { 'Content-Type' => 'text/html' }, ['OK']] }

Now, let’s register a counter in our app.

# app/whistle.rb

require 'prometheus/client'
prometheus = Prometheus::Client.registry
db_whistles_blows_total =
  Prometheus::Client::Counter.new(
    :db_whistles_blows_total,
    'A counter of whistles blown'
  )
prometheus.register(db_whistles_blows_total)

Then we can report a metric like that:

# app/whistle.rb

begin
  # Makes a DB request. We want to know the number and success/failure rate of these requests
  @whistle.blow!
  db_whistles_blows_total.increment(kind: 'success', db_server: 'name_of_db_server_from_config')
rescue StandardError
  db_whistles_blows_total.increment(kind: 'failure', db_server: 'name_of_db_server_from_config')
end

That is all we need to obtain a /metrics endpoint that will serve a list of values over a GET request. Now your operations engineers can scrape all the necessary data in Prometheus format and quickly analyze it using many tools, including Grafana—an open platform for analytics and monitoring that allows you to see graphs for all of your metrics on one neat dashboard:

Grafana dashboard — Grafana’s dashboard interface

As a developer, you give life to code. As painful and rewarding as it is, it would also be immature to assume that the fruit of your work can be abandoned soon after deployment, rediscovered only to fix an occasional bug.

The production environment is where code lives and prospers and using metrics will allow you, and a part of your team that deals with maintaining applications, to always keep a close eye on how your code behaves.

Yes, learning an entirely new tool like Prometheus requires some work, but wasn’t it the same for tests, when you had to study all those new frameworks? Few people wanted to do it right away—it took time and learning from the mistakes of others to fully embrace the TDD approach.

Thus, why not set aside few hours to expand your tool belt once again and obtain a valuable skill that will make your apps what they need to be—truly maintainable.

Meet Yabeda: A Ruby instrumentation framework