Partition and conquer

November 7, 2017

Topics

Here is a tale of database partitioning inspired by real events. We’ll learn how to break down a gigantic table that is slowing down your application in production—Ruby on Rails and PostgreSQL, in our case. Expect a step-by-step guide.

When bigger is no longer better

Databases tend to grow. At a certain point, their size becomes a liability, and we are not even considering extreme cases when the primary key hits the limit (although, this does happen). This article is written from experience: one of our clients, Gett, had a database table that grew over time to menacing proportions.

With several million rows, some queries took hours to complete. That posed a technical challenge that we solved with database partitioning.

Dividing one big table into a number of smaller ones is a standard technique, but it has to be performed with caution, especially when live data is at stake. We will show how to avoid some common pitfalls and make sure nothing is lost in transition. The best way to learn is to try something hands-on, so we will first create the problem by generating a huge table filled with fake data. Then we will solve it together, armed with some PostgreSQL magic.

A real fake table

First, we need data, lots of data. Our initial table will contain orders, with columns mimicking usual business logic.

CREATE TABLE orders (
  id SERIAL,
  country VARCHAR(2) NOT NULL,                        -- Country code
  type VARCHAR NOT NULL DEFAULT 'delivery',           -- Order type
  scheduled_at TIMESTAMP WITHOUT TIME ZONE NOT NULL,  -- Time an order was made
  cost NUMERIC(10,2) NOT NULL DEFAULT 0,              -- Order cost
  data JSONB NOT NULL DEFAULT '{}'                    -- Additional data
);

Note: From here on, we will be using pure SQL for queries and PL/pgSQL for functions, as our refactoring primarily concerns a PostgreSQL database. We assume that after we are done, our data will be handled by a Rails application through ActiveRecord.

First, let’s decide which queries we will have to perform most often. We come up with these:

Get one order by its id;
Get all orders from a certain country for a certain period, accurate to a minute;
Change order’s cost or its associated data.

Having asked ourselves this, we can immediately see that it makes sense to index our orders on country and scheduled_at, as it will speed up most popular queries.

CREATE INDEX index_orders_on_country_and_scheduled_at ON orders (country, scheduled_at);

Now we can fill our table with generate_series, using some random values.

INSERT INTO orders (country, type, scheduled_at, cost)
SELECT
  ('{RU,RU,RU,RU,US,GB,GB,IL}'::text[])[trunc(random() * 8) + 1],
  ('{delivery,taxi}'::text[])[trunc(random() * 2) + 1],
  CURRENT_DATE - (interval '1 day' * (n / 100000)) + (interval '1 second' * (random() * 86400)),
  round((100 + random() * 200)::numeric, 2)
FROM
  generate_series(1,30 * 1000000) s(n);

Time to break up

We have to set our goals straight. We want to break our orders up, so that:

Each resulting table would have all orders for a specific country and for a specific month;
Our app logic could stay mostly the same.

The easiest way to achieve this is to create child tables, a corresponding trigger, and a trigger function that will distribute records over tables.

But if we want to use ActiveRecord to query our database, there is a catch. In pure SQL, to avoid inserting the same record twice (once in a master table, once in a child table), our trigger procedure needs to return NULL. However, it would not play nice with ActiveRecord, as it expects an INSERT statement with a RETURNING clause to return a primary key for a new record. There are few ways to solve this problem:

Cater to ActiveRecord and return a new record instead of NULL. We can place each new record in a master table and in a child table, then delete it from a master table right away. Yes, three operations instead of one. Most Rails developers would go this way just to give up later on a whole idea of partitioning, as the performance will inevitably suffer.
Tinker with ActiveRecord PostgreSQL adapter. Since Rails 4.0.2 it is not that hard: you can set insert_returning to false in the configuration file. It will work, but it changes the behavior of all tables in your app. You will also get an extra request for each INSERT operation (to get the current value of a primary key).
Use views! It will allow us to keep all refactoring at the database level. As a plus, ActiveRecord works with views naturally, as if they were “normal” tables, so changes to existing app’s logic are minimal.

Let’s take the last route. First, we need to clone our table. Why do we need a clone? It allows us to:

Keep the integrity of existing data and ensure our references from other models are still intact;
Make sure we can keep existing application live during the deploy of our partitioning (after migrations, but before restart);
Fall back to the initial table in case anything goes wrong in the process.

This is how we do it:

CREATE TABLE orders_partitioned (LIKE orders INCLUDING ALL);

A primary key in a cloned table will use the same sequence as in the initial one, which is orders_id_seq. That will help us avoid collisions when we move data from the old table to the new one.

What a view!

Now we need to create a view over a new table. Our new table is still empty, but after we deploy our changes, all new records will go there and distribute themselves over respective child tables on the fly.

CREATE OR REPLACE VIEW orders_partitioned_view AS SELECT * FROM orders_partitioned;

Something is still missing. How about some default values? We certainly need one for a primary key, otherwise INSERT in ActiveRecord would not work (note that orders_id_seq).

ALTER VIEW orders_partitioned_view
ALTER COLUMN id
SET DEFAULT nextval('orders_id_seq'::regclass);

In theory, other columns could be picked up by ActiveRecord without any defaults, but our application code may still need them.
Also, our trigger function will not work without default values (those fields will stay NULL, and an insert operation will fail with Not Null Violation).

ALTER VIEW orders_partitioned_view
ALTER COLUMN cost
SET DEFAULT 0;

ALTER VIEW orders_partitioned_view
ALTER COLUMN type
SET DEFAULT 'delivery';

ALTER VIEW orders_partitioned_view
ALTER COLUMN data
SET DEFAULT '{}';

Now we need a trigger procedure that can do the following:

Determine which child table our record goes to;
Create this table if it does not exist yet. Otherwise, we will have to create all the tables we can think of manually. Sure, we can automate it in the application, but we want to avoid writing extra code, right?

Time to do some procedural programming in our database. For that, we have PL/pgSQL:

CREATE OR REPLACE FUNCTION orders_partitioned_view_insert_trigger_procedure() RETURNS TRIGGER AS $BODY$
  DECLARE
    partition TEXT;
    partition_country TEXT;
    partition_date TIMESTAMP;
  BEGIN

    /* Build a name for a new table */

    partition_date     := date_trunc('month', NEW.scheduled_at);
    partition_country  := lower(NEW.country);
    partition          := TG_TABLE_NAME || '_' || partition_country || '_' || to_char(partition_date, 'YYYY_MM');

    /*
    Create a child table, if necessary. Announce it to all interested parties.
    */

    IF NOT EXISTS(SELECT relname FROM pg_class WHERE relname = partition) THEN

      RAISE NOTICE 'A new orders partition will be created: %', partition;

      /*
      Here is what happens below:
      * we inherit a table from a master table;
      * we create CHECK constraints for a resulting table.
        It means, that a record not meeting our requirements
        would not be inserted;
      * we create a necessary index;
      * triple quotes are a feature, not a bug!
      */

    EXECUTE 'CREATE TABLE IF NOT EXISTS ' || partition || ' (CHECK (
      country = ''' || NEW.country || ''' AND
      date_trunc(''minute'', scheduled_at) >= ''' || partition_date || ''' AND
      date_trunc(''minute'', scheduled_at)  < ''' || partition_date + interval '1 month' || '''))
      INHERITS (orders_partitioned);';

    EXECUTE 'CREATE INDEX IF NOT EXISTS ' || partition || '_scheduled_at_idx ON ' || partition || ' (scheduled_at);';

  END IF;

  /* And, finally, insert. */

  EXECUTE 'INSERT INTO ' || partition || ' SELECT(orders  ' || quote_literal(NEW) || ').*';

  /*
  Attention: we return new record, not a NULL.
  It allows us to play nicely with ActiveRecord!
  */

  RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;

Now we have:

A view that serves as our “table”;
A trigger function that distributes new records.

Time to bring them together:

CREATE TRIGGER orders_partitioned_view_insert_trigger
INSTEAD OF INSERT ON orders_partitioned_view
FOR EACH ROW EXECUTE PROCEDURE orders_partitioned_view_insert_trigger_procedure();

Now, all we need to do Ruby-wise is to teach our Order model to use orders_partitioned_view instead of an original table. Rails can do it naturally, all we need is to give a name of the view to the table_name in our model. This feature is often used to work with legacy schemas.

But wait, what’s with the data?

Yes, we forgot our existing data. It is still sitting in the initial table, and we need to carefully move it to the new one. The problem is that we have a lot of data (several million rows, remember?), and there is more than one way to deal with it. Good news is, we rarely need to transfer all of it—usually, we need to fill child tables only for most recent months. Here’s how we do it for the last month:

INSERT INTO orders_partitioned_view
       SELECT * FROM orders
       WHERE scheduled_at >= date_trunc('month', now());

How to deploy this with no regrets

So, we have our manual. Using it in production is an entirely different story—we will most certainly run into some difficulty we could not foresee.

That is precisely what happened when we used this technique on production. A service that needed partitioning was marked as “high-availability”—meaning, we could not afford more than 15 minutes of downtime.

All live requests to the service went through the Message Queue, so we could wait a bit and execute them once we were done. But we could wait only for so long.

Downtime was inevitable, as business logic of our app did not rely entirely on ActiveRecord, but also used raw database queries (if you can avoid that—you will save yourself a lot of trouble). So, just switching a table name in a Rails model was not an option. It was also possible to avoid cloning a table, so we combined a few steps:

ALTER TABLE orders RENAME TO orders_partitioned;
CREATE OR REPLACE VIEW orders AS SELECT * FROM orders_partitioned;

/* ...and so on */

As we are talking about live business data that is constantly operated upon, a release of our changes should be marked as “risky”. Here is an ideal deployment checklist:

Operation engineer is aware of the upcoming release and is on duty.
Database administrator is also aware and present.
Everything that can be monitored is monitored: disk access operations, RAM and CPU usage, network requests, database queries, app requests, background tasks, logs.
We have a “release plan” vetted by another developer or a team lead.
A developer who wrote migration code is accessible.
Team lead is aware and present.
All steps can be reproduced manually.
We have a roll back plan.

This may look like an overkill, but you can never be too confident; take your time to develop a proper deployment protocol.

So, should I partition or not?

Examine the size of your table and, most importantly, estimate its growth over time. Think what queries are most common. If all you do is SELECT by a primary or a foreign key, you probably don’t need partitioning. If you are grouping your data in any way (by time periods, for instance, as in our example), and if you make those grouping-related queries regularly, partitioning can make your life much easier.

Why shouldn’t I just…

…use pg_pathman?

Sometimes you need more complex and granular partitioning. But if you are just breaking stuff up one column at a time—you are welcome!

…use PostgreSQL 10 (and pg_party, just to have some fun)?

Declarative Partitioning makes life easier indeed. So if you can use it—do not hesitate. Unfortunately, you are rarely free to use latest versions of anything. If you can afford the upgrade, make sure you have a deployment plan (see above).

Ready to try? Remember that writing migration code is only half of the job—the other half is making sure your team is prepared to execute it properly.

Further reading: