PostgreSQL and Rails, sitting in a tree

March 9, 2016

Topics

Viktor Sokolov
Sr. Backend Engineer

The Closure Table pattern has something that other tree storage methods don’t—a good design. Let’s see why it rocks.

Some time ago I was involved in a project aiming to systemize a pandemonium of databases. Especially hard were the toponym ones. The source data was a deep XML tree with regions and cities at the top and streets and houses at the bottom.

| id         | parent_id  | name       |
| ---------- |------------|------------|
| 1          | 2          | Kremlin    |
| 2          |            | Moscow     |
| 3          | 2          | Red Square |
| …   (1M+ unsorted records)         … |

I had to put this tree into PostgreSQL to allow queries. A typical query was, for example, to fetch all the streets of the cities within a specific region or to fetch the whole path from a particular street at the bottom up to the root node, etc.

Since there were quite a few queries to make, it was a must to process them as fast as possible. Therefore, I needed a hierarchical storage pattern to speed them up.

So I did the following:

Imported the table;
Chose the suitable hierarchical pattern;
Calculated the extra reference data used by the chosen pattern.

At each of the three stages, it was the speed that mattered. I wanted to do everything in next to no time, because there are not many things worse than having your server not responding, and downtimes are unacceptable.

Importing the initial data

In PostgreSQL, a table is best populated by the COPY FROM statement. It works significantly faster than multiple INSERT statements. Ensure to prepare the input data for COPY FROM. It accepts CSV or a binary file/string at the input. Binary is faster.

The pg_data_encoder gem can be used to convert data to a binary format to be used by PostgreSQL. It is designed to work with postgres-copy, but I prefer Sequel because it is very lightweight compared to ActiveRecord and allows to work with with tables directly, without using the application model layer.

The import script might look like this:

### import.rb

# Will connect to the database configured in Rails' database.yml
db = Sequel.connect(ActiveRecord::Base.connection_config)

encoder = PgDataEncoder::EncodeForCopy.new

# Suppose the data have already been read.
# In my case, I had a set of DBF's (doh! at 2015!).
[
  [1, 2, "Kremlin"],
  [2, nil, "Moscow"],
  [3, 2, "Red Square"]
  # …
].each { |line| encoder.add(line) }

db.copy_into(
  "tax_authorities",
  columns: %i[id parent_id name],
  format: :binary,
  data: encoder.get_io
)

# Run it with something like:
# > bundle exec rails runner import.rb

If your data set does not fit in memory, you have two options:

Import incrementally, part by part (read-copy loop).
Specify use_tempfile: true as an option for PgDataEncoder#new to enable file buffering.

Incremental import might be faster.

Choosing a tree pattern

There are only a few ready to use ActiveRecord hierarchy plugins:

awesome_nested_set implements Nested Set;
ancestry implements Materialized Path;
and closure_tree is for Closure Table.

Let’s try to choose one.

Nested Set is the worst option:

The slowest one.
The most complicated algorithm with complex queries.
Has issues with concurrency—updates always require locks.

Even though I am going to read the tree way more often than update it, still such problems scare me. Anything but this.

Materialized Path is much better:

The fastest one.
Can be used natively with PostgreSQL.
However, the internals are still complicated.

Reference columns are required

Two additional columns have to be added and calculated to make everything work: ancestry and ancestry_depth. The gem contains #build_ancestry_from_parent_ids! method which calculates initial values.

It took about 44 minutes to calculate 1 188 483 records and save the values in the two columns. The production server won’t be so write-friendly as the local machine, the source database needs to be updated and will grow in size. And the more it grows, the slower it works. Much better but still unacceptable.

Closure Table is something in-between:

Slower than Materialized Path but competitive.
The easiest algorithm which best fits in with the relational model.
Can use foreign keys to maintain referential integrity.
No need to modify existing table schema.

In this case, an extra table is to be created. This table will contain the links between each parent and all its descendants. Сall the #rebuild! method to fill this table with the initial values.

Default `#rebuild!` method works slow

By default, it enumerates each record in the source table and creates corresponding records in the hierarchies table by calling the INSERT INTO … SELECT statement. Therefore, it does an N number of inserts per each source row. It is never fast with millions of inserts. It takes hours to rebuild the hierarchy from scratch. Unacceptable.

Boosting up the closure_tree `#rebuild!`

The reference table can be populated much quicker if calculated in memory and stored by the COPY FROM statement.

The data for your reference table might look like this:

  | ancestor_id | descendant_id | generations |
  |-------------------------------------------|
  | 1           | 1             | 0           |
  | 2           | 2             | 0           | # Moscow itself
  | 3           | 3             | 0           |
  | 2           | 1             | 1           | # Moscow/Kremlin
  | 2           | 3             | 1           | # Moscow/R. Sq.
  | … (long long array) …                     |

Eventually, we need to:

Fetch the source table id and parent_id columns via a single SELECT.
Find all the descendants of each fetched id.
Make an in-memory array with the resulting values as shown above.
Save it to the reference table using the COPY FROM statement.

I have encapsulated these steps into a reusable library.

My pg_closure_tree_rebuild gem will do the job!

You can use it from the command line:

> bundle exec rake closure_tree:rebuild DATABASE_URL=postgres://localhost/tax_authorities TABLE=tax_authorities

or from your code:

# seeds.rb
PgClosureTreeRebuild::Runner.new(
  Sequel.connect(ActiveRecord::Base.connection_config), 'tax_authorities'
).run

Result

It only took 3.5 minutes to deal with 1 188 483 records, which otherwise would have taken hours:

138.100000   4.060000 142.160000 (203.340859)

12.5 times faster compared to Materialized Path. Awesome.

The result was indeed worth the time and effort spent. It was exactly the solution I wanted. Of course, the fact that it did pay off in my case does not mean that you should stick to it too. If your task is different from the one I had, then another pattern or a set of patterns might do a better job. Anyway, whatever option you prefer, remember to avoid Nested Set. Below is a summary of what I’ve learned. I hope you will find it useful.

Use COPY FROM to populate tables whenever possible.
Use the Closure Table pattern (and the closure_tree gem) as it is the simplest tree pattern especially if you don’t do anything but read.
Use the pg_closure_tree_rebuild gem instead of the standard #rebuild! method of closure_tree as it is faster.