Flying free: a tale of partitioning with canaries in PostgreSQL

July 23, 2024

Topics

PostgreSQL

Translations

JapanesePostgreSQLのパーティショニングを「カナリア方式」で成功させる

We had spent more time than we’d like to admit on a PostgreSQL database partition, it remained untested, and, for logistical reasons, we were unable to do a feature freeze. Then, a breakthrough—and a bird was the word: canary. Canary testing, to be specific. We had such a good experience, we decided to share what happened for those in a similar conundrum!

If you’re working with a mature project, or even if you’re working with an early-stage startup, before you know it, you might see database performance degradation. Thus, you’ll need to redesign your database, and if you want to do it gracefully (without messing up anybody’s data) this post is for you!

There’s a few things to remember upfront: first, database partitioning is a technique used to divide a large database into partitions. Each partition can be stored on a different physical storage device or server, potentially improving performance and manageability.

Canary testing involves exposing a small subset of users (the “canaries”) to new changes before rolling out to the entire user base. With this approach, we can better monitor for any issues or unexpected behavior and gather feedback in a controlled environment.

In this post, we’ll discuss doing something we haven’t really seen much in the wild (canary-based partitioning).

This is a tribute to both anyone who has experienced the struggle of trying to stay awake during the monotonous and boring process of manually fixing an endless stream of queries. Oh, and it’s also a tribute to real canaries everywhere, past, present, and future!

Irina Nazarova CEO at Evil Martians

Book a call

The project we were working on

And some background about the project: at the time of partitioning the app served over 800,000 users, and the table we were about to partition had close to 60 million records in it, with a table size of about 140 gigabytes.

The high user growth rate and the development of complex features led to repeated database bloating, as well as a surge in the number of database queries.

The platform team clearly understood that partitioning was crucial. However, just as preparations began, users and data again soared. With that, the prospect of making any changes was increasingly frightening.

Assessing the situation in detail

In this case, beyond just partitioning a gigantic database, which Evil Martians were already were equipped to do using PostgreSQL, it was also critical to fix a huge volume of bad SQL queries—otherwise the database would be unstable and prone to frequent crashes due to memory errors.

Fixing SQL queries is typically the most resource-intensive, monotonous, and, of course, boring part of partitioning. Because they can have a variety of sources, these queries can be written explicitly (using raw SQL) and executed using AR/Arel, and they can be generated by both internal project services or third-party libraries.

Like with many similar cases, here the time factor made things even more urgent: the queries needed to be immediately corrected, since the database was not able to handle the current arrangement. Further complicating matters, amidst hundreds of files it was a mystery where to place the blame in the first place.

The trouble finding the trouble

But what exactly was preventing us from finding the queries that needed to be “partitioned” quickly and efficiently—without hurting production? Essentially, those were the three factors we’ll discuss below.

Factor #1: A lack of query statistics, and no tools to get these stats. Tests are a good start, but they’ll probably give the wrong impression since they check all the code equally (being optimistic). Meanwhile, it’s critical we fix any queries causing bottlenecks ASAP (and we’ll need to fix all of them eventually).

More pessimistically, when your tests don’t have 100% coverage, you can lose many hours looking for suspicious mentions of the table in question.

Factor #2: You cannot (or don’t want to) enact a feature-freeze. If you’re actively developing your project, this adds challenges to the partitioning process and makes it difficult to estimate (and if we don’t even have statistics, well… you got the idea.) Additionally, this, of course, makes it impossible to do a feature-freeze.

Factor #3: Project complexity. The more complexity, the stronger the need to collaborate with the primary developers, especially regarding legacy code. If you’re unable to work and collaborate effectively (or if you don’t have enough time) this will lead to even greater pressure.

Faced with such a scenario, many of us would likely prefer to just ~~start cursing like a sailor… err, cursing like a miner?~~ find a way to collect the statistics we need and test ready-made fixes on real users, all while ensuring minimum negative impacts for the project.

So, we needed canaries because these beautiful birds would be able to tell us all we needed to learn about any “bad” queries. And if something were to go wrong (as is often the case), some canaries will, regrettably, perish. However, this loss would ultimately be for the sake of a greater cause.

Who will be the canaries?

With this particular customer project, we partitioned the platform’s assets (pictures, documents, and so on) using “organizations” as canaries, since assets belong to organizations.

However, for regular projects, things are usually less complex and canaries can simply be, for instance, users. Therefore, in this post, we’ll use user_id to make this point clearer.

Okay, but how to assign canaries? At the beginning of the project, we hand-picked organizations (users) that were alive and active, but which were not particularly large or significant in an overall sense.

Remember: the starting group of canaries shouldn’t be too large. If there are problematic queries, having a large number of canaries would cause serious performance problems.

Also note: you should not take risks. The primary or most active users would highlight what you need to know quickly, sure… but at the cost of your time and reputation. Instead, select users who are active, but not too active, and those who are more willing to wait while you optimize.

From there, we gradually increase the number of canaries, until eventually, you come to a point where everyone is a canary, yet they all are happy, thriving canaries.

Setting up a linter

We’ll start with a tool that can determine whether a query requires any improvements. In other words, we need something that can:

Receive a query
Analyze all the statements in that query
Determine if there are statements involving a partitioned table
Check the statements found in Step 2 to see if they contain a suitable condition for a partition key
If no suitable condition exists, report the query, preferably indicating the file and line where this query was executed—and the method

To create a tool for this, we used ActiveSupport notifications and the pg_query library. The basic framework of our linter might look like this:

class PartitionKeyReporter
  attr_reader :table_name

  def initialize(table_name, partition_key)
    @table_name = table_name
    @partition_key = partition_key
  end

  def sampling_percentage
    @sampling_percentage ||= 10
  end

  def check_query?
    # Check everything possible in non-production environments
    return true unless Rails.env.production?

    # Check only a given percentage of queries in production environments
    rand(100) < sampling_percentage
  end

  def call(*_args, payload)
    return unless check_query?

    return if payload[:sql].blank?

    # The 'Resource' here is the name of the model; Rails really can be very helpful sometimes.
    if payload[:name] == 'Resource Create'
      report(payload[:sql]) unless payload[:sql].include?(partition_key)
    else
      parsed = PgQuery.parse(payload[:sql])
      return unless parsed.tables.include?(table_name)

      # The core part of the query check goes here
    end
  end
end

We can subscribe it to the events we need this way:

ActiveSupport::Notifications.subscribe 'sql.active_record', PartitionKeyReporter.new(‘resources’, ‘user_id’)

In this code, resources is the table in which we want to analyze queries, and it corresponds to the Resource model in Rails. Meanwhile, user_id is our partition key and it corresponds to the user ID.

We got the full version of our linter as a result of a deal with the Devil: it’s so hellish because the library we used didn’t have the ready-made functionality we needed, so we took one of the methods as inspiration and first built a simple version for more or less simple queries.

Later, it turned out that the project had huge queries with multi-level joins, so we had to rework the linter several times. For example, one of the versions caught not only everything we needed but also more than we expected (meaning false positives).

In the end, we ended up with about 100 lines of code, not that great, but did its job well enough. And, of course, we had no time to cover it with tests.

The linter can be used in tests, on staging, and in production to analyze a small portion of queries since parsing and exploring them are memory and time-consuming.

However, if you select a small group of users, you can analyze more queries from that particular group. Moreover, we can implement partitioning so that only this small group enjoys all partitioning benefits, while other users remain completely unaware of what we are doing with the database. (Full code snippet here)

This first group of users will be our canaries.

Setting up the database

If you’re curious, one of our previous articles, A slice of life: table partitioning in PostgreSQL databases explains the difficulties that await on the partitioning path and where to pay attention. In this post, we’ll just share the information necessary to create the canaries.

Create a partitioned table: resources_partitioned.
Add triggers for operations to resources; these will copy changes to resources_partitioned.
Copy data from resources to resources_partitioned. In an ideal world, we would take advantage of downtime (read-only downtime in most cases), but if this is not possible, we can copy the data in chunks.
Add operations triggers to resources_partitioned. At this moment, all changes to the resources_partitioned table are applied only by operation triggers in the resources table.

When to add indexes

It’s clear that it’s better to add indexes after filling the resources_partitioned table with data. However, adding indexes after that, while triggers are running and activity is high, can negatively impact performance. The load on the table and the size of the partitions will matter here (although, each will be significantly smaller than the monolithic table), so keep the following three things in mind.

First, don’t make partitions too big.

Second, before adding indexes to production, estimate the cost of adding an index on a volume of data comparable to production volume, for instance, on staging.

Third, if adding indexes and merging them really takes too much time, taking into account the workload, you can:

Disable triggers
Add indexes
Enable triggers and sync data (or finish their syncs.) You’ll need to determine which records were changed or deleted. Rails-way allows you to perform that first action using the updated_at timestamp. The second step may require preliminary preparation before disabling triggers or traversing the table, but it’s also feasible.

When to add foreign keys

If we’re being responsible developers and caring about data integrity, we shouldn’t forget about foreign keys. Our task here is much easier due to the fact that we can make these keys NOT VALID, but there are a few points worth paying attention to:

The primary key in the partitioned table will be a composite key, and accordingly, the foreign keys will become composite. Add the appropriate fields to the tables you need in advance.
Not all constraints can be created as “not valid”: https://www.postgresql.org/docs/current/sql-altertable.html#SQL-ALTERTABLE-DESC-ADD-TABLE-CONSTRAINT. But, luckily, we can create NOT VALID foreign keys in each partition, validate them, and then create a valid foreign key in the parent table (so the existing constraints in the partitions will be used.)

Setting up the infrastructure

Now that we have two fully synchronized tables and canaries ready to die for us, all we need are some “coal mines”. In other words, we’re going to set up new application processes to handle the workload of our canary users.

At the code level this is a trivial solution (for ActiveRecord, at least):

class Resource < ApplicationRecord
  if ENV['CANARY']
    self.table_name = :resources_partitioned
    self.primary_key = :id # Unless you're using the new Rails composite primary keys feature!
  else
    self.table_name = :resources
  end
end

At the infrastructure level, a bit more effort is required. This is because we need to duplicate every process that generates queries to our database and enable the CANARY environment variable …and we must also figure out how to properly route the workload (like network requests and background jobs) from our canary users to these processes.

The exact steps to do the above will vary depending on your current infrastructure, but there are still some common pitfalls you can avoid.

Creating the canary app

The general rule of thumb when creating the canary application processes is to make sure that they behave exactly like the original ones, apart from the fact that they use the new, partitioned table. This will help us avoid sneaky bugs that only happen for the canary users.

Usually, this means we just need to do these two things:

Keep the environment variables list in sync between our main and canary apps.
Configure the CI to build and deploy the canary app similarly to the main one.

However, take extra care not to accidentally duplicate some side-effects that we only want to have in our main app: like running database migrations, triggering third-party APIs during CI builds, cron jobs, and so on.

And remember to check if you have enough database connections to support the new processes, or that you use a connection pooler like PgBouncer.

Routing network traffic

The most straightforward way to split traffic between the main and canary apps is to configure a proxy in front of them, and this proxy will handle the routing based on the value from a query’s header which contains your partitioning key.

Your clients are likely already sending a header like that with every query, either explicitly or implicitly (e.g., a user ID inside a JWT token), so all that’s left to do is implement the routing logic.

The only limit here is your imagination, of course. Nonetheless, it’s best to start simple: pick a few “early birds”, hardcode their IDs straight into the proxy’s code (or use an environment variable), and route all queries with these IDs to the canary app.

Once the canary app feels stable enough, we can start converting users into canaries more or less indiscriminately by loosening the routing logic (e.g. by hardcoding the whole ranges of IDs instead of individual IDs.)

The main thing to keep in mind here is that this has to be consistent: if you start routing queries from the same user to both the main and the canary apps, you risk running into deadlocks.

As for the proxy itself, you can pick anything that works for you. However, if you’re struggling with this, do check out if your cloud provider offers edge computing capabilities—from our personal perspective, deploying an edge function acting as a reverse-proxy is the most flexible and hassle-free way of achieving what you want.

Routing the background jobs

When it comes to background jobs, things get a bit more tricky. As opposed to network requests, background jobs aren’t necessarily executed in the context of a single user. And, even if they are, it’s not always easy to tell that just by looking at their arguments list. This means implementing routing the right way might turn out to be quite time-consuming.

That said, missing a partitioning key in an SQL query performed inside a background job imposes much lower risks than making the same mistake in code that’s being executed during a network request. This is because background jobs are not time-bound to the extent that network requests are, and they usually have a sturdy retry mechanism in place.

So, while ideally we want to process the whole workload from a single user by our canary application processes to avoid potential deadlocks, it’s not the end of the world if we decide to route background jobs between our main and canary workers using a simpler logic.

For example, we could try routing the jobs by their type and gradually migrating them to use the partitioned table one by one. This is a bit more risky than implementing a more sophisticated routing, but it requires less effort, and (we think) by carefully choosing the order in which the job types are migrated, we can reasonably minimize the risks.

In any case, no matter which way you go, you’d still need to set up a “reverse-proxy” for your background jobs processor (Sidekiq, SolidQueue, etc.) that would dynamically route the jobs between the existing queues and the queues processed by the canary background workers.

In the case of Sidekiq, this would be a custom client middleware that routes the jobs between the original queues and their canary counterparts (e.g. default and default_partitioned) based on the job type and the arguments list.

Make sure to check if you have other middleware that is also assigning the queues dynamically—if you do, put your middleware after them in the chain.

Once we’re confident that all SQL query issues are fixed, we can enable partitioning by default for our main workers → remove the routing logic → wait until the *_partitioned queues are empty → and, finally, drop the canary workers.

Pros and cons of this canary approach

Let’s look at the pros and cons of what we’ve done, starting with the upsides.

Pro #1: You get live statistics on the SQL queries that are not fixed yet—and you get them in the process of a real project’s operation, with real query data (taking obfuscation into account, of course). These stats will save you a lot of time and allow you to understand which code requires priority fixes, and which is used so rarely that it can wait.

Pro #2: Gradual code preparation for partitioning: by analyzing the collected statistics and assigning the status of canaries to larger and larger groups of users, then—thanks to the surge in load—you’ll gain an understanding of how well you’re optimizing and be able to continue with minimal negative impact on users. The ideal conclusion sees all users become canaries.

Pro #3: You get flexibility because you can test partitioning (or your finished portions) on any group of users at any time.

Pro #4: Partitioning rollback for all users, or a part of them. Since regular and partitioned tables are synchronized, you won’t lose data in the case of problems, and avoid the need for time to refine a number of queries or to abandon partitioning.

Now, the downsides.

Con #1: An increasing amount of data. If your database is limited in size while the partitioned table takes up a lot of disk space, you may need to increase the database size limits.

Con #2: Increasing load on the database. Since all operations are duplicated, you’ll get at least twice more reads and writes on the target (main and partitioned) tables. This means that, before you start partitioning, you might want to carry out general database optimization (it’s useful in any case but can take some time.)

Con #3: Deadlocks. Since we’re duplicating operations over resources, updating the same resource in a partitioned and regular table simultaneously. Let’s say, for example, we do it with the help of two workers: one is already “partitioned” (i.e. works with partitioned table) and the second isn’t yet. This can lead to deadlocks. Think about the specific order you’ll change your code in to minimize chances of a deadlock.

Recommendations and final thoughts

Since canaries are a way to prepare for a partitioning release and a method that makes a rollback possible, it’s rational to use them in situations where you’d expect stumbles. For instance, these cases include: when the number of queries that need to be fixed is large; when the time spent on working with them is difficult to estimate; and when the code is complex and confusing and you don’t know it from top to bottom.

The canaries approach throws out a number of problems: for example, you don’t need to continuously pull all the latest changes into your branch and check all new queries—you’re already partitioning, it’s just not for everyone. This also mitigates some risks: you always have an original, fully working table, with the same sets of data as in the partitioned one, so you can always reduce the number of canaries or stop the entire partitioning process.

Here’s some final considerations:

First, feel free to add tests to ensure that queries are partitioned correctly. This will prevent many problems in the future when your edits fall into the hands of other developers.

RSpec::Matchers.define :use_partitioning do |table_name, partition_key|
  supports_block_expectations

  match do |actual|
    success = true

    callback = lambda do |event|
      query = event.payload[:sql]

      next unless query.match?(/\s+"?#{table_name}"?\s+/)
      next if query.match?(/"?#{table_name}"?\."?#{partition_key}"?\s+(=|BETWEEN|IN)/)

      success = false
    end

    expect(callback).to receive(:call).at_least(:once).and_call_original
    ActiveSupport::Notifications.subscribed(callback, 'sql.active_record', &actual)

    success
  end
end

Second, don’t forget monitoring! For instance, the size of the database matters, especially if you have restrictions on it while your tables are growing fast. Check even the indexes: you don’t need those that are not in use and were created “because everybody does it.” There is no need to waste resources on them.

Finally, communicate! Your colleagues should not only know that you have canaries, they must also be aware that you have two tables in your database with the same data. This means that adding a field to one of them requires adding the same field to the other, and the same field must be taken into account for triggers. Indexes must also be duplicated, and those who add them must know what is the best way to do this in the case of a partitioned table.

Bonus section!

We the time of our intiial implementation, Rails 7.1.0, which allowscomposite primary and foreign keys to be specified and used, was still in beta. We tested it for a long time, and, in general, we think that this thing is really handy, and so feel free to use it, but keep in mind any potential issues.

Phew! Now that you know all about partitioning with canaries, do you feel like you’ve been let out of a cage? Soar free and let us know!