Martian Chronicles
Evil Martians’ team blog
Back-end

The silence of the Ruby exceptions: a Rails/PostgreSQL database transaction thriller

Here comes a tale on why you should never silence errors inside database transactions. Learn how to use transactions properly, and what to do when using them is not an option. Spoiler: use PostgreSQL advisory locks!

I was working on a project where users could import a bunch of heavy entities (let’s call them products) from an external service into our application. Each product brought in even more associated data from external APIs. It is not uncommon for a user to import hundreds of products, and each product has dependencies that need to be fetched too, so you can imagine that the whole process takes a while (30 to 60 seconds for each product). As a user may get tired of waiting and hit the “Cancel” button at any moment, the application should still be usable with only those records that made it.

Here’s how our “interruptable import” was implemented: at the start of an import a temporary record in a separate database table was created for each enqueued product. Then for each of those records, a background job was spawned that pulled in all external information, persisted it in the right places (by creating associations, if necessary), and finally deleted a temporary record. If by the time the background job started, the record was not found (when a user cancels the import, all temporary records are deleted)—the job just did nothing and exited silently.

Interrupted import or not—the absence of temporary records meant we were done.

Our design seemed simple and reliable, but it did not always work exactly as planned. A common bug description stated: “After canceling an import, a user is presented with a list of imported records. However, the next time the page is refreshed, the list has more records than shown initially”.

The reason for that was clear: as background jobs took up to a minute to finish, even a canceled import had an “afterglow.”

Nothing wrong with the design, but it led to confusing user experience, so we needed to address it in two possible ways: either somehow identify and cancel jobs already in progress, or wait for the last imports to finish before confirming that the whole process is in fact “canceled.” I chose the latter.

Transaction locks to the rescue!

For anyone dealing with relational databases often, the answer is clear: use transactions! According to a description in Rails docs, they are “protective blocks where SQL statements are only permanent if they can all succeed as one atomic action.” As documentation states, “you should use transaction blocks whenever you have a number of statements that must be executed together or not at all.

It is good to keep in mind that in most RDBMS records being updated inside a transaction will be locked and unmodifiable by other processes until the transaction is finished. The same is true when you fetch the records with SELECT FOR UPDATE.

Exactly our case! I used a single database transaction to wrap complex import tasks for individual products and to lock task records from being changed or deleted:

ActiveRecord::Base.transaction do
  # SELECT … FOR UPDATE means “I want to change it later, keep it for me”
  task = Import::Task.lock.find_by(id: id)
  return unless task # it was already deleted, so we're free

  # do some heavy lifting
  task.destroy
end

Now whenever the whole import is canceled, this code should wait for all individual tasks to finish:

user.import_tasks.delete_all # waits while all imports will be finished

Simple and elegant! I’ve run everything locally and in the staging environment, found no problems and deployed to production.

Not so fast…

Satisfied with my work, I was surprised to wake up to the myriad of log errors and complaints from colleagues. Many products weren’t being imported at all. In some cases, only a single product would appear after the import is canceled.

Error messages did not make much sense: PG::InFailedSqlTransaction from code that executed innocent SELECTs. What’s going on?

After an entire day of debugging, I’ve identified three root causes of problems:

  1. Concurrent inserts of conflicting records into a database.
  2. Automatic transaction rollback in PostgreSQL.
  3. Silenced Ruby exceptions.

First problem: Concurrent inserts of conflicting records

As each product’s import job takes up to a minute, and there are multiple jobs per user, we run them in parallel to save time. Dependent records that are created when those jobs run can overlap, meaning different products can have the same dependency, first created, then reused.

There are checks for these cases on the application level, but now that we use transactions they are rendered useless: if transaction A has created a dependent record, but has not committed yet, transaction B won’t be aware of it and will try to create a duplicate.

Second problem: Transaction rollback on a DB-level error

We can prevent the creation of duplicate records at the database level, with the following DDL:

ALTER TABLE deps ADD UNIQUE (user_id, chars);

Now if transaction A had inserted a new record and the transaction B, running in parallel, tries to insert the same record—we will get an error:

BEGIN;
INSERT INTO deps (user_id, chars) VALUES (1, '{"same": "value"}');
-- Now it will block until first transaction will be finished
ERROR:  duplicate key value violates unique constraint "deps_user_id_chars_key"
DETAIL:  Key (user_id, chars)=(1, {"same": "value"}) already exists.
-- And will throw an error when first transaction have commited and it is become clear that we have a conflict

One point that is commonly overlooked is that transaction B, upon encountering an error, will roll-back entirely and the sequence of steps leading up to the error will be wasted. An invalid transaction, however, will still be open until all the code inside runs its course. Every attempt to use records in question will result in another error:

SELECT * FROM products;
ERROR:  current transaction is aborted, commands ignored until end of transaction block

It goes without saying that no information (doesn’t matter if you dealt with it before or after the error) will be persisted at all:

COMMIT;  -- If we even try to save what we've done
ROLLBACK -- RDBMS will reject our attempt

Third problem: Silenced exceptions

At that point, it became clear that adding such a common and robust concept as database transaction to the application logic effectively broke it. I had no choice but to dig into other people’s code. This is when I started seeing snippets such as this one:

def process_stuff(_data)
  # Magic, lots of magic
rescue StandardError
  nil # Happy debugging
end

What the author is trying to say here is, “We tried something that failed, but we don’t care, so we keep doing stuff.” While the reasons behind that approach may be justified (there are some things you can’t validate for on the application level), it makes transaction-based logic unimplementable: a silenced exception will never bubble up to a transaction block and will not trigger a rollback (because ActiveRecord catches all exceptions, rolls back a transaction and re-raises them).

A perfect storm

Here’s how those three factors came together to create a perfect storm bug:

  • Transaction tries to insert a conflicting row into a database and encounters a “duplicate key” error from Postgres. This error, however, never causes a transaction rollback, as it is silenced in Ruby.
  • The transaction becomes invalid but keeps running. When the application tries to access the DB for whatever reason (even with an innocent SELECT to list products), it triggers “current transaction is aborted” error. But that exception may be silenced too…
  • You get the idea. While things keep breaking inside the application, no one will know about it until the error bubbles up to the first piece of code that does not use a rescue block. As it might be quite far away from the code that failed originally, debugging becomes a nightmare.

PostgreSQL’s built-in alternative to transaction-level locks

Hunting for rescues in the application code and rewriting the whole import logic was not an option, so I needed a quick win and Postgres gave me one! It has a built-in locking tool that can be used as an alternative for locking records in transactions. Meet session-level advisory locks. This is how I used them:

First, I removed all transaction wrappers that I added in the first place. Doing API interactions or any other non-database side-effects from code within a transaction is a bad idea anyway because even if database changes are successfully rolled back, side-effects will already be applied and that may leave your application in an undesired state. Take a look at the isolator gem that will help you make sure your side-effects are well isolated.

Then I associated every import task with a shared lock based on some unique key (created from a user ID and a hash of operation name, for instance):

SELECT pg_advisory_lock_shared(42, user.id);

Shared locks may be obtained simultaneously by as many sessions as you want.

“Cancel import” operation deletes all pending tasks from the database and tries to take an exclusive lock for the same key. This will have to wait till all shared locks are released:

SELECT pg_advisory_lock(42, user.id)

And that is it!

Now canceling an import will automatically wait for all running import tasks to finish.
Moreover, we can use a little hack to set a timeout for our exclusive lock, because we don’t want to wait for too long, blocking a web server thread, if some task hangs for whatever reason:

transaction do
  execute("SET LOCAL lock_timeout = '30s'")
  execute("SELECT pg_advisory_lock(42, user.id)")
rescue ActiveRecord::LockWaitTimeout
  nil # we are tired of waiting and give up (transaction is already rolled back at this point)
end

Rescuing exception outsidetransaction block is safe because ActiveRecord would already roll back.

What to do with concurrent inserts of the same record?

Unfortunately, I am not aware of a solution that can be used together with concurrent inserts. Here are all the approaches that I know about, but all of them will lock concurrent inserts until the first transaction commits:

  • Use INSERT … ON CONFLICT UPDATE (available since PostgreSQL 9.5) to wait until the first transaction commits and returns a record inserted by it.
  • Lock some common record in the database before running validations for a new record: this will wait until record inserted by the first transaction will become visible and validations will work as expected.
  • Take some common advisory lock before running validations for a new record.

If you’re not afraid of dealing with Postgres errors, you can also do this:

def import_all_the_things
  # Start transaction here and at some point do
  Dep.create(user_id, chars)
rescue ActiveRecord::RecordNotUnique
  retry
end

And make sure this code is not wrapped inside a transaction block.

Why will it lock? UNIQUE and EXCLUDE constraints would prevent potential conflicts from being written at the same time. E.g., if you have unique integer column and one transaction inserts 5, that will prevent other transactions from inserting 5 concurrently, but will not interfere with a transaction inserting 6. Since the least strict transaction isolation level supported by PostgreSQL is READ COMMITED, the second transaction is not allowed to see results of the first transaction that are not yet committed. INSERT of conflicting value must wait until the first transaction either commits (then it will fail) or rollbacks (then it will succeed). Read more in this article from the author of EXCLUDE constraints.

Preventing future havoc

If you know that some code can not be used inside a transaction for reasons described above, make sure it can never be wrapped in one. Here’s how you can detect open transactions and fail fast:

# Here's how you define it
module NoTransactionAllowed
  class InTransactionError < RuntimeError; end

  def call(*)
    return super unless in_transaction?

    raise InTransactionError,
          "#{self.class.name} doesn't work reliably within a DB transaction"
  end

  def in_transaction?
    connection = ApplicationRecord.connection
    # service transactions (tests and database_cleaner) are not joinable
    connection.transaction_open? && connection.current_transaction.joinable?
  end
end

# And here's how you use it
class DepsImport < BaseService
  prepend NoTransactionAllowed

  def call
    do_import
  rescue ActiveRecord::RecordNotUnique
    retry
  end
end

Now, whenever anyone tries to wrap a risky service in a transaction, they will get an exception (unless they’ve silenced it, of course).

Wrapping up

One lesson to be taken from this tale: be careful with exceptions. Only catch those you really know how to deal with, and nothing more. Never silence exceptions unless you are absolutely sure. The sooner error will be noticed—the easier it will be to debug.

And don’t overuse database transactions!

Don’t forget to check out our gems isolator and after_commit_everywhere.

Further reading

Exceptional Ruby by Avdi Grimm. This book will teach you how to work with existing Ruby exceptions, and how to design your own.

Using Atomic Transactions to Power an Idempotent API by @Brandur. His blog is full of awesome articles about software reliability, Ruby, and PostgreSQL.

Humans! We come in peace and bring cookies. We also care about your privacy: if you want to know more or withdraw your consent, please see the Privacy Policy.