One row, many threads: How to avoid database duplicates in Rails applications

August 18, 2021

Topics

Dmitry Tsepelev
Sr. Backend Engineer

Making sure each database row is unique, and tables contain no duplicates seems like a straightforward task in Rails. Active Record has a built-in validator that you can put on a model and call it a day. Until your application encounters high load and concurrent writes—then you will have to deal with nasty bugs that are difficult to reproduce locally or catch in tests. Read on for the comparison and benchmarks of advanced approaches that will make your application more robust and your data—consistent.

In this article, we will use a single-file PostgreSQL-backed Rails application with two tables, simulate different race conditions with Ruby threads, and see how we can enforce uniqueness for records either on the application level or by using database tools. We will talk about locks, constraints, unique indices, “upserts”, and the drawbacks of the Active Record’s built-in uniqueness validator. By the end of this article, you will be well-equipped to choose the best tool for the job. Stay tuned!

Preparing our lab for experiments 🧪

The guinea pig for today’s experiments is an imaginary time tracker application where users can log the amount of time they spend on different tasks. Every time a user makes an entry, we create a record in the time_tracks table that references the user_id and the task_id.

The trick is to avoid creating more than one entry per user and a task per day.

In other words, if “reading” is a task with ID 1 and the user wants to log another hour of reading for the same date—we should just update the hours property of the record in time_tracks and not create a duplicate. Sounds easy enough!

Here’s the schema:

ActiveRecord::Schema.define do
  enable_extension "plpgsql"

  create_table "time_tracks", force: :cascade do |t|
    t.bigint "user_id", null: false
    t.bigint "task_id", null: false
    t.integer "hours", null: false
    t.date "date", null: false
    t.datetime "created_at", precision: 6, null: false
    t.datetime "updated_at", precision: 6, null: false
  end

  create_table "users", force: :cascade do |t|
    t.datetime "created_at", precision: 6, null: false
    t.datetime "updated_at", precision: 6, null: false
  end

  add_foreign_key "time_tracks", "users"
end

And here’s the naive controller implementation that leverages #find_or_initialize_by:

class TimeTracksController < ApplicationController
  def track
    identifiers = time_track_params.slice(:date, :task_id)
    entry = current_user.time_tracks
                        .find_or_initialize_by(identifiers)
    entry.update!(hours: time_track_params[:hours])

    render status: :ok
  end

  private

  def time_track_params
    params.require(:time_track).permit(:task_id, :date, :hours)
  end
end

Or, you can use #find_or_create_by instead, it is basically the same thing:

class TimeTracksController < ApplicationController
  def track
    identifiers = time_track_params.slice(:date, :task_id)
    entry =
      current_user.time_tracks
                  .create_with(hours: time_track_params[:hours])
                  .find_or_create_by!(identifiers)
    entry.update!(hours: time_track_params[:hours])

    render status: :ok
  end

  private

  def time_track_params
    params.require(:time_track).permit(:task_id, :date, :hours)
  end
end

As you’ll see down the line, both approaches have similar downsides, so we will focus on #find_or_create_by!.

To save you the trouble of generating a Rails application so you can follow along, we’ve packed all our examples in single-file runnable scripts that will bootstrap a schema each time you run them and simulate different conditions. You can find them all in this gist.

Dangers of #find_or_create_by!

According to its name, #find_or_create_by! tries to find the matching table row and creates a new one only if #find returns nothing. Run this script to check:

user = User.create

100.times do
  date = Date.new(2020, 1, rand(1..10))
  user.time_tracks.create_with(hours: 10).find_or_create_by!(date: date, task_id: 1)
end

TimeTrack.from(
  TimeTrack.group(:date, :user_id).having('COUNT(*) > 1').select(:date, :user_id)
).select('COUNT(subquery.date) AS not_unique').to_a.first.not_unique # => 0

It works perfectly locally, but once you deploy your application to production we can have more than one thread trying to perform this operation, or even different application instances each running many threads. Would it behave differently? Let’s check:

threads = []
wait_for_it = true

user = User.create

20.times do
  date = Date.new(2020, 1, rand(1..4))

  threads << Thread.new(date) do
    # A loop to make threads busy until we `join` them
    true while wait_for_it

    user.time_tracks.create_with(hours: 10).find_or_create_by!(date: date, task_id: 1)
  end
end

wait_for_it = false

threads.each(&:join)

TimeTrack.from(
  TimeTrack.group(:date, :user_id).having('COUNT(*) > 1').select(:date, :user_id)
).select('COUNT(subquery.date) AS not_unique').to_a.first.not_unique # => not 0!

Well, that’s a bummer.

The issue is that find_or_create_by! is not atomic—find and create are separate operations. So there is a huge chance that two threads would try to insert the same data in parallel, and #find will return nothing for both, making them both perform #create. You’ve got your duplicate!

I don’t want to ruin the mood, but these issues are especially hard to catch in specs as they can only be reproduced using threads. What was the last time you tested your code in a multi-threaded environment?

Taming the race condition

As we now see, the problem is coming from two parallel threads causing the race condition while reading the data and writing it after. Per ACID principles, it must possible make the operation atomic (i.e., all operations happen “together”). One of the ways of doing it in Rails is to use the pessimistic lock:

# Works both with MySQL and PostgreSQL
user.with_lock do
  user.time_tracks
      .create_with(hours: 10)
      .find_or_create_by!(date: date, task_id: 1)
end

In this case, .with_lock won’t let another thread get inside the block. It performs SELECT ... FOR UPDATE on the specified record, and all other transactions trying to take a lock or update the row would wait for the current transaction to be committed.

Can we call it a day? Are there other ways to put inconsistent data into the table? Unfortunately, yes: someone can add code that writes data without the lock or even put the data directly to the table! You will have to exercise incredible discipline on the team level to make sure all your non-thread-safe writes are locked.

Is there something that offers more peace of mind?

Constraints to the rescue!

The easiest way to enforce data consistency is to add constraints right to the table. This is a native feature of PostgreSQL, and it’s a good idea to use it in your application.

To create a new uniqueness constraint with an Active Record migration, we need to add the unique index. Active Record will figure a way to properly add the constraint depending on your database adapter. Now, it will raise RecordNotUnique each time we try to write a duplicate—no matter if we do it from Ruby or bypass ORM completely and drop to pure SQL.

class AddUniqueIndexToTimeTracks < ActiveRecord::Migration[6.0]
  # You need this if you are adding an index _concurrently_
  disable_ddl_transaction!

  def change
    add_index :time_tracks, %i[user_id task_id date], unique: true, algorithm: :concurrently
  end
end

🚨 Downtime alert! If your table is huge—make sure to add the index concurrently. Otherwise, when the table is locked by index creation, all threads trying to write to this table will have to wait for the lock to be released. If there are enough table writes to make all workers busy—it might cause downtime.

Cool, but what if we already have a huge application with inconsistent data? You will have to fix the code and the data before adding the index, so let’s talk about that.

Finding places that need locks

While the solution is pretty trivial (add proper locks!), you might wonder where we should take these locks, especially when you have a huge codebase. If feel lucky and can afford a couple of errors in production, here is the plan.

Find and fix all the code responsible for duplicates that you can find (to make sure we don’t get too many crashes).
Write a rake task that can clear or merge duplicates.
Run the task in production.
Add unique index and wait for error reports.
When you get RecordNotUnique—either quickly fix it or remove the index and then fix it. Repeat steps 3–5 until you stop getting errors.

Let’s also specify what we want from the rake task: it should find all anomalies and either merge or remove them. In our case, it’s trivial: we need to group “time tracks” by user_id, task_id, and date and remove all rows in each group except for the last one.

Spin-off: Conditional unique index

This is not directly related to the topic, but you might face a situation when adding a unique index to a table where some values can be nil. Imagine that users can track time not connected to any tasks (“free time”) but still create only one record per day. Guess what happens if we add the index the way we did it before? It will consider all records with NULL inside task_id are unique because NULL != NULL in PostgreSQL!

In this rare case, you will need to replace NULL with something that can be checked for uniqueness, for insance with zero:

add_index :time_tracks,
          'user_id, COALESCE(task_id, 0), date',
          unique: true,
          algorithm: :concurrently

The downside of this approach is that you can hardly use this index for a lookup: PSQL uses an index only when query conditions match the ones specified in the index exactly. If you want to use indexes both for constraints and lookups—use functional indexes:

add_index :time_tracks,
          'user_id, task_id, date',
          name: 'unique_user_task_date_when_task_exists',
          where: 'task_id is not null',
          unique: true,
          algorithm: :concurrently

add_index :time_tracks,
          'user_id date',
          name: 'unique_user_date_when_task_not_exists',
          where: 'task_id is null',
          unique: true,
          algorithm: :concurrently

However, if you need indices on many columns that rely on each other, you might need to specify many functional indexes, which makes this approach a bit brittle.

Now, back to the topic.

create_or_find_by vs. find_or_create_by

When you have the unique index in place, you can start using a modern replacement to #find_or_create_by!—meet #create_or_find_by!. As per the name, it does the same thing, but in reverse: it tries to insert the record, and if it encounters RecordNotUnique—loads the record (because in this case, we are sure the record exists). Note that it runs a nested transaction inside.

In other words:

def create_or_find_by!(params)
  transaction(requires_new: true) { create!(params) }
rescue RecordNotUnique
  find_by(params)
end

This is how our code looks like with this approach:

threads = []
wait_for_it = true

user = User.create

20.times do
  date = Date.new(2020, 1, rand(1..4))

  threads << Thread.new(date) do
    true while wait_for_it

    user.time_tracks.create_with(hours: 10).create_or_find_by!(date: date, task_id: 1)
  end
end

wait_for_it = false

threads.each(&:join)

TimeTrack.from(
  TimeTrack.group(:date, :user_id).having('COUNT(*) > 1').select(:date, :user_id)
).select('COUNT(subquery.date) AS not_unique').to_a.first.not_unique # => 0

#create_or_find_by! might be slower than the combination of find_or_create_by and with_lock in cases when there are more reads than writes.

Alternatively, we can implement just create! and rescue ActiveRecord::RecordNotUnique ourselves, as we don’t need to lock anything (we already have a uniqueness constraint)—that way we can avoid an unneccesary nested transaction. Check out the full gist.

lookup_params = { date: date, task_id: 1 }

begin
  user.time_tracks.create!(lookup_params.merge(hours: 10))
rescue ActiveRecord::RecordNotUnique
  user.time_tracks.find_by(lookup_params).update(hours: 10)
end

I bet you’re curious which of the described ways to pick, so let’s benchmark!

First, let’s check what happens when we never insert duplicates:

              create! + rescue: 803.9 i/s
            create_or_find_by!: 694.5 i/s - same-ish
with_lock + find_or_create_by!: 359.0 i/s - 2.24x slower

As you could expect, create! is the fastest one (cause we never rescue).

Second, let’s choose the approach to use when we always insert duplicates:

with_lock + find_or_create_by!: 530.3 i/s
            create_or_find_by!: 525.5 i/s - same-ish
              create! + rescue: 491.9 i/s - same-ish

Despite the fact that we spend time taking a lock, find shows the best result (but the difference with create_or_find_by! is not that significant). It turns out that the preferred way depends on whether you have the unique index or not (in this case, you need a pessimistic lock) and how often ActiveRecord::RecordNotUnique actually occurs.

Using UPSERTs

If the database is so smart, can we delegate it checking if the record exists and updating it or creating it if it’s missing? It will let the application avoid handling any exceptions (it’s hidden, but they are still caught under the hood). Of course, we do! There is a special mechanism called UPSERT (or, more precisely, INSERT ... ON CONFLICT ...). Rails also supports it via the #upsert method.

Take a look:

threads = []
wait_for_it = true

user = User.create

20.times do
  date = Date.new(2020, 1, rand(1..4))

  threads << Thread.new(date) do
    true while wait_for_it

    user.time_tracks.upsert(
      {
        hours: 10,
        date: date,
        task_id: 1,
        created_at: Time.current,
        updated_at: Time.current
      },
      unique_by: %i[user_id task_id date]
    )
  end
end

wait_for_it = false

threads.each(&:join)

TimeTrack.from(
  TimeTrack.group(:date, :user_id).having('COUNT(*) > 1').select(:date, :user_id)
).select('COUNT(subquery.date) AS not_unique').to_a.first.not_unique # => 0

Beware that it does not create any ActiveRecord objects, omits all validations and callbacks and tries to insert the data directly into the table. This approach is faster, but makes us provide created_at and updated_at explicitly.

Under the hood it crafts an INSERT statement that asks PostgreSQL to do something in case of conflict: in our scenario, we want to override hours.

INSERT INTO time_tracks (hours, date, task_id, created_at, updated_at, user_id)
VALUES (10, '2020-01-03', 1, '2021-08-05 14:15:39.346889', '2021-08-05 14:15:39.346891', 1)
ON CONFLICT (user_id, task_id, date) DO
  UPDATE SET
    hours = excluded.hours,
    created_at = excluded.created_at,
    updated_at = excluded.updated_at
RETURNING id

Built-in uniqueness validation

Rails would not be Rails if there were no special validator that can ensure that the column value is unique inside the given scope or the whole table:

class TimeTrack < ActiveRecord::Base
  validates :task_id, uniqueness: { scope: %i[date user_id] }
end

When the validator is called, it performs the database request to make sure there is no such row in the table already. When validation is passed, ActiveRecord saves the model. Doesn’t it sound similar to the #find_or_create_by? Oh, it does, and it has the same issue with race conditions!

threads = []
wait_for_it = true

user = User.create

40.times do
  date = Date.new(2020, 1, rand(1..4))

  threads << Thread.new(date) do
    true while wait_for_it

    begin
      user.time_tracks.create_with(hours: 10).find_or_create_by!(date: date, task_id: 1)
    rescue ActiveRecord::RecordInvalid
      puts 'RecordInvalid rescued!'
    end
  end
end

wait_for_it = false

threads.each(&:join)

TimeTrack.from(
  TimeTrack.group(:date, :user_id).having('COUNT(*) > 1').select(:date, :user_id)
).select('COUNT(subquery.date) AS not_unique').to_a.first.not_unique # => not 0!

The only way to make it work fine is to wrap each create, update or save with .with_lock, and we cannot do it on the model level, so the whole approach sounds a bit error-prone.

There is one more issue with the built-in validator: it does not play well with the #create_or_find_by and #create_or_find_by!. Before trying to insert the data to the table, it performs all validations, and of course the uniquiness validation fails when record with same attributes exist.

As a result, instead of getting either new or existing record, we get an invalid one:

# inserting a first record
user.time_tracks
    .create_with(hours: 10)
    .create_or_find_by!(date: Date.new(2020, 1, 1), task_id: 1)
# okay, it's in the table

# trying to add a second one
user.time_tracks
    .create_with(hours: 10)
    .create_or_find_by!(date: Date.new(2020, 1, 1), task_id: 1)
# => Validation failed: Task has already been taken (ActiveRecord::RecordInvalid)

# trying non-bang version
user.time_tracks
    .create_with(hours: 10)
    .create_or_find_by(date: Date.new(2020, 1, 1), task_id: 1)
# => #<TimeTrack id: nil, user_id: 1, task_id: 1, hours: 10, date: "2020-01-01", created_at: nil, updated_at: nil>

Turns out, that in both cases validator prevented ActiveRecord from either creating record or finding one! 😕

What about human-readable errors?

Knowing all downsides discovered above, you might wonder when built-in uniqueness validation might be helpful. There is such a scenario: a situation when we need to let the user know the exact reason why the data he inserts is invalid. For instance, it might happen when we don’t want them to accidentally overwrite the data they already have.

Like all other validators, the uniqueness validator can add a message to errors (with I18n and all bells and whistles). If we do not need #create_or_find_by! for a specific model—we can use the built-in validator (just don’t forget about locks!).

Alternatively, there is a nice gem database_validations, which understands database-level validations and knows how to integrate them with model errors:

class User < ActiveRecord::Base
  validates :email, db_uniqueness: true
end

Wrapping up

Let’s sum up what we learned today:

find_or_initialize_by, find_or_create_by and find_or_create_by! are not thread-safe, so we need pessimistic locks.
A unique index in a database is the best way to keep data consistent.
When unique index is set up, we can use atomic create_or_find_by, create_or_find_by!, or find_or_create_by! with rescue ActiveRecord::RecordNotUnique.
Built-in uniqueness validator is not thread-safe either (if you really need it—use pessimistic locks), and it doesn’t work with create_or_find_by and create_or_find_by!.
You might need to use a validator to tell the user that the data he inserts is invalid (instead of silently overriding it); in this case, the validator can help, but consider using the one from database_validations gem.

Thank you for reading! Feel free to reach out to us if you want to talk about database issues or if you need to enlist Evil Martians to work on your Rails application.