One row, many threads: How to avoid database duplicates in Rails applications
Making sure each database row is unique, and tables contain no duplicates seems like a straightforward task in Rails. Active Record has a built-in validator that you can put on a model and call it a day. Until your application encounters high load and concurrent writes—then you will have to deal with nasty bugs that are difficult to reproduce locally or catch in tests. Read on for the comparison and benchmarks of advanced approaches that will make your application more robust and your data—consistent.
In this article, we will use a single-file PostgreSQL-backed Rails application with two tables, simulate different race conditions with Ruby threads, and see how we can enforce uniqueness for records either on the application level or by using database tools. We will talk about locks, constraints, unique indices, “upserts”, and the drawbacks of the Active Record’s built-in uniqueness
validator. By the end of this article, you will be well-equipped to choose the best tool for the job. Stay tuned!
Preparing our lab for experiments 🧪
The guinea pig for today’s experiments is an imaginary time tracker application where users can log the amount of time they spend on different tasks. Every time a user makes an entry, we create a record in the time_tracks
table that references the user_id
and the task_id
.
The trick is to avoid creating more than one entry per user and a task per day.
In other words, if “reading” is a task with ID 1 and the user wants to log another hour of reading for the same date
—we should just update the hours
property of the record in time_tracks
and not create a duplicate. Sounds easy enough!
Here’s the schema:
ActiveRecord::Schema.define do
enable_extension "plpgsql"
create_table "time_tracks", force: :cascade do |t|
t.bigint "user_id", null: false
t.bigint "task_id", null: false
t.integer "hours", null: false
t.date "date", null: false
t.datetime "created_at", precision: 6, null: false
t.datetime "updated_at", precision: 6, null: false
end
create_table "users", force: :cascade do |t|
t.datetime "created_at", precision: 6, null: false
t.datetime "updated_at", precision: 6, null: false
end
add_foreign_key "time_tracks", "users"
end
And here’s the naive controller implementation that leverages #find_or_initialize_by:
class TimeTracksController < ApplicationController
def track
identifiers = time_track_params.slice(:date, :task_id)
entry = current_user.time_tracks
.find_or_initialize_by(identifiers)
entry.update!(hours: time_track_params[:hours])
render status: :ok
end
private
def time_track_params
params.require(:time_track).permit(:task_id, :date, :hours)
end
end
Or, you can use #find_or_create_by instead, it is basically the same thing:
class TimeTracksController < ApplicationController
def track
identifiers = time_track_params.slice(:date, :task_id)
entry =
current_user.time_tracks
.create_with(hours: time_track_params[:hours])
.find_or_create_by!(identifiers)
entry.update!(hours: time_track_params[:hours])
render status: :ok
end
private
def time_track_params
params.require(:time_track).permit(:task_id, :date, :hours)
end
end
As you’ll see down the line, both approaches have similar downsides, so we will focus on #find_or_create_by!
.
To save you the trouble of generating a Rails application so you can follow along, we’ve packed all our examples in single-file runnable scripts that will bootstrap a schema each time you run them and simulate different conditions. You can find them all in this gist.
Dangers of #find_or_create_by!
According to its name, #find_or_create_by!
tries to find the matching table row and creates a new one only if #find
returns nothing. Run this script to check:
user = User.create
100.times do
date = Date.new(2020, 1, rand(1..10))
user.time_tracks.create_with(hours: 10).find_or_create_by!(date: date, task_id: 1)
end
TimeTrack.from(
TimeTrack.group(:date, :user_id).having('COUNT(*) > 1').select(:date, :user_id)
).select('COUNT(subquery.date) AS not_unique').to_a.first.not_unique # => 0
It works perfectly locally, but once you deploy your application to production we can have more than one thread trying to perform this operation, or even different application instances each running many threads. Would it behave differently? Let’s check:
threads = []
wait_for_it = true
user = User.create
20.times do
date = Date.new(2020, 1, rand(1..4))
threads << Thread.new(date) do
# A loop to make threads busy until we `join` them
true while wait_for_it
user.time_tracks.create_with(hours: 10).find_or_create_by!(date: date, task_id: 1)
end
end
wait_for_it = false
threads.each(&:join)
TimeTrack.from(
TimeTrack.group(:date, :user_id).having('COUNT(*) > 1').select(:date, :user_id)
).select('COUNT(subquery.date) AS not_unique').to_a.first.not_unique # => not 0!
Well, that’s a bummer.
The issue is that find_or_create_by!
is not atomic—find
and create
are separate operations. So there is a huge chance that two threads would try to insert the same data in parallel, and #find
will return nothing for both, making them both perform #create
. You’ve got your duplicate!
I don’t want to ruin the mood, but these issues are especially hard to catch in specs as they can only be reproduced using threads. What was the last time you tested your code in a multi-threaded environment?
Taming the race condition
As we now see, the problem is coming from two parallel threads causing the race condition while reading the data and writing it after. Per ACID principles, it must possible make the operation atomic (i.e., all operations happen “together ”). One of the ways of doing it in Rails is to use the pessimistic lock:
# Works both with MySQL and PostgreSQL
user.with_lock do
user.time_tracks
.create_with(hours: 10)
.find_or_create_by!(date: date, task_id: 1)
end
In this case, .with_lock
won’t let another thread get inside the block. It performs SELECT ... FOR UPDATE
on the specified record, and all other transactions trying to take a lock or update the row would wait for the current transaction to be committed.
Can we call it a day? Are there other ways to put inconsistent data into the table? Unfortunately, yes: someone can add code that writes data without the lock or even put the data directly to the table! You will have to exercise incredible discipline on the team level to make sure all your non-thread-safe writes are locked.
Is there something that offers more peace of mind?
Constraints to the rescue!
The easiest way to enforce data consistency is to add constraints right to the table. This is a native feature of PostgreSQL, and it’s a good idea to use it in your application.
To create a new uniqueness constraint with an Active Record migration, we need to add the unique index. Active Record will figure a way to properly add the constraint depending on your database adapter. Now, it will raise RecordNotUnique
each time we try to write a duplicate—no matter if we do it from Ruby or bypass ORM completely and drop to pure SQL.
class AddUniqueIndexToTimeTracks < ActiveRecord::Migration[6.0]
# You need this if you are adding an index _concurrently_
disable_ddl_transaction!
def change
add_index :time_tracks, %i[user_id task_id date], unique: true, algorithm: :concurrently
end
end
🚨 Downtime alert! If your table is huge—make sure to add the index concurrently. Otherwise, when the table is locked by index creation, all threads trying to write to this table will have to wait for the lock to be released. If there are enough table writes to make all workers busy—it might cause downtime.
Cool, but what if we already have a huge application with inconsistent data? You will have to fix the code and the data before adding the index, so let’s talk about that.
Finding places that need locks
While the solution is pretty trivial (add proper locks!), you might wonder where we should take these locks, especially when you have a huge codebase. If feel lucky and can afford a couple of errors in production, here is the plan.
- Find and fix all the code responsible for duplicates that you can find (to make sure we don’t get too many crashes).
- Write a rake task that can clear or merge duplicates.
- Run the task in production.
- Add unique index and wait for error reports.
- When you get
RecordNotUnique
—either quickly fix it or remove the index and then fix it. Repeat steps 3–5 until you stop getting errors.
Let’s also specify what we want from the rake task: it should find all anomalies and either merge or remove them. In our case, it’s trivial: we need to group “time tracks” by user_id
, task_id
, and date
and remove all rows in each group except for the last one.
Spin-off: Conditional unique index
This is not directly related to the topic, but you might face a situation when adding a unique index to a table where some values can be nil. Imagine that users can track time not connected to any tasks (“free time”) but still create only one record per day. Guess what happens if we add the index the way we did it before? It will consider all records with NULL
inside task_id
are unique because NULL != NULL
in PostgreSQL!
In this rare case, you will need to replace NULL
with something that can be checked for uniqueness, for insance with zero:
add_index :time_tracks,
'user_id, COALESCE(task_id, 0), date',
unique: true,
algorithm: :concurrently
The downside of this approach is that you can hardly use this index for a lookup: PSQL uses an index only when query conditions match the ones specified in the index exactly. If you want to use indexes both for constraints and lookups—use functional indexes:
add_index :time_tracks,
'user_id, task_id, date',
name: 'unique_user_task_date_when_task_exists',
where: 'task_id is not null',
unique: true,
algorithm: :concurrently
add_index :time_tracks,
'user_id date',
name: 'unique_user_date_when_task_not_exists',
where: 'task_id is null',
unique: true,
algorithm: :concurrently
However, if you need indices on many columns that rely on each other, you might need to specify many functional indexes, which makes this approach a bit brittle.
Now, back to the topic.
create_or_find_by vs. find_or_create_by
When you have the unique index in place, you can start using a modern replacement to #find_or_create_by!
—meet #create_or_find_by!. As per the name, it does the same thing, but in reverse: it tries to insert the record, and if it encounters RecordNotUnique
—loads the record (because in this case, we are sure the record exists). Note that it runs a nested transaction inside.
In other words:
def create_or_find_by!(params)
transaction(requires_new: true) { create!(params) }
rescue RecordNotUnique
find_by(params)
end
This is how our code looks like with this approach:
threads = []
wait_for_it = true
user = User.create
20.times do
date = Date.new(2020, 1, rand(1..4))
threads << Thread.new(date) do
true while wait_for_it
user.time_tracks.create_with(hours: 10).create_or_find_by!(date: date, task_id: 1)
end
end
wait_for_it = false
threads.each(&:join)
TimeTrack.from(
TimeTrack.group(:date, :user_id).having('COUNT(*) > 1').select(:date, :user_id)
).select('COUNT(subquery.date) AS not_unique').to_a.first.not_unique # => 0
#create_or_find_by!
might be slower than the combination of find_or_create_by
and with_lock
in cases when there are more reads than writes.
Alternatively, we can implement just create!
and rescue ActiveRecord::RecordNotUnique
ourselves, as we don’t need to lock anything (we already have a uniqueness constraint)—that way we can avoid an unneccesary nested transaction. Check out the full gist.
lookup_params = { date: date, task_id: 1 }
begin
user.time_tracks.create!(lookup_params.merge(hours: 10))
rescue ActiveRecord::RecordNotUnique
user.time_tracks.find_by(lookup_params).update(hours: 10)
end
I bet you’re curious which of the described ways to pick, so let’s benchmark!
First, let’s check what happens when we never insert duplicates:
create! + rescue: 803.9 i/s
create_or_find_by!: 694.5 i/s - same-ish
with_lock + find_or_create_by!: 359.0 i/s - 2.24x slower
As you could expect, create!
is the fastest one (cause we never rescue
).
Second, let’s choose the approach to use when we always insert duplicates:
with_lock + find_or_create_by!: 530.3 i/s
create_or_find_by!: 525.5 i/s - same-ish
create! + rescue: 491.9 i/s - same-ish
Despite the fact that we spend time taking a lock, find
shows the best result (but the difference with create_or_find_by!
is not that significant). It turns out that the preferred way depends on whether you have the unique index or not (in this case, you need a pessimistic lock) and how often ActiveRecord::RecordNotUnique
actually occurs.
Using UPSERTs
If the database is so smart, can we delegate it checking if the record exists and updating it or creating it if it’s missing? It will let the application avoid handling any exceptions (it’s hidden, but they are still caught under the hood). Of course, we do! There is a special mechanism called UPSERT (or, more precisely, INSERT ... ON CONFLICT ...
). Rails also supports it via the #upsert method.
Take a look:
threads = []
wait_for_it = true
user = User.create
20.times do
date = Date.new(2020, 1, rand(1..4))
threads << Thread.new(date) do
true while wait_for_it
user.time_tracks.upsert(
{
hours: 10,
date: date,
task_id: 1,
created_at: Time.current,
updated_at: Time.current
},
unique_by: %i[user_id task_id date]
)
end
end
wait_for_it = false
threads.each(&:join)
TimeTrack.from(
TimeTrack.group(:date, :user_id).having('COUNT(*) > 1').select(:date, :user_id)
).select('COUNT(subquery.date) AS not_unique').to_a.first.not_unique # => 0
Beware that it does not create any ActiveRecord objects, omits all validations and callbacks and tries to insert the data directly into the table. This approach is faster, but makes us provide created_at
and updated_at
explicitly.
Under the hood it crafts an INSERT
statement that asks PostgreSQL to do something in case of conflict: in our scenario, we want to override hours
.
INSERT INTO time_tracks (hours, date, task_id, created_at, updated_at, user_id)
VALUES (10, '2020-01-03', 1, '2021-08-05 14:15:39.346889', '2021-08-05 14:15:39.346891', 1)
ON CONFLICT (user_id, task_id, date) DO
UPDATE SET
hours = excluded.hours,
created_at = excluded.created_at,
updated_at = excluded.updated_at
RETURNING id
Built-in uniqueness validation
Rails would not be Rails if there were no special validator that can ensure that the column value is unique inside the given scope or the whole table:
class TimeTrack < ActiveRecord::Base
validates :task_id, uniqueness: { scope: %i[date user_id] }
end
When the validator is called, it performs the database request to make sure there is no such row in the table already. When validation is passed, ActiveRecord saves the model. Doesn’t it sound similar to the #find_or_create_by
? Oh, it does, and it has the same issue with race conditions!
threads = []
wait_for_it = true
user = User.create
40.times do
date = Date.new(2020, 1, rand(1..4))
threads << Thread.new(date) do
true while wait_for_it
begin
user.time_tracks.create_with(hours: 10).find_or_create_by!(date: date, task_id: 1)
rescue ActiveRecord::RecordInvalid
puts 'RecordInvalid rescued!'
end
end
end
wait_for_it = false
threads.each(&:join)
TimeTrack.from(
TimeTrack.group(:date, :user_id).having('COUNT(*) > 1').select(:date, :user_id)
).select('COUNT(subquery.date) AS not_unique').to_a.first.not_unique # => not 0!
The only way to make it work fine is to wrap each create
, update
or save
with .with_lock
, and we cannot do it on the model level, so the whole approach sounds a bit error-prone.
There is one more issue with the built-in validator: it does not play well with the #create_or_find_by
and #create_or_find_by!
. Before trying to insert the data to the table, it performs all validations, and of course the uniquiness validation fails when record with same attributes exist.
As a result, instead of getting either new or existing record, we get an invalid one:
# inserting a first record
user.time_tracks
.create_with(hours: 10)
.create_or_find_by!(date: Date.new(2020, 1, 1), task_id: 1)
# okay, it's in the table
# trying to add a second one
user.time_tracks
.create_with(hours: 10)
.create_or_find_by!(date: Date.new(2020, 1, 1), task_id: 1)
# => Validation failed: Task has already been taken (ActiveRecord::RecordInvalid)
# trying non-bang version
user.time_tracks
.create_with(hours: 10)
.create_or_find_by(date: Date.new(2020, 1, 1), task_id: 1)
# => #<TimeTrack id: nil, user_id: 1, task_id: 1, hours: 10, date: "2020-01-01", created_at: nil, updated_at: nil>
Turns out, that in both cases validator prevented ActiveRecord from either creating record or finding one! 😕
What about human-readable errors?
Knowing all downsides discovered above, you might wonder when built-in uniqueness validation might be helpful. There is such a scenario: a situation when we need to let the user know the exact reason why the data he inserts is invalid. For instance, it might happen when we don’t want them to accidentally overwrite the data they already have.
Like all other validators, the uniqueness validator can add a message to errors
(with I18n and all bells and whistles). If we do not need #create_or_find_by!
for a specific model—we can use the built-in validator (just don’t forget about locks!).
Alternatively, there is a nice gem database_validations, which understands database-level validations and knows how to integrate them with model errors
:
class User < ActiveRecord::Base
validates :email, db_uniqueness: true
end
Wrapping up
Let’s sum up what we learned today:
find_or_initialize_by
,find_or_create_by
andfind_or_create_by!
are not thread-safe, so we need pessimistic locks.- A unique index in a database is the best way to keep data consistent.
- When unique index is set up, we can use atomic
create_or_find_by
,create_or_find_by!
, orfind_or_create_by!
withrescue ActiveRecord::RecordNotUnique
. - Built-in uniqueness validator is not thread-safe either (if you really need it—use pessimistic locks), and it doesn’t work with
create_or_find_by
andcreate_or_find_by!
. - You might need to use a validator to tell the user that the data he inserts is invalid (instead of silently overriding it); in this case, the validator can help, but consider using the one from database_validations gem.
Thank you for reading! Feel free to reach out to us if you want to talk about database issues or if you need to enlist Evil Martians to work on your Rails application.