Soft deletion with PostgreSQL: but with logic on the database!

Cover for Soft deletion with PostgreSQL: but with logic on the database!

Any developer eventually faces a challenge: making significant changes to large code bases. But here, I’ll walk through an unorthodox solution which allows all the application code’s complexity to remain untouched thanks to the might and power of the underlying database! It’s soft deletion, but we’ll put the logic in the database itself. Keep reading for the ‘how’ and ‘why’!

While this post is primarily about PostgreSQL, I’ll need to touch the application code a couple of times. Since Ruby on Rails is the stack I have the most experience with, I’ll be using it in my examples. Although Rubyists might have extra appreciation for this article, but let’s emphasize up front that the final solution will be completely in SQL.

Up first, let’s talk about deletion itself, and we’ll understand the difference between “hard” and “soft” deletion.

Schedule call

Irina Nazarova CEO at Evil Martians

Schedule call

Hard and soft deletion: a brief intro and comparison

A typical application’s logic requires data to be deleted from time to time. In SQL, the straightforward approach to make this happen is to simply drop the record from the database using a DELETE statement.

The problem with such “hard” deletion is that any of the affected data is irrecoverably lost and becomes inaccessible for later analysis.

For this reason, developers frequently employ another option which is widely known as soft deletion. The idea here is simple: instead of dropping a record from a database, you can specify some attribute (like deleted: true or deleted_at: Time.now) to mark that it has been removed. And now, because no actual deletion is being performed, the entire history of your data can remain intact.

Of course, to make this viable, it’s necessary to scope all requests so that only not deleted rows are visible to the application itself.

In Ruby, such code might look something like this:

# performing a "soft" delete
Order.find(1).update!(deleted: true)

# counting the visible records
Order.where.not(deleted: true).count

Both “hard” and “soft” approaches are already widespread.

So, the question now: how to gracefully transition from hard to soft deletion in a real application?

When making the switch to soft deletion in a live application, there are many factors to consider. First up, there are a number of ways a record can be deleted on the application side that need to be accounted for.

Additionally, dependencies can exist in a database, meaning that dependent records should be deleted in cascade:

class User
  has_many :orders, dependent: :destroy # or :delete
end

And don’t forget about foreign keys, making such cascades implicit:

# somewhere in a migration
add_foreign_key :orders, :users, on_delete: :cascade

How can we make sure all possible deletions are found and being carried out—“softly”—as intended?

Wouldn’t life be much easier if there was some magical way to implement soft deletion outside of the application? Wouldn’t it be nice to do this on, say, the database layer, thus eliminating any need for the app be touched at all?

Well, we can do this through the power of modern databases. I believe any advanced database like MySQL or Oracle possess reach features to do the job, but for this demonstration, I’m going to stick with PostgreSQL (one of the primary selections we utilize in our projects). And indeed, with PostgreSQL we have several ways to implement this solution in a clean, simple, and manageable manner.

But wait, really?

Well, yes, really. Full disclosure—this is no doubt a bit of a humorous misadventure, but nonetheless, I still very much envision this approach as a reality—the database simply does some things better. Think about data integrity. If the data-related logic was on the application side and you decided to migrate one microservice to soft deletion, you wouldn’t need to be cautious not to mess anything up.

If you’ve got many applications for many microservices, something could get easily become broken if all the data-related rules and constraints are made at the application level. It’s possible to miss some instance where one application performs a check and another doesn’t, and thus, you end up with corrupted data.

And, if you run into issue with database logic, the PostgreSQL community is the number one in open source, so you can be confident you’ll have a solid group of developers offering support on the sidelines.

Admittedly, this would be difficult to scale if you didn’t plan out from the outset of development, and, if you keep adding logic to the database, it could eventually become just as fat as the application itself, and that wouldn’t really be too great. But no matter, this is still certainly theoretically possible, so let’s press onward.

Preparing for Deletion

As I mentioned above, switching from hard to soft deletion consists of 2 simple steps:

  1. Changing all deletions to updates
  2. Scoping all queries to hide “deleted” records

Let’s start with deletion itself. To begin, we’ll provide some context for our example. Let’s open the psql console and set up an environment:

CREATE TABLE users (
  id integer PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  name text NOT NULL
);

CREATE TABLE orders (
  id integer PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  user_id integer NOT NULL,
  number text NOT NULL,
  FOREIGN KEY (user_id) REFERENCES users (id) ON DELETE CASCADE
);

Next, we’ll populate it with some data:

INSERT INTO users (name) VALUES ('Andrew'), ('Vladimir'), ('Sara');
INSERT INTO orders (user_id, number) VALUES (1, 'A1'), (1, 'A2'), (2, 'V1'), (2, 'V2'), (3, 'S3');

For soft deletion to work, we’ll need additional columns: either deleted::boolean or deleted_at::timestamp. In real databases, I always prefer the second approach, but let’s keep things simple for now:

ALTER TABLE users ADD COLUMN deleted boolean NOT NULL DEFAULT false;
ALTER TABLE orders ADD COLUMN deleted boolean NOT NULL DEFAULT false;

Now that we’ve prepared the schema, let’s put it to use.

The deletion black magic begins

Since time immemorial, PostgreSQL has supported operations with rules (creation, altering, and dropping). These are database-specific extensions of the SQL syntax that are outside of the standard. It’s a simple concept: a rule is essentially a macro applied by the query executor to modify your query.

Prior to execution, a query goes though several steps. First, the parser takes the initial request and converts it into a query tree (after checking the correctness of the syntax). Then, the analyzer provides a semantic analysis of the tree. At this stage it doesn’t care about the semantics of the query (whether all the requested objects are actually present in the database, not to mention their accessibility).

After the query tree proves to be correct, it’s time for the rewriter to scan the tree looking for the rules applicable to every single node. Every time some rule is found, the rewriter uses it to update the tree. In fact, anyone with some database experience has come across this mechanism with views: in PostgreSQL, views are nothing more than a rule which specifies how to substitute the SELECT FROM <view> node to the corresponding sub-query.

After all the rules are applied, and the semantics of the resulting query tree has been checked, the query planner/optimizer comes into play. Neither the planner nor the executor know anything about the original request; they just deal with the rewritten tree. An interesting application of this is that in some cases you can extend the SQL syntax (as Alexey Kondratov explained in his amazing talk on FOSDEM-21).

So, how can we apply this particular approach to our problem? Actually, this is easily done: all we need to do is create a rule which changes the deletions to updates. Let’s do just that:

CREATE RULE "_soft_deletion" AS ON DELETE TO "orders" DO INSTEAD (
  UPDATE orders SET deleted = true WHERE id = old.id AND NOT deleted
);

Now we can try deleting some items from orders:

DELETE FROM orders WHERE id = 4;
SELECT FROM orders ORDER BY id;
 id | user_id | number | deleted
----+---------+--------+---------
  1 |       1 | A1     |
  2 |       1 | A2     |
  3 |       2 | V1     |
  4 |       2 | V2     | t
  5 |       3 | S3     |
(5 rows)

There we go! While the operation itself was just a classic DELETE statement, PostgreSQL silently applied the rule and changed the request to an update.

Cascade deletes

What about cascade deletion? How would this be processed working with our users? Let’s give it a shot:

DELETE FROM users WHERE id = 1;

Upon running this, we end up with an error:

ERROR:  XX000: referential integrity query on "users" from constraint "orders_user_id_fkey" on "orders" gave unexpected result
HINT:  This is most likely due to a rule having rewritten the query.

And it is quite a reasonable error, actually. We tried to delete users but the corresponding orders are not accounted for with this statement, so our foreign key constraint is working as intended to protect referential integrity.

To fix this problem, let’s change the orders table as well:

CREATE RULE "_soft_deletion" AS ON DELETE TO "users" DO INSTEAD (
  UPDATE users SET deleted = true WHERE id = old.id AND NOT deleted
);

SELECT * FROM users ORDER BY id;
 id |   name   | deleted
----+----------+---------
  1 | Andrew   | t
  2 | Vladimir |
  3 | Sara     |

SELECT * FROM orders ORDER BY id;
 id | user_id | number | deleted
----+---------+--------+---------
  1 |       1 | A1     |
  2 |       1 | A2     |
  3 |       2 | V1     |
  4 |       2 | V2     | t
  5 |       3 | S3     |

Another problem: while the soft deletion of the user record was successful, the orders are still alive. This is because our foreign key constraint can’t provide a cascade update in a dependent table.

We can fix the issue with another rule:

-- restore the "deleted" user for the next attempt
UPDATE users SET deleted = false WHERE id = 1;

CREATE RULE "_delete_orders" AS ON UPDATE TO users
  WHERE NOT old.deleted AND new.deleted
  DO ALSO UPDATE orders SET deleted = true WHERE user_id = old.id;

DELETE FROM users WHERE id = 1;

SELECT * FROM orders ORDER BY id;
 id | user_id | number | deleted
----+---------+--------+---------
  1 |       1 | A1     | t
  2 |       1 | A2     | t
  3 |       2 | V1     |
  4 |       2 | V2     | t
  5 |       3 | S3     |
(5 rows)

You might’ve spotted a difference here: instead of DO INSTEAD, we’ve used the DO ALSO rule. In addition to updating users, this rule also updates orders; both operations are executed by PostgreSQL while processing the same request.

Now, dependent orders are deleted along with the corresponding user. From the application’s point of view, literally nothing has been changed. It can continue sending delete requests to the database, and it’s on the database itself to handle the processing (albeit, a little differently now).

While these changes can might complicated, notice that, unlike a typical application, SQL is a declarative language. This means we only need to define any rule once and it will then be applied to any request. This contrasts with application code, where we’d need to separately fix every request.

Adding a hard delete hatch

Now, before continuing on to the second part of the article (scoping selects), I’d like to point out another problem. Since our database works “softly” now, someone on the team (for example, your db administrator) might be quite surprised if they were unable to actually delete a record from the database. Sometimes, even when soft deletion is being used, we still need the option to remove rows.

This can be made possible by adding a WHERE clause to our rule. In this clause, we’ll ask PostgreSQL to check some setting like rules.soft_deletion before performing the substitution.

First of all, we should add the setting itself (superuser access is necessary to do this):

ALTER SETTINGS SET rules.soft_deletion TO on;

And then, let’s modify all our rules to check this setting. We’ll do this with the OR REPLACE clause:

CREATE OR REPLACE RULE "_soft_deletion"
  AS ON DELETE TO "orders"
  WHERE current_setting('rules.soft_deletion') = 'on'
  DO INSTEAD UPDATE orders SET deleted = true WHERE id = old.id;

And with that, we can now modify this setting for the current connection and perform hard deletion as necessary:

SET rules.soft_deletion TO off;
DELETE FROM orders WHERE id = 5;
SELECT * FROM orders ORDER BY id;
 id | user_id | number | deleted
----+---------+--------+---------
  1 |       1 | A1     | t
  2 |       1 | A2     | t
  3 |       2 | V1     |
  4 |       2 | V2     | t

Scoping

After teaching our database this new “safe” deletion trick, it’s now time to cover up the evidence! To do that, we’ll teach it to automatically hide any records which have been marked as deleted.

We can’t do this in the same manner by catching the SELECTs, as support for SELECT rules in PostgreSQL is limited in some ways. For example, you can’t define such a rule for a table that has indexes.

One way to get around this restriction is by substituting the table with a VIEW and adding rules for inserting, updating, and deleting records:

ALTER TABLE orders RENAME TO _orders;

CREATE VIEW orders AS SELECT * FROM _orders WHERE NOT deleted;

SELECT * FROM orders ORDER BY id;
 id | user_id | number | deleted
----+---------+--------+---------
  3 |       2 | V1     |

The amazing thing with PostgreSQL views is that you can INSERT, UPDATE and DELETE rows directly from a VIEW. Under the hood, the PostgreSQL query executor will apply a view-specific rule, effectively “delegating” all those operations to the underlying table. This is only possible because our view has been built on the top of the sole table.

Let’s try it now:

INSERT INTO orders (user_id, number) VALUES (2, 'V3');

-- try the results in the view...
SELECT FROM orders;
 id | user_id | number | deleted
----+---------+--------+---------
  3 |       2 | V1     |
  6 |       3 | V3     |

-- ..and the underlying table
SELECT FROM _orders;
 id | user_id | number | deleted
----+---------+--------+---------
  1 |       1 | A1     | t
  2 |       1 | A2     | t
  3 |       2 | V1     |
  4 |       2 | V2     | t
  6 |       3 | V3     |

Nice.

Next, let’s try deletion:

DELETE FROM orders WHERE id = 6;
DELETE 0

SELECT FROM orders;
 id | user_id | number | deleted
----+---------+--------+---------
  3 |       2 | V1     |

SELECT FROM _orders;
 id | user_id | number | deleted
----+---------+--------+---------
  1 |       1 | A1     | t
  2 |       1 | A2     | t
  3 |       2 | V1     |
  4 |       2 | V2     | t
  6 |       3 | V3     |

Success! Our database now works exactly as we need. All deletions are safe, and scoping is properly being provided by a view which behaves exactly like the underlying table.

However, we can still make one more improvement. Previously, in order to implement hard deletion from a table, we had to add a condition. But, when using a view layer, we no longer need this. Instead of defining rules on the tables themselves, we can do the same on a view, keeping our table renamed, but unchanged in all the rest.

This entire scenario is very similar to what we did before:

  1. Add the deleted columns to allow for the soft deletion of users and orders
  2. Rename tables: orders → _orders and users → _users
  3. Create the corresponding views: orders and users on top of the tables
  4. Define 2 rules (_soft_deletion and _delete_orders) for the users view, and also the _soft_deletion rule for orders
-- let's suggest we are at the initial stage with users and orders

ALTER TABLE users RENAME TO _users;
CREATE VIEW users AS SELECT * FROM _users WHERE deleted IS NULL OR NOT deleted;

ALTER TABLE orders RENAME TO _orders;
CREATE VIEW orders AS SELECT * FROM _orders WHERE deleted IS NULL OR NOT deleted;

CREATE RULE _soft_deletion AS ON DELETE TO orders DO INSTEAD (
  UPDATE _orders SET deleted = true WHERE id = old.id
);

CREATE RULE _soft_deletion AS ON DELETE TO users DO INSTEAD (
  UPDATE _users SET deleted = true WHERE id = old.id
);

-- here we deal with updates in _users table, not with a view
CREATE RULE _delete_orders AS ON UPDATE TO _users
  WHERE old.deleted IS NULL OR NOT old.deleted AND new.deleted
  DO ALSO UPDATE _orders SET deleted = true WHERE user_id = old.id;

Here you can find the final SQL snippet for reproducing the feature.

With this, migration to soft deletion is completed. Although at the beginning of this article I used a couple of Ruby on Rails snippets, the resulting solution uses plain SQL (the PostgreSQL version), so it actually doesn’t matter which stack the application uses; this would work with Python/Django, or PHP/Symphony, or whatever else!

Migrations

The only thing this approach leaves you to take care of yourself is wrapping the DDL code into migration. I’d like to linger here for a moment longer. In Ruby on Rails, you have several options on how to define migration rules. The main question is this: while you can run SQL by calling the execute(sql) command, how should the change be reflected in the database schema?

The default schema, db/schema.rb, is built by Rails as Ruby code. It’s database-version-agnostic and safe, but it’s rather restricted in terms of features. In terms of its migrations, Rails doesn’t even support more popular features like custom types and views, not to mention the weirder beasts like rules. This was a prudent decision for a framework which must support integration with many databases. But this also means we’ll need other options for building the schema.

The first option: switching from the default db/schema.rb to db/structure.sql. The latter is not provided by Rails, but instead by PostgreSQL itself in pure SQL. All objects created in the database will be reflected in the schema.

Such power cannot be realized in absence of a corresponding dark side.

The first downside is that the format of the schema is version specific. Every time the local version of a database differs from the server version, it will provide a slightly different schema, and fixing a schema with every single migration is not the best use of your time. Another problem concerns safety: if you make some changes in your local database though means other than migration (for example, if you’re experimenting with some functions, or you’ve created a temporary table), these changes will go into the schema (PostgreSQL knows nothing about your intentions, it just reports its current state). This requires more care from a developer than they might be comfortable with.

I tried to solve this problem with my latest project pg_trunk. It supports a number of PostgreSQL-specific objects aside from tables and indexes, for instance: foreign keys (with some extensions compared to Rails), functions, triggers and procedures (similar to the gem f(x)), views and materialized views (take a look at scenic, types, statistics, sequences, and finally, rules.

The last snippet above can be re-written in the migration as following:

# let's suggest we are at the initial stage with users and orders
class SwitchToSoftDeletion < ActiveRecord::Migration[6.1]
  # add columns to mark deleted records

  add_column 'users',  :deleted, :boolean, null: false, default: false, comment: 'Recognize users as deleted'
  add_column 'orders', :deleted, :boolean, null: false, default: false, comment: 'Recognize orders as deleted'

  # place views on top of raw tables

  rename_table 'users', to: '_users'
  create_view 'users' do |v|
    v.sql_definition 'SELECT * FROM _users WHERE NOT deleted'
    v.comment 'A view hiding softly deleted rows'
  end

  rename_table 'orders', to: '_orders'
  create_view 'orders' do |v|
    v.sql_definition 'SELECT * FROM _orders WHERE NOT deleted'
    v.comment 'A view hiding softly deleted rows'
  end

  # add the corresponding rules

  create_rule 'orders' do |r|
    r.name '_soft_deletion'
    r.event :delete
    r.kind :instead
    r.command 'UPDATE _orders SET deleted = true WHERE id = old.id'
    r.comment 'Make soft instead of hard deletion'
  end

  create_rule 'users' do |r|
    r.name '_soft_deletion'
    r.event :delete
    r.kind :instead
    r.command 'UPDATE _users SET deleted = true WHERE id = old.id'
    r.comment 'Make soft instead of hard deletion'
  end

  create_rule 'users' do |r|
    r.name '_delete_orders'
    r.event :update
    r.kind :also
    r.where 'NOT old.deleted AND new.deleted'
    r.command 'UPDATE _orders SET deleted = true WHERE user_id = old.id'
    r.comment 'Mark dependent orders as softly deleted'
  end
end

You can find all the necessary operations to implement the aforementioned SQL Rails migrations in this documentation. Your feedback and feature requests are welcome, and highly appreciated!

Calling it a day

In his amazing new book ”PostgreSQL from the Inside” (only available in Russian for now, but I hope an English translation is on the way), Egor Rogov writes:

“Support for rules was announced as one of goals behind the creation of Postgres. As a result, rules have existed ever since this database was developed as a university project, and were also reconsidered at several later points. This is a powerful mechanism, though it is difficult to deal with in terms of both debugging and understanding. It has even been suggested to remove rules from PostgreSQL, but this movement did not find community support. Still, in most cases, it’s more convenient and safer to use triggers instead of rules.”

Despite my agreement with this notion, I still couldn’t help but show off the brighter parts of this powerful feature.

But however shiny rules might be, the much more important takeaway from this silly post is intended to be a bit different. As developers, we don’t have to stay purely within the boundaries of an application when facing architectural challenges (like moving from hard to soft deletion). Instead, we can consider other options outside of complexity itself, especially at the data storage layer. Modern databases are impressive and mighty tools that let you solve many tasks in a more clean, efficient and safe manner.

Putting soft delete logic on the database might be an unorthodox approach, but if you’ve got a problem or project that needs help, there’s nothing unusual about turning to Evil Martians for assistance! We are ready to help detect, analyze, and solve! Get in touch!

Schedule call

Irina Nazarova CEO at Evil Martians

We partitioned a database with 800K users and 60M records with zero downtime for Playbook.com. This relieved the DB of a major performance bottleneck, to serve many more millions of users! Got any performance and scalability tasks? We are ready to jump in and solve them!