Martian Chronicles
Evil Martians’ team blog
Back-end

Logidze: for all those tired of versioning data

You want a VCS for your ActiveRecord models but cannot choose between PaperTrail and Audited? Let me offer you a better alternative — Logidze!

Why not PaperTrail? Come on, a library so well-designed and mature can’t solve the versioning problem? Seems impossible, but it really can’t.

The versioning problem can be split into the following sub-problems:

  • how to get a previous state of an object (versions problem);
  • how to see what has changed since the time specified — the difference between then and now (diff problem).

Out of the box, PaperTrail allows you to get only the diff between two adjacent versions. Of course, you can build a custom solution on top of it, but will it be fast enough for production usage? I don’t think so (see below why).

Logidze was designed with the diff problem in mind. Not only does it solve both problems, but also it uses state-of-the-art features to solve them efficiently.

Logidze is a faster tool by and large and second to none when it comes to calculating diffs for ActiveRecord models.

Usage example

Suppose you are building an offline-first application (web, mobile or even desktop) which works with some data (let’s call it items) loaded from a server.

You want to have the items state synchronized every time your app is connected to the Internet. However, the data being too huge for everything to get reloaded, you consider loading only the changes made since the last sync.

Here comes Logidze. All you have to do is use the diff_from(timestamp) method:

class ItemsController < ApplicationController
  def diffs
    render json: Item.diff_from(params[:timestamp])
  end
end

It is not only dead simple but also incredibly fast! Check these benchmarks out for more details.

Needless to say, you can use it to solve the other versioning problem.

Implementation details

Application-level vs. database-level

What is the main difference between Logidze and other tools? It’s all about building the log data — unlike Audited and PaperTrail, Logidze uses the database itself, and not the application as these two do.

Pros

  • Application-agnostic.

You can use the same database from different applications (e.g. in micro-service architecture) effortlessly.

Logidze updates 1.5 and inserts 3 times faster than PaperTrail. Moreover, the update speed is only 1.5 times slower than triggerless. Not too bad, huh?

Cons

  • Database-dependent.

Currently, it supports only PostgreSQL 9.5+.

  • You have to reload data to access changes:
post  = Post.find(42)
post.log_size #=> 2
post.update!(name: 'New Name')
post.log_size #=> 2 still the same
post.reload.log_size #=> 3 !

Separate versions table vs. in-place log

Another difference is that Logidze stores changes in the record itself (in the log_data column). Just one column (per table), no additional tables at all.

The log_data is of JSONB type, which allows us to store and efficiently handle unstructured data.

Pros

  • No additional queries or JOINs.

This is how you get a copy of an object at the specified time:

old_post = post.at(2.days.ago) # no DB query at all!

It is also possible to get versions:

Post.where(active: true).at(1.month.ago)

And, as we’ve already mentioned, very fast diffs:

Post.where(created_at: Time.zone.today.all_day).diff_from(1.hour.ago)

Cons

  • Tracing all the model logs is hardly possible.

  • Because of the additional column, loading records requires a little bit more RAM — even if you don’t work with the changes.
    However, it’s way more memory-efficient when working with versions (see the benchmarks again).

Full snapshot vs. incremental log

An incremental log assumes you store only the changes made during the current operation. This approach reduces the memory usage drastically but requires more computation to restore a version or calculate a diff.

A full snapshot, on the other hand, does exactly what is states — it saves the whole record every time change has occurred. Thus if you want to store, let’s say, 10 versions per record, you need 10 times more disk space.

Logidze uses an incremental log.

Log format

The log_data column has the following format:

{
  "v": 2, // current record version,
  "h": // list of changes
    [
      {
        "v": 1,  // change number
        "ts": 1460805759352, // change timestamp in milliseconds
        "c": {
            "attr": "new value",  // updated fields with new values
            "attr2": "new value"
            }
        }
    ]
}

If you want to keep the log size under control, you can set a limit in the trigger definition. In this case, when there is no more room for a new change, the two oldest ones will get merged.

Diffing algorithm

The problem was as follows: there were two rows (OLD and NEW); we wanted to get the JSON, which represented their difference.

I considered several approaches:

  • iterating through keys using the FOR loop;
  • using the JSONB minus operator;
  • using the built-in hstore minus operator.

The hstore option turned out to be the fastest. Here:

changes := hstore_to_jsonb_loose(
  hstore(NEW.*) - hstore(OLD.*)
 );

Take a look at the hstore_to_jsonb_loose function. It typecasts for you. Moreover, does it rather fast!

You can run my benchmarks or just see the resulting table here.


Logidze is a free and open-source project, purposely designed to put an end to the data versioning problem. Just an experiment at the start, it has proved successful. However, the experiment is not finished yet. It is hosted on GitHub, and you are most welcome to contribute!

In case you’ve got a feature request to make or a bug to report, fill free to contact me through Issues.