“Fair” multi-tenant prioritization of Sidekiq jobs—and our gem for it!

February 14, 2024

Topics

Translations

JapaneseRails: マルチテナントでSidekiqジョブを公平に優先順位付けするsidekiq-fair-tenant gem

In many backend applications, particularly complex, multi-tenant systems, managing the queue of background jobs can be difficult. This is especially true when we consider applications that cater to diverse user bases like an ecommerce app, for example, where there might be both smaller sellers and mega-merchants. Managing this in a “fair” way presents a challenge. In this post, we’ll present one potential solution while we were working with Rails and Sidekiq, and a new Ruby gem that implements it: sidekiq-fair-tenant.

But first, some more context: a “background job” in this context refers to a unit of work, like processing a transaction, updating inventory, or generating reports. These jobs are typically placed in a job queue to be processed later to avoid blocking both the UI and application server.

However, without a prioritization mechanism, a large batch of jobs from a single tenant can monopolize resources—and this could mean delays (minutes, or even hours) for other user jobs placed in the queue after that, and a massive degradation of the overall user experience. (And it could possibly even violate SLOs and SLAs, depending on the terms of service, which is no good at all!)

For instance, to return to the previous example of the ecommerce platform, if a mega-merchant submits a request to update stock levels for hundreds of thousands of items, this could result in hundreds of thousands of jobs being enqueued, and without proper prioritization, this massive job batch can clog the system, delaying the processing of jobs from smaller merchants. This situation is, naturally, unfair and must be solved.

We’ve extracted our solution into a Ruby gem, sidekiq-fair-tenant that is designed to implement fair job prioritization in multi-tenant applications for your Rails and Sidekiq app:

Add fair_tenant_queues section to the sidekiq_options in your job class:

 class SomeJob
   sidekiq_options \
     queue: 'default',
+    fair_tenant_queues: [
+     { queue: 'throttled_2x', threshold: 100, per: 1.hour },
+     { queue: 'throttled_4x', threshold:  10, per: 1.minute },
+    ]
 end

Add tenant detection logic into your job class:

 class SomeJob
+  def self.fair_tenant(*_perform_arguments)
+    # Return any string that will be used as tenant name
+    "tenant_1"
+  end
 end

And you’re pretty much good to go! (You can check the README for more details.)

Still, to understand the essence of this gem, let’s run through the story of how it came into this world!

To illustrate our solution, we’ll talk about our experience with one of our clients, Coveralls, which provides an application to monitor test coverage by revealing the parts of the code that aren’t covered by your test suite.

On this platform, customers pay based on the number of repositories or GitHub organizations they interact with, and these repositories can vary greatly in size and activity level. (A small but highly active repository can potentially generate more “lightweight” tasks than a larger, but less active one that generates less amount of “heavyweight” tasks.)

This results in a situation where a customer on a small pay tier (but with a large and active codebase) could potentially overwhelm the background job queue, resulting in longer wait times for other customers from all the other tiers.

Discovering and tackling the problem

Typically, these problems come to light when monitoring tools flag that the background job queues are backed up and that the service has slowed down. And, thereafter, a deeper dive often reveals that a single customer is responsible for this backlog.

For tech solutions involving Ruby on Rails and Sidekiq (meaning the underlying technology stack involves Ruby and Redis) the challenge is to manage these queues without compromising performance or storage. After some discussions, we started to delve into “fair queue” experiments to try and derive a Sidekiq-suitable approach.

The primary challenge was rooted in the unpredictable nature of job creation: in test environments, batches of jobs from different users are pre-determined, but in the real world, jobs are also distributed in time, and it’s impossible to predict when another batch of jobs will arrive or how big it will be, making any “static” experiments less usable; we need to be able to look back into the job history for every user.

Another technical limitation is that queues in Sidekiq are strictly ordered: you can’t prioritize some jobs over others within the same queue. So you have to either use multiple queues or hack around scheduled jobs to achieve the desired effect. In this guide we are going to implement the auxiliary queue approach: “excess” jobs will be re-routed to “slower” queues.

The solution

We proposed a strategy to throttle “greedy tenant” jobs to give way for other tenant jobs. The first step was to define the threshold that constitutes “greediness.” Thus, a tentative threshold was set at 100 jobs a day for every user. If a tenant were to exceed this limit within a 24-hour sliding time window, their jobs would be deprioritized.

However, implementing this required a careful consideration of exactly how to create this sliding window in Redis, as well as how to implement the throttling mechanism within Sidekiq’s existing architecture.

One way to implement a sliding window is to use Redis’ sorted set data structure. (Heck, sliding window rate limiters are even mentioned as a use case for it!)
Pros: it allows us to nicely handle job retries (we don’t use the quota to re-execute a failed job).
Cons: it requires space proportional to the number of jobs executed; this requirement may be too high for applications with billions of jobs per day.

Let’s take a closer look at how we can implement our solution with Redis’ sorted sets: for every tenant we create, we keep a sorted set in Redis that contains a Sidekiq job identifier (jid) as an element and the time of job enqueue as element weight. This allows us to easily count enqueued jobs in arbitrary time intervals and we can track multiple time windows in one sorted set.

Here’s the API we would like to have for application developers:

class HeavyJob
  include Sidekiq::Job

  sidekiq_options \
    queue: 'default',
    fair_tenant_queues: [
      { threshold: 100, per: 1.day,  queue: 'default_throttled' },
      { threshold:  40, per: 1.hour, queue: 'default_superslow' },
    ]

  def self.fair_tenant(klass, id) # same arguments as in `perform`
    record = (klass.is_a?(String) ? klass.safe_constantize : klass).find(id)
    record.user_id
  end

  def perform(klass, id)
    # job implementation
  end
end

To “re-route” jobs to other queues, we need to implement Sidekiq “client” middleware, which will examine every job being enqueued for threshold violations, and if so, put them into slowed queues for execution instead of the main queue:

module Sidekiq::FairTenant
  TENANT_ENQUEUES_KEY = "sidekiq-fair_tenant:enqueued:%<job_class>s:tenant:%<fair_tenant>s".freeze
  MAX_THROTTLING_WINDOW = 1.day

  class ClientMiddleware
    def call(worker, job, queue, redis_pool)
      # implementation will be here
    end
  end
end

First of all, we need to check that the current job is eligible for throttling (and skip if not):

return yield unless job["fair_tenant_queues"]&.any? # This job doesn't have throttling rules
return yield if queue != job["queue"] # Someone already re-routed this job

We need to check whether the tenant has any throttling rules defined. Everything from sidekiq_options helper is available in the job argument in the middleware call method, so we can check for the presence of the job["fair_tenant_queues"] array. If it is missing or empty, then this job doesn’t need any throttling and we can just execute it by yielding control to the next middleware in the stack (or the job itself).

Next, we’ll check that the job is still in the queue defined in its job class, meaning that it hasn’t been re-routed by a middleware up the stack, and we want to play nicely with other Sidekiq plugins.

Then, we need to ensure that we know the tenant identifier (or try to find it out):

worker = worker.is_a?(Class) ? worker : worker.constantize
job["fair_tenant"] ||= worker.fair_tenant(*job["args"]) if worker.respond_to?(:fair_tenant)
if job["fair_tenant"].blank?
  Rails.logger.warn "#{worker} with args #{job["args"].inspect} won't be throttled due to missing fair_tenant (#{job["fair_tenant"].inspect})"
  return yield
end

A tenant identifier can be specified in multiple ways: it can be provided explicitly on job enqueue (MyJob.set(fair_tenant: "tenant_1").perform_async) or calculated via the fair_tenant class-level method hook, but if it’s not present, we just skip throttling for this job and log a warning; this is most probably a bug.

If all the checks have been passed, it’s time to “register” our job in the sliding window, re-route to another queue if needed, and allow it to be finally enqueued:

redis_pool.then do |redis|
  register_job(worker, job, queue, redis)
  job["queue"] = assign_queue(worker, job, queue, redis)
end

yield

In a nutshell, registration is just 3 Redis commands executed at once (in a transaction):

def register_job(worker, job, _queue, redis)
  fair_tenant = job["fair_tenant"]
  tenant_enqueues_key = TENANT_ENQUEUES_KEY % { job_class: worker, fair_tenant: fair_tenant }
  redis.multi do |tx|
    tx.zadd(tenant_enqueues_key, Time.current.to_i, "jid:#{job["jid"]}")
    tx.zremrangebyscore(tenant_enqueues_key, "-inf", MAX_THROTTLING_WINDOW.ago.to_i)
    tx.expire(tenant_enqueues_key, MAX_THROTTLING_WINDOW)
  end
end

Before we begin, we need to obtain a key name for the tenant’s sorted set. We use a Ruby format string for this, which allows us to easily change the key name format in the future if needed. Then we execute the following commands:

We open a transaction with MULTI command, so all the commands will be executed at once.
We add the job identifier to the sorted set with ZADD, Redis will automatically create the set if it doesn’t exist.
Then, we trim the set removing stale jobs (older than our maximum time window) with ZREMRANGEBYSCORE
Next, we add an expiration to the key with the EXPIRE command so if the tenant becomes inactive, the whole sorted set with job identifiers of the old enqueued job for that tenant will be automatically removed by Redis after the maximum time has passed, and it won’t use any space.
Finally, the Redis Ruby client will execute all the commands in the transaction with the EXEC command when the block is closed.

And that’s it!

After that, we need to decide whether the job needs to be re-routed, and for that, we’ll check whether the job violates any thresholds; if so, we’ll assign it to another queue:

# Chooses the last queue, for the most restrictive (threshold/time) rule that is met.
# Assumes the slowest queue, with most restrictive rule, comes last in the `fair_tenant_queues` array.
def choose_queue(worker, job, queue, redis)
  tenant_enqueues_key = TENANT_ENQUEUES_KEY % { job_class: worker, fair_tenant: job["fair_tenant"] }
  job["fair_tenant_queues"].map(&:symbolize_keys).filter do |threshold:, per: MAX_THROTTLING_WINDOW, **|
    threshold < redis.zcount(tenant_enqueues_key, per.ago.to_i, Time.current.to_i)
  end.last&.[](:queue) || queue
end

The main work here is done by the Redis ZCOUNT command, which counts the number of jobs enqueued within a time window from per seconds ago to now. If the number of jobs exceeds the threshold, we assume the rule as matching and return the queue name written in the last matching rule, otherwise, we’ll return the original queue name. Always selecting the last matching rule allows us to order rules from least to most restrictive, adding robustness to the configuration: if existing throttling isn’t strict enough, we just add a new rule at the end of the array with a smaller time window pointing to a more throttled queue.

Finally, we need to enable our client middleware for both clients (processes that don’t execute jobs) and servers (Sidekiq worker processes). The server also needs to re-route jobs that are being enqueued from within other Sidekiq jobs.

# config/initializers/sidekiq.rb
Sidekiq.configure_client do |config|
  config.client_middleware do |chain|
    chain.add Sidekiq::FairTenant::ClientMiddleware
  end
end
Sidekiq.configure_server do |config|
  config.client_middleware do |chain|
    chain.add Sidekiq::FairTenant::ClientMiddleware
  end
end

Setting up the queues

The main mechanism allowing this solution to work is Sidekiq’s weighted queues (see Advenced options for queues in the Sidekiq wiki) to make fast queues drain a few times faster than slow ones (e.g. fast 4; slow 1; will drain the fast queue roughly 4 times faster than the slow one):

:queues:
  - [general_queue, 6]
  - [general_queue_throttled, 3]
  - [general_queue_superslow, 1]

In this example, jobs in the general_queue have a chance to be executed twice as often compared to jobs in the general_queue_throttled, and six times more often than in general_queue_superslow. Within Sidekiq, if we take a look at queue polling order calculation in Sidekiq::BasicFetch, we’ll see that every time a Sidekiq worker wants to take the next job for execution, it will shuffle queues to poll for jobs according to queue weights, so 60% of the time if will try to get a job from general_queue first, 30% of the time it will ask for a job from general_queue_throttled first, and 10% of the time if will try to pop it from general_queue_superslow first. So, even in case when general_queue has a lot of waiting jobs, the throttled queue will still be processed, just at a slower pace.

Advantages:

If fast queues are empty, slow queues are processed at full speed (no artificial delays)
If fast queues are full, the slow queues are still processed, but at a slower rate (configurable), so the app also won’t “stall” for throttled users
Minimal changes to the app code are required

Disadvantages:

Since Sidekiq doesn’t support mixing ordered and weighted queue modes (as stated in the Sidekiq Wiki on queue configuration), you can’t make the same workers ignore other queues to execute some super important queue, ignoring other queues.

You have to keep track of all your queues and their weights and sync Sidekiq configuration for production and development with configurations in your job classes by hand.

Results and feedback

Post-implementation, the general feedback has been positive! The primary objective was ensuring that “smaller” customers do not experience delays, and this goal was met: complaints from users about processing delays have disappeared.

At the same time, when larger, “greedier” customers voiced concerns about slower build times, they were typically offered more powerful plans, or even the potential of setting up a dedicated instance for their operations.

Using this approach, apps can ensure a consistent and positive user experience for all their customers.

And don’t forget the gem!

And as a reminder, we’ve done the heavy lifting for you, so there’s no need to copy code samples from this post, just be sure to check out sidekiq-fair-tenant!

The gem is designed to implement fair job prioritization in multi-tenant applications, plus, it also also supports ActiveJob along with plain Sidekiq jobs.

At Evil Martians, we transform growth-stage startups into unicorns, build developer tools, and create open source products. If you’re ready to engage warp drive, give us a shout!