Flaky tests, be gone: long-lasting relief for chronic CI retry irritation!

September 23, 2025

Topics

Every developer knows this pain: your test suite passes locally but fails on CI. You click “Retry” and hold your breath. It passes! But was it a real fix or just luck? Well now, no luck needed! We’ve helped dozens of developers from ClickFunnels, a leading sales funnel platform, go from flaky tests with ~80% success rates to 100%* reliability across their massive test suite (9k+ unit, 1k+ feature tests). And our Evil Martians formula can clear up your chronic CI retry irritation, too!

This handly formula clinically covers every known cause of test flakiness:

Zero tolerance for flaky tests
Unit test flakiness sources
Feature test stability
Stuck test run mitigation
TLDR: The Evil Martians Formula, or your test suite deserves better

Evil Martians’ ability to dig deep and find root causes was game-changing. From flaky tests to mysterious CI deadlocks, they didn’t just fix symptoms. Their expertise pointed us towards a systematic approach that eliminated the problem at its source.

Zero tolerance for flaky tests

Before diving into fixing flaky tests, the first step is establishing a zero-tolerance policy. We need a quarantine system that immediately isolates flaky tests instead of allowing them to poison the CI pipeline.

Hire Evil Martians

The Evil Martians formula helped ClickFunnels eliminate flaky tests. Let's diagnose and cure yours, too!

Book a call See our services

Why quarantine works: Instead of hitting retry buttons and hoping for the best, a quarantine forces teams to acknowledge flaky tests exist. This prevents the “it’s fine” mentality that lets flakiness spread.

RSpec.configure do |config|
  config.filter_run_excluding :flaky if ENV["CI"]
end

RSpec.describe "flaky test", :flaky do
  it "works unreliably" do
    # This test is flaky and quarantined on CI until fixed
  end
end

At ClickFunnels, with their big test suite and multiple development teams, defined processes are simply required to manage flaky tests at scale. The team was keen to maintain quarantine by disabling tests promptly, having weekly review meetings with code owners, and planning actions to fix or remove specs within specified timeframes.

This last thing is really important: tests can’t stay quarantined indefinitely. The aim isn’t to permanently quarantine our tests away, we want to treat them.

Here’s how to configure best practices and expose hidden flakiness before it grows and bites you in the long run:

RSpec.configure do |config|
  # Recommended setup = randomize order + initial seed for reproducible random
  config.order = :random
  Kernel.srand config.seed

  # Run tests multiple times to expose hidden flakiness
  if ENV["DR_TEST_MODE"]
    config.around(:each) do |example|
      3.times { example.run }
    end
  end
end

DR_TEST_MODE=1 bundle exec rspec

Oh, and with the same retry trick we can try to reproduce flakiness faster than waiting for the whole CI to run:

RSpec.configure do |config|
  # Run focused tests multiple times to expose flakiness quicker
  config.around(:each, :flaky_focus) do |example|
    10.times { example.run }
  end
end

RSpec.describe "suspected flaky test", :flaky_focus do
  it "marks the test for several iterations" do
    # The flakiness is confirmed
  end
end

Found something (or just plain tired of suffering from known issues?) Time to diagnose the primary source of illness!

Flaky tests HATE the Evil Martians formula! After 30 years in the testing laboratory, I’ve seen everything from mild test hiccups to severe flaky infestations. Evil Martians is the only treatment I recommend!

Unit test flakiness sources

Global state is critical as it’s super broad and it affects everything. When tests pass in isolation but fail in groups, you’re dealing with state that persists between tests and creates hidden dependencies.

Global state

Global, class and singleton variables

At ClickFunnels, as expected from a reputable engineering organization, they had all the standard cleanup procedures in place. But they were dealing with a tricky case of bleeding from setup stage to tests: state persisting inside seeds that contaminates tests.

The issue was inside RequestStore, a gem to pre-setup common variables thread locally (such as current user, site settings, and so on). During seed loading, it pre-populates RequestStore with common data that bleeds into tests, requiring double cleanup:

RSpec.configure do |config|
  config.before(:each) do
    load_seeds

    # Purge the state after seeds loading before each test.
    # This ensures a blank state for each test.
    RequestStore.clear!
  end

  config.after(:each) do
    # Cleanup the state after each test.
    # This allows to prepare seeds from the blank state for each setup.
    RequestStore.clear!
  end
end

This double-clearing strategy eliminates frequent global state stalls across the whole test suite. Traditional global variables also need different cleanup strategies:

# Flaky: Variables that persist between tests
$debug_mode = false    # Global variables accumulate

class ReportGenerator
  @@shared_cache = {}  # Class variables persist
end

class HttpClient
  @instance = nil      # Class instance variables persist

  def self.instance
    @instance ||= new
  end
end

# Better: Explicit cleanup after a relevant test
after do
  $debug_mode = false
  ReportGenerator.class_variable_set(:@@shared_cache, {})
  HttpClient.instance_variable_set(:@instance, nil)
end

ClickFunnels had a smaller, local issue with service classes using class instance variables for caching in their important Orders class, which required adding clearing.

Classes themselves are also global state, so when tests define classes, they persist in Ruby’s global namespace and pollute subsequent tests:

# Flaky: Class definitions persist globally
it "allocates a new class" do
  class MockClass; end
end

# Better: Use mocks instead of allocating any classes
it "uses mocks" do
  mock_class = class_double("RealClass").as_stubbed_const
  allow(mock_class).to receive(:flaky?).and_return(false)
end

Global variables are particularly insidious because they’re invisible and you don’t see them declared in your test files, yet, they quietly accumulate state between test runs. So pay attention to globals during development.

Also take care with gems; sometimes they’re just glorified global states like RequestStore or have memoized instances that persist between tests and which we need to reset.

And don’t forget about Rails’ cache that silently accumulates state! If you enable caching through config, cached data persists across test runs-avoid the hardest known CS problem early:

# Better: Clear caches between tests
RSpec.configure do |config|
  config.after(:each) do
    Rails.cache.clear
  end
end

Global configuration and environment variables

Configuration and environment variables are stealthy state leakers that make tests pass or fail based on what ran before them. ClickFunnels had this exact issue with consider_all_requests_local bleeding between tests and some tests required production-like error handling.

We need to reset configs and envs after every test to prevent contamination. The solution is strategic restoration using around blocks:

RSpec.shared_context "config helper" do
  # Temporarily change the value of a Rails configuration option.
  def with_config(config)
    original_config = config.map do |key, _|
      [key, Rails.application.config.public_send(key)]
    end
    config.each do |key, value|
      Rails.application.config.public_send("#{key}=", value)
    end
    yield
  ensure
    original_config.each do |key, value|
      Rails.application.config.public_send("#{key}=", value)
    end
  end
end

RSpec.configure do |config|
  config.include_context "config helper", :with_config
end

# Better: The config is reset after running the block
RSpec.describe "error pages", :with_config do
  it "handles error pages with production config" do
    with_config(consider_all_requests_local: false) do
      # It does not change the config globally
    end
  end
end

Other common leaky patterns are the ENV, locale, logger, and routes:

# Flaky: Changes persist to next test
it "changes environment variables" do
  ENV["API_KEY"] = "test-key"            # ENV persists
end

it "handles another locale" do
  I18n.locale = :pt                      # Locale leaks
end

it "modifies logger" do
  Rails.logger = Logger.new("test.log")  # Logger config is changed globally
end

it "modifies routes" do
  Rails.application.routes.draw do
    get "/special", to: "special#index"  # Permanently modified routes
  end
end

For environment variables, use the Climate Control gem for automatic restoration.

Time troubles

Sometimes state is truly global as it’s the outside world and time, date, and timezone are beyond our application control. Time-dependent tests are a classic source of flakiness: a test that passes at 2 PM might fail at 2 AM, or pass on Monday but fail on Sunday.

Everyone usually knows about this source of flakitude, so it’s about implementing the solution properly:

# Flaky: Depends on current time
it "fails due to time dependency" do
  expect(report.last_day_of_year?).to be(true)
end

# Better: use ActiveSupport time helpers or Timecop
around do |example|
  travel_to Date.parse("2025-12-31")  { example.run }
end

it "checks last day of year consistently" do
  expect(report.last_day_of_year?).to be(true)
end

Why freeze time? Because tests should be deterministic. If your test depends on the current time, date, or timezone, it will behave differently every time it runs. A frozen time eliminates this variability.

Speaking of external factors, let’s talk about the database, the main component of all modern web apps.

Database state

Modifications outside transactions

Rails helps with database cleanup by wrapping each test in a database transaction and rolling it back automatically. This transactional approach seamlessly handles most data cleanup.

# Ensure transactional tests are enabled
RSpec.configure do |config|
  config.use_transactional_tests = true
end

However, before(:all) blocks create data outside this transaction, causing it to persist between tests.

# Flaky: Persists between tests
before(:all) do
  @doctor = create(:user, role: "doctor")
end

# Better: Use transactional setup
let(:doctor) { create(:user, role: "doctor") }

Usually devs go with before(:all) for performance-for a safer alternative, use TestProf’s let_it_be:

# Configure default modifiers
TestProf::LetItBe.configure do |config|
  config.default_modifiers[:refind] = true
end

# Safer performance optimization
let_it_be(:admin) { create(:user, role: "admin") }

Additionally, if you’re testing migrations or testing code working with temporary tables, DDL operations (CREATE, ALTER, DROP tables) are usually not protected by transactions and need manual cleanup too.

Here’s why DDL is special: unlike DML (INSERT, UPDATE, DELETE), DDL operations often auto-commit and can’t be rolled back in many databases, making them especially problematic for test isolation.

Seed data and memory contamination

Seeds and fixtures are another popular performance optimization; they preload common data once instead of creating it in every test. However, the results of run-once snippets share memory state.

Database changes are reverted by transactions, but objects loaded from seeds persist in memory with stale state; this is the same issue that let_it_be’s refind modifier solves:

module SeedHelpers
  def load_seeds
    @current_user = User.find_by!(role: "patient")
  end
end

RSpec.configure do |config|
  config.include SeedHelpers

  config.before(:suite) do
    load_seeds
  end
end

RSpec.describe "user" do
  # Flaky : Memory contamination
  it "disables user status" do
    @current_user.update!(active: false) # Database reverts, memory object stays stale
  end
end

TestProf solves this problem with the AnyFixture helper. It takes care of both cleanup and memory state management by automatically cleaning up at the end of test runs and providing helpers to refresh memory state between tests:

using TestProf::AnyFixture::DSL

RSpec.configure do |config|
  before(:suite) do
    fixture(:current_user) { create(:user, role: "patient") }
  end
end

RSpec.describe "user" do
  let(:patient) { fixture(:current_user) }

  # Better: Fresh from database every time, created only once
  it "disables user status" do
    patient.update!(active: false)
  end
end

For CI environments, you can skip database seeding entirely by caching data between runs. When using caching, always use checksums for invalidation to detect when cached data becomes stale. ClickFunnels does this extensively to save time and simultaneously ensure clean test environments:

- name: Cache test database
  uses: actions/cache@v4
  with:
    path: tmp/test_dump.sql
    key: $❴❴ runner.os ❵❵-seeds-$❴❴ hashFiles('db/**/*.rb') ❵❵

- name: Restore or setup test database
  run: |
    if [ -f tmp/test_dump.sql ]; then
      RAILS_ENV=test bundle exec rails db:create
      psql test_database < tmp/test_dump.sql
    else
      RAILS_ENV=test bundle exec rails db:setup
      pg_dump test_database > tmp/test_dump.sql
    fi

That’s enough about the most common issues because of performance optimizations. Now let’s move on to generic database quirks.

Database ordering

Database queries without explicit ordering return results in an undefined order that can vary between runs. Tests that depend on implicit ordering will pass sometimes and fail at others, depending on recently inserted records, or query planner decisions.

# Flaky: Depends on insertion order and query planner
it "fails due to implicit ordering" do
  expect(User.doctors.map(&:surname)).to eq(["Pasteur", "Fleming"])
end

# Better: Use contain_exactly for order-independent assertions
it "uses order-independent assertions" do
  expect(User.doctors.map(&:surname)).to contain_exactly("Pasteur", "Fleming")
end

# Better: Or specify explicit ordering when order matters
it "uses explicit ordering when needed" do
  expect(User.doctors.order(:created_at).first.surname).to eq("Pasteur")
end

Here’s the deterministic approach: always specify ordering when the order matters, or use order-independent assertions when it doesn’t.

Database primary key sequences

Another database quirk: even with transactional tests, database sequences (including primary key counters) are not reseted and continue incrementing, causing tests to fail when they expect specific ID values.

# Flaky: Don't rely on specific IDs
it "fails due to specific ID dependency" do
  expect(post.user.id).to eq(1)
end

# Better: Test relationships and attributes
it "tests relationships instead of specific IDs" do
  expect(post.user).to eq(user)
end

# Better: For missing records, use a clearly non-existent ID, such as -1
it "uses a clearly non-existent ID for missing records" do
  expect { User.doctors.find(-1) }.to raise_error(ActiveRecord::RecordNotFound)
end

External system dependencies

Beyond database issues, your system clashes with real world infrastructure. When tests interact with external systems (data stores, filesystems, APIs, queues, storage) they introduce contamination from outside your application’s controlled environment.

Feature flags: the ClickFunnels LaunchDarkly challenge

Let’s start with feature flags because ClickFunnels had this exact problem. ClickFunnels relies extensively on LaunchDarkly, with hundreds of flags controlling everything from UI experiments to critical business logic; a popular approach for huge projects.

Their tests were drifting from production reality because of the huge amount of flags that weren’t synced between environments. This was especially problematic for smoke tests which exercise large portions of the app and need realistic flag combinations, not basic stubs.

Because of this complexity, we crafted an extensive mocker system based on production dumps:

RSpec.configure do |config|
  config.before(:each) do
    @original_client = Rails.configuration.x.launch_darkly.client

    # Stub the original client to allow overridding feature flags
    # by `stub_feature_gate_check`
    # and fallbacks to fetch the production feature flags from dump
    # without sending anything to LD.
    stubbed_client = ClickFunnels::FeatureGates::StubbedClient.new(
      Rails.configuration.x.launch_darkly.client
    )
    Rails.configuration.x.launch_darkly.client = stubbed_client
  end

  config.after(:each) do
    Rails.configuration.x.launch_darkly.client = @original_client
  end
end

Note that Rails.configuration.x uses Rails’ official custom configuration reserved namespace, providing settings that are easy to use in code and swap in tests.

In general, we want to eliminate all external dependencies in tests. Feature flags are external services that introduce unpredictable behavior, so mock them deterministically using production data.

External data stores

External data stores like Redis, Elasticsearch, RabbitMQ, and other message queues maintain state between tests if you’re using real integrations instead of mocked ones.

# Better: Clear external data stores between tests when using real integrations
RSpec.configure do |config|
  config.after(:each, :with_cache) do
    # Clear Redis
    MyApp.redis.flushdb
  end

  config.after(:each, :with_search) do
    # Clear Elasticsearch
    # (requires `action.destructive_requires_name = false` in config)
    Searchkick.client.indices.delete(index: "*")
    # or
    User.searchkick_index.delete
    Post.searchkick_index.delete
  end

  config.after(:each, :with_message_queue) do
    # Clear RabbitMQ queues
    channel = MyApp.bunny.create_channel
    %w[my_app.events.fanout].each do |exchange|
      channel.exchange(exchange, type: :fanout, durable: true).delete
    end
    %w[my_app.events].each do |queue|
      channel.queue(queue, durable: true).delete
    end
  ensure
    channel&.close unless channel&.closed?
  end
end

Why this matters: external stores don’t participate in Rails’ transactional test cleanup. Data written by one test remains visible to subsequent tests, creating hidden dependencies.

Job and mailer queues

Background jobs and emails accumulate queues between tests, causing unexpected behavior when tests check queue contents.

# Better: Clear queues between tests
RSpec.configure do |config|
  config.after(:each) do
    Sidekiq::Worker.clear_all
    ActionMailer::Base.deliveries.clear
  end
end

Filesystem changes

File uploads, temporary files, and directory changes persist on the filesystem between tests unless explicitly cleaned up.

# Better: Always clean up files in tests
RSpec.describe FileUploader do
  after do
    FileUtils.rm_rf(Rails.root.join("tmp", "test_uploads"))
  end
end

Consider using the FakeFS gem to mock filesystem operations entirely, eliminating cleanup concerns.

HTTP requests and external dependencies

This includes API calls, file storage, and gems that hide HTTP requests behind their facades. Many gems make network calls without you even realizing it.

Use WebMock to prevent any external HTTP calls and ensure test isolation:

# Better: Add WebMock to your test suite
gem "webmock"

RSpec.configure do |config|
  config.before(:suite) do
    WebMock.disable_net_connect!(allow_localhost: true)
  end
end

RSpec.describe ExternalClient do
  before do
    stub_request(:post, "https://example.com/")
      .with(
        body: hash_including(full_name: "Dr. Test")
      )
      .to_return(
        status: 200,
        body: {id: 1, full_name: "Dr. Test"}.to_json
      )
  end
end

VCR for complex HTTP interactions

For complex API interactions, use VCR cassettes to record and replay HTTP requests.

# Better: Add VCR for complex API testing
gem "vcr"

# By default, VCR matches requests based on URI and method which is pretty lax for POST requests.
VCR_MATCH_ON = %i[uri method body].freeze
VCR_MATCH_ON_LAX_DEFAULT = %i[uri method].freeze

VCR.configure do |config|
  config.cassette_library_dir = "spec/fixtures/vcr_cassettes"
  config.ignore_localhost = true
  config.hook_into :webmock
  config.configure_rspec_metadata!

  config.default_cassette_options = {
    match_requests_on: VCR_MATCH_ON,
    # Easy re-record cassettes when needed.
    record: ENV["VCR_RERECORD_CASSETTES"] ? :all : :once
  }

  config.filter_sensitive_data("<API_KEY>") { ENV["API_KEY"] } # Example
end

RSpec.describe ExternalClient, vcr: true do
  # Uses new strict matching (URI + method + body):
  # Good for GET requests with consistent params.
  it "fetches User" do
    user = client.fetch_user(1)

    expect(user[:full_name]).to eq("Dr. Test")
  end

  # Uses default lax matching (URI + method only):
  # Ignores body differences from Faker randomness.
  it "creates User", vcr: {match_requests_on: VCR_MATCH_ON_LAX_DEFAULT} do
    user = client.create_user("Dr. Mini Test", email: Faker::Internet.email)

    expect(user[:full_name]).to eq("Dr. Mini Test")
    expect(user[:email]).to be_present
  end
end

It’s easy to re-record all cassettes with VCR_RERECORD_CASSETTES. We’ve also updated the default config to include body matching for POST/PUT requests. This prevents overly permissive tests that match everything, while still keeping the more relaxed default constant available as a safety hatch. This option is handy when request bodies contain random data by design, which would otherwise break cassette reuse.

This VCR approach is particularly useful for integration tests with complex external APIs where manual mock management becomes unwieldy.

Beware that cassettes can become stale as APIs evolve, and re-recording can be a maintenance burden. With ClickFunnels, we solved this with automated CI that re-records all cassettes monthly via scheduled job, opening PRs for developer review. Make sure your team has convenient access to test API keys needed for cassette re-recording.

Other network protocols

DNS lookups, email services, and raw sockets are usually overlooked and should be mocked too to prevent external dependencies:

# Better: Mock DNS lookups
before do
  allow(Resolv::DNS).to receive(:getaddress).with("example.com").and_return("192.0.2.1")
end

# Better: Disable actual E-Mail sending
Rails.application.configure do
  config.action_mailer.delivery_method = :test
end

# Better: Mock socket connections
before do
  FakeFS.activate!(io_mocks: true)
end

ClickFunnels previously relied on real DNS since it interacts heavily with domains (it’s a great site builder, after all!). Interestingly, this dependency went unnoticed for a long time, as major outages of public DNS resolvers are rare.

However, no matter how reliable your internet connection is, it’s still slow and brittle. We discovered this with TestProf and disabled DNS resolutions in tests to remove external dependencies and improve performance.

Problematic test design

The last thing to consider is the quality of your tests. Tests themselves can introduce flakiness through poor design by being too permissive, overly strict, or making incorrect environment assumptions:

# Flaky: Generates random data that may violate constraints
it "generates too random data" do
  non_unique_email = Faker::Internet.email
  expect { user.update!(email: non_unique_email) }.not_to raise_error(ActiveRecord::RecordNotUnique)
end

# Flaky: Uses no-wait selector that may not find elements
it "cannot find a link by a strict selector" do
  page.first(".navigation-menu > a", wait: 0).click
end

# Flaky: Fails in CI because of incorrect assumptions (assumes fast disk)
it "makes incorrect timing assumptions" do
  expect { slow_file_operation }.to perform_under(1).sec
end

As we saw with DNS, faster tests are less flaky tests.

When tests run quickly, there’s less opportunity for timing issues, resource contention, and environmental variability.

Fast-running tests also provide immediate feedback to developers, making it easier to identify and fix issues early in the development cycle. Developer productivity + a long-term strategy for stability!

If you’re not sure where to start, consider profiling with TestProf first as it can uncover unexpected bottlenecks.

Feature test stability

Feature tests are inherently more prone to flakiness: they’re running in real browsers with JavaScript, making HTTP requests, and asserting multiple expectations across the full application stack.

The key is accepting this reality and building defenses against it.

Retries for browser tests only

Browser tests can legitimately fail due to temporary browser environment instability. Use automatic retries sparingly and ONLY for browser tests (rspec-retry or Minitest::Retry):

# Better: Browser retries ONLY - low retry count
gem "rspec-retry", require: "rspec/retry"

RSpec.configure do |config|
  config.verbose_retry = true
  config.default_retry_count = 0

  config.around(:each, type: :system) do |example|
    example.run_with_retry(retry: 2)
  end
end

We highly recommended avoiding retries in unit tests, as they can mask real issues. In feature specs, keep the retry count low to prevent mediocre tests from affecting the duration of the entire test suite run.

Essential browser configuration

Since we’re working with a live browser environment, we need to prepare it properly.

The first important thing here is viewport consistency to eliminate all possible variety on different computers, breaking layout-dependent tests. Prefer using popular across your users screen sizes for desktop and mobile devices.

This approach makes sure that your tests reflect real-world usage while maintaining stability across environments.

Also, do not forget to reset the viewport back to the default size if you’re using several sizes:

# Better: consistent viewport
TEST_DEVICES = {
  desktop_full_hd: [1920, 1080],
  mobile_hd: [720, 1280]
}
CAPYBARA_DEFAULT_WINDOW_SIZE = TEST_DEVICES.fetch(:desktop_full_hd)

Capybara.register_driver(:cuprite) do |app|
  options = {window_size: CAPYBARA_DEFAULT_WINDOW_SIZE}
  Capybara::Cuprite::Driver.new(app, options)
end

RSpec.configure do |config|
  config.after(:each, type: :system) do
    page.current_window.resize_to(*CAPYBARA_DEFAULT_WINDOW_SIZE)
  end
end

At ClickFunnels, one major refactoring step was eliminating unnecessary viewport manipulations from tests. While these didn’t affect CI stability, they caused occasional flakiness during local development where developers run tests in different screen resolutions.

The second step is minimizing randomness by disabling fancy things in the browser like animations and transitions that interfere with element interactions. We can also increase default waiting time as some complex applications require more time to load:

# Better: more ample time
Capybara.default_max_wait_time = ENV.fetch("CAPYBARA_DEFAULT_MAX_WAIT_TIME", "10").to_i

# Better: less animations and transitions
Capybara.disable_animation = true
Capybara.register_driver(:cuprite) do |app|
  options = {browser_options: {"disable-smooth-scrolling" => true}}
  Capybara::Cuprite::Driver.new(app, options)
end

From the app side, we should precompile assets on CI to minimize JavaScript slowdowns and CSS content jumps during test execution:

# Better: stable test run duration after precompilation
RSpec.configure do |config|
  config.before(:suite) do
    system("RAILS_ENV=test rails assets:precompile") if ENV["CI"]
  end
end

Finally, we also need to set up screenshots which are built-in Rails to see sources of inevitable failures (do not forget to keep them in CI properly after each run):

# Better: easy flaky test debugging
Capybara.save_path = Rails.root.join("tmp/screenshots")

Browser Advanced State

Clean browser advanced state between tests:

RSpec.configure do |config|
  config.after(:each, type: :system) do
    # Better: Clear browser session and local storages
    page.clear_storage
    # or (non-Selenium)
    page.execute_script("sessionStorage.clear(); localStorage.clear();")
  end
end

Please note that some drivers and versions (Selenium) automatically clear browser advanced state between tests, so you may not need to explicitly clean it up.

Reliable selectors and JS synchronization

Use test selectors

The foundation of reliable browser tests starts with stable selectors that survive UI changes. Use semantic attributes designed for testing, not implementation details that expire faster than people who Google their symptoms:

# Better: Use semantic attributes designed for testing
Capybara.configure do |config|
  config.test_id = "data-testid"
end

# Support the semantic test attribute inside `find`
Capybara.add_selector(:test_id) do
  xpath do |locator|
    XPath.descendant[XPath.attr(Capybara.test_id) == locator]
  end
end

expect(page).to have_link(test_id: "evil-martians-formula")
# and
find(:test_id, "eliminate-flaky-tests").click

Always use waiting selectors

A common misconception is that wait time is the maximum time a test will spend waiting. In reality, Capybara repeats searches several times during the wait period, so successful matches typically complete much faster than the timeout.

To avoid flaky tests, always use waiting matchers and avoid non-waiting alternatives. Here are some common examples of flaky browser tests:

# Flaky: Non-waiting fail if element isn't immediately present
it "fails with non-waiting selectors" do
  page.first(".navigation-menu", wait: 0).click
end

# Flaky: Find waits for element to appear, but doesn't wait for content to load
it "doesn't wait for dynamic content properly" do
  expect(page.find(".dynamic-content")).to have_content("Loaded")
end

# Flaky: All immediately returns collection on the first element - the last element might not exist yet
it "assumes all elements are ready immediately" do
  page.all(".lazy-load-images").last.click
end

# Flaky: If the spinner hasn't rendered yet, the test may pass (false positive)
it "checks spinner too early" do
  expect(page).not_to have_css(".loading-spinner")
end

And here are the corrected versions:

# Better: Waiting matcher that retries until found or timeout
it "uses proper waiting selectors" do
  page.first(".navigation-menu").click
end

# Better: Wait for dynamic content to load
it "waits for dynamic content" do
  expect(page).to have_css(".dynamic-content", text: "Loaded")
end

# Better: Wait for all images to appear when they load one by one
it "waits for all lazy-loaded images" do
  exact_image_count = Image.all.size
  expect(page).to have_css(".lazy-load-images", count: exact_image_count)
  page.all(".lazy-load-images").last.click
end

# Better: Wait for loading spinner to disappear
it "properly waits for spinner to disappear" do
  expect(page).to have_no_css(".loading-spinner")
end

Never use sleep in the majority of cases; waiting selectors are more reliable and faster than arbitrary delays.

capybara-lockstep for JS synchronization

Modern web applications are heavily async, with Hotwire, React, Inertia and plagued with lots of AJAX requests.

While Capybara can retry some failed actions, it doesn’t know about JavaScript or AJAX, so UI interactions may be attempted before elements are ready and network requests complete.

This was especially important for ClickFunnels, which is a platform for building highly dynamic websites with complex user-facing editors.

Instead of manual waiting or inefficient sleep, we decided to use capybara-lockstep. With the gem configured, Capybara waits for all JavaScript async interactions to complete before executing the next matcher, guaranteeing the page is ready:

gem "capybara-lockstep"

# Every layout entrypoint requires this magic line
<%= capybara_lockstep if defined?(Capybara::Lockstep) %>

# Block Capybara while Rails is busy (config/environments/test.rb)
config.middleware.insert_before 0, Capybara::Lockstep::Middleware

The gem synchronizes from frontend to backend before each Capybara action. Basically, Capybara can no longer observe the inconsistent page or while HTTP requests are in flight. However, it has to be included in all pages to work effectively.

The middleware is optional but we recommend to include it too as it covers edge cases (like when aborted requests are still being processed by the backend during test cleanup hooks).

That said, it wasn’t an easy integration for ClickFunnels due to multiple entrypoints, plus SPA pages rendered by an external Node.js server served through the Rails app (used for customer-built page previews). In the end, we greatly appreciated this gem as system tests became much more reliable and less flaky.

Eventual consistency

This is controversial, but sometimes sleep is necessary for complex interactions. For example, smoke tests that use real job processing must wait for eventual consistency and background jobs all take time to propagate through the system:

# Controversial: User registration triggers welcome email job:
#                it's a critical app path to test
it "completes full user onboarding flow" do
  fill_in "Email", with: "user@example.com"
  click_button "Sign Up"

  # Background job processing requires actual wait time
  sleep 2

  # Check eventual consistency across systems
  expect(User.last.email_sent).to be(true)
  expect(AnalyticsTracker.events.last[:type]).to eq("email_sent")
end

But this is the exception, not the rule. In general, you’ll handle most timing issues without arbitrary delays if you’re using the previous tricks. As Dr. Test reminds us: “Use sleep sparingly or your tests risk becoming dependent on it.”

Stuck test run mitigation

Sometimes tests don’t only flakily fail, they completely freeze. No output, no progress, just silence until the CI times out. These “stuck tests” are particularly frustrating because they provide no debugging information.

ClickFunnels was experiencing exactly this: tests would run for 10+ minutes with no output before the CI finally killed them. Through investigation, we identified several main culprits.

Let’s start by adding some visibility to our frozen test runs with all threads backtraces dumped by sigdump:

gem "sigdump"

# Output to STDERR (use `-` for STDOUT, default is /tmp/sigdump*)
ENV["SIGDUMP_PATH"] = "+"
# Use a shutdown signal to trigger sigdump,
# as we usually don't have access to SIGCONT on CI
ENV["SIGDUMP_SIGNAL"] = "TERM"

require "sigdump/setup" if ENV["CI"]

Then, we need to wait for the test to freeze to get relevant dump for investigation (nothing better captures modern software development than a human idling, watching machines work). Sometimes it’s also necessary to use periodic dumps to capture the state of the system if CI doesn’t send meaningful signals on shutdown.

ClickFunnels encountered two specific deadlock scenarios:

# First deadlock
Sigdump at 2025-09-15 17:59:33 +0000 process 6081 (bin/rails)
  Thread #<Thread:0x0000561088a2efb0 sleep_forever> status=sleep priority=0
      ...lib/active_support/concurrency/load_interlock_aware_monitor.rb:17:in `enter`

# Second deadlock
Sigdump at 2025-09-15 02:18:03 +0000 process 6086 (bin/rails)
  Thread #<Thread:0x000055c64f042fc0 sleep> status=sleep priority=0
      ...lib/capybara/node/base.rb:92:in `sleep`

The first involved multiple databases and query cache, Rails can deadlock itself during Capybara run. This is an old Rails issue that we fixed by temporarily disabling the query cache and planning an update to Rails 7.1+.

The second case was trickier: a custom Capybara matcher with a loop that lacked timeout protection. When the condition broke, it waited indefinitely. Always use timeouts for looped conditions to prevent infinite waiting.

TLDR: The Evil Martians Formula, or your test suite deserves better

Flaky tests are not often considered that important, but think about the real cost! One developer spends focus every day dealing with them and retrying. Then, multiply that across your team, add the deployment delays, the lost confidence in CI, the stress of “is this test failure real?” and you’re looking at massive productivity drain.

The Evil Martians formula eliminates this time waste entirely!

Here are the complete ingredients for eliminating flaky tests, approved for your consumption:

Zero-tolerance policy - immediate test isolation, systematic review process
Unit test flakiness sources - global state cleanup, database isolation, external dependency elimination and mocking, reliable test design
Feature test stability - strategic retries, browser environment setup, stable and waiting selectors, JS synchronization
Stuck test run mitigation - stuck test sigdumping, deadlock detection

Thankfully, ClickFunnels already recognizes this hidden drain on developer productivity and has invested in systematically eliminating it!