Go integration testing with courage and coverage

April 19, 2023

Topics

Test coverage is one of the most important metrics of codebase maintainability (or health). At the same time, it’s probably the most controversial. Some developers enforce high coverage standards; others do not care about coverage at all. Finally, some, like me, sometimes refer to that magical number, the percentage of coverage, to learn from it. In this post, I’d like to share my recent findings on code coverage in Go.

Recently, I started integrating coverage reporting into my open-source projects. This was done mostly out of curiosity: I wanted to compare the actual numbers with my gut feelings. Being a test-driven developer, I’ve always written many tests, both functional tests and integration tests. Over time, coverage of my OSS projects has improved due to bug reports (for every problem reported, I write a failing test first). So, based on my own experience, I just assumed that for popular projects coverage is around ~90-95%. And this turned out to be true for all the projects I checked—except one: AnyCable-Go.

In this post, I’d like to share my story of integrating coverage tracking into AnyCable-Go and pushing the overall coverage from a red 50% to a much greener 76%. And all without writing a single Go test!

The path from to 50

Go is a ~~beautiful~~ poweful language that helps us build and ship stuff pretty fast. You write code. It compiles. It works. Bin… Go! ☑️

I started building AnyCable-Go back in 2016 with these ideas in mind. It was fun. Then, I started adding new features, now with tests, and I found that all the productivity I had initially started with began vanishing quickly. Legacy design decisions made writing tests more difficult. Over time, I refactored a good portion of the codebase and wrote a bunch of unit and functional tests. So, I ended up with some test coverage, but I didn’t care about the actual number: my primary quality metric is the rate of bug reports—and it was low (with a lot of production users).

Recently, I started to play with Coveralls and their upgraded coverage reporter and decided to integrate it into AnyCable-Go CI pipeline. I wondered how this magical number correlated to the project quality and perceived maintainability.

I already had a Coveralls account, so I only needed to add the -cover flag to the test command and drop a few configuration snippets into my GitHub Actions workflow:

jobs:
  test:
    # ..
    steps:
    # ...
    - name: Run tests
      run: make test
    - name: Report coverage
      uses: coverallsapp/github-action@v2
      with:
        file: coverage.out
        parallel: true
        flag-name: $-ubuntu

  test-macos:
    # ...
    steps:
    # ...
    - name: Run tests
      run: make test
    - name: Report coverage
      uses: coverallsapp/github-action@v2
      with:
        file: coverage.out
        parallel: true
        flag-name: $-macos

  coverage:
    runs-on: ubuntu-latest
    needs: [test, test-macos]
    steps:
      - uses: coverallsapp/github-action@v2
        with:
          parallel-finished: true

The full PR can be found here.

As the build finished, I finally learned the actual truth of the matter: 54%. That’s a number that would make me feel shame if I were to trust in coverage—but I never did.

How can a robust project safely ship new releases with not-so-good test coverage? To answer this question, we need to talk a bit about how this number is calculated.

Not all coverage is equally useful

Let’s talk a little about how we calculate code coverage and about the different types of tests.

Usually, we instrument our code during white-box testing and track the statements that have been executed. We compare the number of executed statements with the total number of statements in the codebase, and the resulting percentage is our no-longer-magical coverage value. White-box tests are tests that run in the same execution environment as the code they assess; thus, tests have direct access to the code under test. Unit tests are the most common example of white-box tests.

We can achieve decent coverage with white-box tests, even with just unit tests. However, there are some caveats. Unit tests often rely on fake objects, mocks, or stubs to isolate the code under test from the rest of the system. In Go, we use interfaces to replace real peers with fake ones. That drastically simplifies testing, but it also can make the coverage number less useful. For example, if we have a function that calls a method on an interface, the coverage tool will mark the line as executed. However, the actual implementation may be a no-op or return an unexpected (but type-compliant) value, and the test will still pass. This is a false positive. That, combined with a high coverage number, can give you fake confidence about the software you’re writing.

Block-box (or end-to-end) testing is a different story. We cannot mangle with the internals, and everything is real. Yes, such tests are much more expensive to write and execute, but when they catch bugs that have leaked from unit tests, they’re priceless.

For AnyCable, we’ve utilized conformance testing from the very early days. Conformance testing is a special kind of black-box testing meant to verify the compliance of a system with a specification. In our case, that specification is Rails Action Cable. The collection of conformance tests (wrapped into a gem) ensures that AnyCable is compatible with Action Cable and can be used as a drop-in replacement.

Conformance tests give us enough confidence to be sure the code works as expected in basic scenarios, not involving high load.

For high-load scenarios, we rely on benchmarks. We usually conduct full-scale benchmarks manually, and only when we’re about to release a new major or minor version. To catch performance regressions earlier, we also have a set of benchmark tests that are executed on every commit. What is a benchmark test? It’s also a black-box test that is meant to verify the performance characteristics of the system under stress.

Let me give you an example of a benchmark test. It’s the first benchmark test I wrote for AnyCable-Go; its purpose is to ensure we have no leaking Go routines. The tests are written in Ruby—the best language to write custom DSLs for testing. Just take a look at the code:

launch :rpc, "bundle exec anyt --only-rpc"
wait_tcp 50051

launch :anycable, "./dist/anycable-go"
wait_tcp 8080

BENCHMARK_COMMAND = <<~CMD
  websocket-bench broadcast --concurrent 10 --sample-size 100 \
  --step-size 200 --payload-padding 200 --total-steps 3 \
  ws://localhost:8080/cable --server-type=actioncable
CMD

IDLE_ROUTINES_MAX = 200

results = 2.times.map do |i|
  run :bench, BENCHMARK_COMMAND
  # Give some time to cool down after the benchmark
  sleep 5

  gops(pid(:anycable))["goroutines"].tap do |num|
    fail "Failed to gops process: #{pid(:anycable)}" unless num

    if num > IDLE_ROUTINES_MAX
      fail "Too many goroutines: #{num}"
    end
  end
end

if (results[1] / results[0].to_f) > 1.1
  fail "Go routines leak detected: #{results[0]} -> #{results[1]}"
end

The test launches AnyCable-Go and runs the benchmarking tool twice. After each run, we read the number of Go routines (using gops), compare each value with the threshold value (and then with each other) to make sure that the number of routines stays constant.

Over time, we repurposed the custom test runner demonstrated above for other black-box tests not necessarily related to performance. For example, we have a scenario that verifies that an embedded NATS super-cluster works as expected. You can find it here.

To sum up, we, at AnyCable, have multiple types of black-box tests in place in order to sleep well at night. But all these tests and the corresponding black-box coverage are not reflected in the number we obtained at the beginning of this post (54%). How can we enable coverage profiling for compiled binaries we use in our black-box tests? Let’s look at a cool new feature Go 1.20 has to offer and find out!

Adding black-box coverage to the mix

How can we add black-box coverage reporting? TL;DR: We can now use the go build -cover flag to enable coverage profiling for compiled binaries.

Collecting coverage information from a compiled binary in Go has been a non-trivial task for a long time. The coverage profiling functionality was tied to testing, so, the only workaround was to use a binary file compiled for tests (go test -c) and make it dual-purpose via a bit of hackery—no fun at all.

In 2022, a proposal was submitted to add application coverage profiling support to Go. It was accepted, and the new -cover flag was added to the go build command in Go 1.20.

Application coverage works a bit differently compared to test coverage; with the former, results are stored in two binary files: covmeta.xxx and covcounter.yyy. The first one contains the program’s source code metadata (file paths, function names, and so forth). The second contains the actual coverage counters. Each time you run an instrumented binary, a new covcounter.xyz file with a unique name is created. Thus, you can re-use the same binary between runs and collect multiple coverage profiles.

In addition to the -cover flag, Go 1.20 also introduced a new go tool covdata command to manipulate raw coverage data. You can read more about it in the official documentation.

Let’s see how we can combine these new features to add black-box coverage reporting to our CI pipeline:

jobs:
  test-conformance:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        test_command:
          - benchmarks
          - test-features
          - test-conformance
    # ...
    env:
      # ...
      GOCOVERDIR: "_icoverdir_"
    steps:
      - name: Run tests
        run: |
          mkdir _icoverdir_
          make $
      - name: Format coverage
        run: |
          go tool covdata textfmt -i _icoverdir_ -o coverage.out
      - name: Report coverage
        uses: coverallsapp/github-action@v2
        with:
          file: coverage.out
          parallel: true
          format: golang
          flag-name: integration-$

There are a few things to note here:

We use the GOCOVERDIR environment variable to specify the directory where the coverage data will be stored. Without it, the coverage data won’t be saved anywhere.
We must create the directory ourselves (mkdir _icoverdir_), which was a bit unexpected for me. Hopefully, this will be fixed in the future.
Finally, we need to use go tool covdata to convert coverage data into a text format that can be used with Coveralls.

The full PR can be found here.

So, what was the final number combining all the coverage reports, from unit, integration, and black-box tests? 71%! Not bad, right? By looking at the coverage reports, I was able to identify some uncovered (though battle-tested in production) scenarios; so, I added a few more black-box scenarios and increased the overall coverage a bit more, up to 75%. This number better match my expectations. The missing 25% is mostly related to platform-specific and auto-generated code (e.g., gRPC/protobuf files), so, I’m not too worried about it.

What does this story teach us? Coverage is like statistics: if you torture the data long enough, it will confess to anything (Darell Huff, How to Lie with Statistics). Don’t take coverage seriously, use it as a tool, not as a source of truth or confidence. And, of course, don’t forget to write tests!

Speaking of coverage percentages, Evil Martians have you covered, 100%! We’re ready to beam down and fix your earthling troubles, whether related to frontend, product design, backend, devops or something beyond, drop us a message for more info!

Go integration testing with courage and coverage

Topics

The path from to 50

Not all coverage is equally useful

Adding black-box coverage to the mix

Join our email newsletter