Go integration testing with courage and coverage
Test coverage is one of the most important metrics of codebase maintainability (or health). At the same time, it’s probably the most controversial. Some developers enforce high coverage standards; others do not care about coverage at all. Finally, some, like me, sometimes refer to that magical number, the percentage of coverage, to learn from it. In this post, I’d like to share my recent findings on code coverage in Go.
Recently, I started integrating coverage reporting into my open-source projects. This was done mostly out of curiosity: I wanted to compare the actual numbers with my gut feelings. Being a test-driven developer, I’ve always written many tests, both functional tests and integration tests. Over time, coverage of my OSS projects has improved due to bug reports (for every problem reported, I write a failing test first). So, based on my own experience, I just assumed that for popular projects coverage is around ~90-95%. And this turned out to be true for all the projects I checked—except one: AnyCable-Go.
In this post, I’d like to share my story of integrating coverage tracking into AnyCable-Go and pushing the overall coverage from a red 50% to a much greener 76%. And all without writing a single Go test!
The path from to 50
Go is a beautiful poweful language that helps us build and ship stuff pretty fast. You write code. It compiles. It works. Bin… Go! ☑️
I started building AnyCable-Go back in 2016 with these ideas in mind. It was fun. Then, I started adding new features, now with tests, and I found that all the productivity I had initially started with began vanishing quickly. Legacy design decisions made writing tests more difficult. Over time, I refactored a good portion of the codebase and wrote a bunch of unit and functional tests. So, I ended up with some test coverage, but I didn’t care about the actual number: my primary quality metric is the rate of bug reports—and it was low (with a lot of production users).
Recently, I started to play with Coveralls and their upgraded coverage reporter and decided to integrate it into AnyCable-Go CI pipeline. I wondered how this magical number correlated to the project quality and perceived maintainability.
I already had a Coveralls account, so I only needed to add the -cover
flag to the test command and drop a few configuration snippets into my GitHub Actions workflow:
jobs:
test:
# ..
steps:
# ...
- name: Run tests
run: make test
- name: Report coverage
uses: coverallsapp/github-action@v2
with:
file: coverage.out
parallel: true
flag-name: $-ubuntu
test-macos:
# ...
steps:
# ...
- name: Run tests
run: make test
- name: Report coverage
uses: coverallsapp/github-action@v2
with:
file: coverage.out
parallel: true
flag-name: $-macos
coverage:
runs-on: ubuntu-latest
needs: [test, test-macos]
steps:
- uses: coverallsapp/github-action@v2
with:
parallel-finished: true
The full PR can be found here.
As the build finished, I finally learned the actual truth of the matter: 54%. That’s a number that would make me feel shame if I were to trust in coverage—but I never did.
How can a robust project safely ship new releases with not-so-good test coverage? To answer this question, we need to talk a bit about how this number is calculated.
Not all coverage is equally useful
Let’s talk a little about how we calculate code coverage and about the different types of tests.
Usually, we instrument our code during white-box testing and track the statements that have been executed. We compare the number of executed statements with the total number of statements in the codebase, and the resulting percentage is our no-longer-magical coverage value. White-box tests are tests that run in the same execution environment as the code they assess; thus, tests have direct access to the code under test. Unit tests are the most common example of white-box tests.
We can achieve decent coverage with white-box tests, even with just unit tests. However, there are some caveats. Unit tests often rely on fake objects, mocks, or stubs to isolate the code under test from the rest of the system. In Go, we use interfaces to replace real peers with fake ones. That drastically simplifies testing, but it also can make the coverage number less useful. For example, if we have a function that calls a method on an interface, the coverage tool will mark the line as executed. However, the actual implementation may be a no-op or return an unexpected (but type-compliant) value, and the test will still pass. This is a false positive. That, combined with a high coverage number, can give you fake confidence about the software you’re writing.
Block-box (or end-to-end) testing is a different story. We cannot mangle with the internals, and everything is real. Yes, such tests are much more expensive to write and execute, but when they catch bugs that have leaked from unit tests, they’re priceless.
For AnyCable, we’ve utilized conformance testing from the very early days. Conformance testing is a special kind of black-box testing meant to verify the compliance of a system with a specification. In our case, that specification is Rails Action Cable. The collection of conformance tests (wrapped into a gem) ensures that AnyCable is compatible with Action Cable and can be used as a drop-in replacement.
Conformance tests give us enough confidence to be sure the code works as expected in basic scenarios, not involving high load.
For high-load scenarios, we rely on benchmarks. We usually conduct full-scale benchmarks manually, and only when we’re about to release a new major or minor version. To catch performance regressions earlier, we also have a set of benchmark tests that are executed on every commit. What is a benchmark test? It’s also a black-box test that is meant to verify the performance characteristics of the system under stress.
Let me give you an example of a benchmark test. It’s the first benchmark test I wrote for AnyCable-Go; its purpose is to ensure we have no leaking Go routines. The tests are written in Ruby—the best language to write custom DSLs for testing. Just take a look at the code:
launch :rpc, "bundle exec anyt --only-rpc"
wait_tcp 50051
launch :anycable, "./dist/anycable-go"
wait_tcp 8080
BENCHMARK_COMMAND = <<~CMD
websocket-bench broadcast --concurrent 10 --sample-size 100 \
--step-size 200 --payload-padding 200 --total-steps 3 \
ws://localhost:8080/cable --server-type=actioncable
CMD
IDLE_ROUTINES_MAX = 200
results = 2.times.map do |i|
run :bench, BENCHMARK_COMMAND
# Give some time to cool down after the benchmark
sleep 5
gops(pid(:anycable))["goroutines"].tap do |num|
fail "Failed to gops process: #{pid(:anycable)}" unless num
if num > IDLE_ROUTINES_MAX
fail "Too many goroutines: #{num}"
end
end
end
if (results[1] / results[0].to_f) > 1.1
fail "Go routines leak detected: #{results[0]} -> #{results[1]}"
end
The test launches AnyCable-Go and runs the benchmarking tool twice. After each run, we read the number of Go routines (using gops), compare each value with the threshold value (and then with each other) to make sure that the number of routines stays constant.
Over time, we repurposed the custom test runner demonstrated above for other black-box tests not necessarily related to performance. For example, we have a scenario that verifies that an embedded NATS super-cluster works as expected. You can find it here.
To sum up, we, at AnyCable, have multiple types of black-box tests in place in order to sleep well at night. But all these tests and the corresponding black-box coverage are not reflected in the number we obtained at the beginning of this post (54%). How can we enable coverage profiling for compiled binaries we use in our black-box tests? Let’s look at a cool new feature Go 1.20 has to offer and find out!
Adding black-box coverage to the mix
How can we add black-box coverage reporting? TL;DR: We can now use the go build -cover
flag to enable coverage profiling for compiled binaries.
Collecting coverage information from a compiled binary in Go has been a non-trivial task for a long time. The coverage profiling functionality was tied to testing, so, the only workaround was to use a binary file compiled for tests (go test -c
) and make it dual-purpose via a bit of hackery—no fun at all.
In 2022, a proposal was submitted to add application coverage profiling support to Go. It was accepted, and the new -cover
flag was added to the go build
command in Go 1.20.
Application coverage works a bit differently compared to test coverage; with the former, results are stored in two binary files: covmeta.xxx
and covcounter.yyy
. The first one contains the program’s source code metadata (file paths, function names, and so forth). The second contains the actual coverage counters. Each time you run an instrumented binary, a new covcounter.xyz
file with a unique name is created. Thus, you can re-use the same binary between runs and collect multiple coverage profiles.
In addition to the -cover
flag, Go 1.20 also introduced a new go tool covdata
command to manipulate raw coverage data. You can read more about it in the official documentation.
Let’s see how we can combine these new features to add black-box coverage reporting to our CI pipeline:
jobs:
test-conformance:
runs-on: ubuntu-latest
strategy:
matrix:
test_command:
- benchmarks
- test-features
- test-conformance
# ...
env:
# ...
GOCOVERDIR: "_icoverdir_"
steps:
- name: Run tests
run: |
mkdir _icoverdir_
make $
- name: Format coverage
run: |
go tool covdata textfmt -i _icoverdir_ -o coverage.out
- name: Report coverage
uses: coverallsapp/github-action@v2
with:
file: coverage.out
parallel: true
format: golang
flag-name: integration-$
There are a few things to note here:
- We use the
GOCOVERDIR
environment variable to specify the directory where the coverage data will be stored. Without it, the coverage data won’t be saved anywhere. - We must create the directory ourselves (
mkdir _icoverdir_
), which was a bit unexpected for me. Hopefully, this will be fixed in the future. - Finally, we need to use
go tool covdata
to convert coverage data into a text format that can be used with Coveralls.
The full PR can be found here.
So, what was the final number combining all the coverage reports, from unit, integration, and black-box tests? 71%! Not bad, right? By looking at the coverage reports, I was able to identify some uncovered (though battle-tested in production) scenarios; so, I added a few more black-box scenarios and increased the overall coverage a bit more, up to 75%. This number better match my expectations. The missing 25% is mostly related to platform-specific and auto-generated code (e.g., gRPC/protobuf files), so, I’m not too worried about it.
What does this story teach us? Coverage is like statistics: if you torture the data long enough, it will confess to anything (Darell Huff, How to Lie with Statistics). Don’t take coverage seriously, use it as a tool, not as a source of truth or confidence. And, of course, don’t forget to write tests!
Speaking of coverage percentages, Evil Martians have you covered, 100%! We’re ready to beam down and fix your earthling troubles, whether related to frontend, product design, backend, devops or something beyond, drop us a message for more info!