Realtime diagnostic logging, or how to really spy on your Go web apps

Cover for Realtime diagnostic logging, or how to really spy on your Go web apps

Topics

Logging is essential for any application running out in the wild. Usually, we hit up logs retrospectively to learn about the past, but in some cases, real-time access to detailed insights on a particular user’s actions is the best (and, sometimes, only) way to identify a problem. Let’s see how we can implement this type of troubleshooting (or diagnostic) logging functionality with Go and its (still quite new) slog package!

What to do when users start complaining that a particular feature in your application isn’t working? First, you’d probably assess the symptoms and verify that users are truly observing an unexpected behavior. You can then check your error tracking service (nothing to see here), uptime, and vital metrics (all is well). Finally, the last source of any potentially useful information is hidden somewhere in the logs. Yet, your default production log level is “INFO” (right?), so, while there might be some warnings and errors, these logs are unlikely to have enough information about the problem. What to do next?

Schedule call

Irina Nazarova CEO at Evil Martians

Schedule call

In this post, I’d like to explore different options on how to temporarily enable verbose (and, usually, selective) logging for troubleshooting purposes and how to make these type of logs easily accessible from anywhere and in real-time. This kind of logging is sometimes referred to as diagnostic logging. A practical example of diagnostic logging can be seen with the Troubleshooting Logs feature in AnyCable+:

AnyCable+ Troubleshooting Logs in action

In this article, we’ll explore some different strategies for bringing diagnostic logging to a Go application backed by the slog package, covering the following topics in the process:

Updating global log level, with and without restarts

Let’s start with something straightforward and applicable to any application, not just those written in Go.

You can temporarily update the application log level to “DEBUG” and restart it. Now you have plenty of logs recorded and (I hope) a decent log collection and viewing service (that is, not a plain file) to search through for the relevant events. Problem solved, arividerci!

Right? But what if you can’t afford to restart an application just for the sake of debugging? That’s exactly the situation with AnyCable, or any other application that serves thousands of persistent connections (for example, online event platforms). In cases like this, restarting the server would require all active clients to be re-initializated (re-establishing a connection, and restoring the state). That’s not good from the user experience perspective nor is it good from the server perspective—and massive re-connects (also known as connection avalanches) could result in significant load spikes.

So, restarting the server is not an option. If we used Erlang/Elixir (fun fact: the original AnyCable prototype was written in Erlang), we could reap the benefits of its hot code reloading feature. However, we’ve bet on Go, and so the only way for us to change the logging settings is by adding some custom logic to our application. So, let’s start writing some Go.

Slog is the way

We’ve been using structured logging in AnyCable since the early days. Unlike most applications back then, we didn’t choose Logrus, but apex/log (this was mostly because of this blog post). It served us well for many years and, despite being seemingly unmaintained in recent years, we didn’t bother to find a replacement. Why? Because amazing logging libraries just started springing up like mushrooms after a rain—how to pick one that would stand the test of time?

Luckily, the Go core team noticed the problem with logging library bloat, so in Go 1.21, we ended up with a standard structure logging solution: slog. The problem of choice disappeared, and we decided to migrate AnyCable to slog and to build our diagnostic logging on top of it.

Let’s note the two primary concepts of the slog package:

  • slog.Logger is your logging entry-point, a struct you invoke the Info(), Debug(), et al. functions on.

  • slog.Handler is an interface responsible for handling log entries (or records) and performing writes to logging device, level-filtering, formatting, etc.

Here’s a basic example slog in use:

package main

import "log/slog"

func main() {
	opts := &slog.HandlerOptions{
		Level: slog.LevelInfo,
	}
	logger := slog.New(
		slog.NewTextHandler(os.Stdout, opts),
	)
	// ...
}

Note that the logging level is a characteristic of the handler, not the logger instance. There is no logger.SetLogLevel(lvl) function or any similar mechanism (like in many logging libraries) to change the log level. So then, how can we change the log level dynamically?

There are many workarounds, but the one I prefer is as follows:

package main

import "log/slog"

var LOG_LEVEL = new(slog.LevelVar)

func main() {
	LOG_LEVEL.Set(slog.LevelInfo)

	opts := &slog.HandlerOptions{
		Level: LOG_LEVEL,
	}
	logger := slog.New(
		slog.NewTextHandler(os.Stdout, opts),
	)

	// ...
}

// SetLogLevel updates the application log level
func SetLogLevel(lvl slog.Level) {
	LOG_LEVEL.Set(lvl)
}

Here we’ve used the special slog.LevelVar variable as a container for the current logging level. Now, if we want to change the logging level without interrupting the application, we can call the SetLogLevel function. To demonstrate, we can set up a SIGQUIT signal handler to toggle the log value between “DEBUG” and “INFO” (you can find a fully working example in this gist).

This approach is good enough if we have easy access to the logs and are fine with the constant need to find a needle in the haystacks since turning verbose logging on globally would result in tons of logs.

We developed the Troubleshooting Logs feature with two use cases in mind:

  • Providing easy logs access for AnyCable+ users.
  • Making logs accessible via the AnyCable admin UI (in other words, providing an experience similar to AnyCable+ for users of the self-hosted version).

In both scenarios, it wasn’t possible to make any assumptions on how logs are collected. So, we needed an infrastructure-independent way to stream verbose logs on demand. So, we started looking for ways to simultaneously write logs to multiple outputs.

Let’s see how to implement multi-output logging with slog.

Multi-handler slogging

We can build a multi-output logger using the fan-out pattern. Let’s take a look at an example:

package main

import "log/slog"

type MultiHandler struct {
	left slog.Handler
	right slog.Handler
}

func (t *MultiHandler) Enabled(ctx context.Context, level slog.Level) bool {
	return left.Enabled() || right.Enabled()
}

func (t *MultiHandler) Handle(ctx context.Context, r slog.Record) (err error) {
	if t.left.Enabled(ctx, r.Level) {
		err = t.left.Handle(ctx, r)
	}

	if t.right.Enabled(ctx, r.Level) {
		err = t.right.Handle(ctx, r)
	}

	return
}

func (t *MultiHandler) WithAttrs(attrs []slog.Attr) slog.Handler {
	return &MultiHandler{t.left.WithAttrs(attrs), t.right.WithAttrs(attrs)}
}

func (t *MultiHandler) WithGroup(name string) slog.Handler {
	return &MultiHandler{t.left.WithGroup(name), t.right.WithGroup(name)}
}

func main() {
	jsonHandler := slog.NewJSONHandler(
		os.Stdout,
		&slog.HandlerOptions{
			Level: slog.LevelInfo,
		},
	)

	textErrorHandler := slog.NewTextHandler(
		os.Stderr,
		&slog.HandlerOptions{
			Level: slog.LevelError,
		},
	)

	logger := slog.New(
		&MultiHandler{
			jsonHandler,
			textErrorHandler,
		},
	)
	// ...
}

Our MultiHandler must implement the slog.Handler interface:

  • The Enabled() function is a guard used by the logger to decide whether the handler wants to process this entry; if yes, the slog.Record struct is generated and passed to the Handle() function.
  • The WithAttrs() and WithGroup() functions are used to create copies of the handler with the given context attached.

Now, all we need is to create a custom handler, spy handler, that will consume all logs, if, and only if, at least one user is requesting them. Since diagnostic logging can be used in production and during peak hours, there are important performance-related considerations:

  • Having a spy handler attached to a logger must not have a noticeable performance overhead when it’s not active (i.e., it should not be sending messages to users)
  • Once activated, the spy still must not result in a performance degradation.
  • The spy must be non-blocking (and it’s okay to skip some messages to achieve this).

Let’s share the steps we took to get a satisfying implementation!

Tuning spy handler for performance

Our non-blocking requirement naturally implies processing log entries in a separate Go routine and using a channel as a transport. We also need to keep track of the handler state, active or inactive. Accordingly, we came up with this initial implementation for spy handler:

type SpyHandler struct {
	active bool
	ch     chan *slog.Record
}

func NewSpyHandler(callback func(*slog.Record)) *SpyHandler {
	h := &SpyHandler{
		ch: make(chan *slog.Record, 2048),
	}

	go func() {
		for r := range h.ch {
			callback(r)
		}
	}

	return h
}

func (h *SpyHandler) Enabled(ctx context.Context, level slog.Level) bool {
	return true
}

func (h *SpyHandler) Handle(ctx context.Context, r slog.Record) error {
	if h.active {
		h.enqueueRecord(&r)
	}

	return nil
}

func (h *SpyHandler) enqueueRecord(r *slog.Record) {
	select {
	case h.ch <- r:
	default:
	}
}

The important bits in the snippet above are as follows:

  • The usage of the select { ... default:} pattern avoids potentially blocking channel write operations (if the channel’s buffer is full, the default clause just drops the record).
  • The return true in the Enabled() function—our handler is supposed to consume all the logs, right?

Before we continue to polish this implementation, we’d better run some performance tests and check if this approach is viable at all. Luckily, writing performance tests (or benchmarks) is a no-brainer in Go:

func BenchmarkSpy(b *testing.B) {
	handler := slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelError})
	callback: = func(r *slog.Record) {
		// Immitate some work to ensure that the spy is not blocking the logger
		time.Sleep(10 * time.Millisecond)
	}
	spy := NewSpyHandler(callback)

	configs := []struct {
		spy      *SpyHandler
		active   bool
	}{
		{spy, true},
		{spy, false},
		{nil, false},
	}

	for _, config := range configs {
		desc := "no spy"
		if config.spy != nil {
			desc = "active spy"
			if !config.active {
				desc = "inactive spy"
			}
		}
		b.Run(desc, func(b *testing.B) {
			var logger slog.Logger

			if config.spy != nil {
				config.spy.active = config.active
				logger = slog.New(&MultiHandler{handler, config.spy})
			} else {
				logger := slog.New(handler)
			}

			b.ResetTimer()

			for i := 0; i < b.N; i++ {
				logger.Debug("test", "key", 1, "key2", "value2", "key3", 3.14)
			}
		})
	}
}

The results we obtained are as follows:

BenchmarkSpy
BenchmarkSpy/active_spy            401.4 ns/op
BenchmarkSpy/inactive_spy          266.0 ns/op
BenchmarkSpy/no_spy                8.142 ns/op

Wow! Spying on debug logs is 40x slower than having no spy. But that’s not surprising: if we set the log level to “DEBUG” for the base handler, we’ll see similar (or even higher) numbers in the “no spy” scenario. In other words, attaching an active spy is roughly equal to globally enabling verbose logs in our case. What’s more interesting is that, even when the spy is inactive, we can still observe noticeable overhead: hundreds of nanoseconds vs. about a dozen.

And while it’s okay to have some overhead when someone is reading diagnostic logs, always having overhead is unacceptable.

I decided to take a closer look at how slog works under the hood in order to figure out where this overhead may be coming from. Here’s the slog.Logger.log() function that’s called by the functions like Info(), Debug(), and so on:

func (l *Logger) log(ctx context.Context, level Level, msg string, args ...any) {
	if !l.Enabled(ctx, level) {
		return
	}
	var pc uintptr
	if !internal.IgnorePC {
		var pcs [1]uintptr
		// skip [runtime.Callers, this function, this function's caller]
		runtime.Callers(3, pcs[:])
		pc = pcs[0]
	}
	r := NewRecord(time.Now(), level, msg, pc)
	r.Add(args...)
	if ctx == nil {
		ctx = context.Background()
	}
	_ = l.Handler().Handle(ctx, r)
}

You can see that the first thing it does is call the Enabled() function, which relies on the same functions as the handler. Then, if the handler responds with “Yes, I’m enabled”, we build a log entry record in multiple phases. What did I realize after looking at this code? I understood that two functions (specifically, Enabled() and Handle()) exist for a reason: building a log record isn’t a free operation; building the record first, and only then checking the log level (if we had only the Handle() function) would be non-optimal from the performance point of view.

You, probably, already realized the quick fix we need to apply to avoid spying overhead during an inactive state:

 func (h *SpyHandler) Enabled(ctx context.Context, level slog.Level) bool {
-  return true
+  return h.active
 }

And the benchmark results after this change are much better:

BenchmarkSpy/active_spy             409.9 ns/op
BenchmarkSpy/inactive_spy           8.939 ns/op
BenchmarkSpy/no_spy                 8.243 ns/op

We can see that the overhead is now within the margin of error.

Let me skip the part when we turn this proof-of-concept into a more robust solution (by replacing the Boolean active field with atomics for safe concurrent interest tracking and by introducing a buffer to the output). Instead, I’d like to talk about another optimization that we found while looking at the slog source code.

And hey, you can find the full version of the SpyHandler on GitHub: palkan/slog-spy. It’s been extracted from AnyCable+ and open-sourced, enjoy!

The tale of IgnorePC

Have you noticed this internal configuration parameter, internal.IgnorePC? It’s responsible for attaching the caller pointer to the log record, so we can use the AddSource: true option with built-in handlers.

Adding source lines to logs is a nice feature in development, but it’s not needed here in production. So, I decided to see if disabling this feature would bring any performance boost. To do that, I tweaked the slog source code to set the IgnorePC var to true and then ran the benchmarks again:

BenchmarkTracer/active_spy_with_ignorePC=false           393.2 ns/op
BenchmarkTracer/active_spy_with_ignorePC=true            259.3 ns/op
BenchmarkTracer/inactive_spy                             10.14 ns/op
BenchmarkTracer/no_spy                                   8.483 ns/op

The effect of removing the caller information from the log record turned out to be quite good. I wondered why there wasn’t an option to disable this feature (or why we don’t automatically disable it when the AddSource option is set to false). There was an opportunity for a pull request and a contribution here but… this fairy tale had no happy ending: this situation is already known to the Go team (and actually, there’s a benchmark in the slog package) and they decided not to overoptimize. Maybe, that will change in the future. We’ll see.

For now, you can use some compiler-level hackery to change the value of IgnorePC in your project, like this:

//go:linkname IgnorePC log/slog/internal.IgnorePC
var IgnorePC = true

Drop this spell somewhere in your code if you want to make logging faster. (Though, for most applications, slog is definitely already fast enough—the work and thought put into it is outstanding.)

Turning spying into truly diagnostic logging

So far, we’ve managed to create a logging abstraction to consume all application logs (and of all levels) only when we need them. In AnyCable, we pipe the JSON-formatted logs through the existing pub/sub functionality and expose them via server-sent events (so you can consume the stream via EventSource API in the browser or cURL from any server).

Currently, we implement log context filtering at the client side, i.e., every logs viewer receives everything from the server. Diagnostic logging usually implies that this filtering occurs on the producer side (that is, our server) so that no unnecessary logs are sent to the consumer.

…And that’s something we can try to achieve (up to a point) by enhancing our SpyHandler. More precisely, we can keep track of attached attributes (WithAttrs()) and consider them within the Enabled() function. But let me leave that to you as a fun bit of homework. Happy logging!

Schedule call

Irina Nazarova CEO at Evil Martians

We partitioned a database with 800K users and 60M records with zero downtime for Playbook.com. This relieved the DB of a major performance bottleneck, to serve many more millions of users! Got any performance and scalability tasks? We are ready to jump in and solve them!