# Choose your fighter: benchmarking 5 WebSocket servers for Node.js

> Evil Martians benchmarked five WebSocket servers for Node.js: Socket.io, uWebSockets.js, and AnyCable (OSS and Pro). How we caught our own load generator lying, and how to make WebSocket benchmark numbers honest.

- Date: 2026-06-24T00:00:00.000Z
- Authors: Irina Nazarova, Travis Turner
- Categories: Performance, Real-time, Open Source
- URL: https://evilmartians.com/chronicles/choose-your-fighter-benchmarking-5-websocket-servers-for-nodejs

---

I compared five ways to run WebSockets on Node.js: default Socket.io, Socket.io with Connection State Recovery, uWebSockets.js, AnyCable OSS, and AnyCable Pro. In this post, I share how a banned laptop, a lying load generator, and a stubborn throughput disparity taught me that the hardest part of benchmarking is getting your own measuring rig out of the frame.

Here's the result up front. Across the five Node.js WebSocket stacks I tested, raw latency is a wash, all within a few milliseconds of each other at 10K subscribers. Under pressure, they pull apart.

- AnyCable held 100% message delivery through simulated network drops, rode out deploy-time reconnect storms with no restart, and led broadcast throughput at scale.
- Default Socket.io lost about 15% of messages under those same drops and topped out near 120K connections on one box.
- uWebSockets.js stayed the leanest per connection but needed seconds to recover from a reconnect storm. 

The full scoreboard is further down. But getting to numbers I'd stand behind started with a laptop and five burned IP addresses.

---

*Evil Martians build and scale real-time systems, including AnyCable, which holds over 800,000 idle connections on a single 32 GB box. Tell us where your WebSocket layer buckles.* [Contact Evil Martians](https://evilmartians.com/contact-us)

---

## A tour of the Americas, sponsored by an abuse-detection system

AnyCable is our own open source real-time server, so yes, we had a horse in this race, which is exactly why I tried so hard to catch it lying. All five deployed on identical Railway hardware, 32 vCPU and 32 GB, same region, same project. The plan was simple: open ten thousand WebSocket connections to each server, drop them, broadcast through them, and measure what survives.

The load came from my laptop.

It worked for a while. Then it stopped working, because opening ten thousand WebSocket connections to a host from a single residential IP looks, to any edge protection worth paying for, exactly like the thing that edge protection exists to stop. Railway's edge took one look at my apartment and decided I was a botnet. Fair enough.

I hadn't budgeted for the fact that the block landed on my whole IP, not just the project I was hammering. So, everything of ours that lives on Railway went dark with it. **Solaris** and **Fizzy**, two of our internal tools, were both unreachable from my desk. The benchmark had drawn first blood, and it was mine.

Luckily, that IP was old news: I was on a Latin American Ruby tour that week, in São Paulo to accompany Vladimir for his [Tropical on Rails keynote](/events/tropical-on-rails-keynote) and then on to the **Ruby meetup in Montevideo**, which served as a natural experiment in how many fresh IP addresses one person can burn through between talks:

- **Panama, airport layover on the way down.** Fresh IP. One good cup of geisha coffee, then blocked.
- **Montevideo, hotel WiFi before the Ruby meetup.** Fresh IP. Blocked.
- **São Paulo, hotel WiFi during Tropical on Rails.** Blocked.
- **Chicago, layover on the way home.** Blocked before I finished my McMuffin.
- **Home, where it had all started.** Still blocked, but the mobile IP worked.

I was running a distributed system's load test as a frequent-flyer program, and losing on every leg.

Even on a clean IP, the laptop was the wrong tool. A developer machine taps out around a thousand clients before local NAT and a single event loop become the bottleneck. I needed ten thousand, and the bans just made the deadline louder. The problem was that **my laptop was inside the experiment.**

## Lesson zero: get the measuring rig off your laptop

There are two reasons a load generator does not belong on your machine, and the IP ban is the less important one.

The important one is actually physics. When my laptop in a São Paulo hotel measures "WebSocket latency" to a server in a Railway region, the number it writes down includes hotel WiFi, the hotel's upstream ISP, a few thousand kilometers of fiber, and whatever Starlink or congestion was doing that minute. None of that is the WebSocket server. All of it is in the result. You cannot benchmark a server through a network you do not control and call the output a property of the server.

So I moved the load generator to where the servers live. A small service I call the **bench-runner** runs inside the same Railway project, on the internal network, next to the targets. My laptop's job shrank to one sentence: tell the bench-runner to start, then wait and collect the JSON.

```
                ┌──────────────────────────────────────────────┐
                │            Railway, single region            │
                │                                              │
   driver       │   socketio-server  ◄─── /_broadcast ┐        │
   (local) ───► │    (Node 22)                        │        │
                │                                     │        │
                │   anycable-go ◄── /_broadcast ──────┤        │
                │   (or anycable-go-pro)              │        │
                │                                     │        │
                │   uws-server     ◄── /_broadcast ───┤        │
                │                                     │        │
                │   bench-runner ◄───── POST ─────────┘        │
                │   (50× shards for 1M-scale tests)            │
                └──────────────────────────────────────────────┘
                         ▲
                         │  HTTPS, 5-min edge cap
                         │  → bench-runner returns 202 {jobId}
                         │  → driver polls /jobs/:id
```

Three rules came out of this, and they are the whole foundation of the setup:

- **The thing being measured is separate from the thing measuring.** The WebSocket layer is the subject. The bench-runner is the load generator. Put them in the same process and every dropped frame in the generator becomes a dropped frame in your result, with no way to tell which one you are looking at.
- **The driver runs local, the work runs remote.** The CLI on the laptop triggers and waits. All WebSocket load lives on the bench-runner, on the internal network, with no hotel between it and the target.
- **One clock owns latency.** `sentAt` is stamped when the bench-runner dispatches a publish, `receivedAt` when the subscriber callback fires. One process, one clock, no cross-host NTP arithmetic.

Great. The rig was now respectable, the bans stopped, and the numbers came out clean.

The numbers were also wrong.

## The twist: the rig was still lying, it was just lying from inside Railway now

Here is the number that shipped on our own comparison page, and one that I believed for an embarrassingly long time: AnyCable's p99 latency at 10,000 subscribers was **234 ms**. We had even written the narrative around it. "AnyCable trades a little latency for delivery guarantees." It was a tidy story. The replay buffer costs you something, and here is what it costs.

The real server-side p99 was **11 ms**.

The other 223 milliseconds were never in the server. They were in the bench-runner.

All ten thousand subscriber clients lived in one Node.js process. At that scale, a single event loop spends its time deserializing ActionCable JSON envelopes for thousands of cables, one after another, sequentially. The WebSocket frames arrived at the socket on time. The `receivedAt` timestamp got stamped late, because the callback that stamps it was stuck in a queue behind nine thousand other callbacks. I was not measuring how fast the server delivered. I was measuring how fast a saturated Node event loop could acknowledge that delivery had already happened.

The same cascade inflated throughput by roughly 10x. My load generator was the slowest thing in the building, and it was writing its own slowness down as the server's latency.

### How I caught it: give the rig slack, then watch what moves

I held the system under test perfectly still: ten thousand subscribers, one target server, identical server-side load. Then I gave the *measuring rig* more room. Instead of one process holding 10,000 clients, I spread the same 10,000 clients across **40 shards of 250 clients each**. Forty Node event loops, each comfortable, each with nothing better to do than stamp timestamps on time.

The server saw the exact same load. Only the harness changed.

- Latency p99: **234 ms → 11 ms**
- Throughput p99: **150 ms → 12 ms**

*Image: Same server underneath both bars. Only the measuring rig changed.*
When you give the measuring apparatus more headroom and the result changes while the system under test has not, you were measuring the apparatus. That is the whole trick. A benchmark number that improves when you add capacity to the load generator is a number that belonged to the load generator.

### Then I proved it a second way, because one method is a guess

The leeway trick told me the harness had been lying, but it did not tell me what the server actually costs. So I built a second, independent measurement that came at the question from a different angle: an OpenTelemetry-shaped tracer running inside the Railway network that timed only the server's broadcast path, the broker write plus the HTTP round trip, with no client-side cascade anywhere near it.

That path measured **under 11 ms p99 at 2,500 subscribers.** Two methods, two directions, same answer. The "broker write costs 200 ms" claim we had shipped was wrong by a factor of twenty.

We fixed the page. The hero verdict stopped saying "trades latency for reliability" and started saying what was true: server-side latency comparable to the alternatives, with a 4 ms broker-write premium that buys you the replay buffer.

## The counter-twist: sometimes you are wrong and the rig is right

It would be a clean morality tale if the lesson were "distrust your benchmark." But it's not.

AnyCable kept coming out slower than uWebSockets.js on raw throughput. My very first instinct, freshly burned by the latency cascade, was: test bug. Obviously, the harness is lying again. I chased it for several rounds. I even swapped the broker transport to NATS to rule out the in-memory path.

The gap did not move, because it was real. For that specific configuration, a single HTTP publisher feeding one `anycable-go` broadcaster, AnyCable genuinely trailed uWS's bare in-process emit. [Vladimir Dementyev](/martians/vladimir-dementyev), AnyCable's creator, reproduced it with `benchi`, an in-process benchmark with no network hop at all, and got the same shape. We added `benchi` to the bench-runner image and now report both numbers side by side on the page: the production-shape figure over HTTP, and the in-process ceiling.

So the discipline is not "the tool lies." The discipline is symmetric:

> Every surprising number gets a second, independent measurement that brackets it from a different angle. When the second method agrees, the number is real. When it disagrees, one of your two rigs is wrong and you get to find out which.

Two surprises, opposite resolutions, same move resolved both:

| Surprise | Reflex | Second measurement | Verdict |
| --- | --- | --- | --- |
| AnyCable p99 = 234 ms | "the server is slow" | 40-shard re-run + intra-network trace | **rig was lying**, server is 11 ms |
| AnyCable < uWS throughput | "the rig is lying" | in-process `benchi`, no network | **rig was right**, assumption was wrong |

If you only apply skepticism in one direction, you'll fix the flattering errors and keep the unflattering ones, or the reverse. So bracket everything.

## Field notes: the smaller traps that each cost me a night

The two big reversals get the headline, the following are the ones that did not, although every one of them changed a number on the page.

### The port-pool ceiling, or: how my one banned laptop IP became fifty Railway IPs

Remember the IP ban? The fix for it turned out to be the same fix for a completely separate problem.

A Linux container has roughly 64K outbound ports per source IP, of which about 50K are usable after kernel reservations. One load-generator container therefore caps at about 50,000 WebSocket connections to a single target, no matter how much RAM it has. To reach a million, you need at least twenty source IPs.

The traditional answer is kernel tuning, `net.ipv4.ip_local_port_range` and friends, which needs root, is platform-specific, and turns your benchmark into a sysctl tutorial nobody will reproduce.

My answer: fifty bench-runner containers, each with its own source IP and its own port pool, with a coordinator that fans out the load and merges the results.

*Image: The fleet: fifty bench-runner shards, each with its own source IP, sharing one Railway project with the target servers.*
```bash
SHARDS=https://bench-runner-1.up.railway.app,https://bench-runner-2.up.railway.app,...
PER_SHARD_N=20000 HOLD_SEC=120 RAMP_PER_SEC=200 \
  npm run bench:idle:multi
```

A thing that started as "my laptop has one IP and it is banned" ended as "the rig has fifty IPs and a port budget of two and a half million." Giving the harness slack, again. This is the same lesson as the latency cascade, it's just wearing a different hat.

### Equalize the variable you are not testing

My first jitter test set the offline window to one second. Every subscriber's TCP socket got force-closed, stayed down for a second, came back. I assumed it was the same disruption for everyone.

But it was not the same disruption for anyone. Each client library applies its own reconnect backoff:

| Setup | Mechanism | Effective offline window |
| --- | --- | --- |
| Default Socket.io | runner opens a fresh socket after the sleep | sleep length, exactly as set |
| Socket.io + CSR | library `reconnectionDelay: 2000` | ~2.0 to 5.0 s |
| AnyCable | `@anycable/core` Monitor backoff | ~2.0 s |
| uWS | custom wrapper matching socket.io-client | ~2.0 to 5.0 s |

Default Socket.io was the only config actually facing a one-second window, because the library never got to apply its backoff. So I floored the offline window at two seconds for everyone:

```ts
// lib/core/timing.ts
export const MIN_OFFLINE_MS = 2000;
```

Equalizing flipped the headline. 

- Before: default Socket.io showed ~27% delivery, and CSR showed ~80% delivery with a terrifying 100-second p95 tail. 
- After: default Socket.io 84%, CSR 100% with a 1.97 s p95. 

That 100-second tail was never a CSR property; it was ten thousand clients all reconnecting inside the same one-second window while the box serialized the accept queue.

> When a benchmark hands you a dramatic result, ask what other variable is riding along with the one you think you are testing. Equalize it. If the result holds, it's real. If the shape changes, you were measuring the passenger.

### Keep the test shape identical, down to the HTTP request count

My first design had AnyCable publishing one HTTP POST per broadcast, while Socket.io did one POST that kicked off an in-process loop of emits. If the test is "100 broadcasts," every setup should do 100 broadcasts the same way. The asymmetric version baked about 3 ms of clock skew into one side and quietly favored it. I converted everyone to per-message HTTP publishing and kept the in-process path only as a labeled diagnostic. Symmetric, or the comparison is decoration.

### Merge distributions, do not average percentiles

Once load is spread across shards, the naive way to combine results is to average each shard's p99. That answer is wrong, because a percentile is not a mean. Each shard returns its sorted latency samples, and the driver concatenates, re-sorts, and recomputes over the union:

```ts
// lib/core/stats.ts
export function mergeJitterResults(label: string, shards: JitterResult[]) {
  const merged: number[] = [];
  for (const s of shards) for (const v of s.latencySamplesSorted!) merged.push(v);
  merged.sort((a, b) => a - b);
  // recompute p50 / p95 / p99 over the merged distribution
}
```

### Name what stopped each test

I pushed all five servers toward one million idle connections on one box. What stopped each one matters more than the count it reached:

- **Socket.io** stopped at **119,826** with memory and CPU near idle. One core was pinned handling handshakes serially. A single-threaded JS ceiling.
- **AnyCable OSS** reached **821,877**, using 28.3 GB at ~34 KB per connection, closing in on the 32 GB RAM wall.
- **AnyCable Pro** reached **822,037** using just 14.8 GB at ~18 KB per connection and 7.9% CPU, with roughly half the box still free. It did not hit a wall; the test run ended before it could.

{% image "03-idle-capacity.png", alt: "Idle connections held on one 32 vCPU / 32 GB box toward a 1,000,000 target. Socket.io stopped at 119,826 (single-threaded JS ceiling, one core pinned, RAM idle). AnyCable OSS reached 821,877 using 28.3 GB at ~34 KB per connection, closing on the RAM wall. AnyCable Pro reached 822,037 using 14.8 GB at 18 KB per connection and 7.9% CPU, with half the box free.", caption: "Same headline number, opposite headroom: OSS near its RAM wall, Pro using half the box." %}

AnyCable OSS and Pro stopped at almost the same number, and it meant opposite things: OSS was closing on its RAM wall, while Pro was using less than half the box. "It held N connections" is a useless fact without the reason it stopped at N.

## The scoreboard

A methodology is only worth so much without numbers, so here's where the five servers actually landed once the rig was honest. The full, reproducible comparison lives on the [AnyCable Node.js WebSocket comparison page](https://anycable.io/compare/nodejs-websocket/); this is the top-line view.

{% image "summary-results.png", alt: "Comparison summary for five Node.js WebSocket servers across four dimensions. Latency p50/p99 at 10K subscribers: Socket.io + CSR 3/7 ms, uWS 2/10 ms, AnyCable Pro 3/11 ms. Reliability under jitter (delivered under WiFi drops): Socket.io 85%, uWS 87%, AnyCable 100%. Reconnect avalanche: Socket.io + CSR fails past 10K, uWS 9 s at 20K, AnyCable 0 s no restart. Footprint per idle connection: Socket.io ~52 KB, AnyCable Pro 18 KB, uWS 5 KB.", caption: "The top-line scoreboard: latency, jitter delivery, reconnect survival, and footprint." %}

Four questions, four answers:

- **Is it fast enough?** At 10K subscribers the leaders sit within a few milliseconds of each other: Socket.io + CSR at 3/7 ms p50/p99, uWS at 2/10 ms, AnyCable Pro at 3/11 ms. Nobody here is slow.
- **Does it survive bad networks?** Under WiFi drops, default Socket.io delivers 85% and uWS 87%, while Socket.io + CSR and AnyCable both deliver 100%.
- **Does it survive a deploy?** In a reconnect storm, Socket.io + CSR falls over past 10K, uWS takes 9 seconds to absorb 20K, and AnyCable rides it out with no restart.
- **What does it cost to run?** Per idle connection, a bare uWS wire is lightest at 5 KB, AnyCable Pro carries its replay buffer for 18 KB, and Socket.io needs ~52 KB and caps around 120K connections per node.

No single server wins every column. AnyCable's case is the combination: 100% delivery under jitter, connections that survive deploys, the lead on broadcast throughput at scale (10× lower p99 than uWS under parallel publishers), and 2.5× less RAM than Socket.io, with latency in line with everyone else.

## What "honest" turned out to mean

Some of the results did not flatter us, and they stayed. uWS holds more connections per gigabyte than AnyCable Pro at a million-connection scale, because a bare wire with no replay buffer is lighter than one with. The page says so. CSR's replay tail is competitive with AnyCable's in the in-memory setup, and we kept that result the day the data showed it, even though our first draft had said otherwise.

Four rules survived the whole build:

1. Run the comparison's library at its best, with its real defaults, never its weakest setting.
2. Keep the test shape byte-for-byte identical across every subject.
3. Publish the code, the parameters, and the raw results, not just the chart.
4. Rebuild the test the moment someone shows you a fair criticism.

And underneath all four, this is the thing that four countries' worth of IP bans really beat into me:

**Your benchmark is part of the experiment.** The load generator, the network between it and the target, the process count, the clock, the edge proxy, the source IP: every one of them can write its own behavior down as the server's. The work of benchmarking honestly is mostly the work of finding where your rig is secretly in the frame, and stepping it out, one banned laptop at a time.

The code is open. If you find a flaw, open an issue. We would rather fix it than leave a wrong number standing.

*Code: [github.com/anycable/nodejs-websocket-bench](https://github.com/anycable/nodejs-websocket-bench). The full comparison: [anycable.io/compare/nodejs-websocket](https://anycable.io/compare/nodejs-websocket).*

---

**Need WebSockets that hold up under load?** Evil Martians build and scale real-time systems, including AnyCable, which holds over 800,000 idle connections on a single 32 GB box. Tell us where your WebSocket layer buckles. [Contact Evil Martians](https://evilmartians.com/contact-us)
