200ms ± 500ms

30 Jun, 2026 · 11 min read
Contents

I once needed the SLA for an endpoint my dashboard leaned on, so I asked the team that owned it. Their lead came back with 200ms ± 500ms. Read that literally and the fastest responses arrive 300ms before the request is even sent. The number wasn’t malicious — it came straight out of the standard formulas. The formulas were wrong for the data, and that mistake is everywhere.

Statistics for programmers

This is about the statistics we actually run into — latencies, throughputs, sizes — and the one assumption that quietly wrecks most of it.

A number that can’t be right

200ms ± 500ms. The intent was “about 200 milliseconds, give or take.” But read a ± the way we all do — 200 ± 10 means “almost always between 190 and 210” — and ± 500 on 200 promises responses from −300ms to +700ms. Negative time is nonsense. When I pointed that out, the number was revised on the spot to “200, minus 200, plus 500” — an asymmetric 0-to-700ms. Better, except 0ms is impossible too: nothing returns instantly. Each patch produced another world that can’t exist, because the problem was never the size of the ± — it was folding a lopsided distribution into a symmetric “value ± error” at all.

And this wasn’t a one-off — confident, impossible numbers like this reach me again and again. Few of us are fluent in math, let alone statistics. So where did such a number come from — and what should it have been instead? Both answers start in the same place: stop summarizing, and look at the actual measurements.

Reproducing it

Take ten honest measurements, all clustered near 100ms:

100, 99, 101, 98, 102, 100, 98, 99, 102, 101

Run the usual numbers over them:

Statistic Value
average 100
median 100
standard deviation 1.49

Everything agrees and everything looks respectable: 100ms ± 1.49ms. Tight, trustworthy, done.

One outlier

Now something hiccups — a garbage-collection (GC) pause, a cold cache, a retry — and one request takes 2,000ms. A single extra data point. Re-run the same formulas:

Statistic With outlier Before
average 273 100
median 100 100
standard deviation 573 1.49

There it is again: 273ms ± 573ms, negative time and all. One point out of eleven moved the “average experience” from 100ms to 273ms — a number almost no one actually saw — and inflated the spread by 380×. The median didn’t budge. The mean and the standard deviation both assume something our data never promised; the median doesn’t, so it stayed put.

The tell isn’t always this loud, either. The same mistake hides in a perfectly respectable 200 ± 150 — no impossible negative to give it away, just a symmetric summary of something that was never symmetric.

Where it comes from

We were all taught the normal distribution first, and often only:

The normal distribution: a symmetric bell curve

The lesson usually arrives with three quiet suggestions: it’s the most common distribution, so approximate everything with it, and use its tidy formulas (mean, standard deviation) for our summaries. But the normal distribution has three properties baked in:

  • it’s real-valued — defined from −∞ to +∞;
  • it’s symmetric — equally likely above and below the mean;
  • it’s unimodal — one peak.

Real measurements rarely honor those. Latencies — and payload sizes, and anything with a hard floor and a long tail — are not symmetric: there’s a floor (nothing beats the speed of light or the algorithm’s lower bound) and a long right tail (anything can make a request slow). They cluster into a lopsided shape — often log-normal, sometimes heavier-tailed:

The log-normal distribution: the lopsided, long-tailed shape of typical response latencies

Counts behave differently again — requests per second, errors per hour, retries per job are discrete and never negative, nothing like a smooth bell.

Performance often comes in discrete levels (cache hit or miss, fast path or slow path), not a smooth spread, and plenty of distributions are multimodal. There are hundreds of named distributions; the point isn’t to tour the zoo, it’s that almost none of what we measure is the bell curve we reach for.

None of this makes the normal the enemy. It’s a genuinely good default — feed non-normal data into normal-theory formulas and the results are often surprisingly close anyway, because the central limit theorem keeps the average of enough samples nearly normal for any distribution with finite variance (worked through here ). That’s why it’s everywhere, and reaching for it isn’t a blunder. The catch is narrower than the reputation: that resilience is about the mean. When the question is what a request actually feels like — the tail, the modes, the shape — the mean is the wrong summary, and the normal’s tidy formulas describe a curve the data never had.

When the average lies

A second real case. A cache with a 90/10 hit rate: the fast path returns in 10ms, the slow path in 100ms. The average is 0.9 × 10 + 0.1 × 100 = 19ms. Looks fine on a dashboard.

But trace actual users:

  • 9 of 10 are served in 10ms,
  • 1 of 10 waits 100ms,
  • nobody is served in 19ms.

The “average” is a value no real user ever experiences, and that 100ms tail may be exactly the part that’s unacceptable. (The mean has its place — totals and capacity, where load is mean × count — but “typical experience” isn’t it.) This is a bimodal distribution — two peaks — and the mean falls in the empty valley between them:

A bimodal distribution: two peaks with a valley between them

A real one: the tax-time mystery

At a major fintech company, every tax season the same complaints trickled in: the site worked fine until a customer logged in, and then their data took forever to appear. The reporters were small customers — shops that touched the service maybe once a year. It was sporadic, nobody could reproduce it on demand, and the in-house explanations ranged from “flaky internet” to “pure fiction.” But the reports kept piling up, and the servers showed nothing: no spikes, no errors, and the response-time stats looked perfectly normal.

So I instrumented the servers and plotted the actual response times. The bulk sat right where it should — and far off to the side was a small, separate cluster. A second mode the averages had quietly swallowed.

The cause: to cut costs, the backend had moved to tiered storage, and data nobody had touched in a long time was migrated to a cheaper, much slower tier. The first request paged it back, so any retry was fast — which is exactly why it could never be reproduced. The victims were precisely the once-a-year users; active customers, and our own testers who hammered the system daily, never left the fast tier, so to everyone in a position to look, the system was fine.

Two summary numbers said “normal.” The distribution said “a whole population is having a terrible time.” Only one of them was telling the truth.

Smell tests

A handful of signals make me stop and recheck before trusting a number:

  • A latency written as value ± error at all. 200 ± 500 or a tame-looking 200 ± 150 — either way, a symmetric error bar on a one-sided, long-tailed quantity claims a shape the data doesn’t have. The form alone makes me recheck.
  • A mean handed over without a median. Usually it means nobody looked at the shape — and the number may be wrong.
  • A mean and median that disagree. Far apart means the real distribution is skewed, multi-modal, or both — not normal. Recheck.
  • A summary that contradicts what people feel — a flattering average next to complaints of multi-second waits. That gap is the data asking to be plotted; the histogram, or the percentiles behind it, shows what the average is hiding.
  • Outliers, in or out? Find out which, and ask why they’re there. An outlier usually has a story — sometimes the whole story, as the tax-time users were.

What to do instead

  • Collect enough first. Three samples decide nothing, and a percentile off 100 isn’t much better — a p99 from a hundred points leans on one or two data points, not a distribution. Modern hardware sorts a million in a blink, so gather generously; go wild. (The returns do flatten eventually — far past the 100 points people tend to settle for.)
  • Don’t assume the normal — check. It’s often a fine approximation, sometimes a resilient one, but that’s something to confirm on the data, not lean on by default.
  • Prefer distribution-free statistics. The median didn’t flinch at the outlier, and neither do the central percentiles. The tail ones — p95, p99 — aren’t robust; they move when the tail moves, which is exactly why we watch them. Quote p50 / p95 / p99 instead of mean ± standard deviation — that’s why latency service-level objectives (SLOs) and good benchmarks are written in percentiles.
  • In any critical case, plot the density. A histogram takes seconds and answers the only question that matters first: what shape is this?
  • Watch for multiple peaks. More than one means no single summary is honest — split the clusters and analyze them apart (the 90/10 cache is two populations wearing one number, and each, on its own, may well be normal).
  • Trim outliers only when they’re genuinely errors. A miskeyed value or a dropped probe, sure. But if the tail is real — the tax-time users were the tail — trimming it just deletes the problem you’re chasing.
  • Watch who’s in the sample. Our monitoring, our tests, and our most-active users tend to be the same convenient sample — so the population in trouble can be the one we never measure. The tax-time users above were invisible to everyone who used the system daily.
  • Advanced: when a closed-form formula doesn’t fit, bootstrap the statistic we care about straight from the data.

Computing percentiles is trivial

There’s rarely a reason to reach for a formula, because for data that fits in memory (or on disk) a percentile is trivial to read off. Sort the values; the p-th percentile is the one p% of the way up the sorted list. No distribution assumed, nothing fitted:

class Percentiles {
  constructor(xs) {
    this.sorted = xs.toSorted((a, b) => a - b);
  }
  at(p) {
    return this.sorted[Math.ceil((p / 100) * this.sorted.length) - 1];
  }
}

const lat = new Percentiles(latencies);
lat.at(2.5);  // bottom of the central 95%
lat.at(50);   // the median
lat.at(97.5); // top of the central 95%

Sort once, and every percentile after that is a cheap index. (Real implementations interpolate between neighbors for a smoother answer — the median of an even sample is the average of the two middle values — but the idea is the same.)

That’s all it takes to replace that opening 200 ± 500 honestly. A 95% spread is just p2.5 to p97.5, with the median at p50: three real numbers off one sorted array, no symmetry assumed. (When it’s the tail that bites — latency SLOs — quote p95 or p99 instead.) And percentiles don’t care about shape: bell, skew, two humps, or dead flat, p95 still means the same thing. On a flat stretch “modes” and “peaks” stop meaning anything, yet the percentiles read exactly the same — which is the real reason to reach for them.

The genuinely hard case is the unbounded stream you can’t hold all at once, which is what sketches like t-digest and HDR histograms are for. But most of the time the data fits, and the honest answer is one sort() away.

Summary

Back to that 200ms ± 500ms. Once we stopped patching the error bar and actually plotted the service, a real p95 fell out — and it was far less comfortable than “200ms.” That was the number I needed: I was driving a dashboard off that endpoint, and its slow tail was real, not theoretical. The shape had the answer the summary couldn’t.

The mistake isn’t using statistics; it’s borrowing the normal distribution’s formulas for data that isn’t normal. Latency isn’t symmetric, performance is often discrete, and real populations clump — so the mean and standard deviation describe a bell curve our data never agreed to be. Plot the shape first, prefer the median and percentiles, and split multimodal data into the clusters it’s actually made of. The honest summary is almost never a single number with a ± after it.

None of this retires the mean or the bell curve — they’re good tools, often the right ones. The skill is noticing when they aren’t, and reaching for the shape instead.

P.S.

This blog is a case in point. A new post’s traffic spikes the day it ships, then falls off: a sharp peak with a long right tail, never a bell curve. Across posts it’s the same story — a few busy ones and a long archive, each pulling a steady trickle. The mean views-per-post is, predictably, a number no post actually gets.