The Four Bottlenecks Every System Eventually Hits
The Framework That Actually Matters at Scale
When a system slows down under load, the cause is almost always one of four things: compute cycles, working set size, persistence I/O, or data movement. CPU, memory, disk, network. Not because engineers memorized a list—but because these are the only physical resources a program can exhaust.
The useful reframe is that each one represents a distinct failure mode under load:
- CPU — throughput of computation
- Memory — size of the hot working set
- Disk — persistence bandwidth and latency
- Network — cross-boundary coordination cost
Let's work through what each actually looks like in the wild.
CPU: When Cost Per Request Explodes
CPU bottlenecks are rarely about raw QPS. They're about work per request.
Take a feed ranking service at 2k QPS. Each request does feature extraction, ML scoring, and filtering across ~500 posts. Even at moderate traffic:
Total work = QPS × cost_per_request
If cost per request is expensive, horizontal scaling becomes very expensive, very fast.
The usual fix is precomputation—compute the ranking offline and cache results. But that trade-off isn't free: you've shifted CPU cost onto memory (cache size) and disk (materialized views), and introduced staleness.
The same pattern appears in JSON-heavy APIs. At high throughput, CPU isn't burning on business logic—it's burning on serialization, validation, and copying. Switching to protobuf reduces CPU and network payload size, but increases schema coupling and deployment complexity.
Memory: The Working Set Boundary
Memory is not about how much RAM you have. It's about whether your working set fits in RAM.
If it doesn't, the OS starts paging to disk. Cache hit rates collapse. What looked like a memory problem becomes a disk problem.
A concrete example: cache user feeds at ~1 MB each for 5M active users. That's ~5 TB. No machine holds that. The result isn't graceful degradation—it's frequent evictions, low hit rates, and constant fallback to the database.
The fix isn't "add more memory." It's recognizing that cache is only useful when it matches the access distribution. Cache the top few percent of active users, store IDs instead of full objects, tune TTL and eviction policy aggressively.
A subtler version: a JVM service showing increasing latency with no CPU spike. The culprit is often GC pressure from a large heap with high object churn. More memory → longer GC pause cycles → CPU stalls. Memory and CPU are coupled indirectly, and this surprises engineers who treat them as independent knobs.
Disk: Latency That Compounds
Disk is slow, but more importantly, it's multiplicatively slow under bad access patterns.
A 10k QPS OLTP system doing 3 reads and 1 write per request hits 40k IOPS before accounting for index misses or random access. On SSDs that's manageable—until it isn't.
Missing a single index turns a millisecond query into a full table scan. At load, that means threads pile up, connection pools exhaust, and the failure cascades. You can trace production outages that start with one table missing an index.
A frequently missed source: synchronous logging. Disk flush per request, at high volume, creates tail latency spikes that look completely unrelated to the actual bottleneck.
Network: The Cost of Distribution
Every remote call is a tax: latency, failure probability, coordination complexity.
In a microservice chain—API → Auth → User → Feed → Ranking → Ads—even at 10 ms per hop you're at 60–100 ms in the best case. Add retries and tail latency and you're at 300–500 ms easily.
Fan-out makes it worse. If your feed service calls 50 downstream services in parallel, the request completes at the speed of the slowest one. P99 = max(P99 of all downstream calls). One flaky dependency punishes every user.
The Part Most Engineers Miss: These Constraints Trade Off
Adding a cache:
- ↓ Disk reads
- ↑ Memory pressure
- ↑ CPU (serialization, cache logic)
Compressing responses:
- ↓ Network bandwidth
- ↑ CPU (compression cost, which at scale may become the new bottleneck)
Denormalizing data:
- ↓ Disk reads (fewer joins)
- ↑ Disk writes (duplication on every update)
Moving work to queues:
- ↓ CPU on the request path
- ↑ Disk (queue persistence) and Network (message passing)
Every architectural decision is a load transfer, not a load elimination. The question is never "how do I remove this bottleneck?" It's "which bottleneck is cheapest to scale?"
How to Think About It in a Design Review
When evaluating a system, run through four questions:
- Where is the work happening? → CPU
- What is the hot working set? → Memory
- What hits persistent storage? → Disk
- What crosses process or machine boundaries? → Network
Then ask the one question that exposes the real design: at 10× current load, which one saturates first?
Good systems don't eliminate bottlenecks. They make a deliberate choice about which one they're willing to pay for.