Back to the path
Level 5 advanced 27 min #rate-limiter#concurrency#algorithms#thread-safety#system-design#token-bucket#distributed-systems

Designing a Thread-Safe Rate Limiter

A rigorous walk through rate-limiting algorithms (token bucket, leaky bucket, fixed/sliding window), their trade-offs, and a correct thread-safe token-bucket implementation in three languages — including per-user keying, edge cases, free-threaded caveats, and the jump to a distributed Redis-backed design using an atomic Lua script.

The Problem, Framed for an Interview

“Design a rate limiter” is a deceptively deep prompt. The interviewer is probing three things at once: do you know the algorithms and their trade-offs, can you write correct concurrent code, and do you understand how the design breaks when it goes distributed. A weak candidate jumps straight to code. A strong candidate spends the first two minutes pinning down requirements out loud.

The contract you are building is one method:

allow(key) -> boolean        // may this request proceed right now?

Before writing anything, say these clarifying questions to your interviewer:

  • What is the limit shape? “N requests per window” (e.g. 100/min) or “sustained R requests/sec with burst capacity B”? These map to different algorithms.
  • What is the key? Per-user, per-IP, per-API-key, per-endpoint, or a composite like (user, endpoint)? This decides your map structure.
  • Single process or distributed fleet? A single in-memory limiter is ~30 lines; a correct distributed one needs an atomic store like Redis.
  • Fail-open or fail-closed? If the limiter’s backing store is down, do you let traffic through (availability) or block it (protection)?
  • Hard or soft limit? Reject with 429, or queue/throttle?

In an interview, explicitly state your assumptions: “I’ll build a per-user, in-process, thread-safe token bucket first because it handles bursts gracefully, then extend it to Redis for the multi-node case.” This signals you have a plan.

The Four Core Algorithms

Fixed Window Counter

Divide time into fixed windows (e.g. each calendar minute). Keep a counter per key per window; increment on each request; reject when the counter exceeds the limit; reset at the window boundary.

  • Memory: O(1) per key — just a count and a window id.
  • Time: O(1) per request.
  • Fatal flaw — boundary bursting. A client can send limit requests in the last instant of one window and limit more in the first instant of the next, achieving 2× the limit across a window-width span. With 100/min, a burst of 200 requests can land inside a ~2-second span straddling the minute boundary.

Sliding Window Log

Store a timestamp for every request, per key. On each request, evict timestamps older than now - window, then check if the remaining count is below the limit.

The data structure matters for complexity, so name it. Because requests arrive in (roughly) monotonically increasing time order, the natural structure is a deque (double-ended queue) of timestamps: append the new timestamp at the back, and evict expired entries from the front.

  • Memory: O(N) per key where N = requests currently in the window — expensive under high throughput.
  • Insert: O(1) amortized (append to back of the deque).
  • Evict: O(k) where k = number of now-expired entries popped from the front; amortized O(1) per request since each timestamp is pushed and popped at most once.
  • Exact, no boundary problem, but the memory cost is usually a dealbreaker at scale.

If you instead store timestamps in a sorted set / balanced BST / skip list (e.g. when out-of-order arrival is possible, or in Redis via ZADD/ZREMRANGEBYSCORE), insertion becomes O(log N) and range-eviction O(log N + k). Be precise about which structure you mean — a plain sorted array gives O(N) insertion because of the shift. The deque is preferred for in-order arrivals precisely because it is O(1).

Sliding Window Counter

A pragmatic approximation that fixes fixed-window bursting cheaply. Keep counts for the current and previous fixed windows, then weight the previous window by how much of it still overlaps the sliding window:

estimated = curr_count + prev_count * (overlap_fraction_of_prev_window)
  • Memory: O(1) per key (two counters).
  • Time: O(1) per request.
  • Slightly approximate (assumes uniform distribution within the previous window) but smooths the boundary spike. This is what Cloudflare popularized for HTTP rate limiting.

Token Bucket

A bucket holds up to capacity tokens. Tokens refill at a steady rate r (tokens/sec). Each request consumes one token; if a token is available it is allowed, otherwise rejected. The bucket caps at capacity, so unused allowance accumulates up to a burst ceiling.

  • Memory: O(1) per key (token count + last-refill timestamp).
  • Time: O(1) per request.
  • Allows controlled bursts (up to capacity) while enforcing a long-run average of r. This is the most popular general-purpose choice — it is what AWS API Gateway, Stripe, and most cloud APIs use.

Leaky Bucket

Think of a queue (the bucket) drained at a fixed rate. Requests enter the queue; the queue is processed at a constant outflow; if the queue is full, new requests overflow and are dropped.

  • Memory: O(queue size) per key.
  • Time: O(1) per enqueue/dequeue.
  • Smooths output to a constant rate — good when a downstream system needs a steady, predictable load. The trade-off: it does not allow bursts (by design), and queued requests add latency.
Key idea

Token bucket vs leaky bucket is the question that separates candidates. Both enforce an average rate. Token bucket allows bursts (spare tokens accumulate) and rejects excess immediately — good for client-facing APIs. Leaky bucket shapes traffic into a constant outflow and can queue — good for protecting a fragile downstream. If asked “which one,” tie it to whether bursts are desirable.

AlgorithmMem/keyBurst behaviorBoundary-safeNotes
Fixed windowO(1)allows 2× at edgesNoSimplest, flawed
Sliding logO(N)exactYesAccurate, costly
Sliding counterO(1)smoothed~YesCheap approximation
Token bucketO(1)allows burst up to capacityYesDefault choice
Leaky bucketO(queue)no burst, constant outYesTraffic shaping

Token Bucket: The Lazy-Refill Trick

You do not need a background thread ticking tokens into every bucket — that would be O(number of keys) work per tick and a thread-management nightmare. Instead, refill lazily on each access by computing elapsed time:

now      = monotonic_time()
elapsed  = now - last_refill
tokens   = min(capacity, tokens + elapsed * refill_rate)
last_refill = now
if tokens >= 1:
    tokens -= 1
    return ALLOW
else:
    return DENY

Two correctness details that interviewers look for:

  • Use a monotonic clock, not wall-clock time. Wall-clock can jump backward (NTP correction, leap seconds), which would make elapsed negative and corrupt the token count. The per-language monotonic APIs are performance.now() (JS/TS), time.monotonic() (Python), and std::chrono::steady_clock (C++). (In Java you would reach for System.nanoTime(), but we show no Java tab here.)
  • Clamp at capacity. Without min(capacity, ...), an idle bucket would accumulate unlimited tokens and allow an unbounded burst.
   refill rate r (tokens/sec)
            |
            v
       +---------+
       |  ░░░░   |  <- tokens (<= capacity)
       |  ░░░░   |
       +---------+
            |
       request consumes 1 token
            |
        token? --yes--> ALLOW
            |
           no --------> DENY (429)

Thread Safety

The refill-and-consume sequence is a read-modify-write on shared state (tokens, lastRefill). Two threads hitting the same bucket concurrently can both read tokens == 1, both decrement, and both be allowed — a lost-update race that lets capacity + 1 requests through. The critical section must be atomic.

The simplest correct approach is one lock per bucket (not one global lock — that serializes all users). Below, each TokenBucket owns its own mutex, so different users never contend with each other.

Note on tri-language parity. Python and C++ run real OS threads, so they show genuine Lock/std::mutex critical sections. The TypeScript version runs on a single-threaded event loop: because allow is fully synchronous (no await), it runs to completion without interleaving, so it is atomic by construction — but it provides no actual lock, and is therefore not a thread-safe counterpart in the same sense. The moment you move TS to worker_threads, or make allow async, you must reintroduce real synchronization (Atomics over a SharedArrayBuffer, or an async mutex). This asymmetry is intentional and worth calling out aloud in an interview.

// TypeScript: single-threaded event loop. 'allow' is SYNCHRONOUS (no await),
// so it runs to completion without interleaving => atomic by construction.
// CAVEAT: this is NOT a real lock. If 'allow' ever became async, the
// read-modify-write below would need re-guarding (an async mutex). For true
// parallelism (worker_threads) you'd need Atomics + SharedArrayBuffer.
class TokenBucket {
private tokens: number;
private lastRefill: number; // ms, monotonic

constructor(
  private readonly capacity: number,
  private readonly refillPerSec: number,
) {
  this.tokens = capacity;
  this.lastRefill = TokenBucket.now();
}

private static now(): number {
  return performance.now(); // monotonic
}

// Synchronous => no interleaving point exists inside this method.
allow(cost = 1): boolean {
  const now = TokenBucket.now();
  const elapsedSec = (now - this.lastRefill) / 1000;
  this.tokens = Math.min(
    this.capacity,
    this.tokens + elapsedSec * this.refillPerSec,
  );
  this.lastRefill = now;
  if (this.tokens >= cost) {
    this.tokens -= cost;
    return true;
  }
  return false;
}
}

class RateLimiter {
private buckets = new Map<string, TokenBucket>();
constructor(
  private readonly capacity: number,
  private readonly refillPerSec: number,
) {
  if (refillPerSec <= 0) throw new Error("refillPerSec must be > 0");
}

allow(key: string, cost = 1): boolean {
  if (cost > this.capacity) {
    // Unsatisfiable by construction: reject immediately, don't let a
    // caller spin retrying forever.
    throw new Error("cost exceeds bucket capacity");
  }
  let b = this.buckets.get(key);
  if (!b) {
    b = new TokenBucket(this.capacity, this.refillPerSec);
    this.buckets.set(key, b);
  }
  return b.allow(cost);
}
}

Why two levels of locking

There are two distinct shared resources: the map of buckets, and each bucket’s internal state. Conflating them under one global lock makes every user serialize behind every other user — your limiter becomes the bottleneck.

  • The map lock is held only briefly to look up or insert a bucket.
  • The bucket lock protects the refill/consume arithmetic and is independent per key.

Critically, in the C++ version, notice that b->allow() is called after releasing mapMtx_. Holding the map lock while doing per-bucket work would reintroduce global serialization. In Python the _map_lock is similarly released before allow runs.

Common pitfall

A subtle bug: don’t hold the coarse map lock while invoking the bucket’s own locking method. That nests locks and can both serialize all traffic and, if lock ordering is ever inconsistent elsewhere, deadlock. Acquire the map lock, grab a reference/shared_ptr, release it, then operate on the bucket.

Tip

For very high throughput you can replace the per-bucket mutex with a lock-free CAS loop on a packed 64-bit word encoding tokens + timestamp, or a striped lock (N mutexes, hash(key) % N) to bound mutex memory. Mention this as an optimization — but in an interview, write the correct mutex-based version first. Premature lock-free code is a red flag.

Per-User Keying and Memory Management

Map<key, TokenBucket> grows unbounded as new keys appear — a memory leak and a DoS vector (an attacker rotates keys to allocate buckets). Two standard fixes:

  • Eviction: wrap the map in an LRU cache (e.g. a fixed-capacity LinkedHashMap / cachetools.LRUCache / a hand-rolled LRU). A bucket evicted while full simply re-initializes full on next access — acceptable. Evicting a throttled bucket resets it to full, which briefly relaxes the limit; bound this risk by sizing the cache generously.
  • TTL sweeping: periodically remove buckets idle longer than capacity / refillPerSec. That value is the time to refill from empty to full, so it is a conservative upper bound on how long any bucket needs to be untouched before a fresh full bucket is behaviorally indistinguishable. A partially-filled idle bucket actually reaches full sooner, so using this bound is always safe (you only ever drop buckets that are already, or are guaranteed to soon be, equivalent to fresh).

In an interview, proactively raise unbounded key growth. Most candidates forget it, and it is exactly the kind of edge case that distinguishes SSE-level answers.

Class diagram

+------------------+        owns 0..*      +------------------+
|   RateLimiter    |---------------------->|   TokenBucket    |
+------------------+                       +------------------+
| - buckets: Map   |                       | - tokens         |
| - capacity       |                       | - lastRefill     |
| - refillPerSec   |                       | - capacity       |
| - mapLock        |                       | - refillPerSec   |
+------------------+                       | - lock           |
| + allow(key,cost)|                       +------------------+
+------------------+                       | + allow(cost)    |
                                           +------------------+

Edge Cases to Name Out Loud

  • Clock going backward: monotonic clock eliminates this. State why you chose it.
  • cost > capacity: a request asking for 10 tokens from a 5-capacity bucket can never be satisfied. The allow() code itself does not loop — a single call simply returns false (it performs one comparison and returns). The real failure mode is external: a client that retries such a request will be denied on every call, forever, spinning uselessly. The fix is to validate cost <= capacity up front and reject immediately with a clear error (as the RateLimiter.allow guards above do), rather than returning a retryable 429.
  • Fractional tokens: use double for tokens so that a refill rate of 0.5/sec works; comparing tokens >= 1 still gates whole requests.
  • Cold start / first request: initialize the bucket full (tokens = capacity) so a brand-new user isn’t penalized.
  • Thundering herd at refill: with token bucket this is naturally smoothed; with fixed window, all clients reset simultaneously and stampede — another point for token bucket.
  • Variable cost: weight expensive endpoints by passing cost > 1 (a search costing 5 tokens, a ping costing 1). The implementations above already support this.

Going Distributed: Redis

The in-process limiter breaks across N application nodes: each node keeps its own bucket in its own heap, so a user routed across all N nodes effectively gets up to N × capacity — the global limit is silently multiplied by the fleet size. You need a single source of truth that every node consults atomically.

The naive instinct — GET tokens / refill in app code / SET tokens — is racy across nodes: two nodes can GET the same value, both decide there is a token, and both SET, double-spending exactly like the in-process lost-update race, except now no local mutex can help because the state lives in Redis. Wrapping it in WATCH/MULTI (optimistic CAS) works but retries under contention.

The correct, single-round-trip approach is an atomic Lua script. Redis executes a script as one indivisible unit (Redis is single-threaded for command execution), so the read-modify-write of the token bucket happens with no interleaving — the distributed analogue of our per-bucket mutex.

// token_bucket.lua — executed atomically by Redis.
// KEYS[1] = bucket key (a Redis hash with fields 'tokens', 'ts')
// ARGV: capacity, refillPerSec, now (ms, from a single trusted source),
//       cost, ttlSeconds
// Returns 1 (allowed) or 0 (denied).
export const TOKEN_BUCKET_LUA = `
local key        = KEYS[1]
local capacity   = tonumber(ARGV[1])
local refill     = tonumber(ARGV[2])   -- tokens per second
local now        = tonumber(ARGV[3])   -- ms
local cost       = tonumber(ARGV[4])
local ttl        = tonumber(ARGV[5])

local data    = redis.call('HMGET', key, 'tokens', 'ts')
local tokens  = tonumber(data[1])
local last    = tonumber(data[2])

if tokens == nil then          -- cold start: full bucket
tokens = capacity
last   = now
end

local elapsed = math.max(0, (now - last) / 1000.0)  -- clamp >= 0
tokens = math.min(capacity, tokens + elapsed * refill)

local allowed = 0
if tokens >= cost then
tokens = tokens - cost
allowed = 1
end

redis.call('HSET', key, 'tokens', tokens, 'ts', now)
redis.call('PEXPIRE', key, ttl * 1000)   -- self-cleaning idle keys
return allowed
`;

// Application side (ioredis). One round trip; no WATCH/retry loop.
import Redis from "ioredis";

class DistributedRateLimiter {
private sha: string | null = null;
constructor(
  private readonly redis: Redis,
  private readonly capacity: number,
  private readonly refillPerSec: number,
  private readonly failOpen = true, // availability vs protection
) {}

async allow(key: string, cost = 1): Promise<boolean> {
  if (cost > this.capacity) throw new Error("cost exceeds capacity");
  const now = Date.now(); // single clock source; see clock-skew note
  const ttl = Math.ceil(this.capacity / this.refillPerSec);
  try {
    const res = await this.redis.eval(
      TOKEN_BUCKET_LUA, 1, key,
      String(this.capacity), String(this.refillPerSec),
      String(now), String(cost), String(ttl),
    );
    return res === 1;
  } catch (e) {
    // Store outage policy. fail-open = allow (favor availability),
    // fail-closed = deny (favor protecting the backend).
    return this.failOpen;
  }
}
}

Why the Lua script is the whole point

  • Atomicity: Redis runs the script start-to-finish without serving other commands, so the HMGET → refill → conditional HSET is a single indivisible critical section across the entire fleet. No WATCH/MULTI retry loop, no app-side mutex, one network round trip.
  • Lazy refill survives: the same elapsed-time math works; we store tokens and ts in a hash and PEXPIRE the key so idle buckets evict themselves — solving the unbounded-key problem for free at the store layer.
  • Clock source: notice the Lua uses now passed in from the application, not redis.call('TIME'), in this sketch. If your app nodes have skewed clocks, two nodes can disagree on now, slightly over- or under-crediting tokens. Prefer a single trusted clock: either pass redis.call('TIME') from inside the script (one clock for all nodes — but note scripts that read TIME are flagged non-deterministic for old-style replication; use effects/script replication, the default in Redis ≥ 5) or accept bounded skew. Always clamp elapsed at ≥ 0 so a backward jump can never mint tokens.
  • Fail-open vs fail-closed: if Redis is unreachable, fail_open = true favors availability (let traffic through, risking the backend) and fail_open = false favors protection (deny everything, risking a self-inflicted outage). State the choice and tie it to what the limiter guards.
Watch out

A fixed-window variant in Redis is even simpler — INCR the key, and on the first increment set EXPIRE window. But INCR + EXPIRE as two commands has a race: if the process dies between them, the key never expires and the user is throttled forever. Do it in a MULTI/transaction or a tiny Lua script so the INCR and EXPIRE are atomic. The token-bucket Lua above already folds expiry into the atomic unit.

Key idea

The distributed payoff, in one line for the interviewer: “In-process I protect the read-modify-write with a per-bucket mutex; distributed, I move that exact critical section into Redis as an atomic Lua script so all nodes share one consistent bucket — anything less (GET-then-SET from the app) reintroduces the lost-update race across the network.”

Assessment

Pass mark: 70% on the concept questions unlocks the next lesson.

1. A team uses a fixed-window counter limiting clients to 100 requests/minute. A client wants the maximum sustained burst it can push through in the shortest span. Roughly how many requests can it get accepted, and over what span?

2. In the in-process RateLimiter, why is bucket.allow() invoked AFTER the map lock is released rather than while holding it?

3. Which statements about taking the distributed token-bucket limiter to Redis are correct? Select all that apply. (select all)

4. Using a deque of timestamps for the Sliding Window Log with in-order arrivals, what are the correct per-request insert and amortized eviction costs?

Design problem 5

Design the thread-safety strategy for an in-process, per-user token-bucket rate limiter that must serve millions of distinct keys at high concurrency, then explain precisely what changes when the service scales to a fleet of N nodes. Address: the critical section and how it is protected per key; how the map of buckets is managed (concurrency + unbounded growth); two named edge cases and how you handle them; and the distributed correctness mechanism with one failure-mode policy.