Back to the path
Level 5 advanced 27 min #lld#design-patterns#strategy#factory#concurrency#rate-limiting#solid#retry#thread-safety

Designing an Extensible Notification Service

Build a production-grade, thread-safe Notification Service from the OOP up: pluggable channels via Strategy + Factory, templating, layered user preferences, retry with exponential backoff + jitter, and a lazily-refilled concurrent token-bucket rate limiter that is O(1) per check. A capstone that fuses every SOLID principle and pattern from earlier levels into one defensible class diagram, with explicit reasoning about exactly which lock guards which invariant.

Why this problem is a capstone

A notification service is the single most common “design something extensible” machine-coding prompt above SDE-1, and it is beloved by interviewers for one reason: it forces you to combine patterns rather than recite one. You will touch Strategy (channels), Factory (channel resolution), Template Method / templating, a Chain-style preference resolution, Decorator-ish retry wrapping, and a genuinely concurrent token-bucket rate limiter. The interviewer is watching whether your abstractions stay clean as you bolt on retries and throttling — the place where most candidates’ designs collapse into a god-class.

This lesson builds the design incrementally, justifies every seam against SOLID, and gets the concurrency correct (not hand-wavy). By the end you can defend the class diagram, reason about complexity, and explain exactly which lock protects which invariant.

Requirements we will commit to

Functional: send a notification to a user over one or more channels (email, SMS, push); render content from a named template + variables; respect per-user channel preferences and opt-outs; retry transient failures; rate-limit per channel (and optionally per user). Extensible: adding a new channel (e.g. Slack, WhatsApp) must not modify existing code (OCP).

Non-functional: thread-safe under concurrent senders; bounded retries with backoff; no unbounded memory growth; rate limiter must be O(1) per check.

In an interview, write these down first and get a nod. Half the points are for scoping. Explicitly ask: “Is delivery synchronous, or do we enqueue and a worker pool sends? Do we need delivery-status callbacks?” — then state your assumption so the design has a fixed target. We assume synchronous send called concurrently from a thread pool, and that channel/template registration happens once at startup (single-threaded), before any send — that assumption is what lets the factory registry stay unsynchronized (see the Factory section).

The core abstractions

Let’s name the nouns and verbs. The central seam is the channel: a thing that knows how to deliver a rendered message. Everything else orchestrates around it.

                    +------------------------+
                    |  NotificationService   |   <-- orchestrator (facade)
                    +------------------------+
                    | - factory              |
                    | - prefResolver         |
                    | - templateEngine       |
                    | - rateLimiter          |
                    | - retryPolicy          |
                    +-----------+------------+
                                |
            +-------------------+--------------------+
            v                   v                    v
   +----------------+  +-----------------+  +------------------+
   | ChannelFactory |  | TemplateEngine  |  | PreferenceResolver|
   +----------------+  +-----------------+  +------------------+
          | creates
          v
   +======================+   <<interface>> Strategy
   |  NotificationChannel |
   +======================+
   | + type(): ChannelType|
   | + send(msg): Result  |
   +======================+
        ^      ^       ^
        |      |       |
  +---------+ +-------+ +--------+
  |  Email  | |  SMS  | |  Push  |   <-- concrete strategies
  +---------+ +-------+ +--------+

   RateLimiter (TokenBucket)  ---- gates ----> send()
   RetryPolicy                ---- wraps ----> send()

Why a strategy interface and not a switch? Because send() over email vs SMS shares nothing but the signature — different SDKs, payload shapes, failure modes. A switch(channelType) inside one method violates OCP (you edit it for every new channel) and SRP (one class knows every provider). The interface gives each channel its own class, its own dependencies, its own tests.

Key idea

The litmus test for the channel abstraction: adding WhatsApp should be a new file plus one factory registration, touching zero existing channel code. If your design fails that test, the Strategy seam is in the wrong place.

The channel strategy + the message value object

Note the tri-language parity in SendResult.error: it is an optional field in all three languages — error? in TS, Optional[str] = None in Python, and std::optional<std::string> in C++. An empty-string sentinel would conflate “no error” with “empty error message,” so we model absence explicitly.

enum ChannelType { Email = "EMAIL", Sms = "SMS", Push = "PUSH" }

interface RenderedMessage {
readonly to: string;          // resolved address for THIS channel
readonly subject?: string;
readonly body: string;
}

interface SendResult {
readonly ok: boolean;
readonly retryable: boolean;  // distinguishes transient vs permanent failure
readonly error?: string;
}

interface NotificationChannel {
type(): ChannelType;
send(msg: RenderedMessage): SendResult;   // assume sync for the interview
}

class EmailChannel implements NotificationChannel {
constructor(private readonly smtp: SmtpClient) {}
type() { return ChannelType.Email; }
send(msg: RenderedMessage): SendResult {
  try {
    this.smtp.send(msg.to, msg.subject ?? "", msg.body);
    return { ok: true, retryable: false };
  } catch (e: any) {
    // 5xx / timeout => retryable; 4xx (bad address) => not
    return { ok: false, retryable: e.transient === true, error: String(e) };
  }
}
}

The retryable flag on SendResult is the small detail that separates a senior answer from a junior one: a retry policy must not retry a 400 invalid phone number. Encoding transient-vs-permanent at the channel boundary keeps the retry layer dumb and correct.

Common pitfall

A C++ class/struct definition must be terminated with a semicolon (};). Forgetting it produces error: expected ';' after class definition and is one of the most common machine-coding slips under time pressure. Every class body in this lesson ends with }; — train your eye to expect it.

The Factory: resolving type → channel

The orchestrator should never new EmailChannel(...). It asks a factory. This keeps wiring (which SMTP host, which Twilio key) out of the orchestration logic — Dependency Inversion: the service depends on the NotificationChannel abstraction and a registry, not on concrete constructors.

class ChannelFactory {
private readonly registry = new Map<ChannelType, NotificationChannel>();

// Startup-only: called single-threaded during wiring, before any send().
register(ch: NotificationChannel): void {
  this.registry.set(ch.type(), ch);   // singletons here; could be providers
}
get(type: ChannelType): NotificationChannel {
  const ch = this.registry.get(type);
  if (!ch) throw new Error(`No channel registered for ${type}`);
  return ch;
}
supported(): ChannelType[] { return [...this.registry.keys()]; }
}

This is a registry-based factory, not a switch-based one. The registry version satisfies OCP at runtime: register a new channel, done.

Watch out

C++ keyword clash. The method is register(...) in TS and Python, but register is a reserved (deprecated since C++17) keyword in C++, so the C++ version is named register_channel(...). Same role, different name — flag this kind of cross-language asymmetry out loud in an interview so the reviewer knows it’s deliberate, not a bug.

Tip

Why the registry needs no lock. We stated up front that registration is a startup-time, single-threaded operation that happens-before any send. Under that ordering, concurrent sends only read the map, and concurrent reads of an unmodified Map/dict/unordered_map are safe. If your requirements allowed runtime registration concurrent with sends, this map would need a RWLock (or a copy-on-write swap) — say which world you’re in. The channel instances are stateless singletons holding an SDK client; sharing one across threads is fine as long as the underlying SDK client is thread-safe — call that out explicitly.

Templating: separating content from delivery

Templates are an SRP play. The channel knows how to deliver; the template engine knows what to say. Mixing them means every copy change forces a code deploy in the channel class.

interface Template { readonly subject?: string; readonly body: string; }

class TemplateEngine {
constructor(private readonly templates: Map<string, Template>) {}

render(name: string, vars: Record<string, string>): { subject?: string; body: string } {
  const tpl = this.templates.get(name);
  if (!tpl) throw new Error(`Unknown template ${name}`);
  const sub = (s: string) =>
    s.replace(/\{\{(\w+)\}\}/g, (_, k) => vars[k] ?? `{{${k}}}`);
  return {
    subject: tpl.subject ? sub(tpl.subject) : undefined,
    body: sub(tpl.body),
  };
}
}

In a real system templates live in a store/CMS and may be localized; for the interview, an in-memory map with {{var}} interpolation is enough — but say that you’d swap the Map for a TemplateStore interface, preserving DIP.

User preferences: layered resolution

Preferences answer: which channels does this user accept for this notification category? The senior move is to model layers — a marketing opt-out beats a category preference beats a default — and resolve them in order. This is conceptually a small Chain of Responsibility / precedence stack.

LayerExampleWins over
Marketing opt-out (compliance)“user unsubscribed from marketing”category pref + default, for marketing only
Per-category preference”transactional: email+push; marketing: none”defaults
Channel availability”no phone number on file” → drop SMScategory pref
System default”all requested channels”nothing
Key idea

Transactional messages bypass the opt-out. A marketing unsubscribe legally and practically must not suppress transactional mail (password resets, receipts, security alerts). That’s why the field is named marketingOptOut, not globalOptOut, and the gate is marketingOptOut && cat === Marketing. The earlier draft mislabeled this as a “global opt-out that wins over everything” — the table above and the field name are now reconciled with the code. In an interview, state this carve-out explicitly; it’s a frequent compliance gotcha.

enum Category { Transactional = "TXN", Marketing = "MKT" }

interface UserPrefs {
readonly userId: string;
readonly marketingOptOut: boolean;                 // suppresses MARKETING only
readonly perCategory: Map<Category, Set<ChannelType>>;
readonly addresses: Map<ChannelType, string>;      // missing => unavailable
}

class PreferenceResolver {
// Returns the channels we may actually use, in delivery order.
resolve(p: UserPrefs, cat: Category, requested: ChannelType[]): {channel: ChannelType, to: string}[] {
  if (p.marketingOptOut && cat === Category.Marketing) return []; // compliance gate (marketing only)
  const allowed = p.perCategory.get(cat) ?? new Set(requested);
  const out: {channel: ChannelType, to: string}[] = [];
  for (const ch of requested) {
    if (!allowed.has(ch)) continue;            // category preference
    const to = p.addresses.get(ch);
    if (!to) continue;                          // availability
    out.push({ channel: ch, to });
  }
  return out;
}
}

The concurrent token-bucket rate limiter

This is the part the interviewer actually probes. We rate-limit per (channel) bucket — e.g. “at most 100 SMS/sec.” A token bucket allows short bursts (up to capacity) while bounding the long-run rate.

The algorithm, and why lazy refill is O(1)

A naive design spawns a timer per bucket to add tokens every tick — that’s a thread per bucket and wasted wakeups. The senior design uses lazy refill: store tokens and lastRefill timestamp; on each tryAcquire, compute how many tokens would have accrued since lastRefill based on elapsed time, cap at capacity, then try to spend one. No background threads, O(1) time and O(1) memory per bucket, evaluated only when traffic arrives.

 tryAcquire(now):
   elapsed   = now - lastRefill
   refill    = elapsed * ratePerSecond
   tokens    = min(capacity, tokens + refill)   # lazy catch-up
   lastRefill= now
   if tokens >= 1:  tokens -= 1; return true     # admitted
   else:            return false                  # throttled
Key idea

Use a monotonic clock, never wall-clock. Wall-clock time can jump backwards (NTP correction, DST, manual set), which would make elapsed negative and corrupt tokens. Use performance.now() / time.monotonic() / std::chrono::steady_clock. Defensively clamp elapsed to >= 0.

Thread-safety: exactly which lock guards which invariant

The invariant: tokens and lastRefill must be updated atomically together. The bug-prone path is the classic check-then-act race: two threads both read tokens == 1, both decide “yes,” both decrement → tokens == -1 and two requests admitted past the limit.

We protect the read-refill-spend sequence with one mutex per bucket. The mutex guards exactly two fields: tokens and lastRefill. The critical section is tiny (a few arithmetic ops — no I/O, no send() inside the lock), so contention is negligible and we never hold the lock across a network call. Per-bucket locks (not one global lock) mean SMS throttling never blocks email throttling. The outer Map/dict/unordered_map of buckets is, again, built at startup and only read at send time, so it needs no lock.

// Node is single-threaded for JS, so a TokenBucket needs no mutex there.
// We model the cross-language locking discipline explicitly with a guard,
// because in worker_threads / other runtimes the invariant still holds.
class TokenBucket {
private tokens: number;
private lastRefill: number;              // monotonic ms
constructor(
  private readonly capacity: number,
  private readonly ratePerSec: number,
  now: number = performance.now(),
) {
  this.tokens = capacity;
  this.lastRefill = now;
}
// Returns true if one token was spent. O(1).
tryAcquire(now: number = performance.now()): boolean {
  // --- critical section: guards {tokens, lastRefill} together ---
  const elapsed = Math.max(0, now - this.lastRefill) / 1000;
  this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.ratePerSec);
  this.lastRefill = now;
  if (this.tokens >= 1) { this.tokens -= 1; return true; }
  return false;
  // --- end critical section ---
}
}

class RateLimiter {
private readonly buckets = new Map<ChannelType, TokenBucket>();
constructor(config: { type: ChannelType; capacity: number; rate: number }[]) {
  for (const c of config) this.buckets.set(c.type, new TokenBucket(c.capacity, c.rate));
}
allow(type: ChannelType): boolean {
  const b = this.buckets.get(type);
  return b ? b.tryAcquire() : true;   // no bucket => unlimited
}
}
Common pitfall

Do not call the channel’s send() while holding the bucket mutex. The lock guards only the arithmetic on tokens/lastRefill; holding it across a multi-hundred-millisecond network call would serialize all senders behind one I/O and destroy throughput. Acquire token → release lock → send. This “short critical section, no I/O under lock” rule is the single most repeated concurrency point an interviewer wants to hear.

Retry with exponential backoff + jitter

The retry layer wraps a single channel send and re-attempts only when result.retryable is true, with a bounded number of attempts and a growing, randomized delay.

Why jitter? If a downstream provider blips and 10,000 senders all back off 1s, 2s, 4s in lockstep, they retry in synchronized waves — a thundering herd that re-overloads the provider. Full jitter (delay = random(0, base * 2^attempt)) spreads the retries uniformly. Always cap the delay (maxDelay) and the attempt count so retries are bounded (a stated non-functional requirement).

interface RetryConfig {
maxAttempts: number;     // total tries, including the first
baseDelayMs: number;
maxDelayMs: number;
}

class RetryPolicy {
constructor(private readonly cfg: RetryConfig,
            private readonly sleep: (ms: number) => void) {}

run(send: () => SendResult): SendResult {
  let last: SendResult = { ok: false, retryable: false };
  for (let attempt = 0; attempt < this.cfg.maxAttempts; attempt++) {
    last = send();
    if (last.ok || !last.retryable) return last;     // success or permanent
    if (attempt === this.cfg.maxAttempts - 1) break; // no sleep after last
    const exp = this.cfg.baseDelayMs * 2 ** attempt;
    const capped = Math.min(this.cfg.maxDelayMs, exp);
    const delay = Math.random() * capped;            // full jitter
    this.sleep(delay);
  }
  return last;   // exhausted: caller logs / dead-letters
}
}
Tip

Note std::mt19937 is not thread-safe; here each call to run may come from a different thread, so in production make the RNG thread_local or lock it. We keep one instance for brevity but call out the caveat — exactly the kind of honesty interviewers reward.

Putting it together: the orchestrator

The NotificationService is a thin facade that composes the pieces. Crucially it contains no business rules of its own — preferences, throttling, retrying, and delivery each live in their own object. That’s what keeps it from becoming a god-class.

interface NotifyRequest {
prefs: UserPrefs;
category: Category;
channels: ChannelType[];
templateName: string;
vars: Record<string, string>;
}

class NotificationService {
constructor(
  private readonly factory: ChannelFactory,
  private readonly prefs: PreferenceResolver,
  private readonly templates: TemplateEngine,
  private readonly limiter: RateLimiter,
  private readonly retry: RetryPolicy,
) {}

notify(req: NotifyRequest): Map<ChannelType, SendResult> {
  const results = new Map<ChannelType, SendResult>();
  const targets = this.prefs.resolve(req.prefs, req.category, req.channels);
  const { subject, body } = this.templates.render(req.templateName, req.vars);

  for (const { channel, to } of targets) {
    if (!this.limiter.allow(channel)) {
      results.set(channel, { ok: false, retryable: true, error: "rate_limited" });
      continue;                          // shed load; caller may re-enqueue
    }
    const ch = this.factory.get(channel);
    const msg: RenderedMessage = { to, subject, body };
    results.set(channel, this.retry.run(() => ch.send(msg)));
  }
  return results;
}
}

Notice the capture-by-reference lambda in the C++ retry call is safe only because run is synchronous and the lambda does not outlive the loop iteration. If send were async/deferred, you’d capture by value or with shared ownership — another good thing to say aloud.

Complexity summary

OperationTimeSpaceNotes
RateLimiter.allowO(1)O(1)/bucketlazy refill, no background thread
PreferenceResolver.resolveO(R)O(R)R = requested channels (tiny)
TemplateEngine.renderO(L)O(L)L = template length
RetryPolicy.runO(A) triesO(1)A = maxAttempts, bounded
notifyO(R·A)O(R)R channels, each up to A tries

Interview wrap-up checklist

  • Lead with requirements + the assumption (sync send, startup-time registration).
  • Justify the Strategy seam with the WhatsApp litmus test.
  • Use a registry factory, not a switch — OCP at runtime.
  • Model preferences as layers, and call out the transactional carve-out for marketing opt-out.
  • For the rate limiter: token bucket, lazy refill, monotonic clock, per-bucket mutex guarding only {tokens, lastRefill}, no I/O under lock, O(1).
  • For retries: only when retryable, bounded attempts, exponential backoff with full jitter, capped delay.
  • Keep the orchestrator a facade with no business rules.

Assessment

Pass mark: 70% on the concept questions unlocks the next lesson.

1. Why does the token-bucket rate limiter use *lazy* refill (computing accrued tokens on each `tryAcquire`) instead of a background timer thread that adds tokens every tick?

2. A marketing-opted-out user (`marketingOptOut = true`) triggers a password-reset (Transactional) notification. What does `PreferenceResolver.resolve` return, and why is that correct?

3. Which of the following are TRUE about the mutex in `TokenBucket.tryAcquire`? (select all that apply) (select all)

4. Why does the retry policy use *full jitter* (`delay = random(0, base * 2^attempt)`) rather than plain exponential backoff (`base * 2^attempt`)?

Design problem 5

Extend the design to support a NEW channel, WhatsApp, that has a provider-side constraint: it must respect BOTH the existing per-channel rate limit AND a stricter per-user limit (e.g. at most 1 WhatsApp message per user per 60s). Describe the classes/changes you'd add, where the per-user limit lives, how you keep it thread-safe and memory-bounded, and which existing code must change. Note any concurrency hazards.