The interview bar has changed
In 2024 a system design interview meant a URL shortener, a news feed or a payment gateway. In 2026 the same loop almost always includes an AI question — Design ChatGPT for our internal docs, Design a real-time voice agent, Design a multi-tenant LLM gateway. Hiring panels now expect you to reason about tokens per second, vector recall, hallucination rate and the cost of a single conversation, in addition to QPS, p99 latency and storage.
Why AI systems are not just web systems
Classic web architecture optimises for stateless requests over predictable data. LLM systems instead deal with:
- Non-determinism. The same input can produce a different output. Your tests, observability and SLOs all have to account for it.
- Token economics. Compute is metered per token, not per request. A 10× longer prompt is genuinely 10× more expensive.
- Long-running streams. Generation can take seconds per response — you stream tokens, manage backpressure and survive disconnects.
- Memory and retrieval. Context windows are bounded, so almost every production system is partly an information-retrieval system.
- Safety and compliance. PII, hallucinations, jailbreaks and copyright are first-class design constraints, not afterthoughts.
What you will get out of this masterclass
By the time you finish you will be able to (a) own the whiteboard during an AI system design interview, (b) ship a production RAG, agent or multimodal service that does not embarrass your team in week two, and (c) defend cost and reliability decisions with numbers, not vibes. The next two lessons set up the language we will use to do that consistently.