The improv-musician analogy
A jazz musician hears the last bar and plays the next note. Then the next, then the next. Each note is a prediction of what fits — but the chain of predictions becomes a solo nobody has ever heard before.
Generative AI works the same way. It does not "have an idea" up front. It predicts the next token, appends it, predicts again, and the chain becomes a paragraph, a function, a poem. Out of pure prediction emerges something that looks and feels like creation.
Generative vs discriminative
- Discriminative models answer "which class is this?" — spam vs not-spam, cat vs dog.
- Generative models answer "what comes next?" — and by repeating that question, produce open-ended output.
LLMs, diffusion models for images, and audio models for speech are all generative. The output type changes; the next-step prediction idea does not.
How autoregressive text works
- Tokenize the prompt.
- Run the model — get a probability distribution over every possible next token (often 50k–200k options).
- Sample one token from that distribution.
- Append it. Go back to step 2 until you hit a stop token or
max_tokens.
The whole "intelligence" of the output rides on two things: how good the distribution is (the model) and how you sample from it (decoding strategy).
The two knobs that change the vibe
| Setting | Low value | High value |
|---|---|---|
| Temperature | Greedy, repetitive, "safe" | Creative, surprising, sometimes nonsense |
| Top-p (nucleus) | Only the most likely tokens | Long tail allowed, more variety |
Production tip: temperature 0–0.3 for code, classification, structured output. 0.7–1.0 for creative prose. Adjust top-p before you crank temperature past 1.
Beyond text
- Diffusion — start with noise, iteratively denoise toward an image.
- Speech — generate audio tokens or waveform chunks autoregressively.
- Code — same as text, but the eval metric is "does it compile and pass tests".
The shape of the model differs. The "predict, sample, repeat" loop does not.