Understanding LLMs: The Map
From raw text to streaming response - every step of how large language models are built and how they process your prompts.
In the previous post, I explored how programming languages work - the compiler pipeline that turns source code into something a machine can execute. Seven stages, from lexer to code generation. That pipeline has been refined over decades and is well understood.
LLMs have a pipeline too. A different one. When I started pulling the thread - what happens between the moment you type a prompt and the moment text streams back? - I found something far more layered than I expected. Two distinct pipelines, fifteen steps, and a surprising amount of engineering that has nothing to do with neural networks.
This post is the map. Here’s the complete pipeline:
Two Pipelines, One Model
An LLM has two lives. First it’s built (training), then it’s used (inference). These happen at different times, in different places, using different algorithms. But they share the same model - the weights learned during training are the weights used during inference.
When you type a prompt into Claude Code, you’re seeing inference. But everything the model knows - every pattern, every capability, every failure mode - was determined during training, weeks or months earlier.
Let’s walk through both.
Part I: How an LLM Gets Built
1. Data Collection and Curation
Everything starts with text. Massive amounts of it.
The typical corpus begins with web crawls - Common Crawl alone contains petabytes of raw HTML. But raw web data is noisy. The curation pipeline is where the real work happens: language identification filters to target languages, deduplication removes near-copies (critical - duplicate data degrades model quality measurably), quality classifiers score documents against a “Wikipedia-like” standard, and content filters remove toxic material, PII, and malware.
Then comes domain balancing - the ratio of web text to books to code to academic papers matters enormously. Too much code and the model talks like a compiler. Too little and it can’t write a function.
The output is a cleaned, deduplicated, balanced corpus - typically trillions of tokens. This process takes months.
2. Tokenizer Training
Before the model can read a single word, text must become numbers. A tokenizer is trained on a representative sample of the corpus - not all of it, just enough for the frequency statistics to converge - using Byte Pair Encoding (BPE), a compression algorithm that discovers which character sequences appear frequently enough to deserve their own token.
One key decision happens before the tokenizer can be trained: vocabulary size. This is an architecture choice, not a linguistic one. LLaMA 1 chose 32K tokens. LLaMA 3 jumped to 128K. BLOOM chose 250,880 - a number divisible by 128 (GPU memory alignment) and by 4 (tensor parallelism). The vocabulary size is set, then BPE runs until it fills that many slots.
The result is a merge table - an ordered list of roughly 100,000 rules that define how text gets split. “the” becomes one token. “tokenization” becomes two (“token” + “ization”). This merge table is frozen and never changes again. Change it, and you retrain the entire model.
This isn’t theoretical. When Anthropic shipped Claude 4.7, they included a new tokenizer - a different merge table that produces smaller, more granular tokens. The same text now becomes 1.3-1.47x more tokens. That meant retraining the entire model from scratch - new merge table, new embedding layer, new weights. They accepted that cost because finer tokens gave the model more literal instruction following and fewer tool-call errors.
Our running example through the tokenizer:
"The bank by the river had no money"
→ [791, 7085, 553, 279, 15140, 1047, 912, 3300]
“bank” is now 7085. Just a number. The tokenizer has no idea it means two different things.
3. Pre-training
This is the main event - the most expensive step, often running for months on thousands of GPUs.
Two terms worth defining first. Tokens are the sub-word chunks a model thinks in - pieces like “cat”, ” sat”, ” believ”, ” able” - built from a sample of training text by repeatedly merging the most common adjacent pairs until you have a fixed set of around 50,000 of them. That set is the vocabulary. The corpus is the trillions of tokens of actual text the model reads during training. The vocabulary is the lens; the corpus is what you look at through it.
The training loop is deceptively simple in concept. Take a sequence of tokens. For each position, predict the next token. Compare the prediction to reality. Adjust the weights to make the prediction slightly better. Repeat billions of times.
But a token is just an integer - 7085 for “bank.” A neural network can’t do math on raw integers. So the model starts with an embedding table: one row per vocabulary token, each row a vector - a list of several thousand numbers. At initialization, these are random. No meaning, no intelligence - just noise. When token 7085 enters the model, the embedding step is just a table lookup: give me row 7085. That’s it.
These random vectors are what the model’s architecture receives. That architecture is a transformer - a stack of layers (typically 32-128, depending on model size), each containing two sub-components:
- Multi-head attention: lets each token look at every previous token and decide which ones are relevant. This is where “bank” would learn to attend to “river.”
- Feed-forward network: processes each token independently through a non-linear transformation. This is where most of the model’s knowledge gets stored.
Each token’s vector passes through all layers and exits as a prediction of what comes next. The difference between prediction and reality is the loss - and backpropagation adjusts every weight in every layer, including the embedding table itself, to reduce that loss. After billions of updates, the random numbers have been shaped into something meaningful. Tokens that behave similarly in text - “cat” and “kitten”, “big” and “large” - end up with similar vectors. Not because anyone designed it that way, but because similar vectors made the model better at predicting the next token.
The only training signal is next-token prediction. The model learns grammar, facts, reasoning, code, translation, and humor - all from predicting what word comes next.
4. Post-Training and Alignment
The pre-trained model is a powerful text completer, but not a useful assistant. It will happily continue any text you give it, including toxic content, hallucinations, and rambling. Post-training transforms it into something helpful.
This happens in stages:
Supervised Fine-Tuning (SFT): The model trains on thousands of carefully written (instruction, response) pairs - examples of ideal assistant behavior. This teaches it the format of being helpful.
Preference Optimization: Humans (or AI systems) rank multiple model responses to the same prompt. The model then learns to prefer the higher-ranked responses. Two approaches dominate:
- RLHF (Reinforcement Learning from Human Feedback) - trains a separate reward model, then optimizes the LLM against it using reinforcement learning.
- DPO (Direct Preference Optimization) - skips the reward model entirely and trains directly on preference pairs. Simpler, increasingly popular.
Constitutional AI (Anthropic’s approach): Instead of relying solely on human feedback, the model critiques and revises its own responses according to a set of principles. This scales better than human annotation.
Reasoning Training (newer): For models like OpenAI’s o1 or DeepSeek-R1, a separate RL phase trains the model to produce explicit chains of reasoning before answering. This is a distinct step beyond general alignment.
5. Evaluation and Testing
Before deployment, the model runs through automated benchmarks (MMLU for knowledge, HumanEval for code, GSM8K for math), human evaluation (blind side-by-side comparisons), and safety testing (red-teaming, bias audits, capability evaluations for dangerous knowledge).
6. Deployment Preparation
The trained model weights are optimized for serving: quantization reduces numerical precision (32-bit to 8-bit or even 4-bit) to cut memory and increase speed with minimal quality loss. Serving infrastructure (vLLM, TensorRT-LLM) is configured for efficient multi-user serving.
Part II: What Happens When You Send a Prompt
Here’s every step when you hit Enter in Claude Code. Most of this is invisible to you.
YOUR MACHINE (Claude Code)
│ You type: "The bank by the river had no money"
│ Claude Code sends raw text over HTTPS
│
└──→ ANTHROPIC'S SERVERS ───────────────────────────────────
7. API Gateway
Your request hits infrastructure first. Authentication, rate limiting, quota checks, request validation. If your API key is invalid or your rate limit is exceeded, you get rejected here - no GPU touched.
8. Prompt Assembly
The raw text you typed is just one piece. The server assembles the full prompt:
- System prompt: safety instructions, behavioral guidelines, current date
- Tool definitions: schemas for any tools the model can use (in Claude Code, this includes file reading, editing, bash execution, etc.)
- Conversation history: all previous turns in the conversation
- Your message: what you just typed
- Special tokens: delimiters that tell the model where each role’s message begins and ends
A typical Claude Code prompt is thousands of tokens before your message even appears.
9. Tokenization
The assembled prompt is split into token IDs using the same merge table that was built during training. First, a regex pre-tokenizes the text into chunks (splitting contractions, separating numbers, isolating punctuation). Then BPE merge rules replay in priority order, turning each chunk into token IDs.
This runs on the CPU. It’s fast - millions of tokens per second.
10. Prefill (Processing the Prompt)
The token IDs enter the GPU. The entire prompt is processed in a single parallel forward pass - this is what GPUs excel at. Each token passes through:
- Embedding lookup: token ID 7085 becomes a vector of 4,096-12,288 dimensions
- Positional encoding (RoPE): rotation applied to encode where each token sits in the sequence
- Transformer layers (32-126 of them, each containing):
- Multi-head attention: each token attends to all previous tokens
- Feed-forward network: each token processed independently
- Residual connections and normalization
- KV cache populated: key and value vectors for every token, at every layer, are stored in GPU memory for later reuse
This phase is compute-bound - the bottleneck is raw computation speed. A long prompt means a slow Time to First Token (TTFT).
11. Sampling the First Token
The final layer outputs a vector for the last token position. This vector is projected to vocabulary size (~100K dimensions) to produce logits - a raw score for every possible next token.
These logits are then shaped by:
- Temperature: controls randomness (lower = more deterministic)
- Top-k: keeps only the k most likely tokens
- Top-p (nucleus sampling): keeps the smallest set of tokens whose cumulative probability exceeds p
One token is sampled from the resulting distribution.
12. Decode Loop (Token by Token)
Now the autoregressive loop begins. Each new token is generated one at a time:
- Feed the new token through all transformer layers
- But only compute Q, K, V for this one token (not the whole sequence)
- Attend over the cached K, V from all previous tokens (this is why the KV cache exists)
- Append this token’s K, V to the cache
- Compute logits, sample next token
- Convert token ID to text, stream to client
- Check stop conditions (EOS token, stop sequence, max tokens)
- Repeat
This phase is memory-bandwidth-bound - each token requires reading the entire model’s weights from GPU memory, but does very little computation per parameter. This is why generation speed (tokens per second) is roughly constant regardless of prompt length.
13. Tool Use (When Applicable)
When the model decides to call a tool - say, reading a file - the pipeline interrupts:
- Model emits a structured tool call and stops generating
- Response streams back to Claude Code with stop reason “tool_use”
- Claude Code executes the tool locally (reads the file, runs the command, etc.)
- Tool result is sent back as a new message
- The server assembles a new prompt (original + tool call + tool result)
- Prefill + decode starts again from step 10
Each tool call is a full round-trip. Prompt caching avoids recomputing the KV cache for the unchanged prefix.
14. Extended Thinking
When Claude uses extended thinking, it generates reasoning tokens before the visible response. This is the same autoregressive loop - there’s no separate “thinking module.” The model simply generates into a thinking region (with its own token budget), and those tokens become context that influences the final answer through attention.
15. Response Complete
The final token is generated. Output moderation runs a safety check on the response. Usage is counted (input tokens, output tokens, cached vs. uncached) for billing. The completion event streams back to your terminal.
The Running Example, End to End
From the moment you type “The bank by the river had no money” to the moment you see a response:
| Step | What Happens | Where |
|---|---|---|
| Prompt assembly | Your text joins system prompt, tools, history | Server CPU |
| Pre-tokenize | Regex splits into: [“The”, ” bank”, ” by”, ” the”, ” river”, ” had”, ” no”, ” money”] | Server CPU |
| BPE encode | Merge rules produce: [791, 7085, 553, 279, 15140, 1047, 912, 3300] | Server CPU |
| Embedding | 7085 becomes a vector: [0.23, -0.41, 0.87, …] (thousands of dimensions) | GPU |
| Attention | ”bank” attends to “river” (high weight) and “money” (lower weight) - the model resolves that this is a riverbank, not a financial institution | GPU |
| Generation | Model predicts the most likely next token, one at a time | GPU |
| Decode + stream | Token IDs convert back to text, stream to your terminal | Server CPU |
“bank” started as the string b-a-n-k. It became the integer 7085. That integer became a point in high-dimensional space. Attention shifted that point toward “riverbank” by connecting it to “river.” And the model generated its response understanding the joke.
Every step in this pipeline exists because the previous step wasn’t enough. Tokenization alone doesn’t capture meaning - you need embeddings. Embeddings alone don’t capture context - you need attention. Attention alone doesn’t generate text - you need the autoregressive loop.
What’s Next
This post is the map. The next posts are the territory.
Each step above has enough depth for its own deep-dive. The series will explore them one by one, using “The bank by the river had no money” as the running example throughout. Tokenization, embeddings, attention, generation, training, alignment - each gets its own post.
The tokenizer doesn’t understand language. The embedding layer doesn’t understand context. The attention mechanism doesn’t generate text. But stacked together, they produce something that feels like understanding.
That’s the complete picture. Now let’s go deeper.
References
- Vaswani et al., “Attention Is All You Need” - The transformer architecture (2017)
- Sennrich et al., “Neural Machine Translation of Rare Words with Subword Units” - BPE for NLP (2016)
- Radford et al., “Language Models are Unsupervised Multitask Learners” - GPT-2 and byte-level BPE (2019)
- Touvron et al., “LLaMA” - Open model with well-documented training pipeline (2023)
- Ouyang et al., “Training language models to follow instructions” - InstructGPT / RLHF (2022)
- Rafailov et al., “Direct Preference Optimization” - DPO (2023)
- Bai et al., “Constitutional AI” - Anthropic’s alignment approach (2022)
- Dao et al., “FlashAttention” - IO-aware exact attention (2022)
- Kwon et al., “PagedAttention / vLLM” - Efficient KV cache management (2023)
- OpenAI, tiktoken source code - BPE tokenizer implementation
Co-written with AI. Credit the prose, blame the opinions.