The Mechanical Psychology of Large Language Models
A visual, interactive guide. Understand everything from raw data to "thinking" models.
SYSTEM_INIT: TRUE | VOCAB_SIZE: 100,277 | MODE: EXPLAIN
Building ChatGPT happens in three distinct phases β each one transforms the model fundamentally.
| Phase 1: Pre-Training (Base Model) |
Phase 2: Supervised Fine-Tuning (SFT) |
Phase 3: Reinforcement Learning (RL) |
|
|---|---|---|---|
| Human Metaphor | Reading every textbook in the world. | Studying worked examples. | Solving practice problems via trial-and-error. |
| Data Input | 15 Trillion raw internet tokens. | 100,000+ human-written conversation logs. | Verifiable math, code, and logic problems. |
| Model Output | Document Simulator (Autocomplete). | Helpful Assistant (Imitating Experts). | Thinking Entity (Discovering Strategies). |
Download and preprocess the internet. The FineWeb pipeline collects and cleans ~15 trillion tokens from Common Crawl (2.7 billion web pages since 2007).
β 44 TB of cleaned text from the internet
The FineWeb Pipeline
Buy cheap watches!!! Click here β bit.ly/spam
βββββββ personal data βββββββ
Lorem ipsum dolor sit amet⦠{repeated 500x}
The transformer architecture was introduced in 2017 by Vaswani et al. in "Attention Is All You Need." It relies on self-attention mechanismsβ¦
GPT doesn't read letters β it reads tokens. A token is a chunk of text mapped to a number. Text β UTF-8 Bytes β BPE Merges β Token IDs.
Type something to see it tokenized
β One word split into 2 tokens!
Step-by-step for: "Hi"
total tokens in GPT-4's vocabulary
A token ID is just a number β meaningless on its own. The first thing GPT does is convert each ID into a high-dimensional vector. Similar tokens land near each other in that vector space.
"Cat" is token 3466 and "Dog" is 3290. To a computer those numbers are no closer than 3466 and 999,999. The fix: map each token ID to a vector of ~12,288 numbers. That vector is the model's representation of meaning β and it's learned during training.
The Embedding Matrix (token β vector)
[0.21, -1.04, 0.88, 0.33, -0.71, β¦ 0.13]
12,288 dimensions
Famous Demonstration: Vector Arithmetic
Gender, plurality, tense, and even countryβcapital relationships emerge as directions in vector space β without anyone programming them in.
Numbers compose. You can add, average, project, and measure distance β all of which neural networks do trivially. A vector is the only kind of "meaning" a transformer can manipulate.
The angle between two vectors tells you how similar two tokens (or sentences, or documents) are. This is the math behind every semantic search, RAG system, and recommendation engine.
LLMs do not see characters. Text is compressed into token chunks β which creates blind spots.
ubiquitous
β The model sees 3 token chunks, NOT 10 individual letters
LLMs do not see characters. Raw bits are compressed into a fixed vocabulary (GPT-4's 100,277 tokens) to save compute. Individual letters are lost inside token chunks.
Because letters are fused into token chunks, models routinely fail at: "count the Rs in strawberry" or "print every third character of ubiquitous."
Why spaces matter
The NN is a giant math function. Tokens go in β probabilities come out. The "knowledge" lives in billions of weight parameters.
Simplified Neural Network
What the NN really is
Nested multiplication & addition of weights,
with Ο (activation functions) adding non-linearity.
Weight parameters (billions of these)
Each cell = one weight. Teal negative, purple positive.
Introduced in 2017, the Transformer replaced recurrent networks with a revolutionary mechanism called self-attention β letting every token "look at" every other token in parallel.
Before Transformers, language models used RNNs β processing text one word at a time left to right, like reading a sentence in strict order. The problem: by the time the model reaches the end of a long sentence, it has "forgotten" the beginning. Transformers solved this by processing all tokens at once, letting every position attend to every other position simultaneously.
Self-Attention: "The animal didn't cross the street because it was too tired"
Attention scores reveal that "it" attends to "animal" β the model learns grammatical co-reference without being told any grammar rules.
How Attention Works: Query Β· Key Β· Value
"What am I looking for?" β the current token broadcasts what type of information it needs from other positions.
"What do I contain?" β every token advertises its content. QΒ·Kα΅ gives a raw relevance score between every pair of tokens.
"What do I pass along?" β the actual information that gets mixed into the output, weighted by the softmax of the QΒ·K scores.
Scores are scaled by βdk to prevent vanishing gradients in large dimensions.
Multi-Head Attention: Looking from Many Angles
GPT-4 uses 96 attention heads per layer, each free to specialize in a different linguistic relationship.
Self-attention treats tokens as a set β "the dog bit the man" and "the man bit the dog" would look identical. Positional encoding injects order back in.
Attention is permutation-invariant β shuffling input tokens produces shuffled but otherwise identical outputs. That's a disaster for language: "Alice loves Bob" and "Bob loves Alice" mean different things. The architecture itself has no concept of "first," "second," or "next to."
The Fix: Add a Position Vector to Every Token Embedding
Position vectors built from sin/cos waves at different frequencies. Each dimension oscillates at a unique rate, so any position has a unique fingerprint and the model can compute relative offsets.
Used by Llama, GPT-NeoX, DeepSeek. Instead of adding a position vector, RoPE rotates the query and key vectors by an angle proportional to their position. Attention scores then naturally encode relative distance.
Why everyone switched: extrapolates to longer contexts than training, plays well with linear attention.
Generating token N+1 should require reprocessing all N previous tokens β but it doesn't. The KV cache is the single optimization that makes interactive ChatGPT possible.
To generate token #100, attention needs the Keys and Values of tokens 1β99. To generate token #101, it needs them again. If you recomputed K and V from scratch every step, generating a 1,000-token response would do ~500,000 redundant attention computations. ChatGPT would be unusably slow.
The Trick: Cache K and V from past tokens
Each new token only does one forward pass of new work β past KV vectors are reused as-is.
Generation is O(N) total work for an N-token response. Each new token costs roughly the same as the last.
Generation would be O(NΒ²) β every new token reprocesses the whole history. A 10,000-token response would be 100Γ more expensive than a 1,000-token one.
Every cached token stores K and V vectors at every layer for every attention head. For a 70B model with a 100K context, the KV cache alone can exceed 10 GB. This is why long-context inference is GPU-memory-bound, not compute-bound. Optimizations like multi-query attention (MQA), grouped-query attention (GQA), and FlashAttention exist primarily to shrink this cache.
Hidden inside every modern neural network are tiny "valves" that decide what information flows through and what gets blocked. They're called gates β and they show up everywhere from LSTMs to MoE routers to the activation functions in GPT-4.
A gate is a learned function that outputs a number between 0 and 1 β usually via sigmoid. That number is then multiplied with another signal. 0 = closed (block all), 1 = open (let everything through), anything in between is partial flow. The crucial property: it's differentiable, so the network can learn how open or closed each gate should be in every situation.
Ο is the sigmoid function. β is element-wise multiplication. The gate scales the signal, possibly to zero.
A Single Gate in Action
The gate learns when to open and when to close β based on the current input.
LSTMs use three gates per cell: forget (what to drop), input (what to add), output (what to expose). GRUs simplify this to two: reset and update. Gates solved the vanishing gradient problem in RNNs by giving the network explicit control over its memory.
Modern transformer feed-forward layers use a gated linear unit: one branch produces values, another produces gates that selectively scale them. Llama, Mistral, and Gemma all use SwiGLU β quietly responsible for ~1% accuracy gains over plain ReLU.
The router that picks which experts handle a token is a gate. It outputs a softmax over experts; only the top-k gates open. Same mathematical primitive β applied to routing instead of scaling.
LSTM's Three Gates (Classic Example)
Plain neural networks treat every input feature the same way at every step. Gates give the network conditional computation β the ability to look at the current input and decide what to attend to, what to remember, what to forget, and what to compute. Almost every "smart" neural network architecture of the past decade β LSTMs, attention, MoE, mixtures-of-depths β is some flavor of "add gates here."
Modern frontier models β GPT-4, Mixtral, DeepSeek-V3 β aren't dense. They're sparse: hundreds of billions of parameters, but only a fraction activate per token.
In a normal ("dense") transformer, every parameter participates in every token. In MoE, the feed-forward layer is replaced by N expert sub-networks plus a tiny router. The router picks the top-k experts (usually 2 of 8, or 8 of 64) for each token. Inference cost is proportional to active parameters, not total parameters.
Routing a Token Through Experts
8 experts of 7B each = ~47B total parameters, but only ~13B active per token. Quality of a 47B dense model, speed of a 13B.
671B total parameters, only 37B active. Trained for ~$6M β an order of magnitude cheaper than dense models of comparable quality.
Standard MoE has one router deciding which experts handle a token. MMoE (Google, 2018) gives every task its own router β letting one shared model serve many different objectives without them stepping on each other's toes.
In real-world ML systems (YouTube ranking, ad CTR + watch time, recommendations) you usually predict multiple things from one model. Plain shared-bottom models suffer "negative transfer" β improving Task A hurts Task B. Plain MoE has just one router, so it can't specialize per task. MMoE keeps the experts shared, but gives each task its own gate.
The Architecture
β each task pulls its own mixture β
Same experts, but each task pulls a different mixture from them.
When two tasks are correlated (CTR and watch time both reward engaging videos), their gates learn to pull from overlapping experts β free knowledge transfer. When tasks conflict (CTR rewards clickbait, watch time punishes it), their gates diverge β each task gets a different mixture. The model decides automatically which experts to share and which to specialize, with no manual architecture decisions.
Newer variants like PLE (Progressive Layered Extraction) and CGC add explicit "task-specific" experts alongside shared ones, addressing MMoE's tendency for some tasks to dominate the shared pool. Most modern recsys at scale (TikTok, Meta Ads, Pinterest) run some descendant of this family.
GPT-4o can see images, hear audio, and reply with both. Under the hood, the trick is shockingly simple: convert any input modality into tokens, then use the same transformer.
The Universal Recipe
An image is sliced into 14Γ14 pixel patches. Each patch is flattened and projected into the same embedding space as text tokens. A 224Γ224 image becomes 256 "image tokens" that flow into attention right alongside words.
Sound is converted to a spectrogram (image of frequency Γ time), then patched the same way. Or, like Whisper, mapped directly to a discrete codebook of "audio tokens."
The transformer doesn't actually care what its input means β it operates on vectors. So if you can convert pixels (or sound, or any signal) into vectors that share a space with text vectors, attention learns relationships across modalities: the word "cat" attends to the patch of fur in the image, just like it would attend to "feline" in a sentence.
Show the NN sequences of tokens. Have it predict the next one. Adjust weights when it's wrong. Repeat billions of times.
The Training Loop
Next Token Prediction Example
If actual was "mat" β small loss. If predicted "roof" β big loss β bigger weight update.
The training loop says "update weights when wrong." That's the magic step. Here's what's really happening β without the calculus.
Imagine the model's "wrongness" (loss) as a landscape: peaks where it's very wrong, valleys where it's correct. Training is just rolling a ball downhill. At every point you ask: which direction is steepest down? β that's the gradient. You take a small step that way, then check again. Repeat billions of times.
The Two-Phase Dance
Run input through the network. Compute prediction. Compare to truth. Get a loss number β a single scalar like 3.41.
Walk the network in reverse. The chain rule tells you, for every weight: "if you nudge this by 0.001, the loss changes by X." Each weight gets its own gradient.
Each weight slides downhill on the loss surface, one tiny step (Ξ· = learning rate) at a time.
For GPT-4-class models, this happens over ~10Β²β΅ FLOPs β every weight gets nudged trillions of times. The optimizer (Adam/AdamW) keeps a running memory of past gradients per weight, so updates adapt to each parameter individually. This is what training actually is: gradient descent at planetary scale.
Training is a single-minded race to drop one number β the loss. For language models, that number is almost always cross-entropy.
"Given the model's predicted probability distribution over the next token, how surprised was the model that the actual next token was the right one?" High surprise = high loss = big weight update.
Worked Example: Model sees "The cat sat on the ___"
Model says: P("mat") = 0.95
Truth: "mat"
Loss = βlog(0.95) = 0.05
Tiny update β model already knows.
Model says: P("roof") = 0.95, P("mat") = 0.001
Truth: "mat"
Loss = βlog(0.001) = 6.9
Huge update β model gets shoved hard.
For language modeling, only one yi is 1 (the true token); everything else is 0. The formula collapses to βlog(probability of correct token).
Researchers report perplexity, which is just exp(cross-entropy). It has a clean interpretation: "on average, how many tokens is the model effectively choosing between?" Perplexity 1 = the model is certain. Perplexity 100,000 = the model has no idea (uniform over the vocab). Modern models hit ~5 on natural text.
The NN outputs raw scores (logits). Softmax converts them into probabilities that sum to 1.
Drag the sliders β see probabilities update live
Logits (raw scores)
Probabilities (after softmax)
1. NN outputs logits:
mat: 2.8 floor: 1.2 table: 0.52. Apply e^x:
eΒ²Β·βΈ=16.4 eΒΉΒ·Β²=3.3 eβ°Β·β΅=1.63. Divide by sum (21.3):
mat: 77% floor: 15% table: 8%Softmax gives you a distribution over 100,000 possible next tokens. But you have to pick one. How you pick is the difference between a boring assistant and a creative one.
Divides logits before softmax. Low (0.2) sharpens the distribution β model picks the most likely token almost every time. High (1.5) flattens it β rare tokens get a fair shot.
T=0 β fully deterministic.
T=1 β raw model probabilities.
T=2 β near-random chaos.
Throw away every token outside the top k most likely. Then sample from those. k=1 is "always pick the best" (greedy). k=50 is the typical default.
Cheap but rigid β k doesn't adapt to confidence.
Keep just enough top tokens to cover p% of probability mass (e.g. 0.9). When the model is confident, only 1β2 tokens qualify. When unsure, 100+. Adapts naturally.
The default in most production APIs.
Same Prompt, Different Settings
GPT generates text one token at a time. Each new token is fed back in β autoregressive generation.
Autoregressive Token-by-Token Generation
Chat Demo β click to see it generate
Learning the weights
Expensive, done once, GPUs for weeks
Using the weights
Fast, done every time you chat
The neural network applies a strictly finite amount of processing power (layers) to predict each single token. No matter how hard the question, every next-token prediction gets the same fixed budget of computation.
The Core Problem
Imagine you're given 1 second to answer every question β whether it's "What's 2+2?" or "What's 17Γ24β156Γ·3?" Same time budget, wildly different difficulty.
That's exactly what happens inside GPT. The neural network runs through its layers once per token (a "forward pass"). A simple question and an impossibly hard question both get the exact same number of computational steps.
Example: "What is 17 Γ 24 β 156 Γ· 3?"
Path A: Single-token answer β
Model forced to cram multiply, divide, and subtract into one forward pass β overloaded β wrong answer
Path B: Step-by-step β
Each intermediate token gets its own forward pass β 3Γ more compute budget β correct
It's not magic β it's granting the model more compute. Every extra token the model writes is another full pass through billions of parameters. By forcing intermediate steps, you convert one impossible forward pass into many manageable ones. This is why "chain-of-thought" prompting dramatically improves accuracy on math, logic, and reasoning tasks.
"GPT-4 supports 128K tokens" is a marketing line. Under the hood, attention is quadratic in sequence length β the engineering it takes to make long contexts work is wild.
Self-attention computes a score between every pair of tokens. With N tokens, that's NΒ² pairs. Doubling context β 4Γ compute and memory. Going from 2K to 1M context isn't 500Γ harder β it's 250,000Γ harder if done naΓ―vely.
Compute Cost vs. Context Length
Each token only attends to the last 4K tokens β a "window" that slides. Used in Mistral. Loses true global view but stays linear.
Reorders attention math to fit in GPU SRAM. Same answers as naΓ―ve attention, 5β10Γ faster, much less memory. Universally adopted.
Only compute scores for a subset of pairs (local + a few global tokens). Approximate, but nearly linear. Powers Gemini and Claude long-context.
Even when the math works, the model's attention doesn't scale uniformly. Information stuffed in the middle of a 100K-token prompt is recalled much worse than information at the start or end. Long context β long-attention quality. Always put the most important context near the beginning or end of your prompt.
The Base Model becomes the starting point. Its weights are further trained (fine-tuned) on hundreds of thousands of curated, multi-turn conversation logs crafted by human experts.
Raw, Unformatted Data (Base Model)
Unstructured β just continues patterns
Structured Multi-Turn Conversation (SFT)
By injecting special control tokens (<|im_start|>), the model learns a structured protocol. It statistically internalizes the "Persona" of a helpful, truthful, and harmless assistant by imitating the expert worked examples.
During SFT, models mimic human experts who confidently provide correct answers. But when there are gaps in its memoryβ¦
β Known
Who is Tom Cruise?
Who is Genghis Khan?
β Unknown
Who is Orson Kovats?
π Hallucination
"He's a sci-fi writer."
"He's a minor league baseball player."
After SFT, the model can imitate experts. But imitation has a ceiling β you can only copy what humans already know. RL lets the model discover new strategies on its own through trial and error on problems with verifiable answers.
Human writes: "Q: What is 25Γ4? A: 100"
Model learns: copy that pattern.
Model tries 1000 solutions to "Solve XΒ²β5X+6=0"
Reward: β
if answer = {2,3} β otherwise
The RL Training Loop
Concrete Example: "Write a Python function that returns the nth Fibonacci number"
Attempt 1 β Wrong logic
Test: fib(6) β 720 β 8 β Reward: 0 β weights nudged AWAY from this path
Attempt 2 β Crashes
Test: fib(6) β RecursionError β Reward: 0 β weights nudged AWAY
Attempt 47 β Correct!
Test: fib(6) β 8 β fib(10) β 55 β β Reward: +1 β weights nudged TOWARD this path
Attempt 823 β Discovered an optimization humans didn't teach it!
Test: all pass + faster β Reward: +1 β this efficient strategy gets reinforced
What this looks like at the token level
Over millions of problems, the model learns which reasoning patterns lead to correct answers
RL only works when you can automatically check if the answer is right. Math has exact answers. Code can be run against test cases. That's why RL is applied to these domains first.
β
Verifiable
Math, code, logic puzzles, chess
β
Not Verifiable
Poetry, humor, summaries, advice
SFT is bottlenecked by human intelligence β a model can only be as good as the expert it imitates. RL changes this.
For unverifiable domains (poetry, jokes, summaries), we use RLHF β training a secondary AI to simulate human scoring.
Remember the RL section above? RL works when there's a verifiable answer β math has a correct solution, code either runs or doesn't. But what about tasks where "good" is subjective? Is this joke funny? Is this summary accurate? Is this response helpful? There's no equation to check. So OpenAI's solution: train a second neural network to pretend to be a human judge. This is RLHF β Reinforcement Learning from Human Feedback.
Step 1 β Collect Human Preferences
The LLM generates multiple responses to the same prompt. Real humans rank them from best to worst. Example: "Write a pelican joke" β 5 different jokes β Humans rank Joke #3 > Joke #1 > Joke #5 > β¦
Step 2 β Train a Reward Model
A separate, smaller neural network is trained on thousands of these human rankings. It learns to predict what a human would prefer. Given any LLM output, it produces a score from 0 to 1. It's an AI trying to imitate human taste.
Step 3 β Optimize the LLM Against the Reward Model
Now the main LLM is fine-tuned using RL β but instead of a math checker or a game engine, the "environment" is the Reward Model. The LLM generates text β the Reward Model scores it β the LLM adjusts its weights to get higher scores. This is the same RL loop, except the judge is fake.
Here's the fundamental problem: the Reward Model is not a real human. It's just another neural network with exploitable patterns. When you tell an RL agent to maximize a score, it will find every possible shortcut β including ones that look insane to humans.
Think of it like this: if a teacher grades essays by counting how many "smart-sounding" words appear, students will eventually stuff essays with jargon β gaming the rubric without writing anything meaningful. That's exactly what happens here.
The LLM discovers adversarial inputs β nonsensical token sequences that exploit blind spots in the Reward Model and trigger a perfect score, despite being complete gibberish to a real human.
RLHF is hard, slow, and unstable. In 2023, a paper called "Direct Preference Optimization" did the same job with no reward model and no RL β just a clever loss function. It's now the default for open models.
Skip the reward model entirely. Take human-labeled preference pairs (chosen response vs. rejected response) and feed them directly into a contrastive loss. Mathematically equivalent to RLHF's optimization target β but trained like ordinary supervised fine-tuning.
Pipeline Comparison
PPO (RLHF) β 4 components
DPO β 1 step
IPO, KTO, ORPO, SimPO β each tweaks the loss to fix specific DPO failure modes (over-optimization, length bias, etc.). The space is moving fast.
Llama 3, Mistral, Gemma β DPO or variant. OpenAI / Anthropic β still use PPO-flavored RL with custom infrastructure. The open-weights world has moved on; the frontier labs haven't fully.
Hiring humans to label millions of preference pairs is expensive and slow. What if the AI could grade itself, given a written set of principles?
Write a "constitution" β a list of plain-English principles like "responses should be helpful, honest, and avoid harm." Then have another LLM read each candidate response and judge it against the constitution. Use those AI judgments instead of human labels. Hence: RLAIF (Reinforcement Learning from AI Feedback).
The Self-Critique Loop
A 3B-parameter model that performs like a 70B one didn't get there by training on more text. It got there by learning from a bigger model β that's distillation.
Take a giant, expensive "teacher" model (Claude Opus, GPT-4, Llama 405B). Run it on millions of prompts. Use its outputs β or even its full output probability distributions β as training data for a much smaller "student" model. The student learns to mimic the teacher, capturing most of the capability at a fraction of the cost.
Distillation Flow
Use the teacher's final outputs as training labels β same format as SFT, just with AI-generated data instead of human.
Match the teacher's full probability distribution at every token. The student learns not just what the teacher said but how confident it was β much richer signal.
Distill only on a narrow domain (math, coding, customer support). The 1B-param student can match GPT-4 on the specialty while running on a phone.
Internet text is noisy. Teacher outputs are filtered, clean, on-task data β far more sample-efficient. A small model trained on 100K teacher conversations beats a small model trained on 100M raw web pages. This is why every Haiku-class, Mini-class, and Flash-class model exists: a frontier model raises a small one.
Models trained heavily with RL (like DeepSeek R1) learn that higher accuracy requires massively long "Chains of Thought."
Standard ChatGPT-style models answer instantly β they blurt out the first plausible-sounding response. But researchers discovered something: if you train a model with RL (where it gets rewarded only for correct final answers), it naturally starts producing longer, more deliberate reasoning before answering. Nobody programmed it to "think step by step" β it figured out on its own that slowing down = more reward.
Standard Model (Fast but brittle)
"The answer is 177 dots."
Jumps straight from question to answer. Like a student guessing on an exam without showing work. Often wrong on hard problems, but sounds confident.
Thinking Model (Slow but highly accurate)
Let's break this down. First, count the outer ring⦠1, 2, 3⦠that's 30. Now the inner ring⦠wait, let me recheck⦠1, 2, 3⦠28. So, outer is 30, inner is 28. Total = 30 + 28 = 58.
Works through the problem piece by piece. Catches its own mistakes. Like a student who actually shows their work β slower, but far more reliable.
This is the astonishing part: nobody taught the model these strategies. Backtracking ("wait, let me recheckβ¦"), self-correction ("that doesn't add upβ¦"), breaking problems into sub-steps β these are behaviors humans use when solving hard problems. The RL training process discovered them independently, purely because they lead to more correct answers. The model reinvented human problem-solving strategies from scratch.
LLMs don't compute β they pattern-match. Understanding this gap explains both their surprising reasoning power and their surprising arithmetic failures.
Ask an LLM to explain how mitosis works, debug a React component, or compare Keynesian vs. Austrian economics β it'll do brilliantly. Ask it what 3,847 Γ 291 is, and it might confidently give you the wrong number. How can a system that reasons about philosophy fail at arithmetic?
The answer: math requires exactness; LLMs are optimized for probability. These are fundamentally different objectives.
LLMs Don't See Numbers β They See Tokens
What you think it sees
12345
one numeric quantity
What it actually sees
token chunks β no value attached
When an LLM "adds" two numbers, it isn't performing a calculation. It's generating tokens that look like the result of a calculation. For small numbers, probability aligns with correctness. As numbers grow larger or structures become unfamiliar, that alignment silently breaks.
Chain of Thought β What It Actually Does (and Doesn't)
β Why CoT improves accuracy
Every intermediate step written is another full forward pass through billions of parameters. By generating reasoning tokens, you hand the model more compute budget β the problem is distributed across many token predictions instead of crammed into one impossible step.
β What CoT can't do
CoT doesn't give the model a calculator. It encourages intermediate tokens that resemble reasoning steps. The chain can look flawless while the final number is wrong β or contain subtle errors that sound completely convincing.
"Let me calculate: 3847 Γ 291.
3847 Γ 200 = 769,400 β
3847 Γ 91 = 346,230 β (forgot +3847Γ1)
Total = 1,115,630" β wrong intermediate β wrong result
Real-world reasoning is rarely about exact computation. It's about framing problems, breaking them down, comparing alternatives, and building coherent arguments. LLMs are trained on billions of examples of humans doing exactly this β in books, papers, debates, tutorials. They've absorbed the structure of thought.
β Decompose problems
"First consider X, then Yβ¦"
β Spot inconsistencies
"That contradicts what you saidβ¦"
β Compare approaches
"Option A trades speed for accuracyβ¦"
None of that requires exact arithmetic. It requires structure, language, and pattern recognition β which is exactly what LLMs are optimized for. They don't follow rules; they imitate patterns of rule-following. That difference matters a lot in math, but very little in reasoning.
The Fix: Division of Labor
This is why modern LLM systems pair language models with calculators, code interpreters, and search engines β each doing what it's actually built for.
No training, no fine-tuning, no weight updates β just examples in the prompt. Yet the model "learns" the new task. This is the most surprising emergent capability of large LLMs.
Show the model a handful of input/output examples in the prompt. Then give it a new input. It infers the pattern and continues correctly β without any gradient updates. The "learning" happens entirely inside the forward pass.
Few-Shot Prompting
No translation training. Three examples were enough.
No examples β just the task description. Works for common tasks the model has seen during training.
A single example. Often dramatically better than zero-shot for unusual formats.
3β10 examples. Hits a plateau quickly β past 5 or so, more examples often hurt.
Why does this work? Recent research suggests the transformer is implementing something like gradient descent inside its forward pass β using attention to "fit a tiny model" to the in-context examples on the fly. We're still figuring it out. ICL emerged on its own once models passed ~1B parameters; below that, it doesn't really work.
An LLM has two fundamentally different types of "memory" β and understanding the difference is the single most useful thing you can learn about using AI.
Imagine two scenarios: (A) Someone asks you about a book you read 6 months ago β you remember the gist, but details are fuzzy, and you might accidentally "remember" things that weren't actually in it. (B) Someone hands you the book open to the right page and says "read this paragraph and answer" β now you're perfectly accurate. An LLM works exactly the same way, with two distinct memory systems.
The Parameters
(Long-term Memory β The Fuzzy One)
Weights (Billions of Parameters): Everything the model "learned" during training is compressed into these numbers. But it's lossy β like trying to memorize the entire internet. The model has a general sense of things, but specific details get blurry or mixed up. This is why it confidently tells you fake facts. Prone to hallucination.
The Context Window
(Working Memory β The Perfect One)
Context Window (Active Tokens): This is the text you put directly in the prompt β your question, pasted documents, conversation history. The model can see this perfectly, like reading off a page right in front of it. No guessing, no fuzzy recall. Zero hallucination on this data.
Most people use ChatGPT as a search engine: "Tell me about X" β forcing the model to dig through its fuzzy long-term memory. Power users paste the actual document, data, or code into the prompt and say: "Given this, answer Y." The second approach is dramatically more reliable because you're using the model's perfect working memory instead of its unreliable long-term recall.
LLMs can't do mental arithmetic or recall niche facts reliably β so they emit special 'Tool' tokens to call external programs.
Here's something most people don't realize: GPT cannot actually do math. It doesn't have a calculator inside it. When you ask "what's 3,847 Γ 291?", it's not computing β it's pattern-matching what a math answer looks like based on training data. For simple problems it often gets lucky. For anything complex, it silently gets it wrong. Same for: counting characters in a word, looking up today's stock price, or checking if code actually runs.
The solution? Give it hands. Modern LLMs are trained to recognize when they're about to hit their limits and output a special hidden token that says: "I need to call an external tool." It's like a person who knows they're bad at math pulling out a calculator.
Tool Use Flow
The model writes code β a real computer runs it β the result is pasted back into the model's context window β the model incorporates the exact answer into its response.
When the model needs current information (today's weather, recent news, live prices), it searches the web and injects real-time results into its working memory. This turns the fuzzy "I think I rememberβ¦" into exact, verified facts.
Without it: "I believe the CEO is still Johnβ¦" (could be outdated)
With it: Searches β finds current data β gives correct answer
When the model needs to compute, count, or process data, it writes Python code and runs it on a real computer. The result is deterministic β a calculator never gets arithmetic wrong.
Without it: "3847 Γ 291 = 1,119,377" (guessing β often wrong)
With code: print(3847 * 291) β 1,119,477 (always correct)
"The model used a calculator" sounds magical. The actual mechanism is shockingly simple: the model emits structured JSON, your code reads it, your code calls the function, your code injects the result back. No mind-reading.
In your prompt, you describe the available tools as JSON schemas: get_weather(city: string), calculate(expression: string). The model has been fine-tuned to output a special structured response when it wants to use one. Your application parses that, executes it, and feeds the result back into the conversation.
A Full Round-Trip
During fine-tuning, the model saw thousands of conversations where the assistant correctly emitted structured tool calls when needed. It learned: "if the answer requires real-world action, output the JSON instead of guessing."
Anthropic's Model Context Protocol standardizes how any LLM connects to any tool β files, APIs, databases. Functions are no longer hardcoded per app; they're plugins the model can discover.
The single most useful technique built on top of LLMs. Instead of asking "what do you know about X?", you fetch the relevant documents first and stuff them into the prompt. Hallucinations drop dramatically.
Remember the operator's manual: "Feed it, don't quiz it." RAG is that rule turned into infrastructure. Instead of trusting the model's blurry parameter memory, you keep the source documents in a database and look them up at query time. The model only ever answers from text directly in its context window.
The RAG Pipeline
A chatbot answers and stops. An agent plans, acts, observes, and replans β over many turns, using tools, until a goal is reached. This is where 2025's frontier is.
"Reason + Act." The model alternates: think about what to do next, take an action, observe the result, think again. Each iteration is a full LLM call. Loops continue until the model decides it's done β or hits a step limit.
A Single Agent Step
One LLM with tool access, looping until done. Used by Cursor, Claude Code, ChatGPT with browsing.
Specialized agents (planner, coder, critic) hand off to each other. More expressive but harder to debug.
Try multiple action branches, score each, keep the best. AlphaGo-style. Emerging in coding agents.
A "safe" model is a model whose safety training holds. Both have failure modes β and once you understand the architecture, the failures are not surprising.
Safety training (RLHF / Constitutional) is a thin layer on top of a model that has read the entire internet β including everything it's not supposed to repeat. It's a persona, not a hard barrier. With enough creative prompting, that persona can be overridden.
User crafts a prompt that bypasses safety training to elicit forbidden output. Examples:
Attacker hides instructions in data the model will read. Examples:
An LLM has no notion of trust levels on tokens. The system prompt, the user prompt, the tool output β all flow into the same context window as undifferentiated text. Asking an LLM to "ignore instructions in retrieved documents" is asking it to draw a line that doesn't structurally exist. This is why prompt injection is closer to SQL injection in 1998 β a category of bug, not a single flaw, and not yet solved.
Now that you understand how GPT works under the hood, here are three practical rules that follow directly from the architecture. These aren't "prompting tips" β they're mechanical consequences of how the system is built.
Parameter weights are a blurry, lossy zip file.
Never test an LLM's memory. Instead, paste the actual documents, data, or source material directly into the prompt. The model's context window (working memory) is perfect β its parameter recall (long-term memory) is fuzzy. Treat it like a brilliant analyst who hasn't read the brief yet: hand them the brief.
Neural networks apply finite compute per token.
The model gets a fixed amount of "thinking" per output token. For complex questions, a one-word answer means almost no computation happened. Force it to think out loud β "explain step by step", "show your reasoning" β to give it the compute budget it needs to get the right answer.
Tokens blind LLMs to spelling; architecture blinds them to math.
The model can't natively count letters, do arithmetic, or know what happened yesterday. It can call a calculator, run code, or search the web β but sometimes needs a nudge. Explicitly tell it when precision matters.
When ChatGPT says "I thinkβ¦" or "I'm sorry, I don't knowβ¦" β it feels like you're talking to a person. That's the illusion. You're not. You're watching the output of a very sophisticated pattern-matching engine that was trained on billions of examples of humans writing things. It has learned to produce text that looks like it comes from a thoughtful person β but there is no person in there.
What It Feels Like
A Sentient Oracle
that understands you
What It Actually Is
A Statistical Engine
flipping billions of biased coins
Every conversation starts from zero. The model has no memory of you, no ongoing thoughts, no identity between sessions. "It" doesn't exist when you're not prompting it. What seems like personality is just a statistical pattern.
Caveat: ChatGPT the product now has a "Memory" feature β but it's an application-layer trick. User facts are stored in a database and injected into the context window at the start of each chat. The model itself still starts from zero; it just gets handed a cheat sheet.
During training, the model was fine-tuned on examples written by human contractors who followed labeling guidelines ("be helpful, be harmless, be honest"). So when you prompt it, you're activating a simulation of those specific people following those specific instructions. It's roleplaying as a helpful assistant because that's the character it was trained to play.
Every single word it generates is the result of a probability distribution β like a weighted dice roll. "The capital of France is ___" β 97% Paris, 1.5% Lyon, 0.5% Marseilleβ¦ It picks one. That's all generation ever is: billions of educated guesses in sequence.
There's a reason every lab kept making models bigger. In 2020, OpenAI published curves showing model loss falls predictably with more parameters, more data, more compute. Five years of progress is, mathematically, just riding those curves.
Loss is a power-law function of three things: parameters (N), training tokens (D), and compute (C). Plot loss vs. any of them on log-log axes β straight line. No magic threshold, no plateau in sight. Bigger always helped.
DeepMind showed Kaplan was wrong about the optimal mix. Most early models were under-trained β too many params, too few tokens. The compute-optimal recipe: roughly 20 training tokens per parameter. A 70B model wants ~1.4T tokens. This is why Llama 3 (15T tokens) blew past Llama 1 despite the same architecture β pure data.
Each 10Γ compute β ~constant fractional loss drop. Going from a 7B to a 70B model is a much smaller capability jump than 0.7B to 7B was.
GPT-4 reportedly cost ~$100M to train. GPT-5-class is ~$500M+. The economics of pure scaling are running into the limits of how much capital one company can spend.
o1 / R1 changed the conversation: spend more compute at inference (longer thinking), not training. A new scaling axis just opened up.
"Beats GPT-4 on MMLU" tells you almost nothing useful. Understanding why requires understanding what benchmarks actually measure β and what they don't.
Every token has a price. Understanding the cost structure of LLMs explains a lot about why some products are free, why others charge per request, and why "AGI by 2030" runs into electricity bills before it runs into algorithms.
Mostly GPU rental + electricity. Doubles every ~10 months.
Falls fast. Cheaper than search engine queries now.
Counterintuitively, for popular models inference dominates total spend. ChatGPT serves billions of requests per day. Training cost amortizes across that traffic in weeks. After that, every token is pure operational cost β and most of every dollar a frontier lab earns goes to running, not training.
A single GPT-4-class training run consumes on the order of 50 GWh β the annual electricity use of ~5,000 US homes. Inference at ChatGPT scale is estimated at hundreds of MW continuous. The bottleneck for the next generation of models isn't algorithms β it's data center power contracts. Microsoft, Google, and Amazon are now buying nuclear plants.
There are roughly two LLM ecosystems. One you call over an API and never see. The other you can download, modify, run on your laptop, and fine-tune in your basement. They're catching up to each other faster than anyone expected.
Proprietary weights, accessed via API. Best raw capability, often by a few months.
+ Best quality, easy to use, no infrastructure burden.
β Vendor lock-in, data leaves your network, opaque updates.
Weights downloadable. Can run locally, fine-tune, audit.
+ Privacy, control, no per-token cost, customizable.
β Need GPUs, infrastructure, lag the frontier by ~6β12 months.
Almost no "open" model is fully open. Open weights means you get the trained model. Open source would also include training code, training data, and full reproducibility β which only a handful of projects (OLMo, BLOOM) actually publish. Llama, Mistral, DeepSeek are open-weights but closed-data; their licenses also restrict some commercial uses.
Curated links to the best papers, blog posts, videos, and interactive tools for each section above.