The Mechanical Psychology of Large Language Models
A visual, interactive guide. Understand everything from raw data to "thinking" models.
SYSTEM_INIT: TRUE | VOCAB_SIZE: 100,277 | MODE: EXPLAIN
Building ChatGPT happens in three distinct phases β each one transforms the model fundamentally.
| Phase 1: Pre-Training (Base Model) |
Phase 2: Supervised Fine-Tuning (SFT) |
Phase 3: Reinforcement Learning (RL) |
|
|---|---|---|---|
| Human Metaphor | Reading every textbook in the world. | Studying worked examples. | Solving practice problems via trial-and-error. |
| Data Input | 15 Trillion raw internet tokens. | 100,000+ human-written conversation logs. | Verifiable math, code, and logic problems. |
| Model Output | Document Simulator (Autocomplete). | Helpful Assistant (Imitating Experts). | Thinking Entity (Discovering Strategies). |
Download and preprocess the internet. The FineWeb pipeline collects and cleans ~15 trillion tokens from Common Crawl (2.7 billion web pages since 2007).
β 44 TB of cleaned text from the internet
The FineWeb Pipeline
Buy cheap watches!!! Click here β bit.ly/spam
βββββββ personal data βββββββ
Lorem ipsum dolor sit amet⦠{repeated 500x}
The transformer architecture was introduced in 2017 by Vaswani et al. in "Attention Is All You Need." It relies on self-attention mechanismsβ¦
GPT doesn't read letters β it reads tokens. A token is a chunk of text mapped to a number. Text β UTF-8 Bytes β BPE Merges β Token IDs.
Type something to see it tokenized
β One word split into 2 tokens!
Step-by-step for: "Hi"
total tokens in GPT-4's vocabulary
LLMs do not see characters. Text is compressed into token chunks β which creates blind spots.
ubiquitous
β The model sees 3 token chunks, NOT 10 individual letters
LLMs do not see characters. Raw bits are compressed into a fixed vocabulary (GPT-4's 100,277 tokens) to save compute. Individual letters are lost inside token chunks.
Because letters are fused into token chunks, models routinely fail at: "count the Rs in strawberry" or "print every third character of ubiquitous."
Why spaces matter
The NN is a giant math function. Tokens go in β probabilities come out. The "knowledge" lives in billions of weight parameters.
Simplified Neural Network
What the NN really is
Nested multiplication & addition of weights,
with Ο (activation functions) adding non-linearity.
Weight parameters (billions of these)
Each cell = one weight. Teal negative, purple positive.
Introduced in 2017, the Transformer replaced recurrent networks with a revolutionary mechanism called self-attention β letting every token "look at" every other token in parallel.
Before Transformers, language models used RNNs β processing text one word at a time left to right, like reading a sentence in strict order. The problem: by the time the model reaches the end of a long sentence, it has "forgotten" the beginning. Transformers solved this by processing all tokens at once, letting every position attend to every other position simultaneously.
Self-Attention: "The animal didn't cross the street because it was too tired"
Attention scores reveal that "it" attends to "animal" β the model learns grammatical co-reference without being told any grammar rules.
How Attention Works: Query Β· Key Β· Value
"What am I looking for?" β the current token broadcasts what type of information it needs from other positions.
"What do I contain?" β every token advertises its content. QΒ·Kα΅ gives a raw relevance score between every pair of tokens.
"What do I pass along?" β the actual information that gets mixed into the output, weighted by the softmax of the QΒ·K scores.
Scores are scaled by βdk to prevent vanishing gradients in large dimensions.
Multi-Head Attention: Looking from Many Angles
GPT-4 uses 96 attention heads per layer, each free to specialize in a different linguistic relationship.
Show the NN sequences of tokens. Have it predict the next one. Adjust weights when it's wrong. Repeat billions of times.
The Training Loop
Next Token Prediction Example
If actual was "mat" β small loss. If predicted "roof" β big loss β bigger weight update.
The NN outputs raw scores (logits). Softmax converts them into probabilities that sum to 1.
Drag the sliders β see probabilities update live
Logits (raw scores)
Probabilities (after softmax)
1. NN outputs logits:
mat: 2.8 floor: 1.2 table: 0.52. Apply e^x:
eΒ²Β·βΈ=16.4 eΒΉΒ·Β²=3.3 eβ°Β·β΅=1.63. Divide by sum (21.3):
mat: 77% floor: 15% table: 8%GPT generates text one token at a time. Each new token is fed back in β autoregressive generation.
Autoregressive Token-by-Token Generation
Chat Demo β click to see it generate
Learning the weights
Expensive, done once, GPUs for weeks
Using the weights
Fast, done every time you chat
The neural network applies a strictly finite amount of processing power (layers) to predict each single token. No matter how hard the question, every next-token prediction gets the same fixed budget of computation.
The Core Problem
Imagine you're given 1 second to answer every question β whether it's "What's 2+2?" or "What's 17Γ24β156Γ·3?" Same time budget, wildly different difficulty.
That's exactly what happens inside GPT. The neural network runs through its layers once per token (a "forward pass"). A simple question and an impossibly hard question both get the exact same number of computational steps.
Example: "What is 17 Γ 24 β 156 Γ· 3?"
Path A: Single-token answer β
Model forced to cram multiply, divide, and subtract into one forward pass β overloaded β wrong answer
Path B: Step-by-step β
Each intermediate token gets its own forward pass β 3Γ more compute budget β correct
It's not magic β it's granting the model more compute. Every extra token the model writes is another full pass through billions of parameters. By forcing intermediate steps, you convert one impossible forward pass into many manageable ones. This is why "chain-of-thought" prompting dramatically improves accuracy on math, logic, and reasoning tasks.
The Base Model becomes the starting point. Its weights are further trained (fine-tuned) on hundreds of thousands of curated, multi-turn conversation logs crafted by human experts.
Raw, Unformatted Data (Base Model)
Unstructured β just continues patterns
Structured Multi-Turn Conversation (SFT)
By injecting special control tokens (<|im_start|>), the model learns a structured protocol. It statistically internalizes the "Persona" of a helpful, truthful, and harmless assistant by imitating the expert worked examples.
During SFT, models mimic human experts who confidently provide correct answers. But when there are gaps in its memoryβ¦
β Known
Who is Tom Cruise?
Who is Genghis Khan?
β Unknown
Who is Orson Kovats?
π Hallucination
"He's a sci-fi writer."
"He's a minor league baseball player."
After SFT, the model can imitate experts. But imitation has a ceiling β you can only copy what humans already know. RL lets the model discover new strategies on its own through trial and error on problems with verifiable answers.
Human writes: "Q: What is 25Γ4? A: 100"
Model learns: copy that pattern.
Model tries 1000 solutions to "Solve XΒ²β5X+6=0"
Reward: β
if answer = {2,3} β otherwise
The RL Training Loop
Concrete Example: "Write a Python function that returns the nth Fibonacci number"
Attempt 1 β Wrong logic
Test: fib(6) β 720 β 8 β Reward: 0 β weights nudged AWAY from this path
Attempt 2 β Crashes
Test: fib(6) β RecursionError β Reward: 0 β weights nudged AWAY
Attempt 47 β Correct!
Test: fib(6) β 8 β fib(10) β 55 β β Reward: +1 β weights nudged TOWARD this path
Attempt 823 β Discovered an optimization humans didn't teach it!
Test: all pass + faster β Reward: +1 β this efficient strategy gets reinforced
What this looks like at the token level
Over millions of problems, the model learns which reasoning patterns lead to correct answers
RL only works when you can automatically check if the answer is right. Math has exact answers. Code can be run against test cases. That's why RL is applied to these domains first.
β
Verifiable
Math, code, logic puzzles, chess
β
Not Verifiable
Poetry, humor, summaries, advice
SFT is bottlenecked by human intelligence β a model can only be as good as the expert it imitates. RL changes this.
For unverifiable domains (poetry, jokes, summaries), we use RLHF β training a secondary AI to simulate human scoring.
Remember the RL section above? RL works when there's a verifiable answer β math has a correct solution, code either runs or doesn't. But what about tasks where "good" is subjective? Is this joke funny? Is this summary accurate? Is this response helpful? There's no equation to check. So OpenAI's solution: train a second neural network to pretend to be a human judge. This is RLHF β Reinforcement Learning from Human Feedback.
Step 1 β Collect Human Preferences
The LLM generates multiple responses to the same prompt. Real humans rank them from best to worst. Example: "Write a pelican joke" β 5 different jokes β Humans rank Joke #3 > Joke #1 > Joke #5 > β¦
Step 2 β Train a Reward Model
A separate, smaller neural network is trained on thousands of these human rankings. It learns to predict what a human would prefer. Given any LLM output, it produces a score from 0 to 1. It's an AI trying to imitate human taste.
Step 3 β Optimize the LLM Against the Reward Model
Now the main LLM is fine-tuned using RL β but instead of a math checker or a game engine, the "environment" is the Reward Model. The LLM generates text β the Reward Model scores it β the LLM adjusts its weights to get higher scores. This is the same RL loop, except the judge is fake.
Here's the fundamental problem: the Reward Model is not a real human. It's just another neural network with exploitable patterns. When you tell an RL agent to maximize a score, it will find every possible shortcut β including ones that look insane to humans.
Think of it like this: if a teacher grades essays by counting how many "smart-sounding" words appear, students will eventually stuff essays with jargon β gaming the rubric without writing anything meaningful. That's exactly what happens here.
The LLM discovers adversarial inputs β nonsensical token sequences that exploit blind spots in the Reward Model and trigger a perfect score, despite being complete gibberish to a real human.
Models trained heavily with RL (like DeepSeek R1) learn that higher accuracy requires massively long "Chains of Thought."
Standard ChatGPT-style models answer instantly β they blurt out the first plausible-sounding response. But researchers discovered something: if you train a model with RL (where it gets rewarded only for correct final answers), it naturally starts producing longer, more deliberate reasoning before answering. Nobody programmed it to "think step by step" β it figured out on its own that slowing down = more reward.
Standard Model (Fast but brittle)
"The answer is 177 dots."
Jumps straight from question to answer. Like a student guessing on an exam without showing work. Often wrong on hard problems, but sounds confident.
Thinking Model (Slow but highly accurate)
Let's break this down. First, count the outer ring⦠1, 2, 3⦠that's 30. Now the inner ring⦠wait, let me recheck⦠1, 2, 3⦠28. So, outer is 30, inner is 28. Total = 30 + 28 = 58.
Works through the problem piece by piece. Catches its own mistakes. Like a student who actually shows their work β slower, but far more reliable.
This is the astonishing part: nobody taught the model these strategies. Backtracking ("wait, let me recheckβ¦"), self-correction ("that doesn't add upβ¦"), breaking problems into sub-steps β these are behaviors humans use when solving hard problems. The RL training process discovered them independently, purely because they lead to more correct answers. The model reinvented human problem-solving strategies from scratch.
LLMs don't compute β they pattern-match. Understanding this gap explains both their surprising reasoning power and their surprising arithmetic failures.
Ask an LLM to explain how mitosis works, debug a React component, or compare Keynesian vs. Austrian economics β it'll do brilliantly. Ask it what 3,847 Γ 291 is, and it might confidently give you the wrong number. How can a system that reasons about philosophy fail at arithmetic?
The answer: math requires exactness; LLMs are optimized for probability. These are fundamentally different objectives.
LLMs Don't See Numbers β They See Tokens
What you think it sees
12345
one numeric quantity
What it actually sees
token chunks β no value attached
When an LLM "adds" two numbers, it isn't performing a calculation. It's generating tokens that look like the result of a calculation. For small numbers, probability aligns with correctness. As numbers grow larger or structures become unfamiliar, that alignment silently breaks.
Chain of Thought β What It Actually Does (and Doesn't)
β Why CoT improves accuracy
Every intermediate step written is another full forward pass through billions of parameters. By generating reasoning tokens, you hand the model more compute budget β the problem is distributed across many token predictions instead of crammed into one impossible step.
β What CoT can't do
CoT doesn't give the model a calculator. It encourages intermediate tokens that resemble reasoning steps. The chain can look flawless while the final number is wrong β or contain subtle errors that sound completely convincing.
"Let me calculate: 3847 Γ 291.
3847 Γ 200 = 769,400 β
3847 Γ 91 = 346,230 β (forgot +3847Γ1)
Total = 1,115,630" β wrong intermediate β wrong result
Real-world reasoning is rarely about exact computation. It's about framing problems, breaking them down, comparing alternatives, and building coherent arguments. LLMs are trained on billions of examples of humans doing exactly this β in books, papers, debates, tutorials. They've absorbed the structure of thought.
β Decompose problems
"First consider X, then Yβ¦"
β Spot inconsistencies
"That contradicts what you saidβ¦"
β Compare approaches
"Option A trades speed for accuracyβ¦"
None of that requires exact arithmetic. It requires structure, language, and pattern recognition β which is exactly what LLMs are optimized for. They don't follow rules; they imitate patterns of rule-following. That difference matters a lot in math, but very little in reasoning.
The Fix: Division of Labor
This is why modern LLM systems pair language models with calculators, code interpreters, and search engines β each doing what it's actually built for.
An LLM has two fundamentally different types of "memory" β and understanding the difference is the single most useful thing you can learn about using AI.
Imagine two scenarios: (A) Someone asks you about a book you read 6 months ago β you remember the gist, but details are fuzzy, and you might accidentally "remember" things that weren't actually in it. (B) Someone hands you the book open to the right page and says "read this paragraph and answer" β now you're perfectly accurate. An LLM works exactly the same way, with two distinct memory systems.
The Parameters
(Long-term Memory β The Fuzzy One)
Weights (Billions of Parameters): Everything the model "learned" during training is compressed into these numbers. But it's lossy β like trying to memorize the entire internet. The model has a general sense of things, but specific details get blurry or mixed up. This is why it confidently tells you fake facts. Prone to hallucination.
The Context Window
(Working Memory β The Perfect One)
Context Window (Active Tokens): This is the text you put directly in the prompt β your question, pasted documents, conversation history. The model can see this perfectly, like reading off a page right in front of it. No guessing, no fuzzy recall. Zero hallucination on this data.
Most people use ChatGPT as a search engine: "Tell me about X" β forcing the model to dig through its fuzzy long-term memory. Power users paste the actual document, data, or code into the prompt and say: "Given this, answer Y." The second approach is dramatically more reliable because you're using the model's perfect working memory instead of its unreliable long-term recall.
LLMs can't do mental arithmetic or recall niche facts reliably β so they emit special 'Tool' tokens to call external programs.
Here's something most people don't realize: GPT cannot actually do math. It doesn't have a calculator inside it. When you ask "what's 3,847 Γ 291?", it's not computing β it's pattern-matching what a math answer looks like based on training data. For simple problems it often gets lucky. For anything complex, it silently gets it wrong. Same for: counting characters in a word, looking up today's stock price, or checking if code actually runs.
The solution? Give it hands. Modern LLMs are trained to recognize when they're about to hit their limits and output a special hidden token that says: "I need to call an external tool." It's like a person who knows they're bad at math pulling out a calculator.
Tool Use Flow
The model writes code β a real computer runs it β the result is pasted back into the model's context window β the model incorporates the exact answer into its response.
When the model needs current information (today's weather, recent news, live prices), it searches the web and injects real-time results into its working memory. This turns the fuzzy "I think I rememberβ¦" into exact, verified facts.
Without it: "I believe the CEO is still Johnβ¦" (could be outdated)
With it: Searches β finds current data β gives correct answer
When the model needs to compute, count, or process data, it writes Python code and runs it on a real computer. The result is deterministic β a calculator never gets arithmetic wrong.
Without it: "3847 Γ 291 = 1,119,377" (guessing β often wrong)
With code: print(3847 * 291) β 1,119,477 (always correct)
Now that you understand how GPT works under the hood, here are three practical rules that follow directly from the architecture. These aren't "prompting tips" β they're mechanical consequences of how the system is built.
Parameter weights are a blurry, lossy zip file.
Never test an LLM's memory. Instead, paste the actual documents, data, or source material directly into the prompt. The model's context window (working memory) is perfect β its parameter recall (long-term memory) is fuzzy. Treat it like a brilliant analyst who hasn't read the brief yet: hand them the brief.
Neural networks apply finite compute per token.
The model gets a fixed amount of "thinking" per output token. For complex questions, a one-word answer means almost no computation happened. Force it to think out loud β "explain step by step", "show your reasoning" β to give it the compute budget it needs to get the right answer.
Tokens blind LLMs to spelling; architecture blinds them to math.
The model can't natively count letters, do arithmetic, or know what happened yesterday. It can call a calculator, run code, or search the web β but sometimes needs a nudge. Explicitly tell it when precision matters.
When ChatGPT says "I thinkβ¦" or "I'm sorry, I don't knowβ¦" β it feels like you're talking to a person. That's the illusion. You're not. You're watching the output of a very sophisticated pattern-matching engine that was trained on billions of examples of humans writing things. It has learned to produce text that looks like it comes from a thoughtful person β but there is no person in there.
What It Feels Like
A Sentient Oracle
that understands you
What It Actually Is
A Statistical Engine
flipping billions of biased coins
Every conversation starts from zero. The model has no memory of you, no ongoing thoughts, no identity between sessions. "It" doesn't exist when you're not prompting it. What seems like personality is just a statistical pattern.
Caveat: ChatGPT the product now has a "Memory" feature β but it's an application-layer trick. User facts are stored in a database and injected into the context window at the start of each chat. The model itself still starts from zero; it just gets handed a cheat sheet.
During training, the model was fine-tuned on examples written by human contractors who followed labeling guidelines ("be helpful, be harmless, be honest"). So when you prompt it, you're activating a simulation of those specific people following those specific instructions. It's roleplaying as a helpful assistant because that's the character it was trained to play.
Every single word it generates is the result of a probability distribution β like a weighted dice roll. "The capital of France is ___" β 97% Paris, 1.5% Lyon, 0.5% Marseilleβ¦ It picks one. That's all generation ever is: billions of educated guesses in sequence.
Curated links to the best papers, blog posts, videos, and interactive tools for each section above.