Inside the Token Tumbler

	Phase 1: Pre-Training (Base Model)	Phase 2: Supervised Fine-Tuning (SFT)	Phase 3: Reinforcement Learning (RL)
Human Metaphor	Reading every textbook in the world.	Studying worked examples.	Solving practice problems via trial-and-error.
Data Input	15 Trillion raw internet tokens.	100,000+ human-written conversation logs.	Verifiable math, code, and logic problems.
Model Output	Document Simulator (Autocomplete).	Helpful Assistant (Imitating Experts).	Thinking Entity (Discovering Strategies).

Phase 1

Building the Internet Document Simulator

Download and preprocess the internet. The FineWeb pipeline collects and cleans ~15 trillion tokens from Common Crawl (2.7 billion web pages since 2007).

15T tokens

≈ 44 TB of cleaned text from the internet

The FineWeb Pipeline

🌐Common Crawl
2.7B pages

→

🔗URL
Filtering

→

📄Text Extraction
Strip HTML/CSS

→

🗣️Language Filter
>65% English

→

🔍Gopher
Filtering

→

🧬MinHash
Dedup

→

🕵️PII
Removal

→

💾The Fine Web
44TB / 15T tokens

Key insight: The resulting "Base Model" (e.g., Llama 3 405B Base) is not an assistant. It is a pure, lossy statistical compression of the filtered internet. It cannot answer questions — it can only continue patterns.

❌ Before Filtering

Buy cheap watches!!! Click here → bit.ly/spam
███████ personal data ███████
Lorem ipsum dolor sit amet… {repeated 500x}

✅ After Filtering

The transformer architecture was introduced in 2017 by Vaswani et al. in "Attention Is All You Need." It relies on self-attention mechanisms…

🔗 Read the FineWeb Blog Post

Architecture

The Transformer: Attention Is All You Need

Introduced in 2017, the Transformer replaced recurrent networks with a revolutionary mechanism called self-attention — letting every token "look at" every other token in parallel.

Why the Transformer Was Revolutionary

Before Transformers, language models used RNNs — processing text one word at a time left to right, like reading a sentence in strict order. The problem: by the time the model reaches the end of a long sentence, it has "forgotten" the beginning. Transformers solved this by processing all tokens at once, letting every position attend to every other position simultaneously.

Self-Attention: "The animal didn't cross the street because it was too tired"

Attention scores reveal that "it" attends to "animal" — the model learns grammatical co-reference without being told any grammar rules.

How Attention Works: Query · Key · Value

🔍 Query (Q)

"What am I looking for?" — the current token broadcasts what type of information it needs from other positions.

🗝️ Key (K)

"What do I contain?" — every token advertises its content. Q·Kᵀ gives a raw relevance score between every pair of tokens.

📦 Value (V)

"What do I pass along?" — the actual information that gets mixed into the output, weighted by the softmax of the Q·K scores.

Attention(Q,K,V) = softmax(QK^T / √d_k) · V

Scores are scaled by √d_k to prevent vanishing gradients in large dimensions.

Multi-Head Attention: Looking from Many Angles

Head 1
syntax

Head 2
co-reference

Head 3
semantics

…Head N
position

↓ concat + linear

Rich token representations

GPT-4 uses 96 attention heads per layer, each free to specialize in a different linguistic relationship.

⛔ Old: Recurrent Networks (RNN/LSTM)

Processes tokens one at a time (sequential)
Forgets distant context (vanishing gradient)
Cannot be parallelized → slow to train
Max useful context: ~1,000 tokens

✅ New: Transformer

Processes all tokens in parallel
Every token can attend to every other token
Massively parallelizable → enables GPU scaling
Context windows of 128K–1M+ tokens today

Why everything is now a Transformer: The parallel architecture maps perfectly onto GPU hardware. Training a 70B parameter model on RNNs would take years; on Transformers it takes weeks. This architectural choice is why scaling LLMs became feasible at all.

🔗 "Attention Is All You Need" — Original Paper (Vaswani et al. 2017)

Limitation

The Token-Compute Limit: Models Need Space to Think

The neural network applies a strictly finite amount of processing power (layers) to predict each single token. No matter how hard the question, every next-token prediction gets the same fixed budget of computation.

The Core Problem

Imagine you're given 1 second to answer every question — whether it's "What's 2+2?" or "What's 17×24−156÷3?" Same time budget, wildly different difficulty.

That's exactly what happens inside GPT. The neural network runs through its layers once per token (a "forward pass"). A simple question and an impossibly hard question both get the exact same number of computational steps.

Example: "What is 17 × 24 − 156 ÷ 3?"

Path A: Single-token answer ❌

[372]

Model forced to cram multiply, divide, and subtract into one forward pass → overloaded → wrong answer

Path B: Step-by-step ✅

[17][×][24][=][408]

[156][÷][3][=][52]

[408][−][52][=][356]

Each intermediate token gets its own forward pass → 3× more compute budget → correct

Why "Think step-by-step" actually works

It's not magic — it's granting the model more compute. Every extra token the model writes is another full pass through billions of parameters. By forcing intermediate steps, you convert one impossible forward pass into many manageable ones. This is why "chain-of-thought" prompting dramatically improves accuracy on math, logic, and reasoning tasks.

Rule of Operation: Complex reasoning must be distributed across a long sequence of intermediate tokens. Force the model to "show its work" to grant it the compute time to succeed.

Phase 2

Supervised Fine-Tuning (SFT)

The Base Model becomes the starting point. Its weights are further trained (fine-tuned) on hundreds of thousands of curated, multi-turn conversation logs crafted by human experts.

Raw, Unformatted Data (Base Model)

tokenchunkdata textblobraw htmlnoisemess

Unstructured — just continues patterns

Structured Multi-Turn Conversation (SFT)

<|im_start|>user

What is 2+2?

<|im_end|>

<|im_start|>assistant

2 + 2 is 4.

<|im_end|>

The Persona Shift

By injecting special control tokens (<|im_start|>), the model learns a structured protocol. It statistically internalizes the "Persona" of a helpful, truthful, and harmless assistant by imitating the expert worked examples.

Limitation

The Hallucination Reflex: The Urge to Imitate Confidence

During SFT, models mimic human experts who confidently provide correct answers. But when there are gaps in its memory…

✅ Known

Who is Tom Cruise?

Who is Genghis Khan?

→

❓ Unknown

Who is Orson Kovats?

→

🎭 Hallucination

"He's a sci-fi writer."

"He's a minor league baseball player."

Key Insight: When faced with a gap in its parameter memory, an unmitigated model doesn't know how to say "I don't know." It statistically imitates the confident tone of its training data. Modern models require deliberate "knowledge boundary" probing to learn the refusal reflex.

Phase 3

Reinforcement Learning (RL)

After SFT, the model can imitate experts. But imitation has a ceiling — you can only copy what humans already know. RL lets the model discover new strategies on its own through trial and error on problems with verifiable answers.

🎓 SFT — Learning by Imitation

Human writes: "Q: What is 25×4? A: 100"
Model learns: copy that pattern.

Ceiling = Best human example in the dataset

🎯 RL — Learning by Doing

Model tries 1000 solutions to "Solve X²−5X+6=0"
Reward: ✅ if answer = {2,3} ❌ otherwise

Ceiling = None — model can surpass humans

The RL Training Loop

1Pick a Problemwith a known answer

→

2Generate Many1000+ attempts

→

3Grade Eachcorrect or wrong?

→

4Reward / Penalizereinforce ✅ paths

→

5Update Weightsmake ✅ more likely

↩

Concrete Example: "Write a Python function that returns the nth Fibonacci number"

❌

Attempt 1 — Wrong logic

def fib(n):
  return n * fib(n-1) ← that's factorial, not Fibonacci!

Test: fib(6) → 720 ≠ 8 → Reward: 0 — weights nudged AWAY from this path

❌

Attempt 2 — Crashes

def fib(n):
  return fib(n-1) + fib(n-2) ← no base case → infinite recursion

Test: fib(6) → RecursionError → Reward: 0 — weights nudged AWAY

✅

Attempt 47 — Correct!

def fib(n):
  if n <= 1: return n ← base case
  return fib(n-1) + fib(n-2) ← correct recursion

Test: fib(6) → 8 ✅ fib(10) → 55 ✅ → Reward: +1 — weights nudged TOWARD this path

⭐

Attempt 823 — Discovered an optimization humans didn't teach it!

def fib(n):
  a, b = 0, 1 ← O(n) iterative
  for _ in range(n):
    a, b = b, a + b
  return a ← faster, no stack overflow

Test: all pass + faster → Reward: +1 — this efficient strategy gets reinforced

What this looks like at the token level

[def][fib][(n)][return][n*][fib...]❌ Wrong answer → penalize

[def][fib][(n)][return][fib(n-1)][+fib...]❌ Crashes → penalize

[def][fib][(n)][if][n<=1][return][n][...]✅ Correct → reinforce

[def][fib][(n)][a,b][=0,1][for][...]⭐ Novel strategy → reinforce strongly

Over millions of problems, the model learns which reasoning patterns lead to correct answers

🔑 Why "verifiable" is the key word

RL only works when you can automatically check if the answer is right. Math has exact answers. Code can be run against test cases. That's why RL is applied to these domains first.

✅

Verifiable

Math, code, logic puzzles, chess

❌

Not Verifiable

Poetry, humor, summaries, advice

The Mechanism: By generating thousands of attempts and reinforcing only the ones that produce correct answers, the model independently discovers which cognitive strategies actually work — including strategies no human ever taught it.

Caveat

The RLHF Illusion: Gaming the Simulator

For unverifiable domains (poetry, jokes, summaries), we use RLHF — training a secondary AI to simulate human scoring.

Why RLHF Exists

Remember the RL section above? RL works when there's a verifiable answer — math has a correct solution, code either runs or doesn't. But what about tasks where "good" is subjective? Is this joke funny? Is this summary accurate? Is this response helpful? There's no equation to check. So OpenAI's solution: train a second neural network to pretend to be a human judge. This is RLHF — Reinforcement Learning from Human Feedback.

The 3-Step RLHF Pipeline

👤1. Human ranks 5
Pelican jokes

⇒

🤖2. Reward Model
simulates human tastes

⇒

🎯3. LLM optimizes
against Reward Model

Step 1 — Collect Human Preferences

The LLM generates multiple responses to the same prompt. Real humans rank them from best to worst. Example: "Write a pelican joke" → 5 different jokes → Humans rank Joke #3 > Joke #1 > Joke #5 > …

Step 2 — Train a Reward Model

A separate, smaller neural network is trained on thousands of these human rankings. It learns to predict what a human would prefer. Given any LLM output, it produces a score from 0 to 1. It's an AI trying to imitate human taste.

Step 3 — Optimize the LLM Against the Reward Model

Now the main LLM is fine-tuned using RL — but instead of a math checker or a game engine, the "environment" is the Reward Model. The LLM generates text → the Reward Model scores it → the LLM adjusts its weights to get higher scores. This is the same RL loop, except the judge is fake.

The Adversarial Cliff — Why This Breaks

Here's the fundamental problem: the Reward Model is not a real human. It's just another neural network with exploitable patterns. When you tell an RL agent to maximize a score, it will find every possible shortcut — including ones that look insane to humans.

Think of it like this: if a teacher grades essays by counting how many "smart-sounding" words appear, students will eventually stuff essays with jargon — gaming the rubric without writing anything meaningful. That's exactly what happens here.

The LLM discovers adversarial inputs — nonsensical token sequences that exploit blind spots in the Reward Model and trigger a perfect score, despite being complete gibberish to a real human.

        "the the the the the" = Reward Model Score: 1.0 (Perfect) 🤯

        A human would score this 0. The Reward Model is fooled.

Bottom Line: RLHF is a useful but fragile fine-tuning trick. It makes models sound more helpful and polite, but it's not true intelligence improvement. The model is learning to please a simulated judge, not to genuinely reason better. This is why RLHF models need constant guardrails and why companies keep the reward model tightly constrained.

Frontier

The Emergence of 'Thinking' Models

Models trained heavily with RL (like DeepSeek R1) learn that higher accuracy requires massively long "Chains of Thought."

What Changed?

Standard ChatGPT-style models answer instantly — they blurt out the first plausible-sounding response. But researchers discovered something: if you train a model with RL (where it gets rewarded only for correct final answers), it naturally starts producing longer, more deliberate reasoning before answering. Nobody programmed it to "think step by step" — it figured out on its own that slowing down = more reward.

The Difference in Practice

Standard Model (Fast but brittle)

"The answer is 177 dots."

Jumps straight from question to answer. Like a student guessing on an exam without showing work. Often wrong on hard problems, but sounds confident.

Thinking Model (Slow but highly accurate)

Let's break this down. First, count the outer ring… 1, 2, 3… that's 30. Now the inner ring… wait, let me recheck… 1, 2, 3… 28. So, outer is 30, inner is 28. Total = 30 + 28 = 58.

<think> Wait, let me reevaluate… If I backtrack here… Setting up an equation… </think>

Works through the problem piece by piece. Catches its own mistakes. Like a student who actually shows their work — slower, but far more reliable.

Why "Emergent"?

This is the astonishing part: nobody taught the model these strategies. Backtracking ("wait, let me recheck…"), self-correction ("that doesn't add up…"), breaking problems into sub-steps — these are behaviors humans use when solving hard problems. The RL training process discovered them independently, purely because they lead to more correct answers. The model reinvented human problem-solving strategies from scratch.

Key Insight: The optimization process naturally discovers human-like cognitive strategies — backtracking, double-checking, reframing — without any human explicitly hardcoding these behaviors. More thinking tokens = more compute = better answers.

Reasoning

Chain of Thought: Why LLMs Are Bad at Math but Great at Reasoning

LLMs don't compute — they pattern-match. Understanding this gap explains both their surprising reasoning power and their surprising arithmetic failures.

The Paradox

Ask an LLM to explain how mitosis works, debug a React component, or compare Keynesian vs. Austrian economics — it'll do brilliantly. Ask it what 3,847 × 291 is, and it might confidently give you the wrong number. How can a system that reasons about philosophy fail at arithmetic?

The answer: math requires exactness; LLMs are optimized for probability. These are fundamentally different objectives.

LLMs Don't See Numbers — They See Tokens

What you think it sees

12345

one numeric quantity

What it actually sees

12345

token chunks — no value attached

When an LLM "adds" two numbers, it isn't performing a calculation. It's generating tokens that look like the result of a calculation. For small numbers, probability aligns with correctness. As numbers grow larger or structures become unfamiliar, that alignment silently breaks.

🔢 Symbolic Math (Calculators)

Manipulates symbols with strict rules
Result is guaranteed correct if rules apply
Zero tolerance for approximation
Can execute — but cannot explain

🧠 Neural Reasoning (LLMs)

Learns patterns of rule-following from data
Result is statistically likely, not guaranteed
Excellent at fuzzy, contextual, language-driven tasks
Can explain, compare, and adapt — flexibly

Chain of Thought — What It Actually Does (and Doesn't)

✓ Why CoT improves accuracy

Every intermediate step written is another full forward pass through billions of parameters. By generating reasoning tokens, you hand the model more compute budget — the problem is distributed across many token predictions instead of crammed into one impossible step.

Not magic — it's more compute. "Think step by step" grants extra forward passes, each refining the answer further.

✗ What CoT can't do

CoT doesn't give the model a calculator. It encourages intermediate tokens that resemble reasoning steps. The chain can look flawless while the final number is wrong — or contain subtle errors that sound completely convincing.

"Let me calculate: 3847 × 291.
3847 × 200 = 769,400 ✓
3847 × 91 = 346,230 ✗ (forgot +3847×1)
Total = 1,115,630" ← wrong intermediate → wrong result

Why LLMs Are Still Excellent Reasoners

Real-world reasoning is rarely about exact computation. It's about framing problems, breaking them down, comparing alternatives, and building coherent arguments. LLMs are trained on billions of examples of humans doing exactly this — in books, papers, debates, tutorials. They've absorbed the structure of thought.

✓ Decompose problems

"First consider X, then Y…"

✓ Spot inconsistencies

"That contradicts what you said…"

✓ Compare approaches

"Option A trades speed for accuracy…"

None of that requires exact arithmetic. It requires structure, language, and pattern recognition — which is exactly what LLMs are optimized for. They don't follow rules; they imitate patterns of rule-following. That difference matters a lot in math, but very little in reasoning.

The Fix: Division of Labor

🧠LLM handlesFraming, explanation,
decision-making

🖥️Tools handlePrecision, guarantees,
exact computation

⭐Best of bothReliable, explainable,
and exact

This is why modern LLM systems pair language models with calculators, code interpreters, and search engines — each doing what it's actually built for.

Bottom Line: LLMs aren't bad at math because they're unintelligent. They're bad at math because math demands exactness, and LLMs are built for probability. They reason well because reasoning in the real world is fuzzy, contextual, and language-driven. Pair them with the right tools, and that difference becomes a strength, not a weakness.

🔗 Why LLMs Are Bad at Math but Great at Reasoning — Jainul Trivedi

Architecture

Cognitive Architecture: Vague Recollection vs. Working Memory

An LLM has two fundamentally different types of "memory" — and understanding the difference is the single most useful thing you can learn about using AI.

The Human Analogy

Imagine two scenarios: (A) Someone asks you about a book you read 6 months ago — you remember the gist, but details are fuzzy, and you might accidentally "remember" things that weren't actually in it. (B) Someone hands you the book open to the right page and says "read this paragraph and answer" — now you're perfectly accurate. An LLM works exactly the same way, with two distinct memory systems.

The Parameters
(Long-term Memory — The Fuzzy One)

🧠

Weights (Billions of Parameters): Everything the model "learned" during training is compressed into these numbers. But it's lossy — like trying to memorize the entire internet. The model has a general sense of things, but specific details get blurry or mixed up. This is why it confidently tells you fake facts. Prone to hallucination.

Example: "What year was X founded?" → Model recalls ~2015 from fuzzy memory → might say 2014 or 2016 with full confidence

The Context Window
(Working Memory — The Perfect One)

📋

Context Window (Active Tokens): This is the text you put directly in the prompt — your question, pasted documents, conversation history. The model can see this perfectly, like reading off a page right in front of it. No guessing, no fuzzy recall. Zero hallucination on this data.

Example: "Here's the Wikipedia article: [paste]. What year was X founded?" → Model reads directly → answers correctly every time

Why This Matters for You

Most people use ChatGPT as a search engine: "Tell me about X" — forcing the model to dig through its fuzzy long-term memory. Power users paste the actual document, data, or code into the prompt and say: "Given this, answer Y." The second approach is dramatically more reliable because you're using the model's perfect working memory instead of its unreliable long-term recall.

Rule of Thumb: Never ask a model to recall facts from memory when you can simply paste the source material into the prompt. Context window = reliable. Parameters = fuzzy guessing.

Capabilities

Cognitive Prosthetics: Bypassing the Network's Flaws

LLMs can't do mental arithmetic or recall niche facts reliably — so they emit special 'Tool' tokens to call external programs.

Why Tools Exist

Here's something most people don't realize: GPT cannot actually do math. It doesn't have a calculator inside it. When you ask "what's 3,847 × 291?", it's not computing — it's pattern-matching what a math answer looks like based on training data. For simple problems it often gets lucky. For anything complex, it silently gets it wrong. Same for: counting characters in a word, looking up today's stock price, or checking if code actually runs.

The solution? Give it hands. Modern LLMs are trained to recognize when they're about to hit their limits and output a special hidden token that says: "I need to call an external tool." It's like a person who knows they're bad at math pulling out a calculator.

How Tool Use Actually Works

Tool Use Flow

💬Prompt Input &
Working Memory"How many dots? [177]"

→

🧠LLM Engine &
Tool Decision<|python_start|>

→

🖥️External Terminal
& Execution> len(dots) → 177

→

💉Inject
Answer177

The model writes code → a real computer runs it → the result is pasted back into the model's context window → the model incorporates the exact answer into its response.

The Two Main Prosthetics

🔍 Web Search

When the model needs current information (today's weather, recent news, live prices), it searches the web and injects real-time results into its working memory. This turns the fuzzy "I think I remember…" into exact, verified facts.

Without it: "I believe the CEO is still John…" (could be outdated)
With it: Searches → finds current data → gives correct answer

🐍 Code Interpreter

When the model needs to compute, count, or process data, it writes Python code and runs it on a real computer. The result is deterministic — a calculator never gets arithmetic wrong.

Without it: "3847 × 291 = 1,119,377" (guessing — often wrong)
With code: print(3847 * 291) → 1,119,477 (always correct)

Practical Tip: If your task involves math, counting, dates, or current facts — explicitly tell the model to use tools. Say "use Python to calculate" or "search the web for this." Don't trust the model's fuzzy internal abilities for anything requiring precision.

Practical

The Operator's Manual: Prompting for Mechanical Realities

Now that you understand how GPT works under the hood, here are three practical rules that follow directly from the architecture. These aren't "prompting tips" — they're mechanical consequences of how the system is built.

Rule 1: Feed It, Don't Quiz It

Parameter weights are a blurry, lossy zip file.

Never test an LLM's memory. Instead, paste the actual documents, data, or source material directly into the prompt. The model's context window (working memory) is perfect — its parameter recall (long-term memory) is fuzzy. Treat it like a brilliant analyst who hasn't read the brief yet: hand them the brief.

❌ "What did the Q3 report say about revenue?"
✅ "Here's the Q3 report: [paste]. What does it say about revenue?"

Rule 2: Make It Show Its Work

Neural networks apply finite compute per token.

The model gets a fixed amount of "thinking" per output token. For complex questions, a one-word answer means almost no computation happened. Force it to think out loud — "explain step by step", "show your reasoning" — to give it the compute budget it needs to get the right answer.

❌ "Is this contract risky? Answer yes or no."
✅ "Analyze this contract clause by clause. For each, explain the risk. Then give your overall assessment."

Rule 3: Tell It to Use Tools

Tokens blind LLMs to spelling; architecture blinds them to math.

The model can't natively count letters, do arithmetic, or know what happened yesterday. It can call a calculator, run code, or search the web — but sometimes needs a nudge. Explicitly tell it when precision matters.

❌ "How many r's in 'strawberry'?"
✅ "Use Python to count how many r's are in 'strawberry'."

Reality

Dispel the Magic: You Are Talking to a Simulation

The Core Misconception

When ChatGPT says "I think…" or "I'm sorry, I don't know…" — it feels like you're talking to a person. That's the illusion. You're not. You're watching the output of a very sophisticated pattern-matching engine that was trained on billions of examples of humans writing things. It has learned to produce text that looks like it comes from a thoughtful person — but there is no person in there.

What It Feels Like

🧠

A Sentient Oracle
that understands you

What It Actually Is

🎰

A Statistical Engine
flipping billions of biased coins

No Persistent Self

Every conversation starts from zero. The model has no memory of you, no ongoing thoughts, no identity between sessions. "It" doesn't exist when you're not prompting it. What seems like personality is just a statistical pattern.

Caveat: ChatGPT the product now has a "Memory" feature — but it's an application-layer trick. User facts are stored in a database and injected into the context window at the start of each chat. The model itself still starts from zero; it just gets handed a cheat sheet.

Simulating a Contractor

During training, the model was fine-tuned on examples written by human contractors who followed labeling guidelines ("be helpful, be harmless, be honest"). So when you prompt it, you're activating a simulation of those specific people following those specific instructions. It's roleplaying as a helpful assistant because that's the character it was trained to play.

Biased Coin Flips

Every single word it generates is the result of a probability distribution — like a weighted dice roll. "The capital of France is ___" → 97% Paris, 1.5% Lyon, 0.5% Marseille… It picks one. That's all generation ever is: billions of educated guesses in sequence.

Why This Matters: Understanding that you're operating a tool, not conversing with a being, changes how you use it. You stop asking "does it understand me?" and start asking "how do I structure this input to get the best statistical output?" That shift in mindset is what separates casual users from power users.

The Evolutionary Arc: Schooling a Statistical Engine

Building the Internet Document Simulator

❌ Before Filtering

✅ After Filtering

What Are Tokens?

Example: "Hello world"

Example: "Tokenization"

The Atoms of Thought: Why Models Can't Spell

The Tokenization Bottleneck

The Spelling Blindspot

Neural Network Internals