Inside the Token Tumbler

The Mechanical Psychology of Large Language Models

A visual, interactive guide. Understand everything from raw data to "thinking" models.

SYSTEM_INIT: TRUE  |  VOCAB_SIZE: 100,277  |  MODE: EXPLAIN

Overview

The Evolutionary Arc: Schooling a Statistical Engine

Building ChatGPT happens in three distinct phases β€” each one transforms the model fundamentally.

Phase 1: Pre-Training
(Base Model)
Phase 2: Supervised Fine-Tuning
(SFT)
Phase 3: Reinforcement Learning
(RL)
Human MetaphorReading every textbook in the world.Studying worked examples.Solving practice problems via trial-and-error.
Data Input15 Trillion raw internet tokens.100,000+ human-written conversation logs.Verifiable math, code, and logic problems.
Model OutputDocument Simulator (Autocomplete).Helpful Assistant (Imitating Experts).Thinking Entity (Discovering Strategies).
Phase 1

Building the Internet Document Simulator

Download and preprocess the internet. The FineWeb pipeline collects and cleans ~15 trillion tokens from Common Crawl (2.7 billion web pages since 2007).

15T tokens

β‰ˆ 44 TB of cleaned text from the internet

The FineWeb Pipeline

🌐Common Crawl
2.7B pages
β†’
πŸ”—URL
Filtering
β†’
πŸ“„Text Extraction
Strip HTML/CSS
β†’
πŸ—£οΈLanguage Filter
>65% English
β†’
πŸ”Gopher
Filtering
β†’
🧬MinHash
Dedup
β†’
πŸ•΅οΈPII
Removal
β†’
πŸ’ΎThe Fine Web
44TB / 15T tokens
Key insight: The resulting "Base Model" (e.g., Llama 3 405B Base) is not an assistant. It is a pure, lossy statistical compression of the filtered internet. It cannot answer questions β€” it can only continue patterns.

❌ Before Filtering

Buy cheap watches!!! Click here β†’ bit.ly/spam
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ personal data β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
Lorem ipsum dolor sit amet… {repeated 500x}

βœ… After Filtering

The transformer architecture was introduced in 2017 by Vaswani et al. in "Attention Is All You Need." It relies on self-attention mechanisms…

πŸ”— Read the FineWeb Blog Post
Fundamentals

What Are Tokens?

GPT doesn't read letters β€” it reads tokens. A token is a chunk of text mapped to a number. Text β†’ UTF-8 Bytes β†’ BPE Merges β†’ Token IDs.

Type something to see it tokenized

Example: "Hello world"

Hello→9906| world→1917

Example: "Tokenization"

Token→3963|ization→2065

↑ One word split into 2 tokens!

Step-by-step for: "Hi"

Hi
↓
UTF-8 Bytes
72105
↓
BPE merges common pairs β†’ "Hi" = one token
↓
17250
100,277

total tokens in GPT-4's vocabulary

πŸ”— Try the TikTokenizer
Limitation

The Atoms of Thought: Why Models Can't Spell

LLMs do not see characters. Text is compressed into token chunks β€” which creates blind spots.

ubiquitous

[ubi]
[quit]
[ous]

↑ The model sees 3 token chunks, NOT 10 individual letters

The Tokenization Bottleneck

LLMs do not see characters. Raw bits are compressed into a fixed vocabulary (GPT-4's 100,277 tokens) to save compute. Individual letters are lost inside token chunks.

The Spelling Blindspot

Because letters are fused into token chunks, models routinely fail at: "count the Rs in strawberry" or "print every third character of ubiquitous."

Why spaces matter

hello + world = 2 tokens
hello_ + _world (with spaces) = entirely different token IDs
Core

Neural Network Internals

The NN is a giant math function. Tokens go in β†’ probabilities come out. The "knowledge" lives in billions of weight parameters.

Simplified Neural Network

INPUT 860 287 11579 … HIDDEN LAYERS OUTPUT the0.12 cat0.41 dog0.08 …100K more Each connection has a "weight" β€” GPT-4 has ~1.8 trillion weights

What the NN really is

f(x) = Οƒ(W₃ Β· Οƒ(Wβ‚‚ Β· Οƒ(W₁ Β· x + b₁) + bβ‚‚) + b₃)

Nested multiplication & addition of weights,
with Οƒ (activation functions) adding non-linearity.

Weight parameters (billions of these)

Each cell = one weight. Teal negative, purple positive.

Architecture

The Transformer: Attention Is All You Need

Introduced in 2017, the Transformer replaced recurrent networks with a revolutionary mechanism called self-attention β€” letting every token "look at" every other token in parallel.

Why the Transformer Was Revolutionary

Before Transformers, language models used RNNs β€” processing text one word at a time left to right, like reading a sentence in strict order. The problem: by the time the model reaches the end of a long sentence, it has "forgotten" the beginning. Transformers solved this by processing all tokens at once, letting every position attend to every other position simultaneously.

Self-Attention: "The animal didn't cross the street because it was too tired"

The animal didn't cross the street because it was animal "it" attends most strongly to "animal" β€” resolving the co-reference query token highest attention

Attention scores reveal that "it" attends to "animal" β€” the model learns grammatical co-reference without being told any grammar rules.

How Attention Works: Query Β· Key Β· Value

πŸ” Query (Q)

"What am I looking for?" β€” the current token broadcasts what type of information it needs from other positions.

πŸ—οΈ Key (K)

"What do I contain?" β€” every token advertises its content. QΒ·Kα΅€ gives a raw relevance score between every pair of tokens.

πŸ“¦ Value (V)

"What do I pass along?" β€” the actual information that gets mixed into the output, weighted by the softmax of the QΒ·K scores.

Attention(Q,K,V) = softmax(QKT / √dk) · V

Scores are scaled by √dk to prevent vanishing gradients in large dimensions.

Multi-Head Attention: Looking from Many Angles

Head 1
syntax
Head 2
co-reference
Head 3
semantics
…Head N
position
↓ concat + linear
Rich token representations

GPT-4 uses 96 attention heads per layer, each free to specialize in a different linguistic relationship.

β›” Old: Recurrent Networks (RNN/LSTM)

  • Processes tokens one at a time (sequential)
  • Forgets distant context (vanishing gradient)
  • Cannot be parallelized β†’ slow to train
  • Max useful context: ~1,000 tokens

βœ… New: Transformer

  • Processes all tokens in parallel
  • Every token can attend to every other token
  • Massively parallelizable β†’ enables GPU scaling
  • Context windows of 128K–1M+ tokens today
Why everything is now a Transformer: The parallel architecture maps perfectly onto GPU hardware. Training a 70B parameter model on RNNs would take years; on Transformers it takes weeks. This architectural choice is why scaling LLMs became feasible at all.
πŸ”— "Attention Is All You Need" β€” Original Paper (Vaswani et al. 2017)
Core

Training β€” How GPT Learns

Show the NN sequences of tokens. Have it predict the next one. Adjust weights when it's wrong. Repeat billions of times.

The Training Loop

1Input Tokens[The, cat, sat, on]
β†’
2NN Predictsnext token = ?
β†’
3Comparepredicted vs actual
β†’
4Compute Losshow wrong was it?
β†’
5Update Weightsbackpropagation
↩

Next Token Prediction Example

Thecatsatonthe?
mat
45%
floor
22%
table
15%
roof
8%
…100K
10%

If actual was "mat" β†’ small loss. If predicted "roof" β†’ big loss β†’ bigger weight update.

πŸ”— Interactive 3D LLM Visualization (bbycroft.net)
Core

Softmax β€” Raw Scores β†’ Probabilities

The NN outputs raw scores (logits). Softmax converts them into probabilities that sum to 1.

softmax(zi) = ezi / Ξ£ ezj

Drag the sliders β€” see probabilities update live

Logits (raw scores)

2.8
1.2
0.5
-1.0
-2.5
β†’

Probabilities (after softmax)

Step-by-step example

1. NN outputs logits:

mat: 2.8 floor: 1.2 table: 0.5

2. Apply e^x:

e²·⁸=16.4 e¹·²=3.3 e⁰·⁡=1.6

3. Divide by sum (21.3):

mat: 77% floor: 15% table: 8%
Core

Inference β€” Generating Text

GPT generates text one token at a time. Each new token is fed back in β€” autoregressive generation.

Autoregressive Token-by-Token Generation

Step 1Thecatsaton
Step 2Thecatsatonthe
Step 3Thecatsatonthemat
Step 4Thecatsatonthemat.

Chat Demo β€” click to see it generate

What is the capital of France?

Training

Learning the weights
Expensive, done once, GPUs for weeks

Inference

Using the weights
Fast, done every time you chat

Limitation

The Token-Compute Limit: Models Need Space to Think

The neural network applies a strictly finite amount of processing power (layers) to predict each single token. No matter how hard the question, every next-token prediction gets the same fixed budget of computation.

The Core Problem

Imagine you're given 1 second to answer every question β€” whether it's "What's 2+2?" or "What's 17Γ—24βˆ’156Γ·3?" Same time budget, wildly different difficulty.

That's exactly what happens inside GPT. The neural network runs through its layers once per token (a "forward pass"). A simple question and an impossibly hard question both get the exact same number of computational steps.

Example: "What is 17 Γ— 24 βˆ’ 156 Γ· 3?"

Path A: Single-token answer ❌

[372]

Model forced to cram multiply, divide, and subtract into one forward pass β†’ overloaded β†’ wrong answer

Path B: Step-by-step βœ…

[17][Γ—][24][=][408]
[156][Γ·][3][=][52]
[408][βˆ’][52][=][356]

Each intermediate token gets its own forward pass β†’ 3Γ— more compute budget β†’ correct

Why "Think step-by-step" actually works

It's not magic β€” it's granting the model more compute. Every extra token the model writes is another full pass through billions of parameters. By forcing intermediate steps, you convert one impossible forward pass into many manageable ones. This is why "chain-of-thought" prompting dramatically improves accuracy on math, logic, and reasoning tasks.

Rule of Operation: Complex reasoning must be distributed across a long sequence of intermediate tokens. Force the model to "show its work" to grant it the compute time to succeed.
Phase 2

Supervised Fine-Tuning (SFT)

The Base Model becomes the starting point. Its weights are further trained (fine-tuned) on hundreds of thousands of curated, multi-turn conversation logs crafted by human experts.

Raw, Unformatted Data (Base Model)

tokenchunkdata textblobraw htmlnoisemess

Unstructured β€” just continues patterns

Structured Multi-Turn Conversation (SFT)

<|im_start|>user
What is 2+2?
<|im_end|>
<|im_start|>assistant
2 + 2 is 4.
<|im_end|>

The Persona Shift

By injecting special control tokens (<|im_start|>), the model learns a structured protocol. It statistically internalizes the "Persona" of a helpful, truthful, and harmless assistant by imitating the expert worked examples.

Limitation

The Hallucination Reflex: The Urge to Imitate Confidence

During SFT, models mimic human experts who confidently provide correct answers. But when there are gaps in its memory…

βœ… Known

Who is Tom Cruise?

Who is Genghis Khan?

β†’

❓ Unknown

Who is Orson Kovats?

β†’

🎭 Hallucination

"He's a sci-fi writer."

"He's a minor league baseball player."

Key Insight: When faced with a gap in its parameter memory, an unmitigated model doesn't know how to say "I don't know." It statistically imitates the confident tone of its training data. Modern models require deliberate "knowledge boundary" probing to learn the refusal reflex.
Phase 3

Reinforcement Learning (RL)

After SFT, the model can imitate experts. But imitation has a ceiling β€” you can only copy what humans already know. RL lets the model discover new strategies on its own through trial and error on problems with verifiable answers.

πŸŽ“ SFT β€” Learning by Imitation

Human writes: "Q: What is 25Γ—4? A: 100"
Model learns: copy that pattern.

Ceiling = Best human example in the dataset

🎯 RL β€” Learning by Doing

Model tries 1000 solutions to "Solve XΒ²βˆ’5X+6=0"
Reward: βœ… if answer = {2,3}   ❌ otherwise

Ceiling = None β€” model can surpass humans

The RL Training Loop

1Pick a Problemwith a known answer
β†’
2Generate Many1000+ attempts
β†’
3Grade Eachcorrect or wrong?
β†’
4Reward / Penalizereinforce βœ… paths
β†’
5Update Weightsmake βœ… more likely
↩

Concrete Example: "Write a Python function that returns the nth Fibonacci number"

❌

Attempt 1 β€” Wrong logic

def fib(n):
  return n * fib(n-1) ← that's factorial, not Fibonacci!

Test: fib(6) β†’ 720 β‰  8 β†’ Reward: 0 β€” weights nudged AWAY from this path

❌

Attempt 2 β€” Crashes

def fib(n):
  return fib(n-1) + fib(n-2) ← no base case β†’ infinite recursion

Test: fib(6) β†’ RecursionError β†’ Reward: 0 β€” weights nudged AWAY

βœ…

Attempt 47 β€” Correct!

def fib(n):
  if n <= 1: return n ← base case
  return fib(n-1) + fib(n-2) ← correct recursion

Test: fib(6) β†’ 8 βœ… fib(10) β†’ 55 βœ… β†’ Reward: +1 β€” weights nudged TOWARD this path

⭐

Attempt 823 β€” Discovered an optimization humans didn't teach it!

def fib(n):
  a, b = 0, 1 ← O(n) iterative
  for _ in range(n):
    a, b = b, a + b
  return a ← faster, no stack overflow

Test: all pass + faster β†’ Reward: +1 β€” this efficient strategy gets reinforced

What this looks like at the token level

[def][fib][(n)][return][n*][fib...]❌ Wrong answer β†’ penalize
[def][fib][(n)][return][fib(n-1)][+fib...]❌ Crashes β†’ penalize
[def][fib][(n)][if][n<=1][return][n][...]βœ… Correct β†’ reinforce
[def][fib][(n)][a,b][=0,1][for][...]⭐ Novel strategy β†’ reinforce strongly

Over millions of problems, the model learns which reasoning patterns lead to correct answers

πŸ”‘ Why "verifiable" is the key word

RL only works when you can automatically check if the answer is right. Math has exact answers. Code can be run against test cases. That's why RL is applied to these domains first.

βœ…

Verifiable

Math, code, logic puzzles, chess

❌

Not Verifiable

Poetry, humor, summaries, advice

The Mechanism: By generating thousands of attempts and reinforcing only the ones that produce correct answers, the model independently discovers which cognitive strategies actually work β€” including strategies no human ever taught it.
Potential

Breaking the Human Ceiling: The "Move 37" Potential

SFT is bottlenecked by human intelligence β€” a model can only be as good as the expert it imitates. RL changes this.

Training Time / Data β†’ Skill Level β†’ Human Expert Ceiling Supervised Fine-Tuning (SFT) Reinforcement Learning (RL) β˜… Move 37
The RL Advantage: RL optimizes for the outcome (winning, solving) rather than the process (imitating). It discovers alien, highly efficient strategies β€” paths of logic completely unknown to human experts.
Caveat

The RLHF Illusion: Gaming the Simulator

For unverifiable domains (poetry, jokes, summaries), we use RLHF β€” training a secondary AI to simulate human scoring.

Why RLHF Exists

Remember the RL section above? RL works when there's a verifiable answer β€” math has a correct solution, code either runs or doesn't. But what about tasks where "good" is subjective? Is this joke funny? Is this summary accurate? Is this response helpful? There's no equation to check. So OpenAI's solution: train a second neural network to pretend to be a human judge. This is RLHF β€” Reinforcement Learning from Human Feedback.

The 3-Step RLHF Pipeline

πŸ‘€1. Human ranks 5
Pelican jokes
β‡’
πŸ€–2. Reward Model
simulates human tastes
β‡’
🎯3. LLM optimizes
against Reward Model

Step 1 β€” Collect Human Preferences

The LLM generates multiple responses to the same prompt. Real humans rank them from best to worst. Example: "Write a pelican joke" β†’ 5 different jokes β†’ Humans rank Joke #3 > Joke #1 > Joke #5 > …

Step 2 β€” Train a Reward Model

A separate, smaller neural network is trained on thousands of these human rankings. It learns to predict what a human would prefer. Given any LLM output, it produces a score from 0 to 1. It's an AI trying to imitate human taste.

Step 3 β€” Optimize the LLM Against the Reward Model

Now the main LLM is fine-tuned using RL β€” but instead of a math checker or a game engine, the "environment" is the Reward Model. The LLM generates text β†’ the Reward Model scores it β†’ the LLM adjusts its weights to get higher scores. This is the same RL loop, except the judge is fake.

The Adversarial Cliff β€” Why This Breaks

Here's the fundamental problem: the Reward Model is not a real human. It's just another neural network with exploitable patterns. When you tell an RL agent to maximize a score, it will find every possible shortcut β€” including ones that look insane to humans.

Think of it like this: if a teacher grades essays by counting how many "smart-sounding" words appear, students will eventually stuff essays with jargon β€” gaming the rubric without writing anything meaningful. That's exactly what happens here.

The LLM discovers adversarial inputs β€” nonsensical token sequences that exploit blind spots in the Reward Model and trigger a perfect score, despite being complete gibberish to a real human.

"the the the the the" = Reward Model Score: 1.0 (Perfect) 🀯
A human would score this 0. The Reward Model is fooled.
Bottom Line: RLHF is a useful but fragile fine-tuning trick. It makes models sound more helpful and polite, but it's not true intelligence improvement. The model is learning to please a simulated judge, not to genuinely reason better. This is why RLHF models need constant guardrails and why companies keep the reward model tightly constrained.
Frontier

The Emergence of 'Thinking' Models

Models trained heavily with RL (like DeepSeek R1) learn that higher accuracy requires massively long "Chains of Thought."

What Changed?

Standard ChatGPT-style models answer instantly β€” they blurt out the first plausible-sounding response. But researchers discovered something: if you train a model with RL (where it gets rewarded only for correct final answers), it naturally starts producing longer, more deliberate reasoning before answering. Nobody programmed it to "think step by step" β€” it figured out on its own that slowing down = more reward.

The Difference in Practice

Standard Model (Fast but brittle)

"The answer is 177 dots."

Jumps straight from question to answer. Like a student guessing on an exam without showing work. Often wrong on hard problems, but sounds confident.

Thinking Model (Slow but highly accurate)

Let's break this down. First, count the outer ring… 1, 2, 3… that's 30. Now the inner ring… wait, let me recheck… 1, 2, 3… 28. So, outer is 30, inner is 28. Total = 30 + 28 = 58.

<think> Wait, let me reevaluate… If I backtrack here… Setting up an equation… </think>

Works through the problem piece by piece. Catches its own mistakes. Like a student who actually shows their work β€” slower, but far more reliable.

Why "Emergent"?

This is the astonishing part: nobody taught the model these strategies. Backtracking ("wait, let me recheck…"), self-correction ("that doesn't add up…"), breaking problems into sub-steps β€” these are behaviors humans use when solving hard problems. The RL training process discovered them independently, purely because they lead to more correct answers. The model reinvented human problem-solving strategies from scratch.

Key Insight: The optimization process naturally discovers human-like cognitive strategies β€” backtracking, double-checking, reframing β€” without any human explicitly hardcoding these behaviors. More thinking tokens = more compute = better answers.
Reasoning

Chain of Thought: Why LLMs Are Bad at Math but Great at Reasoning

LLMs don't compute β€” they pattern-match. Understanding this gap explains both their surprising reasoning power and their surprising arithmetic failures.

The Paradox

Ask an LLM to explain how mitosis works, debug a React component, or compare Keynesian vs. Austrian economics β€” it'll do brilliantly. Ask it what 3,847 Γ— 291 is, and it might confidently give you the wrong number. How can a system that reasons about philosophy fail at arithmetic?

The answer: math requires exactness; LLMs are optimized for probability. These are fundamentally different objectives.

LLMs Don't See Numbers β€” They See Tokens

What you think it sees

12345

one numeric quantity

What it actually sees

12345

token chunks β€” no value attached

When an LLM "adds" two numbers, it isn't performing a calculation. It's generating tokens that look like the result of a calculation. For small numbers, probability aligns with correctness. As numbers grow larger or structures become unfamiliar, that alignment silently breaks.

πŸ”’ Symbolic Math (Calculators)

  • Manipulates symbols with strict rules
  • Result is guaranteed correct if rules apply
  • Zero tolerance for approximation
  • Can execute β€” but cannot explain

🧠 Neural Reasoning (LLMs)

  • Learns patterns of rule-following from data
  • Result is statistically likely, not guaranteed
  • Excellent at fuzzy, contextual, language-driven tasks
  • Can explain, compare, and adapt β€” flexibly

Chain of Thought β€” What It Actually Does (and Doesn't)

βœ“ Why CoT improves accuracy

Every intermediate step written is another full forward pass through billions of parameters. By generating reasoning tokens, you hand the model more compute budget β€” the problem is distributed across many token predictions instead of crammed into one impossible step.

Not magic β€” it's more compute. "Think step by step" grants extra forward passes, each refining the answer further.

βœ— What CoT can't do

CoT doesn't give the model a calculator. It encourages intermediate tokens that resemble reasoning steps. The chain can look flawless while the final number is wrong β€” or contain subtle errors that sound completely convincing.

"Let me calculate: 3847 Γ— 291.
3847 Γ— 200 = 769,400 βœ“
3847 Γ— 91 = 346,230 βœ— (forgot +3847Γ—1)
Total = 1,115,630" ← wrong intermediate β†’ wrong result

Why LLMs Are Still Excellent Reasoners

Real-world reasoning is rarely about exact computation. It's about framing problems, breaking them down, comparing alternatives, and building coherent arguments. LLMs are trained on billions of examples of humans doing exactly this β€” in books, papers, debates, tutorials. They've absorbed the structure of thought.

βœ“ Decompose problems

"First consider X, then Y…"

βœ“ Spot inconsistencies

"That contradicts what you said…"

βœ“ Compare approaches

"Option A trades speed for accuracy…"

None of that requires exact arithmetic. It requires structure, language, and pattern recognition β€” which is exactly what LLMs are optimized for. They don't follow rules; they imitate patterns of rule-following. That difference matters a lot in math, but very little in reasoning.

The Fix: Division of Labor

🧠LLM handlesFraming, explanation,
decision-making
+
πŸ–₯️Tools handlePrecision, guarantees,
exact computation
=
⭐Best of bothReliable, explainable,
and exact

This is why modern LLM systems pair language models with calculators, code interpreters, and search engines β€” each doing what it's actually built for.

Bottom Line: LLMs aren't bad at math because they're unintelligent. They're bad at math because math demands exactness, and LLMs are built for probability. They reason well because reasoning in the real world is fuzzy, contextual, and language-driven. Pair them with the right tools, and that difference becomes a strength, not a weakness.
πŸ”— Why LLMs Are Bad at Math but Great at Reasoning β€” Jainul Trivedi
Architecture

Cognitive Architecture: Vague Recollection vs. Working Memory

An LLM has two fundamentally different types of "memory" β€” and understanding the difference is the single most useful thing you can learn about using AI.

The Human Analogy

Imagine two scenarios: (A) Someone asks you about a book you read 6 months ago β€” you remember the gist, but details are fuzzy, and you might accidentally "remember" things that weren't actually in it. (B) Someone hands you the book open to the right page and says "read this paragraph and answer" β€” now you're perfectly accurate. An LLM works exactly the same way, with two distinct memory systems.

The Parameters
(Long-term Memory β€” The Fuzzy One)

🧠

Weights (Billions of Parameters): Everything the model "learned" during training is compressed into these numbers. But it's lossy β€” like trying to memorize the entire internet. The model has a general sense of things, but specific details get blurry or mixed up. This is why it confidently tells you fake facts. Prone to hallucination.

Example: "What year was X founded?" β†’ Model recalls ~2015 from fuzzy memory β†’ might say 2014 or 2016 with full confidence

The Context Window
(Working Memory β€” The Perfect One)

πŸ“‹

Context Window (Active Tokens): This is the text you put directly in the prompt β€” your question, pasted documents, conversation history. The model can see this perfectly, like reading off a page right in front of it. No guessing, no fuzzy recall. Zero hallucination on this data.

Example: "Here's the Wikipedia article: [paste]. What year was X founded?" β†’ Model reads directly β†’ answers correctly every time

Why This Matters for You

Most people use ChatGPT as a search engine: "Tell me about X" β€” forcing the model to dig through its fuzzy long-term memory. Power users paste the actual document, data, or code into the prompt and say: "Given this, answer Y." The second approach is dramatically more reliable because you're using the model's perfect working memory instead of its unreliable long-term recall.

Rule of Thumb: Never ask a model to recall facts from memory when you can simply paste the source material into the prompt. Context window = reliable. Parameters = fuzzy guessing.
Capabilities

Cognitive Prosthetics: Bypassing the Network's Flaws

LLMs can't do mental arithmetic or recall niche facts reliably β€” so they emit special 'Tool' tokens to call external programs.

Why Tools Exist

Here's something most people don't realize: GPT cannot actually do math. It doesn't have a calculator inside it. When you ask "what's 3,847 Γ— 291?", it's not computing β€” it's pattern-matching what a math answer looks like based on training data. For simple problems it often gets lucky. For anything complex, it silently gets it wrong. Same for: counting characters in a word, looking up today's stock price, or checking if code actually runs.

The solution? Give it hands. Modern LLMs are trained to recognize when they're about to hit their limits and output a special hidden token that says: "I need to call an external tool." It's like a person who knows they're bad at math pulling out a calculator.

How Tool Use Actually Works

Tool Use Flow

πŸ’¬Prompt Input &
Working Memory
"How many dots? [177]"
β†’
🧠LLM Engine &
Tool Decision
<|python_start|>
β†’
πŸ–₯️External Terminal
& Execution
> len(dots) β†’ 177
β†’
πŸ’‰Inject
Answer
177

The model writes code β†’ a real computer runs it β†’ the result is pasted back into the model's context window β†’ the model incorporates the exact answer into its response.

The Two Main Prosthetics

πŸ” Web Search

When the model needs current information (today's weather, recent news, live prices), it searches the web and injects real-time results into its working memory. This turns the fuzzy "I think I remember…" into exact, verified facts.

Without it: "I believe the CEO is still John…" (could be outdated)
With it: Searches β†’ finds current data β†’ gives correct answer

🐍 Code Interpreter

When the model needs to compute, count, or process data, it writes Python code and runs it on a real computer. The result is deterministic β€” a calculator never gets arithmetic wrong.

Without it: "3847 Γ— 291 = 1,119,377" (guessing β€” often wrong)
With code: print(3847 * 291) β†’ 1,119,477 (always correct)

Practical Tip: If your task involves math, counting, dates, or current facts β€” explicitly tell the model to use tools. Say "use Python to calculate" or "search the web for this." Don't trust the model's fuzzy internal abilities for anything requiring precision.
Practical

The Operator's Manual: Prompting for Mechanical Realities

Now that you understand how GPT works under the hood, here are three practical rules that follow directly from the architecture. These aren't "prompting tips" β€” they're mechanical consequences of how the system is built.

Rule 1: Feed It, Don't Quiz It

Parameter weights are a blurry, lossy zip file.

Never test an LLM's memory. Instead, paste the actual documents, data, or source material directly into the prompt. The model's context window (working memory) is perfect β€” its parameter recall (long-term memory) is fuzzy. Treat it like a brilliant analyst who hasn't read the brief yet: hand them the brief.

❌ "What did the Q3 report say about revenue?"
βœ… "Here's the Q3 report: [paste]. What does it say about revenue?"

Rule 2: Make It Show Its Work

Neural networks apply finite compute per token.

The model gets a fixed amount of "thinking" per output token. For complex questions, a one-word answer means almost no computation happened. Force it to think out loud β€” "explain step by step", "show your reasoning" β€” to give it the compute budget it needs to get the right answer.

❌ "Is this contract risky? Answer yes or no."
βœ… "Analyze this contract clause by clause. For each, explain the risk. Then give your overall assessment."

Rule 3: Tell It to Use Tools

Tokens blind LLMs to spelling; architecture blinds them to math.

The model can't natively count letters, do arithmetic, or know what happened yesterday. It can call a calculator, run code, or search the web β€” but sometimes needs a nudge. Explicitly tell it when precision matters.

❌ "How many r's in 'strawberry'?"
βœ… "Use Python to count how many r's are in 'strawberry'."
Reality

Dispel the Magic: You Are Talking to a Simulation

The Core Misconception

When ChatGPT says "I think…" or "I'm sorry, I don't know…" β€” it feels like you're talking to a person. That's the illusion. You're not. You're watching the output of a very sophisticated pattern-matching engine that was trained on billions of examples of humans writing things. It has learned to produce text that looks like it comes from a thoughtful person β€” but there is no person in there.

What It Feels Like

🧠

A Sentient Oracle
that understands you

What It Actually Is

🎰

A Statistical Engine
flipping billions of biased coins

No Persistent Self

Every conversation starts from zero. The model has no memory of you, no ongoing thoughts, no identity between sessions. "It" doesn't exist when you're not prompting it. What seems like personality is just a statistical pattern.

Caveat: ChatGPT the product now has a "Memory" feature β€” but it's an application-layer trick. User facts are stored in a database and injected into the context window at the start of each chat. The model itself still starts from zero; it just gets handed a cheat sheet.

Simulating a Contractor

During training, the model was fine-tuned on examples written by human contractors who followed labeling guidelines ("be helpful, be harmless, be honest"). So when you prompt it, you're activating a simulation of those specific people following those specific instructions. It's roleplaying as a helpful assistant because that's the character it was trained to play.

Biased Coin Flips

Every single word it generates is the result of a probability distribution β€” like a weighted dice roll. "The capital of France is ___" β†’ 97% Paris, 1.5% Lyon, 0.5% Marseille… It picks one. That's all generation ever is: billions of educated guesses in sequence.

Why This Matters: Understanding that you're operating a tool, not conversing with a being, changes how you use it. You stop asking "does it understand me?" and start asking "how do I structure this input to get the best statistical output?" That shift in mindset is what separates casual users from power users.
Learn More

Resources β€” Go Deeper on Every Topic

Curated links to the best papers, blog posts, videos, and interactive tools for each section above.

πŸ“₯ Pretraining & Data

πŸ”€ Tokenization

πŸ“Š Training, Softmax & Inference

🎯 RL, RLHF & Thinking Models

πŸ› οΈ Tool Use & Prompting

🎰 Big Picture & Philosophy