
How Large Language Models Actually Work (Explained Simply)
Understanding how ChatGPT and Claude work—even at a high level—helps you use them more effectively and spot their limitations. No maths required.
No mathematics required: This explanation focuses on concepts, not equations. You'll understand how ChatGPT and Claude work without needing a computer science background.
Imagine you're at a party, and someone asks you to continue this sentence: "The cat sat on the..." Your mind immediately generates possibilities—mat, floor, chair, windowsill. You're not consulting a database of cat-sitting locations. You're drawing on patterns you've absorbed from years of reading and conversation, predicting what words typically follow in that context.
This is, fundamentally, what large language models do. The difference is scale. Where you've read thousands of books and articles, these models have processed trillions of words. Where you might consider a handful of possibilities, they evaluate probabilities across their entire vocabulary. But the core mechanism—predicting what comes next based on learned patterns—is surprisingly similar.
Understanding this mechanism, even at a high level, transforms how you use these tools. It clarifies why they excel at certain tasks and fail spectacularly at others. It explains behaviors that seem mysterious at first—the confident fabrications, the inconsistent outputs, the occasional brilliant insights mixed with basic errors. Once you understand what's actually happening under the hood, these tools become both more useful and less mysterious.
The Core Mechanism: A Sophisticated Autocomplete
When you type on your phone and it suggests the next word, that's a miniature version of what large language models do. Your phone's keyboard has learned from your texting patterns and common phrases to predict what you'll type next. A large language model works on the same principle, but trained on such a vast corpus of text that it can predict not just common phrases, but complex arguments, code snippets, creative narratives, and technical explanations.
Here's what actually happens when you ask ChatGPT or Claude a question. The model receives your prompt and begins generating a response one small piece at a time. Each piece—called a token, which might be a word, part of a word, or a punctuation mark—is selected based on what the model predicts should come next given everything that's come before.
At each step, the model doesn't just pick the single most likely next token. Instead, it calculates probability distributions across thousands of possibilities. "The" might have a 15% probability of coming next, "a" might have 12%, "this" might have 8%, and so on through tens of thousands of options. The model then samples from these probabilities, which is why the same prompt can produce different responses on different runs.
Prompt: "The capital of France is"
Probability Distribution:
Selected token: "Paris" (sampled from this distribution)
This process repeats token by token until the model generates a stopping signal or reaches a length limit. The response you receive—whether it's a single sentence or several paragraphs—emerged from this iterative prediction process, each token building on all the tokens that came before.
How Models Learn: From Raw Text to Coherent Responses
Before a model can predict what comes next, it needs to learn patterns from existing text. This learning happens in distinct phases, each shaping different aspects of the model's behavior.
Pre-training: Learning the Structure of Language
Think of pre-training as immersive language learning. Instead of memorizing vocabulary lists and grammar rules, you'd learn by reading millions of books, articles, websites, and conversations in that language. You'd absorb not just words, but patterns—how sentences are structured, how ideas flow, how different contexts require different phrasings, how technical discussions differ from casual chat.
This is what happens during pre-training. The model processes enormous text corpora—web pages, digitized books, academic papers, code repositories, forum discussions. It learns by trying to predict masked or subsequent words in these texts, adjusting billions of internal parameters (the "weights" of the neural network) to get better at this prediction task.
Crucially, this phase is unsupervised. No human is labeling which sentences are good or which facts are true. The model simply learns statistical patterns: given this context, these tokens tend to follow. This is why pre-trained models acquire both impressive capabilities—fluent language generation, reasoning patterns, factual knowledge—and significant limitations. They learn whatever patterns exist in training data, including biases, misconceptions, and false information that appeared in that data.
"Pre-training gives the model literacy—the ability to read and write fluently in the language. But literacy isn't the same as wisdom, accuracy, or helpfulness. Those qualities require additional training."
— Common description of pre-training's scope
Fine-tuning: Learning to Be Helpful
A purely pre-trained model is like a brilliant mimic with no particular goal. Ask it to help you write an email, and it might continue your prompt as if you were writing a novel about someone writing an email. It would be linguistically coherent but behaviorally unhelpful because it was trained only to continue text patterns, not to assist users.
Fine-tuning addresses this by teaching the model preferred behaviors. In one common approach, called Reinforcement Learning from Human Feedback (RLHF), humans rate model outputs on helpfulness, harmlessness, and honesty. The model learns to generate responses that score higher on these dimensions. This is why ChatGPT attempts to answer your question rather than just continuing your text in an unexpected direction, why Claude tries to be balanced and nuanced, why models hedge when uncertain rather than confidently stating whatever is linguistically plausible.
Fine-tuning doesn't add new knowledge—the model still works from patterns learned during pre-training. But it shapes how that knowledge gets expressed, steering the model toward being a helpful assistant rather than merely a text completion engine.
Key Concepts That Shape Model Behavior
Tokens: The Currency of Language Models
When you think about text, you probably think in words. Language models think in tokens. A token might be an entire word, a chunk of a word, or even a single character, depending on how common that sequence is in the training data. Common words like "the" or "and" are typically single tokens. Uncommon words might be split: "understanding" might be one token, while "bioengineering" might be split into "bio" and "engineering."
This matters practically because pricing and limits are measured in tokens, not words. When a service advertises "32,000 token context window," that's roughly 24,000 words—but it varies depending on your specific text. Technical writing with specialized terminology tends to use more tokens per word than casual conversation because rare terms get split into multiple tokens.
Why Tokenization Affects Model Behavior
The way text gets split into tokens can affect model performance in subtle ways. Languages that use non-Latin scripts often require more tokens to represent the same amount of information, making them more expensive to process and potentially reducing model effectiveness in those languages.
Similarly, code in languages with verbose syntax uses more tokens than compact languages, affecting how much code fits in the context window and how expensive it is to process.
Context Window: The Scope of Attention
Imagine trying to summarize a book, but you can only look at 20 pages at a time. You could read the first 20 pages and summarize them, then read pages 21-40 and summarize those, but you'd struggle to see connections between page 1 and page 35 because you can't view them simultaneously. The context window is analogous—it defines how much text the model can "see" at once.
Early language models had context windows of just a few thousand tokens—maybe a page or two of text. This limited their usefulness for tasks requiring long-range coherence. Modern models have dramatically expanded context windows: GPT-4 Turbo supports 128,000 tokens, Claude 3.5 Sonnet supports 200,000 tokens. These capacities enable new use cases like analyzing entire codebases, processing multi-chapter documents, or maintaining coherent conversations across hundreds of exchanges.
However, larger context windows come with tradeoffs. They're more computationally expensive to process, which means higher costs and slower response times. Models can also struggle to maintain consistent attention across very long contexts—details from early in a 200,000-token conversation might get less weight than more recent context, even if those early details are important.
Parameters: The Model's Memory Capacity
When you hear "GPT-4 has over a trillion parameters," what does that actually mean? Parameters are the learned values that encode the model's knowledge—essentially, the "settings" that determine how the model transforms input into output. You can think of them like the strength of connections in a brain: some connections are strong (represented by large parameter values), reinforcing patterns the model saw frequently during training, while others are weak, representing rare or uncertain patterns.
More parameters generally enable more capability. A model with 7 billion parameters might handle basic conversation and simple tasks, while a model with 175 billion parameters can engage in sophisticated reasoning, write complex code, and demonstrate nuanced understanding across many domains. But this relationship isn't linear—doubling parameters doesn't double capability—and more parameters require more computational resources to run.
Temperature: Tuning Randomness and Creativity
Remember that at each step, the model calculates probability distributions across possible next tokens. Temperature controls how the model samples from these distributions. Low temperature (close to 0) makes the model consistently pick high-probability tokens, resulting in focused, deterministic, predictable outputs. High temperature (1.0 or above) flattens the probability distribution, making the model more likely to pick lower-probability tokens, resulting in creative but less predictable outputs.
For factual tasks—answering questions, summarizing documents, extracting information—you want low temperature. The most probable completion is usually the most accurate one. For creative tasks—brainstorming, writing fiction, generating diverse examples—higher temperature introduces useful variety. The model will explore less obvious possibilities, sometimes producing surprising and valuable results, though with higher risk of incoherence.
Why Models Behave the Way They Do
Understanding the architecture clarifies many behaviors that seem puzzling when you first encounter AI systems. These aren't bugs or mysteries—they're natural consequences of the prediction-based approach.
The Hallucination Problem
When a lawyer used ChatGPT to research case law and the model fabricated six non-existent legal cases, complete with plausible citations and judicial opinions, this wasn't a malfunction. The model did exactly what it was designed to do: predict plausible next tokens based on learned patterns. It had learned what legal citations look like, what judicial reasoning sounds like, what case names typically follow certain patterns. When asked to provide supporting cases, it generated text that fit those patterns, with no mechanism to verify whether specific cases actually exist.
This is the hallucination problem in essence. Models don't distinguish between recalled facts and plausible fabrications because fundamentally, they're not recalling facts—they're predicting tokens. If false information is linguistically plausible, it might be generated. There's no fact-checking layer, no verification step. It's pattern matching all the way down.
Confident-Sounding Uncertainty
Models often sound confident even when wrong because confidence in phrasing is common in training data. Academic papers state findings definitively. News articles present information as fact. Expert commentary uses authoritative language. The model learned that this is how informative text is typically structured, so it generates similarly confident-sounding output regardless of the actual reliability of the content.
Fine-tuning can encourage models to express uncertainty—you'll notice modern assistants more often say "I'm not certain, but..." or "This information might be outdated..."—but the underlying architecture still generates token probabilities based on linguistic patterns, not epistemic confidence about factual accuracy.
Why Context Is Everything
Each token is predicted based on all the tokens that came before it in the context. Better context enables better predictions. This is why prompt engineering matters so much. When you provide detailed context—specifying format, tone, audience, constraints—you're giving the model more signal about what tokens should follow. Vague prompts leave the model to infer from limited information, often leading to generic or off-target responses.
Knowledge Cutoffs and Temporal Blindness
The model's knowledge comes entirely from training data. Events, publications, or developments after the training cutoff don't exist in the patterns the model learned. If you ask about something recent, the model faces a choice: admit it has no information, or extrapolate based on patterns from before the cutoff. Models are increasingly trained to recognize and acknowledge knowledge cutoffs, but they can still generate plausible-sounding content about events they've never encountered, especially if you ask in ways that presuppose their knowledge.
What This Architecture Enables and Constrains
The prediction-based architecture creates a specific capability profile. Models excel at certain tasks while remaining fundamentally limited in others, regardless of how much training data or compute you provide.
Where Models Excel
Language models are extraordinarily good at generating fluent, coherent text across diverse styles and formats. They can write formal business emails and casual messages, technical documentation and creative fiction, Python code and legal contracts—all with appropriate conventions and structures. This isn't because they "understand" these forms in a deep sense, but because they've absorbed the patterns that characterize each style from millions of examples.
They excel at transformation tasks: summarizing long documents, translating between languages, converting bullet points into prose, restructuring arguments, reformatting data. These tasks leverage the model's strength in pattern recognition and generation while minimizing hallucination risk because the source material is provided rather than recalled.
They're effective at following complex instructions with multiple constraints. "Write a formal email declining a job offer, expressing gratitude for the opportunity, mentioning the competing offer without details, maintaining warmth, and keeping it under 150 words" involves juggling multiple requirements simultaneously—a task that plays to models' ability to weight various constraints when predicting each next token.
Where Models Struggle
Verifying truth requires mechanisms that current language models lack. They can generate text that looks like fact-checking, but they can't actually verify whether a claim is true. There's no connection to authoritative sources, no reasoning system that can evaluate evidence, no database to query. Some systems address this by adding retrieval components that look up information before generating responses, but the core language model itself is blind to truth versus plausible falsity.
Real-time information is inaccessible without additional tools. The model knows patterns from training data, full stop. If you ask about today's weather, current stock prices, or breaking news, a pure language model will either admit ignorance or extrapolate from patterns (like typical weather for this time of year), not access actual current data.
Models don't learn from conversations in the traditional sense. Within a single conversation, they can reference what was said earlier—that's using the context window, not learning. But once the conversation ends, nothing is retained. The model returns to its base state. This differs fundamentally from human learning, where each interaction can update our knowledge and change future behavior. Language models require explicit retraining to update their parameters.
Consistent outputs across runs aren't guaranteed because of the probabilistic sampling process. The same prompt can yield different responses, sometimes subtly different, sometimes radically so. This is intentional—it enables creativity and variety—but it means you can't expect reproducibility without special measures like setting temperature to zero or using seed parameters.
Practical Implications for Users
Understanding how these systems work isn't just academically interesting—it should change how you use them.
Prompt Engineering Becomes Obvious
Once you understand that the model predicts based on context, the principles of effective prompting become intuitive. Providing specific context improves predictions. Showing examples demonstrates the pattern you want. Specifying constraints guides the probability distributions toward outputs that meet your requirements. You're not trying to communicate intent to an intelligence that will figure out what you want—you're shaping the context to make desired outputs more probable.
Verification Becomes Non-Negotiable
Knowing that the model has no truth-verification mechanism makes clear that verification must be external. For anything factual that matters, you need to check. This isn't a temporary limitation awaiting a software update—it's inherent to the architecture. Models might get better at expressing uncertainty or citing sources, but the underlying generation process remains probabilistic pattern matching, not fact retrieval.
Task Selection Becomes Strategic
Understanding what models can and can't do helps you route tasks appropriately. Use AI for drafting, not for final factual content without review. Use it for transformation and synthesis, not for memory and retrieval. Use it for generating options, not for making decisions that require judgment about truth or value. The tool becomes more valuable when you understand its fundamental constraints.
The Bottom Line
Large language models are sophisticated pattern-matching systems that predict what text should come next based on statistical patterns learned from massive text corpora. They're remarkably capable at language generation, transformation, and following complex instructions. But they're not knowledge bases, reasoning engines, or truth-verification systems.
This architecture explains both their impressive capabilities and their fundamental limitations. The same mechanism that enables fluent, context-appropriate responses also enables confident fabrication. The same training process that captures knowledge from billions of documents also captures biases and false information from those documents. The same probabilistic sampling that allows creativity also prevents guaranteed consistency.
Understanding this mechanism helps you use these tools effectively. Leverage their strengths in language generation while compensating for their limitations in factual accuracy and reasoning. Provide good context, verify factual claims, and structure workflows that capture value while managing risk. The tools are powerful, but they're most powerful when you understand what's actually happening beneath the surface.
Key Takeaways
Core mechanism: Models predict the next token based on statistical patterns learned from training data, repeating this process to generate full responses.
Training shapes behavior: Pre-training provides language fluency and knowledge patterns, while fine-tuning teaches helpful, harmless, and honest response behaviors.
No truth verification: Plausible text isn't necessarily true text. The architecture has no mechanism to check factual accuracy—it only predicts linguistically likely continuations.
Context is critical: Better prompts provide better prediction context, directly improving output quality. Specificity and examples guide the model toward desired outputs.
Probabilistic by design: Same input can produce different outputs because models sample from probability distributions rather than deterministically selecting the highest-probability option.
Stay Updated on AI
Get the latest news and tutorials