Explainstuff.mebeta
All concepts
AI-Driven Developmentbeginner6 min

Large Language Models, in Plain Terms

A trained text-predictor: feed it context, and it guesses the most likely next words, one at a time.

Phones have done this for years: you type "I'll be there in five" and the keyboard offers "minutes." A large language model is, at heart, the same idea scaled up enormously — a system that looks at the text so far and predicts what is likely to come next. That sounds almost too simple to power something that writes essays and code, but the surprise of the last few years is that when you make that predictor big enough and train it on enough text, it gets startlingly good at it.

It predicts the next token

Models don't actually work in whole words; they work in tokens — small chunks of text, often a word or a piece of one. "Predicting" means: given all the tokens it has seen so far, the model estimates which token is most likely to come next, picks one, and adds it to the text. There is no grand plan or sentence laid out in advance. The reason a model can answer a question or write a function is that, across a vast amount of training text, the most likely continuation of a well-posed question is its answer. Good predictions, stacked one after another, look a lot like thinking.

How it works

Everything you send — your question, any instructions, any files — becomes a sequence of tokens that the model reads all at once. It then produces one token, appends it to that sequence, and repeats: read everything, predict the next token, add it, read again. That loop is why output appears to stream out word by word and always reads left to right — each new token is generated with full sight of everything before it but nothing after. The diagram below shows the shape of it: tokens go in, the model predicts, and tokens come out the other side.

Tokens in, next-token prediction out
text in, next-token prediction out
Prompt tokens
Language Model
Predicted tokens
The model reads the prompt as tokens and predicts the most likely next tokens, one at a time.
Note

In our stack — the models doing this prediction are Anthropic's Claude models. When Claude Code works on your project, it bundles your request and the relevant code into tokens, sends them to a Claude model, and streams back the predicted tokens — which might be an explanation, a patch, or a decision to call a tool. The model itself stays the same between requests; all the project-specific knowledge rides along in what gets sent.

Trained once, then it just runs

It helps to separate two very different phases. Training is the slow, expensive process where the model's internal settings are tuned by exposure to huge amounts of text — this happens once, ahead of time, in a data center. Inference is what happens when you actually use it: the finished model takes your input and predicts tokens. The key thing to internalize is that inference does not change the model. Asking it something today teaches it nothing for tomorrow; the model that answers your next question is byte-for-byte the same one.

No memory between calls

This is the consequence that trips people up most. Because using the model doesn't change it, a model has no memory of past conversations. Each call starts cold. If it seems to "remember" what you said three messages ago, that's only because all of those earlier messages are being re-sent to it every single time. Anything the model needs to know — the conversation so far, the contents of a file, your preferences — has to be packed into the input on each call. That input has a size limit, which is exactly what the context window lesson is about.

Key takeaways

  • A large language model is a text predictor: given some context, it estimates the most likely next token and emits it, over and over.
  • Tokens are the small chunks of text the model reads and writes — roughly word-sized pieces, not whole sentences.
  • Training and running are two separate phases: a model is trained once, then used (inference) many times without learning anything new.
  • The model has no memory between calls — anything it should 'know' must be inside the context you send each time.
  • Output is generated one token at a time, with each new token fed back in to predict the next, which is why it reads left to right.

Keep going