Pre-training: learning language from text
The first stage of training is pre-training. The model reads an enormous amount of text, typically trillions of words drawn from books, websites, code repositories, academic papers, and public records. It learns by predicting the next word, over and over, adjusting its internal parameters to become better at prediction. After this stage, the model has absorbed the statistical structure of language: grammar, facts, reasoning patterns, style, tone, and the relationships between concepts. It can complete any text in a way that is statistically consistent with its training data. But it has no particular inclination to be helpful, to answer questions, or to follow instructions. It is a text-completion engine, powerful but undirected.
Instruction tuning: learning to follow directions
A pre-trained model does not naturally behave like an assistant. If you type a question, it might continue the text as if it were writing a web page that happens to contain that question, rather than answering it. Instruction tuning changes this. The model is trained on curated examples of instruction-response pairs: "summarise this article" followed by a good summary, "translate this sentence" followed by a correct translation. After instruction tuning, the model understands that a prompt is something to respond to, not merely to continue. Instruction tuning is what turns a text-completion engine into something that feels like a conversational partner.
Alignment: optimising for human preference
After instruction tuning, the model can follow directions, but its responses may be dry, unhelpful, or occasionally harmful. Alignment training adjusts the model further, typically using human feedback. Human evaluators compare pairs of responses and indicate which one they prefer. The model is then trained to produce responses that match those preferences. Alignment is where the model's characteristic behaviour gets shaped: the tendency to be thorough, to hedge, to be encouraging, to avoid controversy. The critical insight is that alignment optimises for what evaluators perceive as good, not for what is externally verifiable as correct. An aligned model produces a confident, well-structured, reassuring answer even when the honest answer is "I do not have enough information to answer reliably." It produces a long, detailed response even when a short one would serve better, because evaluators tend to rate longer responses as more helpful. The verbosity, the hedging, the reluctance to say "no": these are not properties of the technology. They are properties of what human evaluators rewarded during alignment.
The gap between preferred and correct
The distinction between preferred and correct deserves attention. A model trained purely on prediction has no opinion about what is good; it predicts probable text. A model trained on human preferences has a strong opinion about what is good, but that opinion reflects what evaluators liked, not what is objectively right. Evaluators like fluency, so the model is fluent even when wrong. Evaluators like thoroughness, so the model over-explains even when brevity is appropriate. Evaluators dislike blunt refusals, so the model attempts an answer even when it should decline. When a chat model is agreeable to a fault, produces unnecessary caveats, or gives you what you seem to want to hear rather than what you need to know, you are seeing alignment training in action. You can compensate: instruct the model to be terse, to express uncertainty explicitly, to prioritise accuracy over agreeableness.
Distillation: transferring capability to smaller models
Training a large model is extraordinarily expensive. Distillation is the process of using a large, capable model (the teacher) to train a smaller, cheaper model (the student). The student model is trained not on the original text data but on the outputs of the teacher model. It learns to approximate the teacher's behaviour at a fraction of the size and cost. Distilled models are central to practical deployment because they make it feasible to run capable models on modest hardware, on local machines, or at high volume without prohibitive API costs. The trade-off is that the student never quite matches the teacher. It captures the most common patterns well but loses some of the nuance and edge-case handling. For well-defined, focused tasks, a distilled model can be as good as its teacher. For open-ended, complex reasoning, the gap becomes noticeable. Choosing between a large model and its distilled variant is a recurring practical decision: do you need the full capability, or is the smaller model sufficient for this particular task?
Why this matters for your work
Every behaviour you observe in a language model is the product of these training stages. The broad knowledge comes from pre-training. The ability to follow your instructions comes from instruction tuning. The tendency to be agreeable, verbose, and evasive about uncertainty comes from alignment. The availability of fast, cheap models that run locally comes from distillation. None of these are fixed. When you write a system prompt that says "be direct, express uncertainty when you are uncertain, and keep responses under three sentences", you are overriding the alignment defaults with your own preferences. When you choose a small distilled model for a classification task instead of a large model, you are making an informed trade-off between capability and cost. Understanding the training pipeline gives you the vocabulary to diagnose why a model behaves the way it does and the confidence to change it.
Examples
Alignment-induced verbosity
You ask a model a yes-or-no question: "Is Python dynamically typed?" The model responds with four paragraphs explaining dynamic typing, static typing, type hints, and the historical evolution of Python's type system. The answer to your question is "yes", but the alignment training rewarded thoroughness over brevity. Adding "respond with only yes or no" to your prompt produces the one-word answer you wanted.
Distilled model for a focused task
You need to classify 50,000 customer feedback entries into five categories. You test a large frontier model and a distilled model one-tenth its size. Both achieve 94% accuracy on your test set. The large model costs 40 times more per request and is six times slower. For this task, the distilled model is the right choice because the capability gap is negligible and the cost and speed difference is decisive.