A Field Guide to Model Post-training
Most AI products are using the same general-purpose models. The moat isn’t the model — it’s the feedback loop. Companies that capture user signal and feed it back into training are building products other people can’t reproduce with a prompt.
This is part one of a two-part series. Part one covers the mechanics: the techniques, the hardware, the trade-offs. Part two walks through real post-training runs and where they actually moved the needle.
You don’t need to have trained a model to follow along. By the end, PEFT, RLHF, LoRA, and GRPO should feel concrete.
A map of post-training
Imagine a 2D grid. The x-axis is correctness — is the model giving the right kind of answer? The y-axis is preference — among correct answers, which one do people want? A freshly pre-trained model sits near the origin. Post-training moves it.
The two axes need different tools.
SFT: Teaching correctness
General-purpose models look great in a demo and stumble in production — not because they’re wrong, but because the error margin matters. A model that hits 90% in a notebook becomes a liability at scale when the task needs precise domain knowledge, a strict output format, or a narrow skill set.
That’s the gap SFT closes.
SFT trains on (prompt → response) pairs with cross-entropy loss applied only to the assistant tokens
1.The model isn’t penalized for how it reads the input, only for what it generates.
. People call it instruction tuning when the goal is general instruction-following, domain fine-tuning when the goal is specialization. [1] was the early proof at scale: a 1.3B SFT+RLHF model preferred over the 175B base GPT-3 in 71% of human comparisons.
A training sample:
{ "messages": [
{ "role": "system", "content": "You are a medical coding assistant." },
{ "role": "user", "content": "ICD-10 code for Type 2 diabetes?" },
{ "role": "assistant", "content": "E11.9 — Type 2 diabetes mellitus without complications." }
]}
The format is trivial. What you put in it is the whole game.
How much data do you really need?
Less than you’d expect, but quality dominates quantity. [2] fine-tuned on just 1,000 hand-curated examples and got responses preferred over GPT-4 in 43% of comparisons — the argument being that almost everything the model knows was learned in pre-training, so alignment needs surprisingly little clean signal. [3] pushed further: selecting just 5% of a dataset using an Instruction-Following Difficulty score outperformed training on the full set and beat WizardLM by ~10% on standard benchmarks. The question isn’t how much, it’s how well selected.
PEFT: Same goal, fraction of the cost
The biggest problem with full fine-tuning isn’t the concept — it’s the bill. Updating every parameter in a 7B or 70B model means significant compute, long training runs, and a separate copy of weights for every variant you want to serve.
PEFT sidesteps this by freezing most of the model and retraining only a small fraction of the weights — typically less than 2%. The quality trade-off is surprisingly small: PEFT recovers 90–95% of full fine-tuning performance while cutting memory 10–20×.
The most widely used PEFT method in production is LoRA.
LoRA: Low-rank adapters
LoRA — [5], not the Google font — is the dominant PEFT method in production. The core idea: instead of modifying the original weight matrices, LoRA freezes them and injects small trainable adapter matrices alongside. The adapters have a low rank r that controls how many parameters are actually trained. After training, the adapters can be merged into the base weights — mathematically just an addition — producing a model that behaves like a fully fine-tuned one with zero inference overhead.
A LoRA run uses the same data format as SFT:
{ "messages": [
{ "role": "system", "content": "You are a contract review assistant." },
{ "role": "user", "content": "Does this clause include an indemnification obligation?" },
{ "role": "assistant", "content": "Yes — Section 4.2 contains a mutual indemnification clause covering third-party IP claims." }
]}
The key hyperparameter is rank r. Practical guidance:
Start low, increase only if the model underfits.
Multi-LoRA in production
The most useful production property of LoRA is multi-adapter serving. Because adapters are small and separate from the base model, a single deployment can serve hundreds of adapters at once — one per customer, domain, or use case — without keeping a full model copy for each. Fireworks and vLLM support this natively; [6] showed serving thousands of concurrent adapters on a single base model. The trade-off: dynamically swapping unmerged adapters adds ~10–30% to prompt processing vs. a merged model 2.If you’re serving exactly one variant and latency is critical, merge the adapter into the base weights at deploy time. .
If you want the mechanics, CodeEmporium’s video is the clearest walkthrough I’ve seen.
Preference: from correct to preferred
Even when the model is generating correct responses, users prefer some correct outputs over others — tone, depth, format, restraint. SFT rarely captures that. Preference optimization does. The goal shifts from training for correctness to training for alignment.
Two camps:
- RL-based — learn from live feedback during training (RLHF/PPO, GRPO).
- Static preference — learn from a fixed dataset of human preferences collected beforehand (DPO, KTO).
In practice you almost always start with static methods. RL comes later — only when you need a ceiling offline methods can’t reach.
RLHF / PPO: the original recipe
[1] is the original preference optimization pipeline. Two stages: (1) train a reward model on human preference data — which of these two responses is better? (2) optimize the language model against that reward via Proximal Policy Optimization (PPO). The model samples during training, gets scored, updates weights — an online loop.
It works. It’s also expensive, unstable, and operationally complex. You’re juggling two models, sampling live, and tuning a finicky RL algorithm. Most teams don’t start here.
DPO: skip the reward model
[7] reframes RLHF as a classification problem. The key insight: the reward model in RLHF is implicit in the language model itself — you don’t have to train it separately. DPO drops the reward model entirely and optimizes directly on (chosen, rejected) pairs with a classification loss. No RL, no live sampling, no second model.
It matches or beats PPO-based RLHF on summarization and dialogue while being substantially simpler.
{
"prompt": "Explain neural networks to a 10-year-old.",
"chosen": "Think of it like a brain made of math — lots of tiny switches that learn to turn on or off based on examples.",
"rejected": "Neural networks are computational systems loosely inspired by biological neural networks that constitute animal brains."
}
The model learns to increase the probability of chosen relative to rejected.
KTO: when you only have thumbs
[8] solves a practical problem: most teams don’t have pairwise preference data. They have logs — responses users liked or didn’t. KTO was designed for exactly this. Grounded in Kahneman–Tversky prospect theory, it optimizes directly on binary feedback (desirable / undesirable) without matched pairs. It matches or exceeds DPO across model scales from 1B → 30B.
{
"prompt": "Summarize this customer complaint.",
"completion": "The customer is frustrated about a delayed shipment and is requesting a refund.",
"label": true
}
true = desirable, false = not. No paired responses needed.
GRPO: RL without the critic
GRPO is RL-based, but it does one thing differently from PPO: no value function. Instead of maintaining a separate critic network, GRPO samples a group of outputs for each prompt and uses their average reward as the baseline. This cuts memory 40–60% vs. PPO, making large-scale RL training tractable.
It was central to [9]: trained with GRPO, DeepSeek-R1 rivals OpenAI’s o1 on reasoning benchmarks while requiring ~147K H800 GPU-hours — an order of magnitude less than comparable models. Notably, DeepSeek-R1-Zero used pure RL with GRPO and no SFT at all, producing emergent behaviors like self-correction mid-reasoning.
Pick the right tool
The decision usually compresses to: what data do I have, and what am I optimizing for? This table is sortable and filterable — try filtering for “LoRA” or sorting by recommended approach.
| Situation | Recommended | Why |
|---|---|---|
| Domain adaptation (medical, legal, e-commerce) | SFT + LoRA | Efficient domain shift without full retraining |
| Enforce output format or style | SFT + LoRA | Format is a correctness problem — SFT is the right axis |
| Limited compute / single-GPU fine-tuning | QLoRA | 4-bit quantization fits a 65B model on a single 48GB GPU |
| Serving multiple fine-tuned variants | Multi-LoRA | One base, many adapters, shared GPU |
| Cold start — only binary feedback | KTO | No pairwise data needed, works on 👍/👎 logs |
| Pairwise preference alignment | DPO | Simpler than RLHF, matches PPO on most tasks |
| Improve helpfulness or tone from existing logs | KTO | Binary signal is enough; no annotation pipeline |
| Full safety alignment at scale | RLHF / PPO | Multi-objective optimization, highest ceiling |
| Reasoning tasks (math, code) | GRPO | Verifiable reward, 40–60% less memory than PPO |
| Prevent catastrophic forgetting | LoRA / PEFT | Frozen base weights preserve general capabilities |
The headline: start with LoRA + SFT for correctness, layer in DPO or KTO for preference once you have signal, and only reach for RLHF or GRPO when an offline method has plateaued and you have the infra to absorb the cost.
Part two picks up from here — real runs, real numbers, and what actually moved.
- [1] Ouyang et al. (2022). Training language models to follow instructions with human feedback. NeurIPS. link
- [2] Zhou et al. (2023). LIMA: Less Is More for Alignment. link
- [3] Li et al. (2024). From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection. NAACL. link
- [4] Wang et al. (2025). Catastrophic forgetting scales with model size in fine-tuning. link
- [5] Hu et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. link
- [6] Sheng et al. (2023). S-LoRA: Serving Thousands of Concurrent LoRA Adapters. link
- [7] Rafailov et al. (2023). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. link
- [8] Ethayarajh et al. (2024). KTO: Model Alignment as Prospect Theoretic Optimization. ICML. link
- [9] DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. link

Comments