A Field Guide to Model Post-training

The mechanics behind turning a general-purpose model into one that fits your product — SFT, LoRA, DPO, KTO, GRPO, and when each one earns its place.

May 19, 2026 · 12 min read·llmpost-training

Most AI products are using the same general-purpose models. The moat isn’t the model — it’s the feedback loop. Companies that capture user signal and feed it back into training are building products other people can’t reproduce with a prompt.

This is part one of a two-part series. Part one covers the mechanics: the techniques, the hardware, the trade-offs. Part two walks through real post-training runs and where they actually moved the needle.

You don’t need to have trained a model to follow along. By the end, PEFT, RLHF, LoRA, and GRPO should feel concrete.

A map of post-training

Imagine a 2D grid. The x-axis is correctness — is the model giving the right kind of answer? The y-axis is preference — among correct answers, which one do people want? A freshly pre-trained model sits near the origin. Post-training moves it.

Technique landscape

Supervised → RL signal

← Fewer weights modified · More weights modified →

Selected technique

LoRA

SFT (or DPO/KTO) objective with ~1% of weights trained. The default for production fine-tuning when compute is a constraint or you serve many variants.

The two axes need different tools.

x-axis — correctness. Teach the model what a good response looks like. This is supervised fine-tuning (SFT). Full fine-tuning (FFT) updates every weight; PEFT methods like LoRA train less than 2% of them and recover most of the quality.

y-axis — preference. Among correct answers, lift the ones people prefer. Two camps: RL-based (RLHF, GRPO) — online, higher ceiling, more expensive — and static (DPO, KTO) — offline, cheaper, easier to start with.

SFT: Teaching correctness

General-purpose models look great in a demo and stumble in production — not because they’re wrong, but because the error margin matters. A model that hits 90% in a notebook becomes a liability at scale when the task needs precise domain knowledge, a strict output format, or a narrow skill set.

That’s the gap SFT closes.

SFT trains on (prompt → response) pairs with cross-entropy loss applied only to the assistant tokens[1] 1.The model isn’t penalized for how it reads the input, only for what it generates. . People call it instruction tuning when the goal is general instruction-following, domain fine-tuning when the goal is specialization. [1] was the early proof at scale: a 1.3B SFT+RLHF model preferred over the 175B base GPT-3 in 71% of human comparisons.

A training sample:

{ "messages": [
  { "role": "system",    "content": "You are a medical coding assistant." },
  { "role": "user",      "content": "ICD-10 code for Type 2 diabetes?" },
  { "role": "assistant", "content": "E11.9 — Type 2 diabetes mellitus without complications." }
]}

The format is trivial. What you put in it is the whole game.

How much data do you really need?

Less than you’d expect, but quality dominates quantity. [2] fine-tuned on just 1,000 hand-curated examples and got responses preferred over GPT-4 in 43% of comparisons — the argument being that almost everything the model knows was learned in pre-training, so alignment needs surprisingly little clean signal. [3] pushed further: selecting just 5% of a dataset using an Instruction-Following Difficulty score outperformed training on the full set and beat WizardLM by ~10% on standard benchmarks. The question isn’t how much, it’s how well selected.

PEFT: Same goal, fraction of the cost

The biggest problem with full fine-tuning isn’t the concept — it’s the bill. Updating every parameter in a 7B or 70B model means significant compute, long training runs, and a separate copy of weights for every variant you want to serve.

PEFT sidesteps this by freezing most of the model and retraining only a small fraction of the weights — typically less than 2%. The quality trade-off is surprisingly small: PEFT recovers 90–95% of full fine-tuning performance while cutting memory 10–20×.

The most widely used PEFT method in production is LoRA.

LoRA: Low-rank adapters

LoRA — [5], not the Google font — is the dominant PEFT method in production. The core idea: instead of modifying the original weight matrices, LoRA freezes them and injects small trainable adapter matrices alongside. The adapters have a low rank r that controls how many parameters are actually trained. After training, the adapters can be merged into the base weights — mathematically just an addition — producing a model that behaves like a fully fine-tuned one with zero inference overhead.

A LoRA run uses the same data format as SFT:

{ "messages": [
  { "role": "system",    "content": "You are a contract review assistant." },
  { "role": "user",      "content": "Does this clause include an indemnification obligation?" },
  { "role": "assistant", "content": "Yes — Section 4.2 contains a mutual indemnification clause covering third-party IP claims." }
]}

The key hyperparameter is rank r. Practical guidance:

rank

r = 4 – 8

Simple style or format adjustments. Start here.

rank

r = 16

Most domain-specific tasks. Sensible default.

rank

r = 32 – 64

Large domain shifts or multi-task settings.

Start low, increase only if the model underfits.

Multi-LoRA in production

The most useful production property of LoRA is multi-adapter serving. Because adapters are small and separate from the base model, a single deployment can serve hundreds of adapters at once — one per customer, domain, or use case — without keeping a full model copy for each. Fireworks and vLLM support this natively; [6] showed serving thousands of concurrent adapters on a single base model. The trade-off: dynamically swapping unmerged adapters adds ~10–30% to prompt processing vs. a merged model[2] 2.If you’re serving exactly one variant and latency is critical, merge the adapter into the base weights at deploy time. .

If you want the mechanics, CodeEmporium’s video is the clearest walkthrough I’ve seen.

Preference: from correct to preferred

Even when the model is generating correct responses, users prefer some correct outputs over others — tone, depth, format, restraint. SFT rarely captures that. Preference optimization does. The goal shifts from training for correctness to training for alignment.

Two camps:

RL-based — learn from live feedback during training (RLHF/PPO, GRPO).
Static preference — learn from a fixed dataset of human preferences collected beforehand (DPO, KTO).

In practice you almost always start with static methods. RL comes later — only when you need a ceiling offline methods can’t reach.

RLHF / PPO: the original recipe

[1] is the original preference optimization pipeline. Two stages: (1) train a reward model on human preference data — which of these two responses is better? (2) optimize the language model against that reward via Proximal Policy Optimization (PPO). The model samples during training, gets scored, updates weights — an online loop.

It works. It’s also expensive, unstable, and operationally complex. You’re juggling two models, sampling live, and tuning a finicky RL algorithm. Most teams don’t start here.

DPO: skip the reward model

[7] reframes RLHF as a classification problem. The key insight: the reward model in RLHF is implicit in the language model itself — you don’t have to train it separately. DPO drops the reward model entirely and optimizes directly on (chosen, rejected) pairs with a classification loss. No RL, no live sampling, no second model.

It matches or beats PPO-based RLHF on summarization and dialogue while being substantially simpler.

{
  "prompt": "Explain neural networks to a 10-year-old.",
  "chosen": "Think of it like a brain made of math — lots of tiny switches that learn to turn on or off based on examples.",
  "rejected": "Neural networks are computational systems loosely inspired by biological neural networks that constitute animal brains."
}

The model learns to increase the probability of chosen relative to rejected.

KTO: when you only have thumbs

[8] solves a practical problem: most teams don’t have pairwise preference data. They have logs — responses users liked or didn’t. KTO was designed for exactly this. Grounded in Kahneman–Tversky prospect theory, it optimizes directly on binary feedback (desirable / undesirable) without matched pairs. It matches or exceeds DPO across model scales from 1B → 30B.

{
  "prompt": "Summarize this customer complaint.",
  "completion": "The customer is frustrated about a delayed shipment and is requesting a refund.",
  "label": true
}

true = desirable, false = not. No paired responses needed.

GRPO: RL without the critic

GRPO is RL-based, but it does one thing differently from PPO: no value function. Instead of maintaining a separate critic network, GRPO samples a group of outputs for each prompt and uses their average reward as the baseline. This cuts memory 40–60% vs. PPO, making large-scale RL training tractable.

It was central to [9]: trained with GRPO, DeepSeek-R1 rivals OpenAI’s o1 on reasoning benchmarks while requiring ~147K H800 GPU-hours — an order of magnitude less than comparable models. Notably, DeepSeek-R1-Zero used pure RL with GRPO and no SFT at all, producing emergent behaviors like self-correction mid-reasoning.

Pick the right tool

The decision usually compresses to: what data do I have, and what am I optimizing for? This table is sortable and filterable — try filtering for “LoRA” or sorting by recommended approach.

Situation	Recommended	Why
Domain adaptation (medical, legal, e-commerce)	SFT + LoRA	Efficient domain shift without full retraining
Enforce output format or style	SFT + LoRA	Format is a correctness problem — SFT is the right axis
Limited compute / single-GPU fine-tuning	QLoRA	4-bit quantization fits a 65B model on a single 48GB GPU
Serving multiple fine-tuned variants	Multi-LoRA	One base, many adapters, shared GPU
Cold start — only binary feedback	KTO	No pairwise data needed, works on 👍/👎 logs
Pairwise preference alignment	DPO	Simpler than RLHF, matches PPO on most tasks
Improve helpfulness or tone from existing logs	KTO	Binary signal is enough; no annotation pipeline
Full safety alignment at scale	RLHF / PPO	Multi-objective optimization, highest ceiling
Reasoning tasks (math, code)	GRPO	Verifiable reward, 40–60% less memory than PPO
Prevent catastrophic forgetting	LoRA / PEFT	Frozen base weights preserve general capabilities

Decision table for common post-training situations

The headline: start with LoRA + SFT for correctness, layer in DPO or KTO for preference once you have signal, and only reach for RLHF or GRPO when an offline method has plateaued and you have the infra to absorb the cost.

Part two picks up from here — real runs, real numbers, and what actually moved.

References

[1] Ouyang et al. (2022). Training language models to follow instructions with human feedback. NeurIPS. link
[2] Zhou et al. (2023). LIMA: Less Is More for Alignment. link
[3] Li et al. (2024). From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection. NAACL. link
[4] Wang et al. (2025). Catastrophic forgetting scales with model size in fine-tuning. link
[5] Hu et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. link
[6] Sheng et al. (2023). S-LoRA: Serving Thousands of Concurrent LoRA Adapters. link
[7] Rafailov et al. (2023). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. link
[8] Ethayarajh et al. (2024). KTO: Model Alignment as Prospect Theoretic Optimization. ICML. link
[9] DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. link

A map of post-training

SFT: Teaching correctness

How much data do you really need?

PEFT: Same goal, fraction of the cost

LoRA: Low-rank adapters

Multi-LoRA in production

Preference: from correct to preferred

RLHF / PPO: the original recipe

DPO: skip the reward model

KTO: when you only have thumbs

GRPO: RL without the critic

Pick the right tool

Comments