Training 12 min read

LLM Post-Training Explained: SFT, DPO, and GRPO

ai.rs Mar 13, 2026

post-training sft dpo grpo reinforcement-learning training

What Is Post-Training?

When a company like Meta releases Llama or Mistral releases their models, they ship two versions: a base model and an instruct model. The base model is the raw output of pre-training — it can autocomplete text but can't follow instructions, answer questions, or hold a conversation. The instruct model does all of that.

The difference is post-training: the set of techniques applied after pre-training that transform a text-completion engine into an AI assistant.

If pre-training is like giving someone a library of books to read, post-training is teaching them how to have a conversation about what they've read.

Post-Training vs. Fine-Tuning

These terms overlap but aren't identical:

	Post-Training	Fine-Tuning
Goal	General-purpose assistant	Task-specific expert
Data size	1M+ samples	10K-1M samples
Who does it	Model providers (Meta, Mistral, etc.)	End users and businesses
Output	Instruct/chat model	Domain-adapted model
Techniques	SFT + DPO + RL	Usually SFT only

Post-training is what turns Llama into Llama-Instruct. Fine-tuning is what turns Llama-Instruct into your custom product assistant. They use the same underlying methods (especially SFT), but at different scales and for different purposes.

The Three-Stage Pipeline

Modern post-training follows a three-stage pipeline, each building on the previous:

Base Model → SFT → DPO → GRPO → Aligned Model
(autocomplete)  (follows      (prefers good   (reasons
                 instructions)  responses)      step-by-step)

Stage 1: Supervised Fine-Tuning (SFT)

SFT is the most intuitive stage. You show the model thousands of instruction-response pairs and train it to produce similar outputs.

What It Does

A base model given "What is the capital of France?" might continue with "What is the capital of Germany? What is..." — it's autocompleting, not answering. After SFT, it responds: "The capital of France is Paris."

SFT teaches three capabilities:

Instruction following — Understanding what the user is asking
Format compliance — Responding in the expected structure (chat, JSON, code)
Knowledge activation — Surfacing relevant knowledge from pre-training

Training Approaches

There are three ways to run SFT, each with different trade-offs:

Method	Quality	VRAM	Speed	When to Use
Full Fine-Tuning	Best	Very high (2x model)	Slow	You have multiple A100s
LoRA	Near-full	High (1x model + 5%)	Fast	Default choice for most teams
QLoRA	Good (slight degradation)	Low (0.25x model)	Medium	Consumer GPUs, prototyping

LoRA (Low-Rank Adaptation) is the standard for most practical work. It freezes the base model weights and trains small adapter matrices (~2% of total parameters), achieving near-full quality at a fraction of the compute.

QLoRA goes further by quantizing the base model to 4-bit precision, cutting VRAM by 4x. The trade-off is a small quality drop — good enough for experimentation, but production models typically use LoRA or full fine-tuning.

Key Parameters

These are the training parameters that matter most for SFT:

Learning rate: 1e-5 to 5e-5 (too high = catastrophic forgetting, too low = no learning)
Epochs: 3-5 (more isn't better — the model overfits quickly on small datasets)
Batch size: 8-16 (larger batches smooth gradients but need more VRAM)
Max sequence length: 2048-8192 tokens (longer = more context but slower training)
Optimizer: AdamW with weight decay 0.01

Dataset Quality Matters More Than Size

The three pillars of a good SFT dataset:

Accuracy — Every response must be correct. One wrong answer teaches the model to hallucinate.
Diversity — Cover the full range of tasks: Q&A, reasoning, coding, math, creative writing.
Complexity — Include multi-step reasoning, not just simple factual recall.

A curated dataset of 50K high-quality samples outperforms a noisy dataset of 500K every time.

Stage 2: Direct Preference Optimization (DPO)

SFT teaches the model to produce reasonable responses. DPO teaches it which response is better when there are multiple valid options.

The Core Idea

DPO works with preference pairs — for each prompt, you provide a chosen (good) response and a rejected (bad) response:

Prompt: "Explain quantum computing"
Chosen: [clear, accurate, well-structured explanation]
Rejected: [vague, overly technical, or slightly wrong explanation]

The training objective widens the probability gap between chosen and rejected responses. The model learns not just what to say, but what not to say.

Why Not Just More SFT?

SFT has a ceiling. It teaches the model to imitate training data, but it can't distinguish between good-enough and excellent responses. DPO adds a quality signal that pushes the model toward the better end of its capability range.

Concretely:

SFT: "Here's how to respond to this type of question"
DPO: "Between these two responses, this one is better because..."

The Policy Drift Problem

DPO has an important pitfall: off-policy data. If your preference data was generated by a different model (say, GPT-4), there's a mismatch between what that model would say and what your model would say. The training signal becomes noisy.

The solution is on-policy data generation: use your own model to generate responses, then have them judged:

Prompt → Your Model generates 2+ responses
                    ↓
           LLM Jury ranks them
                    ↓
         Best = Chosen, Worst = Rejected
                    ↓
              Train with DPO

This creates a tighter feedback loop — the model learns from its own mistakes rather than from another model's outputs.

State-of-the-Art DPO Techniques

Recent improvements that push DPO further:

Length normalization — Prevents the model from learning that longer = better
Anchored preference optimization — Adds a reference anchor to stabilize training
Refine chosen answers — Use a stronger model to polish the "chosen" response before training
Rubric-based scoring — Rate responses on specific criteria (accuracy, helpfulness, safety) instead of binary better/worse

Stage 3: Reinforcement Learning (GRPO)

The newest and most powerful stage. While SFT teaches imitation and DPO teaches preference, RL teaches the model to reason — to try multiple approaches and learn which thinking patterns lead to correct answers.

What Is GRPO?

Group Relative Policy Optimization (GRPO) was introduced by DeepSeek and powers models like DeepSeek-R1. Unlike traditional RL methods (PPO) that require a separate critic model, GRPO is simpler:

Given a prompt, sample a group of responses (e.g., 8 completions)
Score each response with a reward function
Normalize scores within the group to compute advantages
Update the model to produce more high-scoring and fewer low-scoring responses

The key insight: by comparing responses within a group, GRPO doesn't need an absolute value estimate. It just needs to know which responses in the batch were relatively better.

Reward Functions

The reward function is what drives learning. There are two categories:

Rule-based rewards (easy to implement):

Math: Does the answer match the correct solution?
Code: Does it pass the test cases?
Format: Does it follow the requested structure?

Model-based rewards (harder, more general):

A separate LLM judges response quality
More flexible but introduces another model's biases

For most practical applications, rule-based rewards work best because they give an unambiguous signal. This is why RL has been most successful for math and code — the reward is binary (correct or not).

Why RL Matters

RL is what gives models like DeepSeek-R1 and OpenAI o1 their reasoning abilities. The model learns to:

Break problems into steps
Try multiple approaches
Verify its own work
Backtrack when a path isn't working

This emergent behavior doesn't come from SFT (you'd need millions of perfect chain-of-thought examples) or DPO (preference pairs don't capture reasoning processes well). RL lets the model discover reasoning strategies through trial and error.

The Three Eras of Post-Training

Post-training has evolved rapidly:

SFT Era (2017-2023)

Started with the original Transformer paper and RLHF from InstructGPT. The focus was on making models follow instructions at all. Key models: GPT-3.5, early ChatGPT.

DPO Era (2023-2024)

DPO removed the complexity of RLHF by eliminating the separate reward model. Alignment became accessible to smaller teams. Key models: Zephyr, Intel's NeuralChat, early Llama fine-tunes.

RL Era (2025+)

DeepSeek-R1 proved that pure RL could produce breakthrough reasoning capabilities. GRPO became the standard. Key models: DeepSeek-R1, QwQ, Kimi k1.5.

Practical Considerations

When Do You Need Post-Training vs. Fine-Tuning?

Most developers don't need to run the full post-training pipeline. Here's a decision tree:

Start with an instruct model — Someone already did post-training for you
Try RAG first — Inject domain knowledge at inference time
Fine-tune with SFT if you need: specific tone/voice, domain-specific formatting, or consistent behavior patterns
Consider DPO if: your model produces decent responses but lacks consistency in quality
Consider RL only if: you have a clear reward signal (code correctness, math accuracy) and significant compute

Tools of the Trade

Tool	Best For	Complexity
Unsloth	SFT and DPO, beginner-friendly	Low
TRL (Hugging Face)	Full pipeline including GRPO	Medium
OpenRLHF	Large-scale distributed RL	High
torchtune (PyTorch)	SFT with native PyTorch	Medium

For most teams, Unsloth for SFT/DPO and TRL for GRPO covers the full pipeline.

The Cost Spectrum

Stage	Compute	Data Required	Typical Duration
SFT	1 GPU, hours	10K-100K samples	3-8 hours
DPO	1-2 GPUs, hours	10K-50K preference pairs	4-12 hours
GRPO	4-8+ GPUs, days	Prompts + reward function	1-7 days

SFT is accessible to anyone with a single GPU. DPO adds moderate cost. RL requires serious infrastructure — this is why it's mostly done by labs and well-funded teams.

Pros and Cons

Pros of Post-Training

Transforms capability — A base model is nearly useless for end users; post-training makes it practical
Composable stages — Each stage addresses a different weakness; you can stop at any stage
SFT is accessible — Anyone with a GPU and good data can fine-tune a model in hours
RL unlocks reasoning — Capabilities that can't be taught through imitation alone
Open tooling — Unsloth, TRL, and others make the full pipeline available to everyone

Cons of Post-Training

Data quality is everything — Bad training data makes the model worse, not better
Catastrophic forgetting — Aggressive training can destroy pre-trained knowledge
RL is expensive — Full GRPO requires multi-GPU setups and days of compute
Alignment tax — Safety training can reduce raw capability (the model becomes cautious)
Evaluation is hard — Unlike pre-training loss, post-training quality is subjective and task-dependent
Policy drift — DPO with off-policy data produces unreliable results

Key Takeaways

Post-training is the bridge between a raw language model and a useful AI assistant
Three stages: SFT (follow instructions) → DPO (prefer better responses) → RL (learn to reason)
Start with instruct models — Don't reinvent the wheel unless you have specific requirements
SFT is the most practical stage for business fine-tuning with LoRA
RL is the frontier — It's how the best reasoning models are built, but it requires significant resources
Dataset quality > quantity — Always

For a deeper dive into fine-tuning for your specific use case, see Why Train Your Own LLM and What Is Fine-Tuning?

This article draws on Maxime Labonne's presentation "Introduction to Post-Training Techniques" and current research from DeepSeek, Hugging Face, and the open-source ML community.

Have data ready for fine-tuning?

Post-training is only as good as the data behind it. Find out what training data you already have — in 2 minutes.

Take the 2-Minute Data Check

Share: Post Share