Training 14 min read

Synthetic Data for Fine-Tuning: How to Generate Your Own Training Set

ai.rs Mar 16, 2026

synthetic-data fine-tuning training datasets data-generation

The Data Bottleneck

You've read about fine-tuning and post-training. You understand SFT, DPO, and LoRA. You have a GPU ready. But when you sit down to actually train a model, you hit the real wall: you need thousands of high-quality training samples, and you have maybe a hundred.

Manual data creation is slow. A skilled annotator produces 20-50 instruction-response pairs per hour. At that rate, a 10,000-sample dataset takes 200-500 hours of human labor — months of work before training even begins.

Synthetic data generation solves this. Instead of writing every sample by hand, you use LLMs to generate, judge, and filter training data at scale. The result: 10,000+ samples in hours, not months.

The Synthetic Data Pipeline

The modern synthetic data pipeline has five stages:

Seed Prompts → Policy Model → LLM Jury → Heuristic Filter → Training Dataset
   (100s)       (generates)    (ranks)      (quality gate)     (10,000s)

Each stage has a specific role, and getting any one wrong poisons the entire dataset.

Stage 1: Seed Prompts

Everything starts with prompts — the questions and instructions your model will learn to handle. You need diverse, realistic prompts that cover your target domain.

Where to get seed prompts:

Existing customer data — Real questions from support tickets, search logs, or chat history
Manual curation — Write 100-500 high-quality prompts covering key scenarios
Prompt evolution — Use an LLM to create variations of your seeds
Public datasets — Alpaca, ShareGPT, UltraChat as starting points (filter for relevance)

Prompt evolution example:

Seed: "What's the best laptop for video editing under $1500?"

Evolved variants:
→ "I need a laptop for 4K video editing. Budget is flexible but under $2000."
→ "Compare the MacBook Pro M3 and Dell XPS 16 for Premiere Pro workflows."
→ "What specs matter most for DaVinci Resolve — RAM, GPU, or CPU?"
→ "I edit YouTube videos as a side hustle. What's the minimum I should spend?"

From 100 seed prompts, evolution can generate 1,000-5,000 diverse variants. The key is ensuring they span different intents (compare, recommend, explain, troubleshoot), complexity levels (simple factual to multi-step reasoning), and edge cases (out-of-scope requests, ambiguous queries).

Stage 2: Response Generation

For each prompt, generate multiple responses. This is where the pipeline splits depending on whether you're creating SFT data or preference data.

For SFT data — Generate one high-quality response per prompt:

for prompt in seed_prompts:
    response = model.generate(
        prompt,
        temperature=0.7,  # Some creativity
        max_tokens=2048
    )
    dataset.append({"instruction": prompt, "response": response})

For DPO preference data — Generate multiple responses and rank them:

for prompt in seed_prompts:
    responses = [
        model.generate(prompt, temperature=0.9)  # Higher temp = more variety
        for _ in range(4)  # 4 candidates per prompt
    ]
    # Judge picks best and worst → chosen/rejected pair

Which model to use for generation:

Strategy	Pros	Cons
Use a stronger model (GPT-4, Claude)	Higher quality responses	Off-policy for DPO, API costs
Use your own model	On-policy (best for DPO)	Quality ceiling = current model
Mix both	Best of both worlds	More complex pipeline

For SFT data, using a stronger model is fine — you're teaching your model to imitate good responses. For DPO, you should use your own model (on-policy) to avoid the policy drift problem discussed in the post-training article.

Stage 3: LLM-as-Judge

Raw generated responses vary in quality. The LLM jury scores and ranks them:

Prompt: [the user's question]
Response A: [candidate 1]
Response B: [candidate 2]

Evaluate both responses on:
1. Accuracy (0-5): Are the facts correct?
2. Helpfulness (0-5): Does it address the user's need?
3. Clarity (0-5): Is it well-structured and easy to follow?
4. Safety (0-5): Does it avoid harmful content?

Which response is better overall? Explain why.

Important: Use a different model as the judge than the one that generated responses. If the same model judges its own output, it's biased toward its own style regardless of quality.

Rubric-based scoring outperforms simple "which is better" judgments. When the judge evaluates on specific criteria, the signal is clearer and more consistent.

Stage 4: Heuristic Filtering

Even with LLM judging, some samples are bad. Apply hard filters:

Length ratio — Reject pairs where chosen and rejected are nearly identical length (no learning signal)
Score threshold — Drop responses scoring below 3/5 on any criterion
Deduplication — Remove near-duplicate prompts (cosine similarity > 0.95)
Format compliance — Ensure responses match expected structure
Toxicity filter — Run a classifier to catch harmful content the judge missed

Expect to drop 20-40% of generated samples at this stage. That's normal and desirable — aggressive filtering produces a cleaner dataset.

A recent improvement: after selecting the "chosen" response, pass it through a refiner model that polishes it further:

Here is a response to a user question. Improve it while keeping
the same core content. Fix any errors, improve clarity, and ensure
the tone is helpful and professional.

[original chosen response]

This consistently improves DPO training because the chosen response becomes genuinely better, not just the least-bad option from the batch.

Practical Example: Building a Product Q&A Dataset

Let's walk through generating a 10,000-sample SFT dataset for an e-commerce product assistant.

Step 1: Collect Seeds (200 prompts)

Sources:

80 from customer support tickets
60 hand-written covering product categories
60 evolved from the first 140

Step 2: Evolve to 2,500 Prompts

Use an LLM to generate 10-15 variants per seed prompt, varying:

Product category
Customer intent (buy, compare, troubleshoot, return)
Specificity (vague vs. detailed)
Tone (casual, urgent, professional)

Step 3: Generate Responses

Use a strong model (Claude/GPT-4) with your product catalog in context via RAG:

system_prompt = """You are a product expert for [store name].
Use only the product information provided. If a product doesn't
exist in the catalog, say so. Never make up products or prices."""

for prompt in evolved_prompts:
    products = rag_search(prompt, top_k=5)
    response = generate(
        system=system_prompt,
        context=products,
        user=prompt,
        temperature=0.7
    )

Step 4: Judge and Filter

Run each response through the jury:

Score on accuracy, helpfulness, product knowledge, format
Drop responses scoring < 3 on accuracy (these contain hallucinations)
Drop near-duplicates

Result: ~2,000 high-quality SFT samples from 2,500 prompts (80% pass rate)

Step 5: Augment with Multi-Turn

Convert single-turn Q&As into conversations:

for sample in sft_data[:500]:
    follow_up = generate_follow_up(sample["instruction"], sample["response"])
    continuation = generate(context=sample, user=follow_up)
    # Creates a 2-turn conversation

Final dataset: ~2,500 single-turn + ~500 multi-turn = 3,000 samples

Repeat the cycle 3x with different prompt evolution seeds, and you have your 10,000-sample dataset.

Common Pitfalls

1. Model Collapse

If you train on your own model's output, then generate more data, then train again — each cycle amplifies the model's biases. After 3-4 iterations, responses become repetitive and quality degrades.

Fix: Always use fresh seed prompts and mix in human-written samples (even 10-20% human data prevents collapse).

2. Reward Hacking in Preference Data

The LLM judge has predictable preferences: longer responses, bullet points, hedging language ("It's important to note..."). Models learn to game these signals instead of improving actual quality.

Fix: Use length-normalized scoring. Penalize filler phrases. Score on rubrics, not vibes.

3. Distribution Mismatch

Your synthetic prompts might not match what real users actually ask. If you train on academic-style questions but users ask casual ones, the model struggles.

Fix: Start with real user data as seeds. Validate synthetic prompts against actual query logs.

4. Contamination

If the generating model was trained on your evaluation benchmark, it will produce responses that look correct on your evals but fail on real tasks.

Fix: Hold out a manually-created test set that no model has seen. Evaluate on real user satisfaction, not benchmark scores.

Tools for Synthetic Data Generation

Tool	Best For	Notes
distilabel (Argilla)	Full pipelines, production use	Most complete framework, supports UltraFeedback-style pipelines
Magpie (Hugging Face)	Extracting instruction data from LLMs	Clever technique: use model's chat template to elicit natural instructions
Self-Instruct	Quick SFT data from seeds	The original paper's approach, simple but effective
Evol-Instruct	Increasing prompt complexity	WizardLM's approach: iteratively make prompts harder
Your own scripts	Custom pipelines	50 lines of Python + an API key is often enough

For most teams, distilabel is the right starting point — it handles the full pipeline (generation, judging, filtering) with built-in support for multiple LLM providers.

How Much Data Do You Need?

Goal	SFT Samples	Preference Pairs	Notes
Tone/style change	1K-5K	Not needed	Smallest useful dataset
Domain adaptation	5K-20K	5K-10K	The sweet spot for most businesses
New capability	20K-100K	10K-50K	Teaching the model something fundamentally new
Full post-training	100K+	50K+	What model providers do; you probably don't need this

Start with 5K samples and evaluate. Add more data only when you can identify specific gaps in performance — more data without direction just adds noise.

Key Takeaways

Synthetic data removes the data bottleneck — Generate 10,000+ samples in hours instead of months
Quality > quantity — Aggressive filtering (drop 20-40%) produces better models than keeping everything
Use a different model as judge — Self-evaluation introduces bias
Mix in human data — Even 10-20% prevents model collapse across iterations
Start with real user prompts — Synthetic diversity means nothing if the distribution doesn't match reality
Iterate small — Start with 5K samples, evaluate, identify gaps, then scale up

For the full context on how this fits into model training, see LLM Post-Training Explained and Why Train Your Own LLM.

Have data ready for fine-tuning?

Post-training is only as good as the data behind it. Find out what training data you already have — in 2 minutes.

Take the 2-Minute Data Check

Share: Post Share