When to Fine-Tune (and When Not To)
Fine-tuning is not the first tool you should reach for. Before you fine-tune:
- Try better prompting (few-shot, chain-of-thought)
- Try RAG (retrieval-augmented generation)
- Only then consider fine-tuning
Fine-tuning shines when you need consistent formatting, domain-specific terminology, or behavior that’s hard to describe in a prompt.
Data Preparation
The most important phase. Garbage in, garbage out applies 10x to fine-tuning.
# Quality filter: remove short, repetitive, or low-signal examples
def quality_filter(example):
if len(example['output']) < 50:
return False
if example['output'] == example['input']:
return False
return True
dataset = dataset.filter(quality_filter)
Rule of thumb: 500-1000 high-quality examples beats 10,000 noisy ones every time.
Training Configuration
For a 7B parameter model, these settings have worked consistently:
| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Epochs | 3 |
| Batch size | 4 (with gradient accumulation 4) |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| Warmup ratio | 0.03 |
Evaluation Strategy
Don’t just vibe-check your model. Build a proper evaluation suite:
- Automated metrics — BLEU, ROUGE for format compliance
- LLM-as-judge — Use a stronger model to evaluate output quality on a rubric
- Human evaluation — For the final call, nothing beats domain experts reviewing outputs
Cost Reality Check
Fine-tuning a 7B model on 1000 examples with LoRA takes about 30 minutes on a single A100. That’s roughly $1.50 in cloud compute. The expensive part is preparing the data — expect to spend 20-40 hours on data curation for a production model.
Deployment
Serve with vLLM for best throughput. Merge LoRA weights for inference speed. Monitor for drift — models degrade as the world changes around them.