A Practical Guide to LLM Fine-Tuning

When to Fine-Tune (and When Not To)

Fine-tuning is not the first tool you should reach for. Before you fine-tune:

Try better prompting (few-shot, chain-of-thought)
Try RAG (retrieval-augmented generation)
Only then consider fine-tuning

Fine-tuning shines when you need consistent formatting, domain-specific terminology, or behavior that’s hard to describe in a prompt.

Data Preparation

The most important phase. Garbage in, garbage out applies 10x to fine-tuning.

# Quality filter: remove short, repetitive, or low-signal examples
def quality_filter(example):
    if len(example['output']) < 50:
        return False
    if example['output'] == example['input']:
        return False
    return True

dataset = dataset.filter(quality_filter)

Rule of thumb: 500-1000 high-quality examples beats 10,000 noisy ones every time.

Training Configuration

For a 7B parameter model, these settings have worked consistently:

Parameter	Value
Learning rate	2e-5
Epochs	3
Batch size	4 (with gradient accumulation 4)
LoRA rank	16
LoRA alpha	32
Warmup ratio	0.03

Evaluation Strategy

Don’t just vibe-check your model. Build a proper evaluation suite:

Automated metrics — BLEU, ROUGE for format compliance
LLM-as-judge — Use a stronger model to evaluate output quality on a rubric
Human evaluation — For the final call, nothing beats domain experts reviewing outputs

Cost Reality Check

Fine-tuning a 7B model on 1000 examples with LoRA takes about 30 minutes on a single A100. That’s roughly $1.50 in cloud compute. The expensive part is preparing the data — expect to spend 20-40 hours on data curation for a production model.

Deployment

Serve with vLLM for best throughput. Merge LoRA weights for inference speed. Monitor for drift — models degrade as the world changes around them.