LLM fine tuning: when to do it, and when not to
A decision-led guide to LLM fine tuning: when NOT to do it, fine tuning vs RAG vs prompt engineering, the real cost, and a runnable QLoRA recipe with eval.
Most of the LLM fine tuning projects we get asked to run shouldn't happen. The team has a model that's getting something wrong, someone read that fine-tuning fixes it, and a GPU budget appears. Three weeks later they've spent real money to land roughly where a better prompt and a retrieval layer would have put them in an afternoon. Fine-tuning is a powerful tool with a narrow job, and the gap between what it's sold as and what it actually does is where most of the wasted money lives.
So this guide is deliberately backwards. We've shipped model-selection and customization work across 200+ engagements, and the single highest-leverage thing we do is talk teams out of fine-tuning when it's the wrong tool. We'll start with when NOT to fine-tune, then put fine tuning vs RAG vs prompt engineering side by side so you can see what each one actually changes, then count the real cost. Only after that do we get into the how, LoRA, QLoRA, PEFT, data prep, and a runnable recipe, because the how is the easy part once you're sure you should be doing it at all.
LLM fine tuning in one paragraph, and the question to ask before you start
LLM fine tuning is the practice of taking a pre-trained model and continuing its training on a smaller, task-specific dataset so that its weights shift toward the behavior you want. That's the whole idea, and the operative word is behavior. Fine-tuning is very good at changing how a model responds: its format, its tone, the shape of its outputs, the way it handles one narrow task you do over and over. It's bad, expensive, and unreliable at changing what a model knows. If your complaint is "the model gives the wrong format" or "it won't stick to our house style," fine-tuning is plausibly the right tool. If your complaint is "the model doesn't know our 4,000 internal documents," fine-tuning is the wrong tool wearing the right tool's clothes, and the right tool is retrieval. The one question to ask before you start is therefore blunt: am I trying to change how the model behaves, or what it knows? Get that wrong and everything downstream is wasted effort, because you'll spend a training budget teaching a model to memorize facts it will still hallucinate, when you could have handed it those facts at inference time and been done. Almost every regret we see traces back to skipping this one question.
When NOT to fine-tune an LLM: the decision rule most teams skip
Here's the rule we run before any training job gets approved, and it's the same rule we'd give anyone deciding when to fine tune llm systems at all. Classify the problem first. If it's a knowledge problem, the model needs facts it doesn't have, the answer is retrieval, not fine-tuning. If it's a capability problem, the model genuinely can't reason through the task at all, the answer is a stronger model, not fine-tuning a weaker one. Fine-tuning only earns its place when it's a behavior problem: the model can do the task and has the knowledge, but won't do it in the shape, tone, or reliability you need. That's a narrow slice, and most requests don't land in it.
There are four situations where we actively tell teams to stop. First, when the goal is to inject knowledge: a model doesn't reliably memorize facts from a few thousand examples, and even when it appears to, it'll confabulate around the edges. Second, when the requirement is changing fast: a fine-tune freezes a behavior, so if your spec shifts weekly you'll be re-training weekly. Third, when you haven't exhausted prompting: a strong system prompt with a few well-chosen examples closes most behavior gaps for free, and you should prove a prompt-only baseline can't do the job before you train. Fourth, when nobody has budgeted for maintenance, which we'll cost out below. If you're tempted to skip the prompt-first step, our AI readiness assessment framework treats "have you exhausted the cheap options" as a gate before any model-training line item gets approved.
Fine tuning vs RAG vs prompt engineering: what each one actually changes
The fine tuning vs rag debate gets framed as either/or, and that framing is the mistake. They change different things and the strongest systems use them together. Prompt engineering changes the instructions the model sees at inference; it's the cheapest lever and the first one to pull. RAG changes the context the model sees at inference by retrieving relevant facts and pasting them into the prompt; it's how you give a model knowledge it didn't train on, kept fresh without retraining. Fine-tuning changes the weights themselves; it's how you bake in a behavior so durably that you don't have to spell it out every call. The honest 2026 answer to fine tuning vs prompt engineering is: try the prompt first, almost always, because it costs nothing to iterate and you can change it in production without a training run.
Bakes a behavior into the model permanently: format, tone, a narrow task. Best when the behavior is stable and you call it at high volume, since a tuned small model can be cheaper per call than a frontier model. But it freezes you on one base model, can't absorb new facts, and re-training is a real project. Wrong tool for anything that changes weekly or anything knowledge-shaped.
Injects fresh, specific knowledge by retrieving documents and placing them in the prompt at inference time, using pgvector, Pinecone, or Qdrant. New facts appear the moment you index them, no training run, and you ride every base-model upgrade for free. The right answer whenever the complaint is "the model doesn't know our data." Often combined with a light fine-tune that fixes the output format around the retrieved context.
Put all three on one map and the division of labor is clear. The three approaches aren't rivals; they're a ladder you climb only as far as you need to. The way model behavior shifts under each lever is its own topic, in the same family as the design tradeoffs we cover in our piece on diffusion versus flow-based generative models, where the choice of objective changes everything downstream.
| Approach | What it changes | Best for | Where it misleads |
|---|---|---|---|
| Prompt engineering | Instructions at inference | Behavior tweaks, fast iteration | People stop too early and assume it can't scale |
| RAG | Context (retrieved facts) at inference | Knowledge the model lacks, freshness | Treated as an either/or vs fine-tuning |
| LoRA / QLoRA fine-tune | A small set of adapter weights | Stable behavior, high-volume narrow task | Sold as a knowledge fix; it isn't one |
| Full fine-tune | All the weights | Research budgets, deep domain shift | Overkill and a maintenance trap for most |
The real cost of LLM fine tuning: data, GPU hours, and the maintenance tax
Vendor guides love to say PEFT "saves VRAM" and leave cost there, because the VRAM line is the only one that got cheap. The expensive lines didn't. The GPU hours for a QLoRA run on an 8B-class model are genuinely small in 2026, often a few dollars of rented compute, and that's the number people anchor on. The data is where the money actually goes: assembling a few hundred to a few thousand high-quality, strictly-formatted examples, deduplicating them, and holding out a clean eval split is human-labeling work that dwarfs the compute. Then there's the line nobody puts in the proposal, the maintenance tax: the base model you tuned gets superseded roughly every quarter, and your fine-tune is now frozen on an older model while the off-the-shelf frontier walks past it. A prompt-plus-RAG system rides that upgrade for free; a fine-tune has to be re-run on the new base, re-evaluated, and re-deployed. Budget for the second and third re-tune, or the first one was a sunk cost with a short shelf life.
When fine-tuning is the right call: behavior, format, and narrow tasks
All of that honesty isn't an argument against fine-tuning; it's an argument for using it where it genuinely wins. There are three jobs fine-tuning does better than anything else. The first is rigid output structure: when you need a model to emit a specific JSON shape or query dialect every single time, a fine-tune internalizes that format more reliably than a prompt that the model occasionally ignores under pressure. The second is voice and style at scale: a consistent brand persona or a domain register that would otherwise cost you a long, expensive system prompt on every call. The third is a narrow, high-volume task where a tuned small open-weight model matches a frontier model's quality at a fraction of the per-call cost, classification, extraction, routing, the unglamorous workhorses. That last one is where the economics flip in fine-tuning's favor: when you're calling the model millions of times, a cheaper tuned 8B model that's good enough beats an expensive frontier model that's overqualified. The typical shape on a well-scoped classification task in 2026: a QLoRA-tuned 8B model can land within a point or two of a much larger frontier model's accuracy on a held-out set, at a small fraction of the per-call cost, but only after a prompt-only baseline was tried first and proven insufficient. Measure that gap on your own task before you bank on it.
| Your workload | Start here | Add if needed | Fine-tune only when |
|---|---|---|---|
| Answer questions over internal docs | RAG + prompt | Better retrieval / reranking | Format won't hold (rare) |
| Emit strict JSON / a query dialect | Prompt + schema-constrained decoding | Few-shot examples | Drift persists at volume |
| High-volume classification / extraction | Prompt on a frontier model | RAG for edge cases | Per-call cost dominates the bill |
| Consistent brand voice / persona | System prompt | Style examples in context | Prompt is too long / leaks |
The LLM fine tuning method map: full, LoRA, QLoRA, and PEFT
Once you've decided to fine-tune, the next choice is how, and the methods differ mostly in how many weights they touch. Full fine-tuning updates every parameter in the model. It's the most powerful and the most expensive: you need enough VRAM to hold the whole model plus its optimizer states, which for anything past 8B means multiple high-end GPUs, and you get one giant checkpoint per task. PEFT, parameter-efficient fine-tuning, is the family that fixed this by freezing the base model and training only a tiny set of new weights. LoRA, the dominant PEFT method, injects small low-rank adapter matrices alongside the frozen weights and trains only those, often a tiny fraction of the full parameter count. QLoRA goes further by quantizing the frozen base to 4-bit so it fits in a fraction of the memory, then training LoRA adapters on top. The Hugging Face PEFT library, with bitsandbytes for the quantization, is the standard implementation, and wrappers like Unsloth and Axolotl make it a config file rather than a research project.
| Method | Trainable params | VRAM (8B base) | Use when | Watch out |
|---|---|---|---|---|
| Full fine-tune | All weights | Multi-GPU (80GB+) | Deep domain shift, research budget | Heaviest maintenance tax; rarely needed |
| LoRA | Adapters only | ~24GB (fp16 base) | Standard PEFT, ample GPU | Adapter rank too low underfits |
| QLoRA | Adapters only | ~12-16GB (4-bit base) | The 2026 production default | 4-bit can lose a little accuracy |
| Prompt / prefix tuning | Soft prompts | Minimal | Light steering, many tasks one base | Weaker than LoRA for real behavior change |
How QLoRA actually works: low-rank adapters and 4-bit base weights
QLoRA is two ideas stacked. The first is the LoRA insight: a weight update during fine-tuning has low intrinsic rank, so instead of learning a full dense update matrix you learn two small matrices, A and B, whose product approximates it. You freeze the original weights entirely and train only A and B, which is why the trainable parameter count collapses to a sliver. The second idea is quantization: you load the frozen base model in 4-bit precision so it occupies roughly a quarter of the memory, while keeping the small adapter weights in higher precision where the gradients need it. The frozen 4-bit base does the heavy lifting of language understanding; the tiny trainable adapters learn your specific behavior on top. This is not a free lunch on paper but it is close to one in practice: the 2023 QLoRA paper from Dettmers and colleagues reported its Guanaco model reaching 99.3% of ChatGPT's score on the Vicuna benchmark after 24 hours on a single GPU, with 4-bit adapter fine-tuning recovering essentially full 16-bit fine-tuning quality. In 2026 this is what makes the math work on cheap hardware: a QLoRA run on an 8B-class open-weight model fits in roughly 12-16GB of VRAM, a single rented L4 or a free Colab T4 with Unsloth, not a multi-GPU cluster. At inference you either keep the adapter separate and load it over the base, or merge it back into the weights for a single deployable model. The diagram below is the whole mechanism.
Data preparation: the part that decides whether your fine-tune works
If a fine-tune fails, the data is the cause far more often than the hyperparameters. The practitioner literature has said the same thing since the 2023 LIMA result and it still holds in 2026: a few hundred strictly-formatted, high-quality examples beat thousands of noisy ones. For supervised fine-tuning, your dataset is a list of prompt-and-response pairs in JSON Lines format, one example per line, with the exact structure you want the model to learn at inference encoded into every record. Three rules carry most of the weight. Format every example identically, including the chat template and any system message, because the model learns the shape as much as the content. Deduplicate hard, since near-duplicate examples quietly bias the model and inflate your eval. And split a clean held-out set before you train, never after, so your evaluation isn't measuring memorization. The snippet below is the actual record shape for an instruction-style SFT dataset.
# One SFT example per line. Keep the chat template identical to inference.
# A few hundred clean records beat thousands of noisy ones (LIMA, 2023 -> still true 2026).
import json, hashlib, random
def to_record(system, user, assistant):
return {
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": user},
{"role": "assistant", "content": assistant}, # the target behavior
]
}
raw = load_your_labeled_pairs() # human-reviewed, in-format examples
# Dedup on a hash of the user turn so near-duplicates don't bias the model.
seen, clean = set(), []
for r in raw:
key = hashlib.sha256(r["user"].strip().lower().encode()).hexdigest()
if key in seen:
continue
seen.add(key)
clean.append(to_record(r["system"], r["user"], r["assistant"]))
# Hold out the eval split BEFORE training, never after.
random.seed(42)
random.shuffle(clean)
split = int(len(clean) * 0.9)
train, eval_set = clean[:split], clean[split:]
for name, rows in (("train", train), ("eval", eval_set)):
with open(f"{name}.jsonl", "w") as f:
for row in rows:
f.write(json.dumps(row) + "\n")
print(f"train={len(train)} eval={len(eval_set)}") A runnable QLoRA fine-tuning recipe you can adapt this week
Here's the part the vendor explainers leave out: actual code you can run. The three variants below do the same QLoRA job three ways. Unsloth is the fastest path and the lowest memory, which is why it's the default for a single rented GPU or a free Colab. The Transformers plus TRL plus PEFT stack is the canonical Hugging Face approach when you want full control. Axolotl turns the whole thing into a YAML config, which is the cleanest option once you're running many jobs and want them reproducible. Pick one, point it at the JSONL you built above, and you have a tuned 8B model in an hour or two on hardware that costs a few dollars.
# Fastest, lowest-VRAM path. Fits an 8B QLoRA in ~12-16GB (a free Colab T4 works).
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
model, tok = FastLanguageModel.from_pretrained(
"unsloth/Qwen3-8B-bnb-4bit", # 4-bit base, frozen
max_seq_length=2048, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model, r=16, lora_alpha=32, lora_dropout=0.0, # the LoRA adapters
target_modules=["q_proj","k_proj","v_proj","o_proj"],
)
ds = load_dataset("json", data_files="train.jsonl", split="train")
SFTTrainer(
model=model, tokenizer=tok, train_dataset=ds,
args=SFTConfig(per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=2, learning_rate=2e-4,
output_dir="out"),
).train()
model.save_pretrained("adapter") # tiny: just the LoRA weights # Canonical Hugging Face stack. More control, slightly more memory.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch
bnb = BitsAndBytesConfig(load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16)
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B", quantization_config=bnb, device_map="auto")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
lora = LoraConfig(r=16, lora_alpha=32, task_type="CAUSAL_LM",
target_modules=["q_proj","k_proj","v_proj","o_proj"])
ds = load_dataset("json", data_files="train.jsonl", split="train")
SFTTrainer(model=base, peft_config=lora, train_dataset=ds,
processing_class=tok,
args=SFTConfig(num_train_epochs=2, learning_rate=2e-4,
per_device_train_batch_size=2, output_dir="out")
).train() # Config-driven. Reproducible across many jobs: `axolotl train qlora.yaml`
base_model: Qwen/Qwen3-8B
load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_target_modules: [q_proj, k_proj, v_proj, o_proj]
datasets:
- path: train.jsonl
type: chat_template
sequence_len: 2048
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 2
learning_rate: 0.0002
val_set_size: 0.1 # hold-out eval split
output_dir: ./out Evaluating a fine-tuned model: proving it beat the prompt-only baseline
A fine-tune isn't finished when the loss curve flattens; it's finished when you've proven it beats the cheaper thing you'd otherwise ship. That means one comparison above all others: your tuned model versus the same base model with a strong prompt, scored on a held-out golden set you built from real examples, not the training data. Run both through the same eval harness, Ragas, promptfoo, Inspect from the UK AISI, or LangSmith, and look at win-rate on your actual task, not generic benchmark scores. If the tuned model doesn't clearly beat the prompt-only baseline on your golden set, you've just spent a budget to ship the prompt. The discipline here is the same one we apply when scoring any model decision, the kind of evaluation rigor that underpins our work on agent systems in the multi-agent orchestration patterns guide, where a tuned small model often serves as one cheap, reliable tool inside a larger system.
Catastrophic forgetting and the production gotchas nobody warns you about
The failure mode that surprises teams most is catastrophic forgetting: you fine-tune a model to be excellent at your narrow task and it quietly gets worse at everything else, including general instruction-following it used to do for free. PEFT methods like LoRA reduce this because the base weights stay frozen, but they don't eliminate it, and an over-trained adapter can still drag the model away from its general competence. The other gotchas cluster around deployment and drift, and they're the ones that turn a working fine-tune into a maintenance headache months later.
FAQ — LLM fine tuning, in the practitioner's vocabulary
What is LLM fine tuning?
LLM fine tuning is continuing the training of a pre-trained model on a smaller, task-specific dataset so its weights shift toward a behavior you want, a format, a tone, or a narrow task. It changes how a model behaves, not what it knows. For new knowledge, retrieval (RAG) is the better and cheaper tool.
Fine tuning vs RAG: which should I use?
They solve different problems and often work together. Use RAG when the model lacks facts, it retrieves and injects them at inference with no training run and stays fresh. Use fine-tuning when the model has the knowledge but won't behave in the shape you need. If your complaint is "it doesn't know our data," that's RAG. If it's "it won't stick to our format," that may be a fine-tune.
Fine tuning vs prompt engineering: is fine-tuning worth it?
Try the prompt first, almost always. A strong system prompt with a few examples closes most behavior gaps for free, and you can change it in production without retraining. Fine-tuning earns its place only when a prompt-only baseline provably can't do the job, or when you call the model at such high volume that a cheaper tuned small model beats an expensive frontier model on cost.
When should you NOT fine-tune an LLM?
Don't fine-tune to inject knowledge (use RAG), don't fine-tune when the requirement changes weekly (you'll re-train weekly), don't fine-tune before exhausting prompt engineering, and don't fine-tune if nobody has budgeted for the maintenance tax of re-tuning on every base-model upgrade, which in 2026 is roughly quarterly.
What is the difference between LoRA, QLoRA, and PEFT?
PEFT (parameter-efficient fine-tuning) is the umbrella family that freezes the base model and trains only a few new weights. LoRA is the dominant PEFT method: it trains small low-rank adapter matrices, often a tiny fraction of the parameters. QLoRA is LoRA with the frozen base quantized to 4-bit, which is what lets an 8B model fine-tune in roughly 12-16GB of VRAM on a single GPU. QLoRA is the 2026 production default.
How much data do I need to fine-tune an LLM?
Less than you think, but cleaner. A few hundred strictly-formatted, high-quality, deduplicated examples beat thousands of noisy ones, a finding the literature has repeated since the 2023 LIMA result. Format every example exactly as it appears at inference, deduplicate hard, and hold out a clean eval split before training so you measure learning, not memorization.
Fine-tune on evidence, not on hype.
We start every model-customization engagement by trying to talk you out of training, and only fine-tune when prompt and RAG provably can't do the job. If you're weighing fine-tuning, talk to the team that has run these decisions across 200+ engagements.