LLM fine tuning: when to do it, and when not to

Most of the LLM fine tuning projects we get asked to run shouldn't happen. The team has a model that's getting something wrong, someone read that fine-tuning fixes it, and a GPU budget appears. Three weeks later they've spent real money to land roughly where a better prompt and a retrieval layer would have put them in an afternoon. Fine-tuning is a powerful tool with a narrow job, and the gap between what it's sold as and what it actually does is where most of the wasted money lives.

So this guide is deliberately backwards. We've shipped model-selection and customization work across 200+ engagements, and the single highest-leverage thing we do is talk teams out of fine-tuning when it's the wrong tool. We'll start with when NOT to fine-tune, then put fine tuning vs RAG vs prompt engineering side by side so you can see what each one actually changes, then count the real cost. Only after that do we get into the how, LoRA, QLoRA, PEFT, data prep, and a runnable recipe, because the how is the easy part once you're sure you should be doing it at all.

LLM fine tuning in one paragraph, and the question to ask before you start

LLM fine tuning is the practice of taking a pre-trained model and continuing its training on a smaller, task-specific dataset so that its weights shift toward the behavior you want. That's the whole idea, and the operative word is behavior. Fine-tuning is very good at changing how a model responds: its format, its tone, the shape of its outputs, the way it handles one narrow task you do over and over. It's bad, expensive, and unreliable at changing what a model knows. If your complaint is "the model gives the wrong format" or "it won't stick to our house style," fine-tuning is plausibly the right tool. If your complaint is "the model doesn't know our 4,000 internal documents," fine-tuning is the wrong tool wearing the right tool's clothes, and the right tool is retrieval. The one question to ask before you start is therefore blunt: am I trying to change how the model behaves, or what it knows? Get that wrong and everything downstream is wasted effort, because you'll spend a training budget teaching a model to memorize facts it will still hallucinate, when you could have handed it those facts at inference time and been done. Almost every regret we see traces back to skipping this one question.

When NOT to fine-tune an LLM: the decision rule most teams skip

Here's the rule we run before any training job gets approved, and it's the same rule we'd give anyone deciding when to fine tune llm systems at all. Classify the problem first. If it's a knowledge problem, the model needs facts it doesn't have, the answer is retrieval, not fine-tuning. If it's a capability problem, the model genuinely can't reason through the task at all, the answer is a stronger model, not fine-tuning a weaker one. Fine-tuning only earns its place when it's a behavior problem: the model can do the task and has the knowledge, but won't do it in the shape, tone, or reliability you need. That's a narrow slice, and most requests don't land in it.

The fine-tune-or-not decision fork

Classify the problem before you classify the solution. Knowledge problems go to retrieval; capability gaps go to a bigger model; only behavior problems justify a fine-tune. Most requests never reach the right-hand branch.

There are four situations where we actively tell teams to stop. First, when the goal is to inject knowledge: a model doesn't reliably memorize facts from a few thousand examples, and even when it appears to, it'll confabulate around the edges. Second, when the requirement is changing fast: a fine-tune freezes a behavior, so if your spec shifts weekly you'll be re-training weekly. Third, when you haven't exhausted prompting: a strong system prompt with a few well-chosen examples closes most behavior gaps for free, and you should prove a prompt-only baseline can't do the job before you train. Fourth, when nobody has budgeted for maintenance, which we'll cost out below. If you're tempted to skip the prompt-first step, our AI readiness assessment framework treats "have you exhausted the cheap options" as a gate before any model-training line item gets approved.

Fine tuning vs RAG vs prompt engineering: what each one actually changes

The fine tuning vs rag debate gets framed as either/or, and that framing is the mistake. They change different things and the strongest systems use them together. Prompt engineering changes the instructions the model sees at inference; it's the cheapest lever and the first one to pull. RAG changes the context the model sees at inference by retrieving relevant facts and pasting them into the prompt; it's how you give a model knowledge it didn't train on, kept fresh without retraining. Fine-tuning changes the weights themselves; it's how you bake in a behavior so durably that you don't have to spell it out every call. The honest 2026 answer to fine tuning vs prompt engineering is: try the prompt first, almost always, because it costs nothing to iterate and you can change it in production without a training run.

Fine-tuning (changes the weights)

Bakes a behavior into the model permanently: format, tone, a narrow task. Best when the behavior is stable and you call it at high volume, since a tuned small model can be cheaper per call than a frontier model. But it freezes you on one base model, can't absorb new facts, and re-training is a real project. Wrong tool for anything that changes weekly or anything knowledge-shaped.

RAG (changes the context at inference)

Injects fresh, specific knowledge by retrieving documents and placing them in the prompt at inference time, using pgvector, Pinecone, or Qdrant. New facts appear the moment you index them, no training run, and you ride every base-model upgrade for free. The right answer whenever the complaint is "the model doesn't know our data." Often combined with a light fine-tune that fixes the output format around the retrieved context.

Put all three on one map and the division of labor is clear. The three approaches aren't rivals; they're a ladder you climb only as far as you need to. The way model behavior shifts under each lever is its own topic, in the same family as the design tradeoffs we cover in our piece on diffusion versus flow-based generative models, where the choice of objective changes everything downstream.

Approach	What it changes	Best for	Where it misleads
Prompt engineering	Instructions at inference	Behavior tweaks, fast iteration	People stop too early and assume it can't scale
RAG	Context (retrieved facts) at inference	Knowledge the model lacks, freshness	Treated as an either/or vs fine-tuning
LoRA / QLoRA fine-tune	A small set of adapter weights	Stable behavior, high-volume narrow task	Sold as a knowledge fix; it isn't one
Full fine-tune	All the weights	Research budgets, deep domain shift	Overkill and a maintenance trap for most

The three customization levers, what each changes, and the failure mode each one hides.

The real cost of LLM fine tuning: data, GPU hours, and the maintenance tax

Vendor guides love to say PEFT "saves VRAM" and leave cost there, because the VRAM line is the only one that got cheap. The expensive lines didn't. The GPU hours for a QLoRA run on an 8B-class model are genuinely small in 2026, often a few dollars of rented compute, and that's the number people anchor on. The data is where the money actually goes: assembling a few hundred to a few thousand high-quality, strictly-formatted examples, deduplicating them, and holding out a clean eval split is human-labeling work that dwarfs the compute. Then there's the line nobody puts in the proposal, the maintenance tax: the base model you tuned gets superseded roughly every quarter, and your fine-tune is now frozen on an older model while the off-the-shelf frontier walks past it. A prompt-plus-RAG system rides that upgrade for free; a fine-tune has to be re-run on the new base, re-evaluated, and re-deployed. Budget for the second and third re-tune, or the first one was a sunk cost with a short shelf life.

Relative total cost of ownership, not just compute

Prompt engineering

10relative

Iterate in production, no training run

RAG

35relative

Indexing + retrieval infra, rides upgrades free

QLoRA fine-tune

60relative

Small compute, large data + re-tune tax

Full fine-tune

100relative

Multi-GPU compute + the heaviest maintenance tax

When fine-tuning is the right call: behavior, format, and narrow tasks

All of that honesty isn't an argument against fine-tuning; it's an argument for using it where it genuinely wins. There are three jobs fine-tuning does better than anything else. The first is rigid output structure: when you need a model to emit a specific JSON shape or query dialect every single time, a fine-tune internalizes that format more reliably than a prompt that the model occasionally ignores under pressure. The second is voice and style at scale: a consistent brand persona or a domain register that would otherwise cost you a long, expensive system prompt on every call. The third is a narrow, high-volume task where a tuned small open-weight model matches a frontier model's quality at a fraction of the per-call cost, classification, extraction, routing, the unglamorous workhorses. That last one is where the economics flip in fine-tuning's favor: when you're calling the model millions of times, a cheaper tuned 8B model that's good enough beats an expensive frontier model that's overqualified. The typical shape on a well-scoped classification task in 2026: a QLoRA-tuned 8B model can land within a point or two of a much larger frontier model's accuracy on a held-out set, at a small fraction of the per-call cost, but only after a prompt-only baseline was tried first and proven insufficient. Measure that gap on your own task before you bank on it.

Your workload	Start here	Add if needed	Fine-tune only when
Answer questions over internal docs	RAG + prompt	Better retrieval / reranking	Format won't hold (rare)
Emit strict JSON / a query dialect	Prompt + schema-constrained decoding	Few-shot examples	Drift persists at volume
High-volume classification / extraction	Prompt on a frontier model	RAG for edge cases	Per-call cost dominates the bill
Consistent brand voice / persona	System prompt	Style examples in context	Prompt is too long / leaks

Map the workload to the cheapest lever that solves it, then climb only as far as the evidence forces you. Fine-tuning is the top rung, not the default.

The LLM fine tuning method map: full, LoRA, QLoRA, and PEFT

Once you've decided to fine-tune, the next choice is how, and the methods differ mostly in how many weights they touch. Full fine-tuning updates every parameter in the model. It's the most powerful and the most expensive: you need enough VRAM to hold the whole model plus its optimizer states, which for anything past 8B means multiple high-end GPUs, and you get one giant checkpoint per task. PEFT, parameter-efficient fine-tuning, is the family that fixed this by freezing the base model and training only a tiny set of new weights. LoRA, the dominant PEFT method, injects small low-rank adapter matrices alongside the frozen weights and trains only those, often a tiny fraction of the full parameter count. QLoRA goes further by quantizing the frozen base to 4-bit so it fits in a fraction of the memory, then training LoRA adapters on top. The Hugging Face PEFT library, with bitsandbytes for the quantization, is the standard implementation, and wrappers like Unsloth and Axolotl make it a config file rather than a research project.

Method	Trainable params	VRAM (8B base)	Use when	Watch out
Full fine-tune	All weights	Multi-GPU (80GB+)	Deep domain shift, research budget	Heaviest maintenance tax; rarely needed
LoRA	Adapters only	~24GB (fp16 base)	Standard PEFT, ample GPU	Adapter rank too low underfits
QLoRA	Adapters only	~12-16GB (4-bit base)	The 2026 production default	4-bit can lose a little accuracy
Prompt / prefix tuning	Soft prompts	Minimal	Light steering, many tasks one base	Weaker than LoRA for real behavior change

The fine-tuning methods that matter in 2026, by how much of the model they touch.

How QLoRA actually works: low-rank adapters and 4-bit base weights

QLoRA is two ideas stacked. The first is the LoRA insight: a weight update during fine-tuning has low intrinsic rank, so instead of learning a full dense update matrix you learn two small matrices, A and B, whose product approximates it. You freeze the original weights entirely and train only A and B, which is why the trainable parameter count collapses to a sliver. The second idea is quantization: you load the frozen base model in 4-bit precision so it occupies roughly a quarter of the memory, while keeping the small adapter weights in higher precision where the gradients need it. The frozen 4-bit base does the heavy lifting of language understanding; the tiny trainable adapters learn your specific behavior on top. This is not a free lunch on paper but it is close to one in practice: the 2023 QLoRA paper from Dettmers and colleagues reported its Guanaco model reaching 99.3% of ChatGPT's score on the Vicuna benchmark after 24 hours on a single GPU, with 4-bit adapter fine-tuning recovering essentially full 16-bit fine-tuning quality. In 2026 this is what makes the math work on cheap hardware: a QLoRA run on an 8B-class open-weight model fits in roughly 12-16GB of VRAM, a single rented L4 or a free Colab T4 with Unsloth, not a multi-GPU cluster. At inference you either keep the adapter separate and load it over the base, or merge it back into the weights for a single deployable model. The diagram below is the whole mechanism.

QLoRA: frozen 4-bit base, trainable low-rank adapters

The base model is quantized to 4-bit and frozen. Only the low-rank A and B adapter matrices are trained. Their product is added to the frozen layer's output. This is why a few million trainable parameters can re-shape an 8-billion-parameter model on a single GPU.

Data preparation: the part that decides whether your fine-tune works

If a fine-tune fails, the data is the cause far more often than the hyperparameters. The practitioner literature has said the same thing since the 2023 LIMA result and it still holds in 2026: a few hundred strictly-formatted, high-quality examples beat thousands of noisy ones. For supervised fine-tuning, your dataset is a list of prompt-and-response pairs in JSON Lines format, one example per line, with the exact structure you want the model to learn at inference encoded into every record. Three rules carry most of the weight. Format every example identically, including the chat template and any system message, because the model learns the shape as much as the content. Deduplicate hard, since near-duplicate examples quietly bias the model and inflate your eval. And split a clean held-out set before you train, never after, so your evaluation isn't measuring memorization. The snippet below is the actual record shape for an instruction-style SFT dataset.

prepare_sft_data.py python

# One SFT example per line. Keep the chat template identical to inference.
# A few hundred clean records beat thousands of noisy ones (LIMA, 2023 -> still true 2026).
import json, hashlib, random

def to_record(system, user, assistant):
    return {
        "messages": [
            {"role": "system",    "content": system},
            {"role": "user",      "content": user},
            {"role": "assistant", "content": assistant},  # the target behavior
        ]
    }

raw = load_your_labeled_pairs()  # human-reviewed, in-format examples

# Dedup on a hash of the user turn so near-duplicates don't bias the model.
seen, clean = set(), []
for r in raw:
    key = hashlib.sha256(r["user"].strip().lower().encode()).hexdigest()
    if key in seen:
        continue
    seen.add(key)
    clean.append(to_record(r["system"], r["user"], r["assistant"]))

# Hold out the eval split BEFORE training, never after.
random.seed(42)
random.shuffle(clean)
split = int(len(clean) * 0.9)
train, eval_set = clean[:split], clean[split:]

for name, rows in (("train", train), ("eval", eval_set)):
    with open(f"{name}.jsonl", "w") as f:
        for row in rows:
            f.write(json.dumps(row) + "\n")

print(f"train={len(train)}  eval={len(eval_set)}")

A supervised fine-tuning dataset is just well-formed JSONL with a clean train/eval split. The structure of every record matters more than the count.

The supervised fine-tuning loop, end to end

Curate data

JSONL / DEDUP

Hold out eval

BEFORE TRAINING

Load 4-bit base

BITSANDBYTES

Train adapters

LORA / TRL

Evaluate

VS PROMPT BASELINE

Merge / serve

VLLM / OLLAMA

A runnable QLoRA fine-tuning recipe you can adapt this week

Here's the part the vendor explainers leave out: actual code you can run. The three variants below do the same QLoRA job three ways. Unsloth is the fastest path and the lowest memory, which is why it's the default for a single rented GPU or a free Colab. The Transformers plus TRL plus PEFT stack is the canonical Hugging Face approach when you want full control. Axolotl turns the whole thing into a YAML config, which is the cleanest option once you're running many jobs and want them reproducible. Pick one, point it at the JSONL you built above, and you have a tuned 8B model in an hour or two on hardware that costs a few dollars.

UnslothTransformers + TRL + PEFTAxolotl (YAML)

train_unsloth.py python

# Fastest, lowest-VRAM path. Fits an 8B QLoRA in ~12-16GB (a free Colab T4 works).
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

model, tok = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-8B-bnb-4bit",   # 4-bit base, frozen
    max_seq_length=2048, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=32, lora_dropout=0.0,  # the LoRA adapters
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
)

ds = load_dataset("json", data_files="train.jsonl", split="train")
SFTTrainer(
    model=model, tokenizer=tok, train_dataset=ds,
    args=SFTConfig(per_device_train_batch_size=2,
                   gradient_accumulation_steps=4,
                   num_train_epochs=2, learning_rate=2e-4,
                   output_dir="out"),
).train()
model.save_pretrained("adapter")  # tiny: just the LoRA weights

train_trl.py python

# Canonical Hugging Face stack. More control, slightly more memory.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

bnb = BitsAndBytesConfig(load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16)

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B", quantization_config=bnb, device_map="auto")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

lora = LoraConfig(r=16, lora_alpha=32, task_type="CAUSAL_LM",
    target_modules=["q_proj","k_proj","v_proj","o_proj"])

ds = load_dataset("json", data_files="train.jsonl", split="train")
SFTTrainer(model=base, peft_config=lora, train_dataset=ds,
    processing_class=tok,
    args=SFTConfig(num_train_epochs=2, learning_rate=2e-4,
                   per_device_train_batch_size=2, output_dir="out")
).train()

qlora.yaml yaml

# Config-driven. Reproducible across many jobs: `axolotl train qlora.yaml`
base_model: Qwen/Qwen3-8B
load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_target_modules: [q_proj, k_proj, v_proj, o_proj]

datasets:
  - path: train.jsonl
    type: chat_template

sequence_len: 2048
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 2
learning_rate: 0.0002
val_set_size: 0.1        # hold-out eval split
output_dir: ./out

Evaluating a fine-tuned model: proving it beat the prompt-only baseline

A fine-tune isn't finished when the loss curve flattens; it's finished when you've proven it beats the cheaper thing you'd otherwise ship. That means one comparison above all others: your tuned model versus the same base model with a strong prompt, scored on a held-out golden set you built from real examples, not the training data. Run both through the same eval harness, Ragas, promptfoo, Inspect from the UK AISI, or LangSmith, and look at win-rate on your actual task, not generic benchmark scores. If the tuned model doesn't clearly beat the prompt-only baseline on your golden set, you've just spent a budget to ship the prompt. The discipline here is the same one we apply when scoring any model decision, the kind of evaluation rigor that underpins our work on agent systems in the multi-agent orchestration patterns guide, where a tuned small model often serves as one cheap, reliable tool inside a larger system.

What a trustworthy 2026 fine-tune eval looks like

100+

GOLDEN-SET ITEMS

Real, held-out, never in training

vs prompt

THE BASELINE

Same base + strong prompt, head to head

win-rate

THE METRIC

On your task, not generic benchmarks

blind

JUDGE SETUP

Position-randomized, family-isolated

Catastrophic forgetting and the production gotchas nobody warns you about

The failure mode that surprises teams most is catastrophic forgetting: you fine-tune a model to be excellent at your narrow task and it quietly gets worse at everything else, including general instruction-following it used to do for free. PEFT methods like LoRA reduce this because the base weights stay frozen, but they don't eliminate it, and an over-trained adapter can still drag the model away from its general competence. The other gotchas cluster around deployment and drift, and they're the ones that turn a working fine-tune into a maintenance headache months later.

Engineer note

We learned the over-fitting lesson on a classification tune that looked perfect in eval and then fell apart in production. The model had memorized the surface phrasing of our training examples instead of the underlying rule, so it nailed anything that looked like the training set and flailed on real-world variants. Two things fixed it: we cut the epochs from four to two, because the model was past the point of learning the task and into memorizing it, and we rebuilt the golden set from production traffic rather than the same distribution the training data came from. The other habit we keep is to always serve the prompt-only baseline in shadow for the first few weeks. If the tuned model ever underperforms the baseline on live traffic, we can flip back instantly, and roughly one tune in five doesn't survive that test. Treat a fine-tune as a hypothesis you keep trying to falsify, not a deliverable you sign off and forget.

FAQ — LLM fine tuning, in the practitioner's vocabulary

What is LLM fine tuning?

LLM fine tuning is continuing the training of a pre-trained model on a smaller, task-specific dataset so its weights shift toward a behavior you want, a format, a tone, or a narrow task. It changes how a model behaves, not what it knows. For new knowledge, retrieval (RAG) is the better and cheaper tool.

Fine tuning vs RAG: which should I use?

They solve different problems and often work together. Use RAG when the model lacks facts, it retrieves and injects them at inference with no training run and stays fresh. Use fine-tuning when the model has the knowledge but won't behave in the shape you need. If your complaint is "it doesn't know our data," that's RAG. If it's "it won't stick to our format," that may be a fine-tune.

Fine tuning vs prompt engineering: is fine-tuning worth it?

Try the prompt first, almost always. A strong system prompt with a few examples closes most behavior gaps for free, and you can change it in production without retraining. Fine-tuning earns its place only when a prompt-only baseline provably can't do the job, or when you call the model at such high volume that a cheaper tuned small model beats an expensive frontier model on cost.

When should you NOT fine-tune an LLM?

Don't fine-tune to inject knowledge (use RAG), don't fine-tune when the requirement changes weekly (you'll re-train weekly), don't fine-tune before exhausting prompt engineering, and don't fine-tune if nobody has budgeted for the maintenance tax of re-tuning on every base-model upgrade, which in 2026 is roughly quarterly.

What is the difference between LoRA, QLoRA, and PEFT?

PEFT (parameter-efficient fine-tuning) is the umbrella family that freezes the base model and trains only a few new weights. LoRA is the dominant PEFT method: it trains small low-rank adapter matrices, often a tiny fraction of the parameters. QLoRA is LoRA with the frozen base quantized to 4-bit, which is what lets an 8B model fine-tune in roughly 12-16GB of VRAM on a single GPU. QLoRA is the 2026 production default.

How much data do I need to fine-tune an LLM?

Less than you think, but cleaner. A few hundred strictly-formatted, high-quality, deduplicated examples beat thousands of noisy ones, a finding the literature has repeated since the 2023 LIMA result. Format every example exactly as it appears at inference, deduplicate hard, and hold out a clean eval split before training so you measure learning, not memorization.

LLM DEVELOPMENT

Fine-tune on evidence, not on hype.

We start every model-customization engagement by trying to talk you out of training, and only fine-tune when prompt and RAG provably can't do the job. If you're weighing fine-tuning, talk to the team that has run these decisions across 200+ engagements.

Explore our llm development services Start the conversation

LLM fine tuning: when to do it, and when not to

LLM fine tuning in one paragraph, and the question to ask before you start

When NOT to fine-tune an LLM: the decision rule most teams skip

Fine tuning vs RAG vs prompt engineering: what each one actually changes

The real cost of LLM fine tuning: data, GPU hours, and the maintenance tax

When fine-tuning is the right call: behavior, format, and narrow tasks

The LLM fine tuning method map: full, LoRA, QLoRA, and PEFT

How QLoRA actually works: low-rank adapters and 4-bit base weights

Data preparation: the part that decides whether your fine-tune works

A runnable QLoRA fine-tuning recipe you can adapt this week

Evaluating a fine-tuned model: proving it beat the prompt-only baseline

Catastrophic forgetting and the production gotchas nobody warns you about

FAQ — LLM fine tuning, in the practitioner's vocabulary

Fine-tune on evidence, not on hype.

Want help shipping this?

Talk to the engineer
who'd lead the work.

Thanks —,
a reply is on the way.

LLM fine tuning in one paragraph, and the question to ask before you start

When NOT to fine-tune an LLM: the decision rule most teams skip

Fine tuning vs RAG vs prompt engineering: what each one actually changes

The real cost of LLM fine tuning: data, GPU hours, and the maintenance tax

When fine-tuning is the right call: behavior, format, and narrow tasks

The LLM fine tuning method map: full, LoRA, QLoRA, and PEFT

How QLoRA actually works: low-rank adapters and 4-bit base weights

Data preparation: the part that decides whether your fine-tune works

A runnable QLoRA fine-tuning recipe you can adapt this week

Evaluating a fine-tuned model: proving it beat the prompt-only baseline

Catastrophic forgetting and the production gotchas nobody warns you about

FAQ — LLM fine tuning, in the practitioner's vocabulary

Fine-tune on evidence, not on hype.

Continue reading.

Semantic search: how it works and how to build it

Embedding models: how to pick one for RAG

Agentic RAG: architecture, and when it actually pays off

Want help shipping this?