# LLM fine tuning: when to do it, and when not to

> A decision-led guide to LLM fine tuning: when NOT to do it, fine tuning vs RAG vs prompt engineering, the real cost, and a runnable QLoRA recipe with eval.

**HTML version:** https://www.paiteq.com/blog/llm-fine-tuning/
**Published:** 2026-06-12T04:26:27.679Z
**Author:** Navin Sharma, Founder · AI Engineering Lead
**Reading time:** ~13 min


---

Most of the LLM fine tuning projects we get asked to run shouldn't happen. The team has a model that's getting something wrong, someone read that fine-tuning fixes it, and a GPU budget appears. Three weeks later they've spent real money to land roughly where a better prompt and a retrieval layer would have put them in an afternoon. Fine-tuning is a powerful tool with a narrow job, and the gap between what it's sold as and what it actually does is where most of the wasted money lives.
So this guide is deliberately backwards. We've shipped model-selection and customization work across 200+ engagements, and the single highest-leverage thing we do is talk teams out of fine-tuning when it's the wrong tool. We'll start with when NOT to fine-tune, then put fine tuning vs RAG vs prompt engineering side by side so you can see what each one actually changes, then count the real cost. Only after that do we get into the how, LoRA, QLoRA, PEFT, data prep, and a runnable recipe, because the how is the easy part once you're sure you should be doing it at all.

## LLM fine tuning in one paragraph, and the question to ask before you start

LLM fine tuning is the practice of taking a pre-trained model and continuing its training on a smaller, task-specific dataset so that its weights shift toward the behavior you want. That's the whole idea, and the operative word is behavior. Fine-tuning is very good at changing how a model responds: its format, its tone, the shape of its outputs, the way it handles one narrow task you do over and over. It's bad, expensive, and unreliable at changing what a model knows. If your complaint is "the model gives the wrong format" or "it won't stick to our house style," fine-tuning is plausibly the right tool. If your complaint is "the model doesn't know our 4,000 internal documents," fine-tuning is the wrong tool wearing the right tool's clothes, and the right tool is retrieval. The one question to ask before you start is therefore blunt: am I trying to change how the model behaves, or what it knows? Get that wrong and everything downstream is wasted effort, because you'll spend a training budget teaching a model to memorize facts it will still hallucinate, when you could have handed it those facts at inference time and been done. Almost every regret we see traces back to skipping this one question.

## When NOT to fine-tune an LLM: the decision rule most teams skip

Here's the rule we run before any training job gets approved, and it's the same rule we'd give anyone deciding when to fine tune llm systems at all. Classify the problem first. If it's a knowledge problem, the model needs facts it doesn't have, the answer is retrieval, not fine-tuning. If it's a capability problem, the model genuinely can't reason through the task at all, the answer is a stronger model, not fine-tuning a weaker one. Fine-tuning only earns its place when it's a behavior problem: the model can do the task and has the knowledge, but won't do it in the shape, tone, or reliability you need. That's a narrow slice, and most requests don't land in it.
There are four situations where we actively tell teams to stop. First, when the goal is to inject knowledge: a model doesn't reliably memorize facts from a few thousand examples, and even when it appears to, it'll confabulate around the edges. Second, when the requirement is changing fast: a fine-tune freezes a behavior, so if your spec shifts weekly you'll be re-training weekly. Third, when you haven't exhausted prompting: a strong system prompt with a few well-chosen examples closes most behavior gaps for free, and you should prove a prompt-only baseline can't do the job before you train. Fourth, when nobody has budgeted for maintenance, which we'll cost out below. If you're tempted to skip the prompt-first step, our [AI readiness assessment](/blog/ai-readiness-assessment/) framework treats "have you exhausted the cheap options" as a gate before any model-training line item gets approved.

## Fine tuning vs RAG vs prompt engineering: what each one actually changes

The fine tuning vs rag debate gets framed as either/or, and that framing is the mistake. They change different things and the strongest systems use them together. Prompt engineering changes the instructions the model sees at inference; it's the cheapest lever and the first one to pull. RAG changes the context the model sees at inference by retrieving relevant facts and pasting them into the prompt; it's how you give a model knowledge it didn't train on, kept fresh without retraining. Fine-tuning changes the weights themselves; it's how you bake in a behavior so durably that you don't have to spell it out every call. The honest 2026 answer to fine tuning vs prompt engineering is: try the prompt first, almost always, because it costs nothing to iterate and you can change it in production without a training run.
Put all three on one map and the division of labor is clear. The three approaches aren't rivals; they're a ladder you climb only as far as you need to. The way model behavior shifts under each lever is its own topic, in the same family as the design tradeoffs we cover in our piece on [diffusion versus flow-based generative models](/blog/diffusion-vs-flow-models/), where the choice of objective changes everything downstream.

## The real cost of LLM fine tuning: data, GPU hours, and the maintenance tax

Vendor guides love to say PEFT "saves VRAM" and leave cost there, because the VRAM line is the only one that got cheap. The expensive lines didn't. The GPU hours for a QLoRA run on an 8B-class model are genuinely small in 2026, often a few dollars of rented compute, and that's the number people anchor on. The data is where the money actually goes: assembling a few hundred to a few thousand high-quality, strictly-formatted examples, deduplicating them, and holding out a clean eval split is human-labeling work that dwarfs the compute. Then there's the line nobody puts in the proposal, the maintenance tax: the base model you tuned gets superseded roughly every quarter, and your fine-tune is now frozen on an older model while the off-the-shelf frontier walks past it. A prompt-plus-RAG system rides that upgrade for free; a fine-tune has to be re-run on the new base, re-evaluated, and re-deployed. Budget for the second and third re-tune, or the first one was a sunk cost with a short shelf life.

## When fine-tuning is the right call: behavior, format, and narrow tasks

All of that honesty isn't an argument against fine-tuning; it's an argument for using it where it genuinely wins. There are three jobs fine-tuning does better than anything else. The first is rigid output structure: when you need a model to emit a specific JSON shape or query dialect every single time, a fine-tune internalizes that format more reliably than a prompt that the model occasionally ignores under pressure. The second is voice and style at scale: a consistent brand persona or a domain register that would otherwise cost you a long, expensive system prompt on every call. The third is a narrow, high-volume task where a tuned small open-weight model matches a frontier model's quality at a fraction of the per-call cost, classification, extraction, routing, the unglamorous workhorses. That last one is where the economics flip in fine-tuning's favor: when you're calling the model millions of times, a cheaper tuned 8B model that's good enough beats an expensive frontier model that's overqualified. The typical shape on a well-scoped classification task in 2026: a QLoRA-tuned 8B model can land within a point or two of a much larger frontier model's accuracy on a held-out set, at a small fraction of the per-call cost, but only after a prompt-only baseline was tried first and proven insufficient. Measure that gap on your own task before you bank on it.

> [!NOTE] (rich block: callout)

## The LLM fine tuning method map: full, LoRA, QLoRA, and PEFT

Once you've decided to fine-tune, the next choice is how, and the methods differ mostly in how many weights they touch. Full fine-tuning updates every parameter in the model. It's the most powerful and the most expensive: you need enough VRAM to hold the whole model plus its optimizer states, which for anything past 8B means multiple high-end GPUs, and you get one giant checkpoint per task. PEFT, parameter-efficient fine-tuning, is the family that fixed this by freezing the base model and training only a tiny set of new weights. LoRA, the dominant PEFT method, injects small low-rank adapter matrices alongside the frozen weights and trains only those, often a tiny fraction of the full parameter count. QLoRA goes further by quantizing the frozen base to 4-bit so it fits in a fraction of the memory, then training LoRA adapters on top. The Hugging Face PEFT library, with bitsandbytes for the quantization, is the standard implementation, and wrappers like Unsloth and Axolotl make it a config file rather than a research project.

## How QLoRA actually works: low-rank adapters and 4-bit base weights

QLoRA is two ideas stacked. The first is the LoRA insight: a weight update during fine-tuning has low intrinsic rank, so instead of learning a full dense update matrix you learn two small matrices, A and B, whose product approximates it. You freeze the original weights entirely and train only A and B, which is why the trainable parameter count collapses to a sliver. The second idea is quantization: you load the frozen base model in 4-bit precision so it occupies roughly a quarter of the memory, while keeping the small adapter weights in higher precision where the gradients need it. The frozen 4-bit base does the heavy lifting of language understanding; the tiny trainable adapters learn your specific behavior on top. This is not a free lunch on paper but it is close to one in practice: the 2023 QLoRA paper from Dettmers and colleagues reported its Guanaco model reaching 99.3% of ChatGPT's score on the Vicuna benchmark after 24 hours on a single GPU, with 4-bit adapter fine-tuning recovering essentially full 16-bit fine-tuning quality. In 2026 this is what makes the math work on cheap hardware: a QLoRA run on an 8B-class open-weight model fits in roughly 12-16GB of VRAM, a single rented L4 or a free Colab T4 with Unsloth, not a multi-GPU cluster. At inference you either keep the adapter separate and load it over the base, or merge it back into the weights for a single deployable model. The diagram below is the whole mechanism.

## Data preparation: the part that decides whether your fine-tune works

If a fine-tune fails, the data is the cause far more often than the hyperparameters. The practitioner literature has said the same thing since the 2023 LIMA result and it still holds in 2026: a few hundred strictly-formatted, high-quality examples beat thousands of noisy ones. For supervised fine-tuning, your dataset is a list of prompt-and-response pairs in JSON Lines format, one example per line, with the exact structure you want the model to learn at inference encoded into every record. Three rules carry most of the weight. Format every example identically, including the chat template and any system message, because the model learns the shape as much as the content. Deduplicate hard, since near-duplicate examples quietly bias the model and inflate your eval. And split a clean held-out set before you train, never after, so your evaluation isn't measuring memorization. The snippet below is the actual record shape for an instruction-style SFT dataset.

## A runnable QLoRA fine-tuning recipe you can adapt this week

Here's the part the vendor explainers leave out: actual code you can run. The three variants below do the same QLoRA job three ways. Unsloth is the fastest path and the lowest memory, which is why it's the default for a single rented GPU or a free Colab. The Transformers plus TRL plus PEFT stack is the canonical Hugging Face approach when you want full control. Axolotl turns the whole thing into a YAML config, which is the cleanest option once you're running many jobs and want them reproducible. Pick one, point it at the JSONL you built above, and you have a tuned 8B model in an hour or two on hardware that costs a few dollars.

## Evaluating a fine-tuned model: proving it beat the prompt-only baseline

A fine-tune isn't finished when the loss curve flattens; it's finished when you've proven it beats the cheaper thing you'd otherwise ship. That means one comparison above all others: your tuned model versus the same base model with a strong prompt, scored on a held-out golden set you built from real examples, not the training data. Run both through the same eval harness, Ragas, promptfoo, Inspect from the UK AISI, or LangSmith, and look at win-rate on your actual task, not generic benchmark scores. If the tuned model doesn't clearly beat the prompt-only baseline on your golden set, you've just spent a budget to ship the prompt. The discipline here is the same one we apply when scoring any model decision, the kind of evaluation rigor that underpins our work on agent systems in the [multi-agent orchestration patterns](/blog/multi-agent-orchestration-patterns/) guide, where a tuned small model often serves as one cheap, reliable tool inside a larger system.

## Catastrophic forgetting and the production gotchas nobody warns you about

The failure mode that surprises teams most is catastrophic forgetting: you fine-tune a model to be excellent at your narrow task and it quietly gets worse at everything else, including general instruction-following it used to do for free. PEFT methods like LoRA reduce this because the base weights stay frozen, but they don't eliminate it, and an over-trained adapter can still drag the model away from its general competence. The other gotchas cluster around deployment and drift, and they're the ones that turn a working fine-tune into a maintenance headache months later.

## FAQ — LLM fine tuning, in the practitioner's vocabulary

---

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements.

- **Site index for agents:** https://www.paiteq.com/llms.txt
- **Full content for agents:** https://www.paiteq.com/llms-full.txt
- **Book a call:** https://www.paiteq.com/contact/