1. Fine-Tune vs RAG vs Prompting: The Decision
Before any GPU cycles, answer the right question: do you actually need to fine-tune?
| Approach | When It Works | Cost | When It Fails |
|---|---|---|---|
| Prompt Engineering | Task is well-covered in base model training; few-shot examples sufficient | Near-zero | Task requires specialised knowledge not in training data; consistent output format is critical |
| RAG (Retrieval-Augmented Generation) | Task requires access to current, proprietary, or private knowledge | Low–Medium (infrastructure) | Task requires a different reasoning style or output format — knowledge is fine, behaviour isn't |
| Fine-Tuning | Specific output style/format; domain-specific reasoning; consistent tone; confidential data cannot leave org | Medium–High (GPU time + expertise) | Task simply needs knowledge lookup; small dataset; when base model works well with prompting |
Rule of thumb: Try prompting first for a week. If the model consistently fails despite good prompts, try RAG. If the issue is behaviour/style not knowledge, fine-tune. Many "fine-tuning" projects discover that a 5-message system prompt achieves 90% of the goal at zero cost.
2. LoRA: Low-Rank Adaptation Explained
Full fine-tuning updates all billions of parameters in a model. For a 7B Llama model, that's ~28GB of gradients and optimizer states — impossible on consumer hardware. LoRA's insight: the weight updates needed for a specific task live in a much lower-dimensional subspace.
Mathematically, instead of updating weight matrix $W \in \mathbb{R}^{d \times k}$ directly, LoRA freezes $W$ and adds a low-rank decomposition:
$$W' = W + BA$$
Where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, with $r \ll \min(d, k)$. Only $A$ and $B$ are trained. With $r = 16$, LoRA adds ~0.1–1% of the original parameter count — typically 4–40 million trainable parameters for a 7B model, versus 7 billion for full fine-tuning.
Key LoRA hyperparameters:
r(rank): Dimensionality of the low-rank matrices. Higher = more capacity but more memory. Typical: 8–64.lora_alpha: Scaling factor. Usually set equal toror twicer. Affects learning rate effectively.target_modules: Which weight matrices to apply LoRA to. Minimum: query and value attention projections (q_proj,v_proj). For stronger adaptation: all attention + MLP layers.
3. QLoRA: Quantized LoRA
QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit NormalFloat quantization of the base model weights. The base model is loaded in 4-bit (instead of 16 or 32-bit), reducing memory by 4–8×. LoRA adapters are trained in 16-bit precision. This allows fine-tuning a 70B model on a single A100 80GB GPU — previously requiring a multi-GPU setup.
In practice: QLoRA slightly degrades final model quality compared to LoRA at 16-bit, but the gap is small for most tasks and the hardware savings are enormous. For most use cases, QLoRA is the recommended starting point.
4. PEFT Library: Practical Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch
# Load model in 4-bit (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # saves ~0.4 bits/param extra
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
device_map="auto", # auto-distribute across GPUs
)
# Apply LoRA adapters
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,884,224 || trainable%: 0.52
5. Unsloth: 2× Faster Fine-Tuning
Unsloth is an open-source library that reimplements model kernels with hand-written Triton code, achieving ~2× faster training and ~60% less VRAM usage compared to standard Hugging Face + PEFT. In 2026 it supports Llama 3.x, Mistral, Gemma 2, Qwen 2.5, and others:
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
max_seq_length=4096,
dtype=None, # auto-detect: bfloat16 on Ampere+
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
lora_alpha=16,
lora_dropout=0, # 0 is optimised by Unsloth
bias="none",
use_gradient_checkpointing="unsloth", # memory saving
random_state=42,
)
6. Axolotl: Config-Based Training
Axolotl wraps the full fine-tuning pipeline into a YAML configuration, eliminating repetitive Python boilerplate. Ideal for running experiments and for production training pipelines:
# config.yml
base_model: meta-llama/Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_4bit: true
adapter: lora
lora_r: 16
lora_alpha: 32
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
- gate_proj
- up_proj
- down_proj
datasets:
- path: my-org/my-instruction-dataset
type: alpaca # or sharegpt, chatml, etc.
sequence_len: 4096
sample_packing: true # pack multiple short samples into one sequence
val_set_size: 0.05
output_dir: ./outputs/llama-3-1-ft
num_epochs: 3
micro_batch_size: 2
gradient_accumulation_steps: 4 # effective batch size = 8
learning_rate: 0.0002
lr_scheduler: cosine
warmup_steps: 50
optimizer: adamw_8bit # 8-bit Adam: same quality, 50% less VRAM
# Run with:
# axolotl train config.yml
7. Dataset Preparation
Dataset quality beats quantity. 1000 high-quality, diverse instruction-response pairs commonly outperforms 100,000 low-quality examples. The most common format is instruction tuning (Alpaca/ShareGPT format):
[
{
"instruction": "Extract all product names and prices from the following invoice text.",
"input": "Invoice #1042\nProduct: Widget Pro x2 @ $24.99 each\nProduct: Gadget Plus x1 @ $89.00",
"output": "Products:\n- Widget Pro: $24.99 (×2)\n- Gadget Plus: $89.00 (×1)"
},
...
]
Data quality checklist:
- Remove duplicates (exact and near-duplicate — use MinHash LSH or SimHash)
- Filter by response length — remove one-word answers and unusually long outliers
- Quality filter: GPT-4o ratings or Reward model scoring can filter low-quality pairs
- Balance across task types if multi-task — avoid one task dominating and causing catastrophic forgetting
- Ensure diversity: diversity of wording, instruction types, complexity levels, and domains
8. Training Configuration
| Hyperparameter | Typical Range | Notes |
|---|---|---|
| Learning rate | 1e-4 – 3e-4 | Higher than full fine-tuning; cosine decay schedule recommended |
| Batch size (effective) | 8–32 | Use gradient accumulation to achieve effective batch size without VRAM scaling |
| Epochs | 1–5 | LLMs overfit with more epochs on small datasets. <3 is usually optimal |
| Max sequence length | 2048–8192 | Higher = more VRAM. Start with 4096 for most instruction tasks |
| LoRA r | 8–64 | Higher r = more capacity. For complex tasks, try 32–64 |
| Warmup | 50–100 steps | Gradually increase LR to prevent loss spikes at training start |
9. Evaluation and Benchmarks
Evaluation is the hardest part of LLM fine-tuning. Perplexity on a held-out set measures training fit but not real-world task quality. Use task-specific evaluation:
- Exact match / F1: For extraction and classification tasks. Deterministic and cheap.
- LLM-as-judge: Use GPT-4o or Claude to rate model outputs on a 1–5 scale for helpfulness, accuracy, and format adherence. Correlates well with human judgement at a fraction of the cost.
- MT-Bench / AlpacaEval: Standard benchmarks for instruction-following quality using LLM-as-judge methodology.
- Human eval: Irreplaceable for final quality gates before deployment. Budget 200–500 samples rated by domain experts.
- Regression testing: Maintains a fixed set of examples where the base model already performs well — ensures fine-tuning hasn't degraded capability through catastrophic forgetting.
10. Hardware Requirements by Model Size
| Model Size | Full Fine-Tune | LoRA (16-bit) | QLoRA (4-bit) | Min Hardware |
|---|---|---|---|---|
| 1B–3B | ~24GB VRAM | ~8GB VRAM | ~4GB VRAM | RTX 3060 12GB / free Colab T4 |
| 7B–8B | ~60GB VRAM | ~16GB VRAM | ~8GB VRAM | RTX 3090 24GB / RTX 4090 |
| 13B | ~120GB VRAM | ~32GB VRAM | ~14GB VRAM | RTX 3090 24GB + gradient checkpointing |
| 70B | Multi-GPU cluster | ~160GB VRAM | ~40GB VRAM | A100 80GB × 1 (QLoRA) |
11. Frequently Asked Questions
Will my fine-tuned model forget general knowledge?
Yes — this is catastrophic forgetting. When fine-tuned on a narrow task, models tend to degrade on everything else. Mitigation: use LoRA (the base model weights are frozen so forgetting is impossible), mix your task data with ~5–10% general instruction data, and keep fine-tuning epochs low (<3). Full fine-tuning has the highest catastrophic forgetting risk — another reason to prefer LoRA.
Can I publish my fine-tuned model on Hugging Face?
Depends on the base model's license. Llama 3.x's license allows publication for non-commercial and commercial use up to 700M monthly users — check licence terms carefully. Mistral and Gemma have permissive Apache 2.0 licences. You can publish just the LoRA adapter weights (a few MB) with model.save_pretrained() — users merge them with the base model weights locally.
12. Glossary
- LoRA (Low-Rank Adaptation)
- An efficient fine-tuning technique that freezes base model weights and trains small low-rank adapter matrices, reducing trainable parameters by 99%.
- QLoRA
- LoRA combined with 4-bit NormalFloat quantization of the base model, enabling fine-tuning on consumer GPUs.
- PEFT
- Parameter-Efficient Fine-Tuning — a Hugging Face library implementing LoRA, QLoRA, and other efficient fine-tuning methods.
- Instruction Tuning
- Fine-tuning on (instruction, response) pairs to teach a base LLM to follow instructions and be helpful.
- Catastrophic Forgetting
- The tendency of neural networks to forget previously learned information when trained on new data.
- Rank (r)
- In LoRA, the dimension of the low-rank matrices; controls the capacity of the adaptation.
13. References & Further Reading
- LoRA paper — Hu et al. (2021)
- QLoRA paper — Dettmers et al. (2023)
- Hugging Face PEFT Library
- Unsloth — Fast LLM Fine-Tuning
- Axolotl — Config-Based Training
Start with Unsloth's Google Colab notebooks — they get LoRA fine-tuning running on a free T4 GPU in under 10 minutes. Once you've run a small experiment and seen the training loss curve, the concepts click immediately.