LLM Fine-Tuning: LoRA, QLoRA & PEFT Without a Supercomputer

Fine-tuning a 7–70 billion parameter language model was, until 2023, something only organisations with GPU clusters could do. LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) changed the economics entirely: a 7B model can now be fine-tuned on a single RTX 3090 or even a free Google Colab T4 GPU. This guide walks through every decision in the fine-tuning pipeline — whether to fine-tune at all, dataset preparation, training configuration, and rigorous evaluation — with concrete code examples using the Hugging Face ecosystem and Unsloth.

1. Fine-Tune vs RAG vs Prompting: The Decision

Before any GPU cycles, answer the right question: do you actually need to fine-tune?

ApproachWhen It WorksCostWhen It Fails
Prompt Engineering Task is well-covered in base model training; few-shot examples sufficient Near-zero Task requires specialised knowledge not in training data; consistent output format is critical
RAG (Retrieval-Augmented Generation) Task requires access to current, proprietary, or private knowledge Low–Medium (infrastructure) Task requires a different reasoning style or output format — knowledge is fine, behaviour isn't
Fine-Tuning Specific output style/format; domain-specific reasoning; consistent tone; confidential data cannot leave org Medium–High (GPU time + expertise) Task simply needs knowledge lookup; small dataset; when base model works well with prompting

Rule of thumb: Try prompting first for a week. If the model consistently fails despite good prompts, try RAG. If the issue is behaviour/style not knowledge, fine-tune. Many "fine-tuning" projects discover that a 5-message system prompt achieves 90% of the goal at zero cost.

2. LoRA: Low-Rank Adaptation Explained

Full fine-tuning updates all billions of parameters in a model. For a 7B Llama model, that's ~28GB of gradients and optimizer states — impossible on consumer hardware. LoRA's insight: the weight updates needed for a specific task live in a much lower-dimensional subspace.

Mathematically, instead of updating weight matrix $W \in \mathbb{R}^{d \times k}$ directly, LoRA freezes $W$ and adds a low-rank decomposition:

$$W' = W + BA$$

Where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, with $r \ll \min(d, k)$. Only $A$ and $B$ are trained. With $r = 16$, LoRA adds ~0.1–1% of the original parameter count — typically 4–40 million trainable parameters for a 7B model, versus 7 billion for full fine-tuning.

Key LoRA hyperparameters:

  • r (rank): Dimensionality of the low-rank matrices. Higher = more capacity but more memory. Typical: 8–64.
  • lora_alpha: Scaling factor. Usually set equal to r or twice r. Affects learning rate effectively.
  • target_modules: Which weight matrices to apply LoRA to. Minimum: query and value attention projections (q_proj, v_proj). For stronger adaptation: all attention + MLP layers.

3. QLoRA: Quantized LoRA

QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit NormalFloat quantization of the base model weights. The base model is loaded in 4-bit (instead of 16 or 32-bit), reducing memory by 4–8×. LoRA adapters are trained in 16-bit precision. This allows fine-tuning a 70B model on a single A100 80GB GPU — previously requiring a multi-GPU setup.

In practice: QLoRA slightly degrades final model quality compared to LoRA at 16-bit, but the gap is small for most tasks and the hardware savings are enormous. For most use cases, QLoRA is the recommended starting point.

4. PEFT Library: Practical Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Load model in 4-bit (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",         # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,    # saves ~0.4 bits/param extra
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",                 # auto-distribute across GPUs
)

# Apply LoRA adapters
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,884,224 || trainable%: 0.52

5. Unsloth: 2× Faster Fine-Tuning

Unsloth is an open-source library that reimplements model kernels with hand-written Triton code, achieving ~2× faster training and ~60% less VRAM usage compared to standard Hugging Face + PEFT. In 2026 it supports Llama 3.x, Mistral, Gemma 2, Qwen 2.5, and others:

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=4096,
    dtype=None,       # auto-detect: bfloat16 on Ampere+
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
    lora_alpha=16,
    lora_dropout=0,   # 0 is optimised by Unsloth
    bias="none",
    use_gradient_checkpointing="unsloth",  # memory saving
    random_state=42,
)

6. Axolotl: Config-Based Training

Axolotl wraps the full fine-tuning pipeline into a YAML configuration, eliminating repetitive Python boilerplate. Ideal for running experiments and for production training pipelines:

# config.yml
base_model: meta-llama/Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: true
adapter: lora
lora_r: 16
lora_alpha: 32
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

datasets:
  - path: my-org/my-instruction-dataset
    type: alpaca        # or sharegpt, chatml, etc.

sequence_len: 4096
sample_packing: true   # pack multiple short samples into one sequence
val_set_size: 0.05

output_dir: ./outputs/llama-3-1-ft
num_epochs: 3
micro_batch_size: 2
gradient_accumulation_steps: 4   # effective batch size = 8
learning_rate: 0.0002
lr_scheduler: cosine
warmup_steps: 50
optimizer: adamw_8bit   # 8-bit Adam: same quality, 50% less VRAM

# Run with:
# axolotl train config.yml

7. Dataset Preparation

Dataset quality beats quantity. 1000 high-quality, diverse instruction-response pairs commonly outperforms 100,000 low-quality examples. The most common format is instruction tuning (Alpaca/ShareGPT format):

[
  {
    "instruction": "Extract all product names and prices from the following invoice text.",
    "input": "Invoice #1042\nProduct: Widget Pro x2 @ $24.99 each\nProduct: Gadget Plus x1 @ $89.00",
    "output": "Products:\n- Widget Pro: $24.99 (×2)\n- Gadget Plus: $89.00 (×1)"
  },
  ...
]

Data quality checklist:

  • Remove duplicates (exact and near-duplicate — use MinHash LSH or SimHash)
  • Filter by response length — remove one-word answers and unusually long outliers
  • Quality filter: GPT-4o ratings or Reward model scoring can filter low-quality pairs
  • Balance across task types if multi-task — avoid one task dominating and causing catastrophic forgetting
  • Ensure diversity: diversity of wording, instruction types, complexity levels, and domains

8. Training Configuration

HyperparameterTypical RangeNotes
Learning rate1e-4 – 3e-4Higher than full fine-tuning; cosine decay schedule recommended
Batch size (effective)8–32Use gradient accumulation to achieve effective batch size without VRAM scaling
Epochs1–5LLMs overfit with more epochs on small datasets. <3 is usually optimal
Max sequence length2048–8192Higher = more VRAM. Start with 4096 for most instruction tasks
LoRA r8–64Higher r = more capacity. For complex tasks, try 32–64
Warmup50–100 stepsGradually increase LR to prevent loss spikes at training start

9. Evaluation and Benchmarks

Evaluation is the hardest part of LLM fine-tuning. Perplexity on a held-out set measures training fit but not real-world task quality. Use task-specific evaluation:

  • Exact match / F1: For extraction and classification tasks. Deterministic and cheap.
  • LLM-as-judge: Use GPT-4o or Claude to rate model outputs on a 1–5 scale for helpfulness, accuracy, and format adherence. Correlates well with human judgement at a fraction of the cost.
  • MT-Bench / AlpacaEval: Standard benchmarks for instruction-following quality using LLM-as-judge methodology.
  • Human eval: Irreplaceable for final quality gates before deployment. Budget 200–500 samples rated by domain experts.
  • Regression testing: Maintains a fixed set of examples where the base model already performs well — ensures fine-tuning hasn't degraded capability through catastrophic forgetting.

10. Hardware Requirements by Model Size

Model SizeFull Fine-TuneLoRA (16-bit)QLoRA (4-bit)Min Hardware
1B–3B~24GB VRAM~8GB VRAM~4GB VRAMRTX 3060 12GB / free Colab T4
7B–8B~60GB VRAM~16GB VRAM~8GB VRAMRTX 3090 24GB / RTX 4090
13B~120GB VRAM~32GB VRAM~14GB VRAMRTX 3090 24GB + gradient checkpointing
70BMulti-GPU cluster~160GB VRAM~40GB VRAMA100 80GB × 1 (QLoRA)

11. Frequently Asked Questions

Will my fine-tuned model forget general knowledge?

Yes — this is catastrophic forgetting. When fine-tuned on a narrow task, models tend to degrade on everything else. Mitigation: use LoRA (the base model weights are frozen so forgetting is impossible), mix your task data with ~5–10% general instruction data, and keep fine-tuning epochs low (<3). Full fine-tuning has the highest catastrophic forgetting risk — another reason to prefer LoRA.

Can I publish my fine-tuned model on Hugging Face?

Depends on the base model's license. Llama 3.x's license allows publication for non-commercial and commercial use up to 700M monthly users — check licence terms carefully. Mistral and Gemma have permissive Apache 2.0 licences. You can publish just the LoRA adapter weights (a few MB) with model.save_pretrained() — users merge them with the base model weights locally.

12. Glossary

LoRA (Low-Rank Adaptation)
An efficient fine-tuning technique that freezes base model weights and trains small low-rank adapter matrices, reducing trainable parameters by 99%.
QLoRA
LoRA combined with 4-bit NormalFloat quantization of the base model, enabling fine-tuning on consumer GPUs.
PEFT
Parameter-Efficient Fine-Tuning — a Hugging Face library implementing LoRA, QLoRA, and other efficient fine-tuning methods.
Instruction Tuning
Fine-tuning on (instruction, response) pairs to teach a base LLM to follow instructions and be helpful.
Catastrophic Forgetting
The tendency of neural networks to forget previously learned information when trained on new data.
Rank (r)
In LoRA, the dimension of the low-rank matrices; controls the capacity of the adaptation.

13. References & Further Reading

Start with Unsloth's Google Colab notebooks — they get LoRA fine-tuning running on a free T4 GPU in under 10 minutes. Once you've run a small experiment and seen the training loss curve, the concepts click immediately.