Skip to content
SumGuy's Ramblings
Go back

LLM Fine-Tuning for Mortals: LoRA, QLoRA, and Your Gaming GPU

So you’ve been playing with ChatGPT, Claude, or some open-source model running on your machine, and you’ve hit a wall. No matter how cleverly you write your system prompt, the model keeps doing that thing you hate. Maybe it won’t stop being corporate. Maybe it doesn’t understand your domain. Maybe you need it to output JSON in a very specific format and it keeps hallucinating extra fields like an overenthusiastic intern.

You’ve heard whispers of “fine-tuning” in Discord servers and Reddit threads. People talk about it the way medieval villagers talked about dragons — with a mixture of fear, respect, and the vague suspicion that you need to be a wizard to attempt it.

Good news: you don’t. Fine-tuning is more accessible than it’s ever been, and you can do it on the same GPU you use to play Elden Ring. Let’s demystify this thing.

First, Let’s Get Our Terms Straight

Before we start throwing acronyms around like confetti, let’s establish what fine-tuning actually is and how it compares to the other ways you can make an LLM do what you want.

Prompt Engineering

This is the “just ask nicely” approach. You write a system prompt, provide examples, use chain-of-thought, and pray. It’s free, it’s fast, and for a shocking number of use cases, it’s enough.

When to use it: When the model already knows how to do the thing — you just need to steer it. Think of it as giving directions to a driver who already knows how to drive.

Limitations: Context window limits, inconsistent behavior, and some tasks that models just can’t be prompted into doing well no matter how creative you get.

Retrieval-Augmented Generation (RAG)

RAG is the “give the model a cheat sheet” approach. You stuff relevant documents into the context window at inference time so the model can reference them. Great for knowledge-intensive tasks where the model needs access to your specific data.

When to use it: When the model needs to know facts it wasn’t trained on — your company’s documentation, product specs, legal documents, etc.

Limitations: You’re still limited by context window size, retrieval quality matters enormously, and the model’s behavior doesn’t change — just its available information. RAG can’t teach a model to write in a different style or follow a different output format consistently.

Fine-Tuning

This is the “actually change the model’s brain” approach. You take a pre-trained model and continue training it on your specific data so it internalizes new patterns, behaviors, styles, or knowledge.

When to use it: When you need the model to consistently behave differently — match a writing style, follow a specific output format, handle domain-specific tasks, or respond in ways that can’t be reliably prompted.

Limitations: Requires training data, compute resources, and the possibility of messing things up (catastrophic forgetting, overfitting, or accidentally teaching the model to be worse at everything except your specific task).

Think of it this way: prompt engineering is giving the model instructions, RAG is giving the model a textbook, and fine-tuning is sending the model to school.

The VRAM Problem (And Why Full Fine-Tuning Is Not for You)

Here’s the part where we talk about why you’re not just going to do regular, full fine-tuning.

A 7-billion parameter model in full precision (float32) takes about 28 GB just to load the weights. Training requires storing optimizer states and gradients too, so you’re looking at roughly 3-4x the model size in VRAM. That’s 84-112 GB of VRAM for a 7B model. A 70B model? Forget about it. You’d need a small cluster of A100s, and those cost more than most people’s cars.

Your RTX 4090 has 24 GB of VRAM. Your RTX 3080 has 10 GB. See the problem?

This is where LoRA enters the chat.

LoRA: The Adapter That Changed Everything

LoRA stands for Low-Rank Adaptation of Large Language Models, which is a fancy way of saying: “What if we didn’t retrain the whole model? What if we just added some small, trainable modules on top and left the original weights frozen?”

Here’s the intuition. A neural network layer is basically a big matrix multiplication. For a 7B model, these matrices can be enormous — think 4096 x 4096. Full fine-tuning updates every single value in these matrices. That’s expensive.

LoRA’s insight is that the changes you need to make to these matrices during fine-tuning are usually low-rank — meaning they can be approximated by multiplying two much smaller matrices together. Instead of updating a 4096 x 4096 matrix (about 16.7 million parameters), you decompose the update into two matrices: one that’s 4096 x 16 and another that’s 16 x 4096 (about 131,000 parameters). That’s a 99.2% reduction in trainable parameters.

The Analogy

Imagine you have a massive oil painting that you want to modify. Full fine-tuning is like repainting the entire canvas from scratch. LoRA is like placing a thin transparent overlay on top and painting only the changes you need. The original painting stays untouched, and you can swap overlays in and out whenever you want.

This has several beautiful consequences:

  1. Way less VRAM — you’re only storing gradients and optimizer states for the tiny adapter matrices, not the whole model.
  2. Way faster training — fewer parameters to update means each training step is faster.
  3. The base model stays intact — no catastrophic forgetting. Your adapter is modular. Don’t like it? Remove it. Want a different one? Swap it in.
  4. Tiny file sizes — a LoRA adapter for a 7B model might be 10-50 MB, compared to the 14+ GB base model. Easy to share, version, and store.

Key LoRA Hyperparameters

When you configure LoRA, you’ll encounter a few important settings:

QLoRA: When Even LoRA Is Too Chunky

LoRA drastically reduces the number of trainable parameters, but you still need to load the entire base model into VRAM for the forward pass. A 7B model in float16 is still about 14 GB. That eats most of your RTX 4090’s VRAM before training even starts.

QLoRA (Quantized LoRA) solves this by loading the base model in 4-bit quantization. That same 7B model now takes about 3.5-4 GB of VRAM. The LoRA adapters themselves are still trained in higher precision (typically bfloat16), but since they’re tiny, that’s not a problem.

QLoRA uses a technique called NormalFloat4 (NF4) quantization, which is specifically designed to be optimal for normally distributed weights (which neural network weights tend to be). It also uses double quantization — quantizing the quantization constants themselves — to squeeze out even more savings.

The VRAM Math

Let’s do some rough napkin math for a 7B model:

ApproachModel VRAMTraining OverheadTotal VRAM
Full fine-tune (fp32)~28 GB~56 GB~84 GB
Full fine-tune (fp16)~14 GB~28 GB~42 GB
LoRA (fp16 base)~14 GB~1-2 GB~16 GB
QLoRA (4-bit base)~4 GB~1-2 GB~6 GB

Six gigabytes. Your RTX 3060 with 12 GB of VRAM can handle that and still have room for KDE to eat some VRAM in the background. You can fine-tune a 7B model on a mid-range gaming GPU. Welcome to the future.

For larger models, QLoRA really shines:

Preparing Your Dataset

This is the part people rush through, and it’s the part that matters most. Your model will only be as good as the data you train it on. Garbage in, garbage out — this cliché exists for a reason.

Format

The standard format for instruction fine-tuning is a collection of conversations or instruction-response pairs. The most common formats are:

Alpaca format:

{
  "instruction": "Summarize the following text in one sentence.",
  "input": "The quick brown fox jumped over the lazy dog while the farmer watched from the porch.",
  "output": "A fox jumped over a dog while a farmer observed."
}

ChatML / conversational format:

{
  "conversations": [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to reverse a string."},
    {"role": "assistant", "content": "def reverse_string(s):\n    return s[::-1]"}
  ]
}

ShareGPT format (common in the community):

{
  "conversations": [
    {"from": "human", "value": "Explain quantum computing simply."},
    {"from": "gpt", "value": "Imagine regular computers use coins that are either heads or tails..."}
  ]
}

Most training frameworks can handle any of these formats with the right configuration. Pick one and be consistent.

Dataset Size

How much data do you need? It depends (the universal answer in machine learning), but here are some rough guidelines:

Quality matters far more than quantity. 200 carefully crafted, diverse, high-quality examples will outperform 10,000 noisy, repetitive, low-quality ones. Every. Single. Time.

Common Dataset Mistakes

  1. Too homogeneous — if every example is basically the same task with slightly different inputs, the model will overfit to that pattern and become useless at everything else.
  2. Too noisy — typos, contradictions, wrong answers, inconsistent formatting. The model will learn these mistakes too.
  3. Too short — if all your examples are one-liners, the model may struggle to generate longer outputs.
  4. No system prompts — if you want the model to follow system prompts at inference time, include them in your training data.
  5. Forgetting to shuffle — if your dataset is ordered by topic or difficulty, the model might “forget” earlier topics as it trains on later ones.

The Hugging Face Ecosystem

If you’re doing open-source LLM work, Hugging Face is your home base. It’s like GitHub for ML models, datasets, and tools. Here’s what you’ll use:

Install the core stack:

pip install torch transformers peft trl datasets bitsandbytes accelerate

Unsloth: The Speed Demon

Unsloth deserves its own section because it’s genuinely impressive. It’s an open-source library that optimizes the fine-tuning process to be 2-5x faster and use 50-70% less VRAM compared to standard Hugging Face training.

How? It rewrites the forward and backward passes of popular model architectures using custom Triton kernels, avoids unnecessary memory allocations, and fuses operations that the standard implementation runs separately. The result is that a training run that would take 4 hours on a 4090 might take 90 minutes with Unsloth, using less memory.

The best part: it’s mostly a drop-in replacement. You change a few imports and function calls and everything else stays the same.

pip install unsloth

Unsloth supports most popular architectures: Llama, Mistral, Phi, Gemma, Qwen, and more. If you’re fine-tuning on a single consumer GPU, there’s almost no reason not to use it.

Practical Walkthrough: Fine-Tuning with QLoRA and Unsloth

Alright, enough theory. Let’s actually fine-tune a model. We’ll use Llama 3.1 8B as our base, QLoRA for memory efficiency, and Unsloth for speed. This entire process can run on an RTX 3090 or 4090.

Step 1: Load the Model

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    dtype=None,           # Auto-detect (will use bf16 if supported)
    load_in_4bit=True,    # QLoRA: load base model in 4-bit
)

That load_in_4bit=True is doing a lot of heavy lifting. It’s invoking the entire QLoRA quantization pipeline under the hood.

Step 2: Add LoRA Adapters

model = FastLanguageModel.get_peft_model(
    model,
    r=32,                     # LoRA rank
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Saves even more VRAM
)

At this point, only about 1-2% of the model’s parameters are trainable. The rest are frozen in 4-bit quantization, sipping VRAM like a gentleman.

Step 3: Prepare Your Dataset

from datasets import load_dataset

# Load a dataset from Hugging Face Hub (or use your own)
dataset = load_dataset("your-username/your-dataset", split="train")

# Or load from a local JSON file
# dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

You’ll need to format your data into the chat template your model expects. For Llama 3.1 Instruct:

def format_chat(example):
    messages = example["conversations"]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}

dataset = dataset.map(format_chat)

Step 4: Configure Training

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,    # Effective batch size = 8
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        output_dir="./outputs",
        optim="adamw_8bit",              # 8-bit optimizer saves more VRAM
        seed=42,
    ),
)

Step 5: Train

trainer.train()

That’s it. Go get coffee. On a 4090 with Unsloth, a dataset of 1,000 examples at 2048 token max length will typically finish in 15-30 minutes. A 10,000-example dataset might take 1-3 hours. Your GPU fans will sound like a small aircraft, and that’s perfectly normal.

Step 6: Save and Test

# Save the LoRA adapter (small file, typically 20-80 MB)
model.save_pretrained("./my-fine-tuned-adapter")
tokenizer.save_pretrained("./my-fine-tuned-adapter")

# Test it
messages = [
    {"role": "user", "content": "Your test prompt here"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

outputs = model.generate(input_ids=inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluating Your Fine-Tuned Model

Training is only half the battle. You need to know whether your fine-tuned model actually improved or if you just spent two hours teaching it to be confidently wrong.

Quick Sanity Checks

  1. Run your test prompts — create a set of 20-30 representative prompts and compare the outputs before and after fine-tuning. Do this qualitatively. Read the outputs. Use your human brain.
  2. Check for overfitting — if the model starts regurgitating training examples verbatim, you’ve overfit. Reduce epochs, increase dropout, or add more diverse data.
  3. Check for catastrophic forgetting — test the model on general tasks it could do before fine-tuning. If it’s forgotten how to do basic things, you may have trained too aggressively.

Metrics

Merging Adapters

Once you’re happy with your LoRA adapter, you might want to merge it back into the base model. This creates a standalone model that doesn’t need the adapter loaded separately at inference time.

# Merge LoRA into base model
merged_model = model.merge_and_unload()

# Save the full merged model
merged_model.save_pretrained("./my-merged-model")
tokenizer.save_pretrained("./my-merged-model")

With Unsloth, you can also export directly to GGUF format for use with llama.cpp and Ollama:

# Save as GGUF for llama.cpp / Ollama
model.save_pretrained_gguf(
    "./my-model-gguf",
    tokenizer,
    quantization_method="q4_k_m",  # Good balance of quality and size
)

Now you can run your fine-tuned model locally with ollama run or llama-server and never think about Python again. Until next time.

When to Merge vs. Keep Separate

Common Pitfalls and How to Avoid Them

Because learning from other people’s mistakes is cheaper than making your own.

1. Learning Rate Too High

If your training loss spikes or oscillates wildly, your learning rate is too high. For QLoRA, start with 2e-4 and work down. If you’re seeing instability, try 1e-4 or 5e-5.

2. Too Many Epochs

More is not always better. For small datasets (under 1,000 examples), 1-3 epochs is often enough. For larger datasets, 1 epoch might suffice. Watch your training loss — if it plateaus, stop. If it starts going up, you’ve gone too far.

3. Wrong Chat Template

Every model family has its own chat template (Llama uses <|start_header_id|> tokens, Mistral uses [INST] tokens, ChatML uses <|im_start|> tokens). Using the wrong template during training means the model won’t respond correctly to prompts formatted with the correct template at inference time. Always use tokenizer.apply_chat_template().

4. Sequence Length Mismatch

If your training examples are longer than your max_seq_length, they’ll be truncated silently. If they’re much shorter, you’re wasting compute on padding. Check your data distribution and set the sequence length accordingly.

5. Not Enough Diversity

If you fine-tune on 500 examples that are all slight variations of the same task, you’ll get a model that’s really good at that one task and noticeably worse at everything else. Include some general-purpose instruction-following data to maintain broad capabilities.

6. VRAM OOM During Training

If you hit an out-of-memory error mid-training:

7. Forgetting to Test Before and After

Always establish a baseline. Run your evaluation prompts on the base model before fine-tuning so you can actually measure whether your fine-tuning improved things. “It feels better” is not a metric.

Wrapping Up

Fine-tuning used to be something only Big Tech and well-funded startups could do. LoRA and QLoRA changed that equation completely. With a gaming GPU, some curated training data, and a free afternoon, you can create a model that’s specifically tuned to your needs.

The workflow is simpler than you think:

  1. Curate quality data in a standard format.
  2. Load a base model in 4-bit with QLoRA.
  3. Attach LoRA adapters to the attention and MLP layers.
  4. Train with SFTTrainer (use Unsloth for speed).
  5. Evaluate — qualitatively and quantitatively.
  6. Merge or export for deployment.

Is it magic? No. Is it accessible? Absolutely. The barrier to entry has dropped from “needs a GPU cluster” to “needs a decent gaming PC,” and the tools keep getting better.

Now go forth and fine-tune something. Just… maybe start with a small model and a small dataset. Your electricity bill will thank you.


Share this post on:

Previous Post
Vaultwarden Organization Sharing: Password Management for Your Whole Household (or Team)
Next Post
Ollama Beyond the Basics: Model Management, Custom Models, and Optimization