LoRA and QLoRA fine-tuning in practice — when, how, and how…

LoRA and QLoRA fine-tuning in practice — when, how, and how much

When a company decides to pursue fine-tuning, the first question is usually: “which GPU should we buy?” That question is asked too soon. Before getting to hardware and budget, it’s worth understanding what LoRA and QLoRA actually do and why they change the entire economics of this approach.

What LoRA and QLoRA actually do

Full fine-tuning modifies all model weights. For a 7B model, that’s tens of billions of parameters, each stored as a floating-point number. Such training requires at least 40 GB VRAM and is practically inaccessible without a GPU cluster.

LoRA (Low-Rank Adaptation) decomposes the weight update into two much smaller low-rank matrices. Instead of modifying the entire weight matrix W (e.g., 4096 × 4096), it trains two matrices A and B of size 4096 × r and r × 4096, where r is a hyperparameter between 4 and 64. During inference, the adapter is simply added to the original weights or loaded separately. The original model remains unchanged.

QLoRA adds quantization: the original model is loaded in 4-bit precision (NF4), reducing its VRAM footprint fourfold. LoRA adapters are trained in 16-bit, so gradients are computed in higher precision. The resulting model is slightly slower during inference in its quantized version, but in many tasks, the quality difference compared to full fine-tuning is marginal.

The boundary between LoRA and QLoRA is simple: if VRAM is a limiting factor, start with QLoRA.

How much VRAM you actually need

Approach	7B Model	13B Model	70B Model
Full fine-tuning	over 40 GB	over 80 GB	beyond single GPU reach
LoRA (bf16)	18-24 GB	32-40 GB	80+ GB (A100 ×2)
QLoRA (4-bit NF4)	8-12 GB	14-18 GB	48-56 GB (A100 ×1)
Typical consumer GPU	RTX 3090/4090 (24 GB)	RTX 4090 or A6000	unavailable without cloud

Numbers assume batch size 1-4 with gradient checkpointing. For comparison: an RTX 4090 with 24 GB VRAM comfortably handles QLoRA on models up to 13B, LoRA on 7B without quantization, and full fine-tuning on 7B only with aggressive gradient checkpointing and small batch sizes, which extends training time.

A separate consideration is self-hosting the finished adapter. After training, a LoRA adapter is a file of a few dozen megabytes that is applied to the base model. For inference alone, you only need as much VRAM as the base model occupies, without training overhead.

Training data: quality over quantity

The most common mistake is collecting a thousand mediocre examples instead of three hundred precise ones. A few proven principles:

Minimum thresholds. Below 200 input-output pairs, fine-tuning rarely yields stable results. A range of 300-800 high-quality examples is sufficient for narrow tasks (classification, extraction, generation in a fixed template). For broader behavioral changes: 1000 or more.

Hold-out from the start. Set aside 10-20% of data as a test set before training begins. Never use these pairs for training or hyperparameter selection. This is the only metric you can trust when deciding on deployment.

Consistency over diversity. If you’re training a model to generate reports in a specific format, every example should adhere to that format. A few exceptions in the training data can “teach” the model that the format is optional.

Personal data. If training pairs contain PII, RODO and likely a DPIA apply. After training, removing specific data from weights is technically difficult, complicating the right to be forgotten in a way that RAG avoids. We recommend anonymization before training or an architecture where training data never leaves your infrastructure.

▶Is my data suitable for LoRA fine-tuning?sandbox · reasoning

Workflow: from data to deployed adapter

Below is the pattern we use at Cashcrown for production deployments.

1. Baseline evaluation. Before training, measure how the base model performs on your task using the hold-out set. This is your reference point—without it, you won’t know if fine-tuning improved anything.

2. Data preparation. Standardized format (e.g., system instruction + input + output in Alpaca or ChatML format), PII anonymization, label consistency verification by at least two people.

3. LoRA or QLoRA training. Popular stack: transformers + peft + bitsandbytes (for QLoRA) + trl (trainer). Key hyperparameters: rank r (start with 16), alpha (typically 2×r), learning rate (1e-4 to 3e-4), number of epochs (3-5 for small datasets). Log every run with date, base model, and data checksum.

4. Hold-out evaluation. For classification: F1 per class, confusion matrix. For generation: ROUGE-L, BERTScore, and if possible, human evaluation of a 50-100 sample. Compare with the baseline from step 1.

5. Human decision. This isn’t an automated step. Someone responsible for the product reviews the results and decides whether the adapter moves to deployment. For high-risk systems (AI Act, Annex III), this step requires documentation.

6. Adapter deployment. A LoRA adapter can be served via Ollama (GGUF + adapter), vLLM (native PEFT), or as a separate container with the base model and loaded adapter. After deployment, monitor drift. If the query distribution changes significantly compared to training data, hold-out metrics become unreliable.

Realistic cost and time ranges

TCO of fine-tuning isn’t just GPU time.

Cost Item	Range (7B model, ~500 pairs)	Notes
Data preparation	15-40 engineering hours	quality verification, PII anonymization
QLoRA training	1-4 GPU hours (RTX 4090 locally) or $5-20 in cloud	depends on sequence length and epochs
Evaluation and iterations	10-25 hours	2-4 hyperparameter rounds, human evaluation
Deployment and monitoring	5-15 hours	CI for adapter, alert threshold on metrics
Maintenance (quarterly)	5-10 hours	retraining after drift, new data

Full deployment from scratch to production rarely takes less than 3-4 weeks. Projects with clean, ready data can take as little as 2 weeks. Projects requiring data preparation from scratch exceed 6 weeks.

Compare this with migrating from API to your own model: the break-even calculation is similar and worth doing before committing resources.

What fine-tuning doesn’t solve

Fine-tuning changes how the model behaves, not what the model knows. This distinction from the article when fine-tuning makes sense is worth repeating in the context of LoRA and QLoRA.

An adapter trained on 2024 examples won’t know 2026 regulations. A model after fine-tuning still hallucinates facts not present in its weights. That’s a job for RAG, not an adapter. The best production architectures we see combine light fine-tuning (style, format) with RAG (fresh facts per call). Details of this pattern are in the article RAG vs fine-tuning.

Fine-tuning isn’t a security mechanism either. An adapter won’t replace guardrails, doesn’t eliminate prompt injection, and doesn’t reliably restrict the model to a specific domain. Security is built in layers, outside the model weights. More on hardware and local environments in the article local LLM: what hardware and GPU.

FAQ

What’s the practical difference between LoRA and QLoRA?

LoRA trains small adapter matrices while keeping the base model in full precision (bf16 or fp16). QLoRA additionally quantizes the base model to 4-bit NF4 before training, reducing VRAM usage by another 50-60%. The quality of the resulting adapter is similar for most classification and generation tasks; differences appear with very long sequences or high numerical precision requirements.

How many training examples do I need for LoRA?

For a narrow task (classification, extraction, generation in a fixed template), a realistic minimum is 300-500 high-quality pairs with a 10-20% hold-out set. Below 200 pairs, the risk of unstable training or overfitting is high. For broader behavioral changes (tone adjustment, handling multiple intents), you need 1000 pairs or more. Label quality matters more than quantity.

Can I run QLoRA on a standard company laptop?

On a laptop without a dedicated GPU: no. QLoRA on a 7B model requires a graphics card with at least 12 GB VRAM (e.g., RTX 3080/3090/4080/4090) or access to a cloud instance. Training on a laptop with integrated graphics is technically possible on CPU, but takes days instead of hours and is usually impractical. Alternatives include cloud computing: RunPod, Lambda Labs, Google Colab Pro.

How do I deploy a LoRA adapter in production?

A LoRA adapter is a file of a few dozen megabytes (a set of matrices) that is applied to the base model. In practice: convert to GGUF format with embedded adapter (llama.cpp + --lora) or use vLLM with native PEFT support. Load the base model once, swap adapters without restarting in some frameworks. Versioning is necessary for both the adapter and the base model it was trained on—version mismatch leads to hard-to-debug errors.

When does LoRA fine-tuning not make sense?

When the problem is access to up-to-date factual knowledge—this is a job for RAG, not fine-tuning. When you have fewer than 200 training pairs or their quality is low. When you update the knowledge base more often than once a quarter, as each change would require retraining. When you lack resources for evaluation and maintaining multiple adapter versions. In these cases, AI model selection and RAG are cheaper and faster to start with.