When a company decides to pursue fine-tuning, the first question is usually: “which GPU should we buy?” That question is asked too soon. Before getting to hardware and budget, it’s worth understanding what LoRA and QLoRA actually do and why they change the entire economics of this approach.
What LoRA and QLoRA actually do
#Full fine-tuning modifies all model weights. For a 7B model, that’s tens of billions of parameters, each stored as a floating-point number. Such training requires at least 40 GB VRAM and is practically inaccessible without a GPU cluster.
LoRA (Low-Rank Adaptation) decomposes the weight update into two much smaller low-rank matrices. Instead of modifying the entire weight matrix W (e.g., 4096 × 4096), it trains two matrices A and B of size 4096 × r and r × 4096, where r is a hyperparameter between 4 and 64. During inference, the adapter is simply added to the original weights or loaded separately. The original model remains unchanged.
QLoRA adds quantization: the original model is loaded in 4-bit precision (NF4), reducing its VRAM footprint fourfold. LoRA adapters are trained in 16-bit, so gradients are computed in higher precision. The resulting model is slightly slower during inference in its quantized version, but in many tasks, the quality difference compared to full fine-tuning is marginal.
The boundary between LoRA and QLoRA is simple: if VRAM is a limiting factor, start with QLoRA.
How much VRAM you actually need
#| Approach | 7B Model | 13B Model | 70B Model |
|---|---|---|---|
| Full fine-tuning | over 40 GB | over 80 GB | beyond single GPU reach |
| LoRA (bf16) | 18-24 GB | 32-40 GB | 80+ GB (A100 ×2) |
| QLoRA (4-bit NF4) | 8-12 GB | 14-18 GB | 48-56 GB (A100 ×1) |
| Typical consumer GPU | RTX 3090/4090 (24 GB) | RTX 4090 or A6000 | unavailable without cloud |
Numbers assume batch size 1-4 with gradient checkpointing. For comparison: an RTX 4090 with 24 GB VRAM comfortably handles QLoRA on models up to 13B, LoRA on 7B without quantization, and full fine-tuning on 7B only with aggressive gradient checkpointing and small batch sizes, which extends training time.
A separate consideration is self-hosting the finished adapter. After training, a LoRA adapter is a file of a few dozen megabytes that is applied to the base model. For inference alone, you only need as much VRAM as the base model occupies, without training overhead.
Training data: quality over quantity
#The most common mistake is collecting a thousand mediocre examples instead of three hundred precise ones. A few proven principles:
Minimum thresholds. Below 200 input-output pairs, fine-tuning rarely yields stable results. A range of 300-800 high-quality examples is sufficient for narrow tasks (classification, extraction, generation in a fixed template). For broader behavioral changes: 1000 or more.
Hold-out from the start. Set aside 10-20% of data as a test set before training begins. Never use these pairs for training or hyperparameter selection. This is the only metric you can trust when deciding on deployment.
Consistency over diversity. If you’re training a model to generate reports in a specific format, every example should adhere to that format. A few exceptions in the training data can “teach” the model that the format is optional.
Personal data. If training pairs contain PII, RODO and likely a DPIA apply. After training, removing specific data from weights is technically difficult, complicating the right to be forgotten in a way that RAG avoids. We recommend anonymization before training or an architecture where training data never leaves your infrastructure.
Workflow: from data to deployed adapter
#Below is the pattern we use at Cashcrown for production deployments.
1. Baseline evaluation. Before training, measure how the base model performs on your task using the hold-out set. This is your reference point—without it, you won’t know if fine-tuning improved anything.
2. Data preparation. Standardized format (e.g., system instruction + input + output in Alpaca or ChatML format), PII anonymization, label consistency verification by at least two people.
3. LoRA or QLoRA training. Popular stack: transformers + peft + bitsandbytes (for QLoRA) + trl (trainer). Key hyperparameters: rank r (start with 16), alpha (typically 2×r), learning rate (1e-4 to 3e-4), number of epochs (3-5 for small datasets). Log every run with date, base model, and data checksum.
4. Hold-out evaluation. For classification: F1 per class, confusion matrix. For generation: ROUGE-L, BERTScore, and if possible, human evaluation of a 50-100 sample. Compare with the baseline from step 1.
5. Human decision. This isn’t an automated step. Someone responsible for the product reviews the results and decides whether the adapter moves to deployment. For high-risk systems (AI Act, Annex III), this step requires documentation.
6. Adapter deployment. A LoRA adapter can be served via Ollama (GGUF + adapter), vLLM (native PEFT), or as a separate container with the base model and loaded adapter. After deployment, monitor drift. If the query distribution changes significantly compared to training data, hold-out metrics become unreliable.
Realistic cost and time ranges
#TCO of fine-tuning isn’t just GPU time.
| Cost Item | Range (7B model, ~500 pairs) | Notes |
|---|---|---|
| Data preparation | 15-40 engineering hours | quality verification, PII anonymization |
| QLoRA training | 1-4 GPU hours (RTX 4090 locally) or $5-20 in cloud | depends on sequence length and epochs |
| Evaluation and iterations | 10-25 hours | 2-4 hyperparameter rounds, human evaluation |
| Deployment and monitoring | 5-15 hours | CI for adapter, alert threshold on metrics |
| Maintenance (quarterly) | 5-10 hours | retraining after drift, new data |
Full deployment from scratch to production rarely takes less than 3-4 weeks. Projects with clean, ready data can take as little as 2 weeks. Projects requiring data preparation from scratch exceed 6 weeks.
Compare this with migrating from API to your own model: the break-even calculation is similar and worth doing before committing resources.
What fine-tuning doesn’t solve
#Fine-tuning changes how the model behaves, not what the model knows. This distinction from the article when fine-tuning makes sense is worth repeating in the context of LoRA and QLoRA.
An adapter trained on 2024 examples won’t know 2026 regulations. A model after fine-tuning still hallucinates facts not present in its weights. That’s a job for RAG, not an adapter. The best production architectures we see combine light fine-tuning (style, format) with RAG (fresh facts per call). Details of this pattern are in the article RAG vs fine-tuning.
Fine-tuning isn’t a security mechanism either. An adapter won’t replace guardrails, doesn’t eliminate prompt injection, and doesn’t reliably restrict the model to a specific domain. Security is built in layers, outside the model weights. More on hardware and local environments in the article local LLM: what hardware and GPU.
FAQ
#What’s the practical difference between LoRA and QLoRA?
#LoRA trains small adapter matrices while keeping the base model in full precision (bf16 or fp16). QLoRA additionally quantizes the base model to 4-bit NF4 before training, reducing VRAM usage by another 50-60%. The quality of the resulting adapter is similar for most classification and generation tasks; differences appear with very long sequences or high numerical precision requirements.
How many training examples do I need for LoRA?
#For a narrow task (classification, extraction, generation in a fixed template), a realistic minimum is 300-500 high-quality pairs with a 10-20% hold-out set. Below 200 pairs, the risk of unstable training or overfitting is high. For broader behavioral changes (tone adjustment, handling multiple intents), you need 1000 pairs or more. Label quality matters more than quantity.
Can I run QLoRA on a standard company laptop?
#On a laptop without a dedicated GPU: no. QLoRA on a 7B model requires a graphics card with at least 12 GB VRAM (e.g., RTX 3080/3090/4080/4090) or access to a cloud instance. Training on a laptop with integrated graphics is technically possible on CPU, but takes days instead of hours and is usually impractical. Alternatives include cloud computing: RunPod, Lambda Labs, Google Colab Pro.
How do I deploy a LoRA adapter in production?
#A LoRA adapter is a file of a few dozen megabytes (a set of matrices) that is applied to the base model. In practice: convert to GGUF format with embedded adapter (llama.cpp + --lora) or use vLLM with native PEFT support. Load the base model once, swap adapters without restarting in some frameworks. Versioning is necessary for both the adapter and the base model it was trained on—version mismatch leads to hard-to-debug errors.
When does LoRA fine-tuning not make sense?
#When the problem is access to up-to-date factual knowledge—this is a job for RAG, not fine-tuning. When you have fewer than 200 training pairs or their quality is low. When you update the knowledge base more often than once a quarter, as each change would require retraining. When you lack resources for evaluation and maintaining multiple adapter versions. In these cases, AI model selection and RAG are cheaper and faster to start with.