A law firm in Warsaw wanted to deploy an assistant for contract analysis. The first question was technical: is a server with an RTX 3090 enough? The second, more accurate: what model is actually needed for this task? The answer to the first depends on the answer to the second. Without model requirements, the hardware decision is a gamble.
Below, I describe how to connect these dots—from model requirements through hardware to a real cost breakdown for Polish companies considering self-hosting LLMs.
Why VRAM, Not CPU or RAM?
#Local LLMs can run on CPU (e.g., via llama.cpp), but generation speed on CPU alone is typically 2–8 tokens per second for a 7B model. For production applications where users wait for responses, this is too slow: human perception of comfort starts at 20–30 tokens per second.
GPU accelerates inference through parallel matrix multiplication—an operation that dominates LLM computations. However, the key factors aren’t megaflops but two other values:
VRAM (GPU memory) determines whether the model loads at all. If the model doesn’t fit in VRAM, it’s offloaded to RAM or disk, and speed drops by an order of magnitude. The first question when choosing a GPU is always: will the entire model fit in VRAM?
Memory bandwidth determines how quickly the GPU reads model weights during generation. Generating each token requires reading the entire model from memory. A GPU with 800 GB/s bandwidth generates tokens faster than one with 400 GB/s, even with the same number of cores. When comparing GPUs, check bandwidth, not just TFLOPS.
How Much VRAM Does Each Model Need?
#VRAM requirements depend on model size and quantization. Quantization reduces weight precision from 16 bits to 8, 4, or even 3 bits, shrinking model size with an acceptable quality drop.
| Model | Q4_K_M (VRAM) | Q8 (VRAM) | BF16 (VRAM) |
|---|---|---|---|
| 7B (e.g., Mistral 7B) | 4.1 GB | 7.7 GB | 14 GB |
| 13B | 7.9 GB | 14 GB | 26 GB |
| 34B | 20 GB | 34 GB | 68 GB |
| 70B | 40 GB | 70 GB | 140 GB |
| 8×7B MoE (e.g., Mixtral) | 26 GB | 47 GB | 93 GB |
Values are approximate; actual requirements vary based on context window length and implementation. Add a buffer for activations: with an 8K context, that’s an extra 1–2 GB; with 32K, it’s 4–8 GB. Large contexts quickly consume VRAM.
GPU Overview for Local LLMs
#The 2026 GPU market can be divided into three classes for LLM use cases:
Consumer GPUs (RTX 30xx/40xx) are affordable but have limited VRAM (up to 24 GB in RTX 3090/4090). The RTX 4090’s bandwidth is 1,008 GB/s, making it the fastest consumer GPU for LLMs. Limitation: no NVLink support for most models, complicating multi-GPU setups. A new RTX 4090 costs 8,000–10,000 PLN.
Professional GPUs (RTX A-series, L40S, RTX 6000 Ada) offer more VRAM (48–80 GB), ECC, NVLink support, and drivers optimized for uptime. The RTX 6000 Ada (48 GB VRAM, ~24,000 PLN) can run a 34B model in full precision or a 70B model in Q4. The L40S (48 GB) is its server-grade equivalent.
Datacenter GPUs (H100, A100, H200) offer the most memory (80–141 GB) and highest bandwidth (HBM3, ~3.35 TB/s for H100 SXM), but prices start at 150,000 PLN and are typically available only in the cloud or via leasing.
| GPU | VRAM | Bandwidth | Estimated Price | Max Model (Q4) |
|---|---|---|---|---|
| RTX 4090 | 24 GB | 1,008 GB/s | ~9,000 PLN | 34B (partially) |
| RTX 3090 | 24 GB | 936 GB/s | ~5,000 PLN (used) | 13B comfortably |
| RTX 6000 Ada | 48 GB | 960 GB/s | ~24,000 PLN | 70B (Q4) |
| 2× RTX 4090 | 48 GB (total) | 2×1,008 GB/s | ~18,000 PLN | 70B (Q4) |
| L40S | 48 GB | 864 GB/s | ~40,000 PLN | 70B (Q4) |
| A100 80G | 80 GB | 2,000 GB/s | ~100,000+ PLN | 70B (BF16) |
For Polish companies looking to run 7B–13B models for RAG and classification tasks, the RTX 4090 or a used RTX 3090 offers the best price-to-performance ratio. For 70B models, you’ll need either two 24 GB GPUs (NVLink or PCIe) or a single 48 GB+ GPU.
Multi-GPU: When to Combine GPUs?
#Two GPUs can be combined in two ways:
NVLink (only select NVIDIA GPUs, including RTX 3090/4090 with compatible motherboards) merges the memory of both GPUs into a single pool. A 70B Q4 model on 2×24 GB GPUs behaves as if it has 48 GB of unified memory. Communication bandwidth is ~600 GB/s, minimizing bottlenecks.
PCIe (standard multi-GPU setups) doesn’t merge memory. The model must fit on one GPU or be split with transfers over the PCIe bus (16 GB/s), drastically slowing inference. PCIe multi-GPU setups are useful for throughput (multiple parallel queries on different GPUs), not for running a single large model.
For production systems handling multiple parallel queries, a 2–4 GPU PCIe setup with the same model can increase the number of queries per second without scaling VRAM. Each GPU runs its own model instance; an LLM router distributes traffic.
CPU and RAM: Their Role in Local LLM Systems
#CPU and RAM play secondary but not marginal roles.
System RAM is needed to load the model into VRAM (the model is first loaded into RAM, then copied to the GPU) and to handle orchestration layers: API servers, RAG pipelines, and query preprocessing. The minimum is twice the VRAM; practically, 64 GB for single-GPU setups and 128 GB for multi-GPU.
CPU is a bottleneck only in CPU-only mode (no GPU) or during intensive preprocessing (tokenizing large documents, embedding on CPU). For API servers with GPU-accelerated inference, any modern server-grade (e.g., AMD EPYC, Intel Xeon) or desktop (AMD Ryzen 9, Intel Core i9) processor suffices. Core count matters for API parallelism, not generation speed.
NVMe SSD speeds up model loading at server startup. A 7B model is a ~4 GB file; 70B models are ~40 GB. Loading from an HDD can take minutes; from NVMe, it’s seconds. For systems with multiple dynamically switched models (like our OpenClaw router), fast storage reduces readiness time.
Quantization: How to Reduce VRAM Without Losing Quality
#Fine-tuning and quantization are two ways to optimize models for hardware. Quantization is the simpler path for most deployments.
The most popular format in 2026 is GGUF with Q4_K_M or Q5_K_M quantization (supported by llama.cpp, Ollama, LM Studio). Q4_K_M reduces VRAM by ~72% compared to BF16 with a 1–3% quality loss on general benchmarks. For specialized tasks (law, finance, medicine), test empirically—degradation can be higher in niche domains than general benchmarks suggest.
GPTQ and AWQ are quantization formats for NVIDIA GPUs, operating at the GPU kernel level. They’re faster than GGUF at the same VRAM but require a compiler and more configuration. Useful for production NVIDIA servers.
bitsandbytes 4-bit (QLoRA) is used for fine-tuning on GPUs with limited VRAM, not for serving. Don’t confuse it with inference formats.
Practical rule: Start with Q4_K_M, test quality on your question set, and upgrade to Q5 or Q8 only if you measure a tangible drop in accuracy. For 7B models, the difference between Q4 and Q8 is usually minimal for business tasks.
Local LLM Costs vs. Cloud: When Does Self-Hosting Pay Off?
#The decision to deploy a local LLM is financial, not just technical. Here’s a comparison for a typical Polish deployment:
Production setup for a 13B model: server with RTX 4090 (9,000 PLN), 128 GB RAM (3,000 PLN), AMD Ryzen 9 + motherboard + PSU (5,000 PLN), 2 TB NVMe (800 PLN). Total hardware cost: ~18,000–22,000 PLN upfront. Add configuration and administration time (typically 2–4 days of engineering work to start, 1–2 hours weekly for maintenance).
RTX 4090 power consumption is ~350–400 W under full load. At Polish electricity rates (~0.80 PLN/kWh) and 8 hours of daily use, that’s ~65–75 PLN per month.
Cloud API comparison: For 100,000 short queries per month (input ~200 tokens, output ~300 tokens), the cost of a GPT-4o-class model API is ~1,500–3,000 PLN monthly. A local 13B-class LLM handles the same volume for just the electricity cost (a few dozen PLN) after the hardware investment is recouped.
Break-even point for local LLMs vs. cloud APIs: At 100,000 queries per month, hardware pays for itself in 6–12 months. Below 20,000 queries per month, cloud APIs are usually cheaper when accounting for administration time. Generate a precise cost estimate for your scope with the inference cost calculator.
RODO and Data Residency: When Self-Hosting Is a Requirement, Not a Choice
#For many Polish companies, hardware is secondary to regulatory concerns. RODO mandates control over personal data processed by AI systems. If queries sent to an LLM contain personal data (client names, PESEL, addresses), sending them to an external cloud API requires:
- Signing a data processing agreement (DPA) with the API provider,
- Verifying that the provider’s servers are in the EEA or that there’s a legal basis for transfers to third countries,
- Conducting a DPIA if the processing is high-risk.
Self-hosting eliminates these requirements for the generation layer: the LLM runs on your infrastructure, and data never leaves the company. We cover this approach in detail in the article on self-hosted LLMs and RODO. Regardless of hardware choice, PII should be masked before being sent to the model—this rule applies to both local LLMs and APIs.
Try It Live
#Describe your local LLM use case (industry, query type, estimated volume, data sensitivity), and the model will recommend a hardware configuration and optimal quantization (playground: PII masked, zero retention):
FAQ
#Is an RTX 4090 Enough to Run a Local LLM for a Company?
#An RTX 4090 with 24 GB VRAM can handle models up to ~13B in full precision or up to 34B in Q4 quantization. For most business use cases—RAG assistants, document classification, FAQ responses—a 7B or 13B model is sufficient, and the RTX 4090 generates responses at 50–80 tokens per second. If you need a 70B model (e.g., for more complex reasoning), you’ll need either two NVLink-connected GPUs or a 48 GB+ GPU.
How Much VRAM Does a 70B Model Need?
#A 70B model in Q4_K_M quantization takes about 40–42 GB VRAM. Add a buffer for context: with an 8K window, that’s 2–4 GB; with 32K, it’s 8–12 GB. The minimum hardware is two RTX 3090/4090 GPUs connected via NVLink (total 48 GB) or a single 48 GB professional GPU, like the RTX 6000 Ada or L40S. A dual-GPU PCIe setup doesn’t merge memory, so a 70B Q4 model won’t fit on 2×24 GB without NVLink.
Can You Run a Local LLM Without a GPU, Using Only CPU?
#Yes, tools like llama.cpp allow running LLMs on CPU alone. However, generation speed is 2–8 tokens per second for 7B models, which is too slow for production applications (chat, assistants). CPU mode is useful for testing, prototyping, and batch tasks without time constraints (e.g., overnight batch summarization). For production throughput, a GPU is necessary.
How to Choose Between Q4 and Q8 Quantization?
#Q4_K_M reduces VRAM by over half compared to full precision, with a 1–3% quality loss on general benchmarks. Q8 reduces VRAM by ~50% with less than 1% quality loss. The recommended starting point is Q4_K_M: it fits more model into available VRAM, and for most business tasks, the quality difference is negligible. Upgrade to Q5 or Q8 only if you measure a concrete drop in accuracy on your test set, not based on abstract benchmarks. Model selection and quantization are covered in the article on local vs. API LLM costs.
What Software Supports Local LLMs on GPU?
#The most popular options in 2026 are: Ollama (easiest setup, OpenAI-compatible API, supports GGUF), vLLM (production server for GPTQ/AWQ, optimized for throughput, requires CUDA), llama.cpp with HTTP server (flexible, supports CPU and GPU, GGUF format), and LM Studio (GUI for prototyping). For production environments, Ollama or vLLM with an LLM router to manage traffic and fall back to the cloud during peak loads is a proven pattern. Architecture details are in the article on company knowledge-based GPT.