A SaaS company with 40,000 active users receives an LLM API bill exceeding 30,000 PLN per month. The legal team flags that data processed by the external API requires a RODO audit. The CTO asks: wouldn’t it be cheaper to run our own model? This is the question we hear every week in 2026. The answer isn’t simple—but it is calculable.
Below, I outline what such a migration looks like from the perspective of decisions, architecture, and risk. No promises of “saving X% post-migration.” Just concrete thresholds, typical pitfalls, and what actually needs redesigning.
When migration makes sense—and when it doesn’t
#Migration to your own model is an infrastructure project, not an optimization. Before deciding, ask four questions:
Is the cost of tokens your dominant expense? External APIs bill per query. At scale, this adds up to tens of thousands of PLN monthly. But if generative AI is a marginal part of your product (e.g., a few hundred queries per day), the operational costs of self-hosting can easily outweigh token savings.
Can your data leave your infrastructure? Health, financial, legal, and HR data—all categories requiring special protection under RODO or professional secrecy—favor self-hosting even at low volumes. Here, the decision is legal, not just economic.
Is a general-purpose model sufficient, or do you need specialization? Deploying an open-weight model straight from a repository (e.g., Mistral, Llama, Qwen) is still a general-purpose model. If your task requires domain knowledge, migration alone isn’t enough—you’ll need fine-tuning or a RAG architecture on your knowledge base.
Do you have the expertise to maintain infrastructure? Your own model means servers, monitoring, security updates, and incident response. API-first offloads this cost. Without GPU infrastructure experience, migration without external support is an operational risk.
| Signal | API-first | Self-hosting |
|---|---|---|
| Query volume / month | below 30,000 | above 80,000 |
| Sensitive data (RODO / professional secrecy) | acceptable with DPA | self-hosting required |
| Required uptime and latency | external SLA | your responsibility |
| Capex budget (GPU / server) | none | from 15,000 PLN up |
| ML/infra expertise in team | not required | required or outsourced |
| Need for model customization | limited | full control |
What “your own model” actually means
#“Your own model” doesn’t mean building one from scratch. Training an LLM from the ground up costs hundreds of thousands to millions of dollars in compute—out of reach for most companies. In practice, “your own model” means one of three approaches:
Deploy open-weight from Hugging Face / repository. You download model weights (Mistral, Llama 3, Qwen 2.5, Gemma) and run them on your GPU via Ollama, vLLM, or llama.cpp. The model is publicly available, but inference happens on your infrastructure. No data leaves your environment.
Fine-tuning on your data. You take an open-weight model and adapt it to your dataset (QLoRA, LoRA). Result: a model that knows your style, terminology, and response format. When this makes sense—and when RAG is enough—is covered in when fine-tuning makes sense.
Distillation from a large model to a small one. You generate synthetic training data from a large model (API) and train a smaller model on it. Outcome: a compact, specialized model that’s cheap to maintain. Requires careful dataset design and quality validation.
In every case, it’s not a “URL swap” but an architecture change. Router, guardrails, prompt formats, error handling, monitoring—everything must adapt to the new model’s characteristics.
How open-weight differs from API in engineering terms
#External APIs and local models aren’t the same interface with a different address. The differences are fundamental:
Determinism and versioning. APIs can change model behavior without notice (backend updates). Local models have pinned weights—behavior is stable until you update the version. For production systems, this matters: customers notice when an assistant starts responding differently.
Latency and throughput. External APIs have network latency and share bandwidth across clients. Local models have inference latency tied to GPU and are dedicated to your traffic. At low volume, APIs are faster. At high, concentrated volume, self-hosted models win on latency.
Context window and memory. Large API models offer context windows up to 128K–200K tokens. Open-weight models have varying limits (usually 8K–128K depending on version). If your system relies on very long contexts, check the specific model’s limits before migrating.
Response format and structured output. External APIs often natively enforce JSON schemas. With local models, you implement this in-app: schema validation, parser with repair, retries on structure errors.
Cost of tokens vs. compute cost. APIs bill per token. Self-hosted models bill for power and hardware amortization. The latter has fixed costs regardless of query volume—cheaper at scale, more expensive at low volume.
Post-migration architecture: what needs redesigning
#Experience from dozens of migrations shows the same components always need rework.
Model router. External APIs typically use one model. After migrating to self-hosting, you have a fleet of models with varying sizes and costs. An LLM router classifies queries by complexity and routes simple ones to a small model (fast and cheap), while complex ones go to a large model. Without routing, you lose the economies of scale of your own infrastructure.
Guardrails and filtering. External APIs have built-in content moderation. Local models don’t. You must implement guardrails yourself: input filtering (injection, prompt attacks, sensitive data), output filtering (PII in responses, topic scope), escalation traps for human handoff. Without this layer, local models are less secure than APIs.
PII and masking. Paradox: your own model processes data locally, so the risk of external data transfer is zero. But this doesn’t exempt you from masking personal data before the model—per RODO’s data minimization principle. Data should reach the model with masked identifiers and only be unmasked in the output.
Observability and monitoring. External APIs log queries and metrics on their side—often invisible to you. Your own model requires its own observability stack: query logs (PII-free), performance metrics, quality drift alerts. More on what and how to measure in monitoring AI agent quality.
Fallback and degradation. APIs have redundancy on the provider’s side. Your infrastructure needs a fallback plan: what happens when the GPU is unavailable? The service should degrade gracefully—switch to a smaller model or notify users of unavailability, rather than failing silently.
Migration process step by step
#A practical timeline for a company migrating from an external API to self-hosting:
Step 1: Audit current usage (Week 1). Gather API logs: prompt length distribution, response length distribution, task types (classification, generation, extraction, summarization), error distribution. This informs model selection. Use the ROI calculator for a preliminary cost analysis.
Step 2: Model selection and benchmarking (Weeks 1–2). Pick 2–3 open-weight candidate models sized for your tasks. Run benchmarks on 50–100 real production queries. Measure: response quality, latency, percentage of correct structured outputs. Don’t configure based on general benchmarks—test on your data.
Step 3: Infrastructure setup (Weeks 2–3). GPU server, Ollama or vLLM configuration, networking, backups. Detailed hardware selection is covered in local LLMs and GPU hardware. At this stage, deploy the router, guardrails, and observability layer.
Step 4: Parallel pilot—shadow mode (Weeks 3–4). For at least a week, route traffic to both the API and your model. Compare responses. Don’t cut the API—it’s your fallback. Shadow mode reveals quality differences benchmarks miss.
Step 5: Gradual traffic switch (Weeks 4–6). 10% traffic to your model, monitor; 25%, monitor; 50%, monitor; 100%. At each threshold, pause for 48 hours and verify metrics: error rate, latency, quality scores (if you have a feedback loop).
Step 6: API shutdown (after Week 6+). Only after shadow mode and ramp-up confirm stability. Keep API access as an emergency fallback for another 30 days.
Costs and risks no one talks about
#Migration to self-hosting has hidden costs not visible in “token cost vs. server cost” comparisons:
People cost. GPU infrastructure requires maintenance. Updates, monitoring, incidents, model version rotation. Typically 0.2–0.5 FTE of an engineer. For small teams, this is the dominant cost.
Startup cost. Hardware (GPU server) is a one-time expense or lease. Amortization over 3–4 years at 15,000–60,000 PLN (depending on GPU) must factor into the calculation. Run the numbers in the inference cost calculator before deciding.
Quality drift on model updates. Switching to a new model version can change system behavior. Every update requires re-benchmarking and regression testing. External APIs hide this problem—sometimes solving it, sometimes silently passing it to your system.
AI Act risk. AI systems classified as high-risk require technical documentation, system registration, and transparency compliance. Self-hosting doesn’t exempt you—it increases your responsibility as the deployer. Before migrating in high-risk systems, consult a DPIA and technical documentation with a lawyer.
Migration from API to your own model is a decision justified by numbers, not intuition. The entry point is covered in where to start with AI implementation—there, we describe how to frame the question before choosing a tool.
Try it live
#Describe your current stack (model, volume, task type), and the model will show you migration thresholds and a self-hosted architecture skeleton with router and guardrails—as a starting point, not a ready-made project (playground: PII masked, zero retention):
FAQ
#At what volume does migrating to your own model become cost-effective?
#The cost-effectiveness threshold depends on the chosen model and context length. Roughly: for short prompts and responses (200–500 tokens), the threshold is 50,000–80,000 queries per month. For long contexts (RAG, document summarization), the threshold drops to 20,000–40,000 queries because token costs grow non-linearly. Calculate this specifically in the ROI calculator before deciding.
Does migrating to your own model ensure RODO compliance?
#Self-hosting eliminates data transfer to external providers, simplifying compliance—no need to sign a DPA with the API provider or worry about data-residency. But RODO still applies: data minimization, PII masking before the model, logs without personal data, the right to erase conversation history. Your own model isn’t automatically RODO-compliant—it requires the same mechanisms, just at a different address.
How big is the quality difference between open-weight and GPT-4-class API?
#In 2026, leading open-weight models (Llama 3.3 70B, Qwen 2.5 72B, Mistral Large) match GPT-4-class quality on most business tasks: classification, extraction, summarization, structured text generation. The gap appears in tasks requiring complex multi-step reasoning and up-to-date knowledge. For most enterprise systems, the difference is acceptable or negligible. Verify this with a benchmark on your own data.
What to do with prompts that work well in API but poorly in a local model?
#Prompts written for a specific API model aren’t one-to-one portable. Open-weight models have different chat templates and respond differently to system instructions. Usually, a few iterations suffice: adjust the format (system/user/assistant roles), remove ignored instructions, add examples (few-shot) where the API model didn’t need them. Structured output with schema validation also helps instead of relying on the default response format.
Is migration worth it if we primarily use RAG?
#Yes, and it’s often the simplest migration. In RAG architecture, the model is just a response generator based on retrieved fragments—quality depends more on index and embeddings than model size. A smaller, local model (7B–13B) handles this well, and embeddings like BGE-M3 are inherently local. Token costs for RAG are high (long context with fragments), so the self-hosting cost-effectiveness threshold is lower than for context-free generation.