A company orders an AI agent deployment. They receive a project quote: 30,000–80,000 PLN. They sign the contract. The agent goes live. Six months later, the finance department asks: "How much does this agent cost per month?" No one has a ready answer. Cloud invoices grew gradually. The engineer’s time spent updating the knowledge base wasn’t tracked separately. Monitoring was part of the general IT project.
This is a typical scenario, not an exception. The TCO (Total Cost of Ownership) of an AI agent is rarely calculated before deployment and almost never measured correctly during the first two quarters. Below, I describe how to change that.
Five Categories of AI Agent Operational Costs
#AI agent TCO isn’t just about the API bill. Each of the following categories is a separate cost center with its own growth dynamics.
| Category | What’s Included | Dynamics |
|---|---|---|
| Inference (tokens) | LLM call cost per query multiplied by volume | Linear or superlinear with increasing prompt complexity |
| Infrastructure | Server, vector database, cache, network | Stepwise (jumps at volume thresholds) |
| Knowledge Base Maintenance | Reindexing, versioning, document audit | Fixed monthly, with spikes during product changes |
| Monitoring and Oversight | Engineering time, golden set tests, alerts, human-oversight | Fixed, decreases as processes mature |
| Compliance and Security | Logs with TTL, audit trail, guardrails reviews, RODO | Predictable fixed, increases during regulator audits |
The most common TCO calculation mistake is including only token costs. These typically account for 20–40% of the total operational cost. The rest is infrastructure and human labor — and this part determines profitability over a one-year horizon.
Inference Cost: How to Calculate Tokens Across Different Architectures
#Inference is the cost of calling the language model. It depends on three variables: the number of queries, prompt length, and model pricing.
Prompt length in a RAG system consists of: system prompt (fixed, usually 200–600 tokens), context retrieved from the database (top-k fragments, usually 800–2,000 tokens), conversation history (grows during the conversation), and the user’s query itself. For a typical customer service question, the input prompt is 1,200–3,000 tokens, with an output of 200–600 tokens.
Monthly calculation for 5,000 queries using an API model:
- Prompt input: 5,000 × 2,000 tokens = 10 million input tokens
- Output: 5,000 × 400 tokens = 2 million output tokens
- At a rate of 2 USD / 1M input + 6 USD / 1M output: 10 × 2 + 2 × 6 = 32 USD monthly for a mid-tier model
- For a premium model (8 USD / 1M input, 24 USD / 1M output): 128 USD
The difference between an economic and premium model at the same volume is fourfold. The article on token cost optimization describes techniques (prompt caching, model routing, context shortening) that reduce this cost by 30–60% without losing quality.
With self-hosting, the per-token cost drops to zero (you pay for GPU, not per call), but server costs arise. The self-hosting vs. API break-even point for an agent with 5,000 monthly queries is typically 12–18 months. Below this threshold, the cloud is cheaper.
Infrastructure: What You Pay Beyond Tokens
#AI agent infrastructure includes several components rarely included in initial cost estimates.
Vector database stores embeddings of the knowledge base. Cost depends on the number of vectors and required availability. For a database of 10,000 documents (typical for a mid-sized company’s knowledge base), the cost of managed Qdrant or Pinecone is 30–80 USD monthly. Self-hosted Qdrant on a dedicated server eliminates this cost but requires instance maintenance.
Cache for semantic search results and prompts is a one-time deployment cost with low operational expenses (Redis or Valkey). With a well-designed cache, hit rates reach 25–40% for repeated questions, directly reducing inference costs.
Application server for the agent API (Python/FastAPI or Node) at volumes up to 50,000 monthly queries is handled by a VPS for 60–150 USD monthly or serverless with per-request costs.
Monitoring and observability (Prometheus, Grafana, or equivalent) adds an additional 20–50 USD monthly in the cloud or can be configured on your own infrastructure. A detailed monitoring architecture description is available in the article on AI agent quality monitoring.
The total infrastructure cost for an agent handling 5,000–20,000 monthly queries is realistically 150–400 USD monthly for cloud solutions and 80–200 USD for self-hosting (excluding server amortization).
Knowledge Base Maintenance: The Hidden Cost That Grows Over Time
#The agent’s knowledge base ages. Prices change. Procedures are updated. New products enter the offering. Each such change requires document updates and reindexing the vector database.
Reindexing cost consists of two components: the cost of calculating new embeddings (API model cost for tokens; local BGE-M3 for GPU time) and the labor cost of preparing, verifying, and publishing updated documents.
For a company updating its offering quarterly with a database of 500–2,000 documents, reindexing takes 2–4 hours monthly plus embedding costs (typically 5–20 USD for a full reindex with API, zero with a local model). This sounds minor, but with poor document organization, verification time can grow to 10–20 hours.
The article on RAG knowledge updates and versioning describes how to build an incremental reindexing pipeline that reduces this cost by 60–70% by updating only changed fragments, not the entire database.
Monitoring and Human Oversight: The Cost That Doesn’t Disappear
#AI agent monitoring isn’t a one-time deployment. It’s a continuous operational cost with two components: automated (alerts, regression tests) and human (escalation reviews, quality audits, incident response).
The automated part is relatively cheap: once configured, alerts and golden set tests run automatically. The cost is a few hours monthly for reviewing results and responding to anomalies.
The human component depends on scale and application area. For a customer service agent handling 200 cases daily, typical oversight time is 3–6 hours weekly: reviewing escalations, checking response samples, updating the golden set when errors are detected. At 2,000 cases daily, this becomes 15–25 hours weekly for a dedicated person.
Human-oversight for systems covered by the AI Act isn’t optional. The article on AI agent security describes oversight requirements and how to document the audit trail required by the regulator.
Compliance and Security Costs
#Compliance with RODO and the AI Act generates costs that many decision-makers overlook in initial TCO calculations.
Logs with TTL: Storing operational logs with appropriate retention periods and data deletion mechanisms (right to be forgotten) requires infrastructure and processes. The cost is mainly engineering time for implementation and monthly reviews.
DPIA (Data Protection Impact Assessment) for agents processing personal data is a one-time cost at deployment (4–16 hours of work with a lawyer or RODO specialist) and updates with every significant architecture change. Details of obligations are described in the article on AI Act and RODO 2026.
Penetration testing of guardrails for agents with access to external systems (CRM, ERP, databases) costs 2–4 hours quarterly for internal review or external audit at high risk.
The total compliance cost for a typical B2B agent is 500–2,000 PLN annually in labor hours, plus potential external review costs.
Cost Benchmark: Three Deployment Scenarios
#Below are three scenarios showing real monthly TCO for different deployment scales. The numbers assume a cloud model (API) with managed infrastructure and 8 labor hours of monthly oversight.
| Component | FAQ Agent (2,000 queries/month) | Customer Service Agent (10,000 queries/month) | Multi-Step Agent (5,000 queries/month) |
|---|---|---|---|
| Inference (tokens) | 15–40 PLN | 100–300 PLN | 200–600 PLN |
| Infrastructure | 150–300 PLN | 300–600 PLN | 400–800 PLN |
| Knowledge Base Maintenance | 200–400 PLN | 400–800 PLN | 600–1,200 PLN |
| Monitoring and Oversight | 300–600 PLN | 600–1,200 PLN | 800–1,600 PLN |
| Compliance | 80–150 PLN | 150–300 PLN | 200–400 PLN |
| Total TCO | 745–1,490 PLN | 1,550–3,200 PLN | 2,200–4,600 PLN |
The multi-step agent has higher inference costs than the FAQ agent at a lower volume because each ReAct loop step generates a separate LLM call. The multi-step agent architecture impacts TCO more than query volume.
How to Reduce TCO Without Compromising Quality
#Three architectural changes with the biggest impact on TCO:
Model router directs simple queries (classification, FAQ) to a cheaper model and complex ones (multi-step, analytical) to a more expensive model. Inference cost reduction is typically 30–55% with proper configuration. Router construction details are described in the article on migrating from API to your own AI model.
Prompt caching for fixed prompt fragments (system prompt, RAG headers, guardrails instructions) reduces token costs by 20–40% at volumes above 1,000 daily queries. Most API providers have supported this feature natively since 2025.
RAG context shortening through better reranking and top-k fragment filtering reduces prompt size without losing response quality. Instead of passing 5 fragments of 500 tokens each, a more precise reranker selects the 2 best. The article on RAG quality evaluation describes how to measure retrieval precision and when investing in a better reranker pays off in reduced token costs.
Try It Live
#Describe your case, and the model will estimate approximate TCO and show where the biggest savings potential lies (playground: PII masked, zero retention):
FAQ
#How much does it cost monthly to maintain an AI agent for a small business?
#For a small business with 1,000–3,000 monthly queries and a narrow scope (FAQ, statuses, simple classifications), the real operational cost is 600–1,800 PLN monthly. This amount mainly consists of infrastructure (150–300 PLN) and oversight time (3–5 hours monthly). Token costs at this volume are marginal. A calculation for a specific scope is provided by the ROI calculator.
What’s included in AI agent TCO that isn’t in the deployment price?
#The deployment price typically covers: architecture design, agent development, initial knowledge base population, and testing. It doesn’t cover: monthly inference costs (tokens), post-handover infrastructure maintenance, regular knowledge base updates, oversight and monitoring time, or compliance costs (DPIA, RODO logs). These elements make up TCO and determine profitability over a 12–24 month horizon. A pre-deployment assessment is facilitated by the readiness evaluation tool.
When is self-hosting an AI agent cheaper than a cloud API?
#Self-hosting reduces the per-token cost to zero but adds costs: server (GPU or high-end CPU), model and infrastructure maintenance, and security updates. The break-even point typically appears at volumes above 20,000–50,000 monthly queries or when data-residency and RODO requirements mandate self-hosting regardless of economics. At lower volumes, API is cheaper overall, even with higher per-token costs. Break-even analysis details are described in the article on migrating from API to your own model.
How to control token costs when volume grows faster than planned?
#Three control mechanisms: (1) daily per-user or per-endpoint limits in the LLM router block uncontrolled cost growth before alerts; (2) a model router automatically directs simple queries to a cheaper model when volume exceeds a threshold; (3) semantic cache for repeated questions reduces actual call volume by 20–40%. Without these mechanisms, sudden volume spikes (viral traffic, new channel integration) can double the monthly bill within a week. The article on AI deployment planning step-by-step describes how to build these safeguards from day one.
How does the AI Act affect AI agent operational costs?
#The AI Act adds costs mainly in three areas: documentation and DPIA at deployment and updates, audit trail (decision logs with retention), and human-oversight requirements for high-risk systems. For most B2B agents (customer service, FAQ, classification), requirements are moderate. For agents in high-risk sectors (healthcare, finance, HR), compliance costs increase total TCO by 20–40%. A detailed breakdown of obligations by sector is described in the article on AI Act and high-risk systems.