AI voice assistant for phone: replacing IVR with honesty

A company implements IVR in 2014, and in 2026, customers still hear a menu. From several deployments we analyzed, 35 to 60 percent of callers hang up before reaching a consultant. At Cashcrown, we tested dozens of voice agent architectures on Polish-language call datasets. Below, we describe what we measured.

Pipeline: what one conversation turn looks like

Every exchange in a conversation with a voice agent goes through four stages:

STT (Speech-to-Text): The microphone or phone line sends the audio stream to an ASR model. The model converts speech to text. On a good phone line, the Word Error Rate for Polish is 5 to 12 percent with models like Whisper large-v3 or commercial APIs. On a noisy street or with a weak mobile connection, WER jumps to 20 to 35 percent. This is the harsh reality of Polish telephony, and no marketing will change it.
Intent classification: The transcript text goes to a language model, which assigns it to one of the predefined categories (check status, schedule appointment, opening hours, no match). The agent uses tool-use here: it calls a function in the CRM, calendar, or FAQ database depending on the intent.
Response with content: The agent retrieves data from the system (shipment status, available slots) and composes a response. Short text, 1 to 3 sentences. The longer the agent speaks, the higher the risk the customer will interrupt.
TTS (Text-to-Speech): The response text goes to a voice synthesizer. Modern TTS models (ElevenLabs, Azure Neural TTS, OpenAI TTS) sound natural in Polish. The synthesis delay itself is 80 to 200 ms when streaming the first tokens.

The entire STT + intent + TTS loop should complete in 0.8 to 1.5 seconds from the end of the customer’s utterance. This is the TTFT for voice: Time to First Token, meaning when the customer hears the first word of the response.

Latency budget: where time is spent

The table below shows how time is distributed in a realistic local deployment (faster-whisper on GPU) and a cloud variant (commercial API):

Stage	Local (GPU)	Cloud (API)
STT (2 to 5 sec. audio)	150 to 300 ms	300 to 600 ms
Intent classification (small LLM, 7B)	200 to 500 ms	150 to 400 ms
Query to system (CRM/DB)	50 to 200 ms	50 to 200 ms
TTS (first word, streaming)	80 to 200 ms	100 to 250 ms
Total (median)	480 to 1200 ms	600 to 1450 ms

Values are ranges from internal tests, not guarantees. Every deployment requires its own measurement because SIP trunk, WebRTC, and PSTN gateways have different jitter characteristics. If the total regularly exceeds 2.5 seconds, customers interpret the silence as a dropped call, and the transfer rate spikes.

Barge-in: the customer speaks before the agent finishes

Classic IVR blocks customer input during message playback. Production-grade voice agents support barge-in: the customer can interrupt the agent mid-sentence and start speaking. The agent stops synthesis and processes the new utterance.

Barge-in requires Voice Activity Detection (VAD) with a carefully set sensitivity threshold. Too low a threshold causes background noise or on-hold music to trigger false detections. Too high a threshold misses quiet customer utterances. The setting requires testing with recordings from the target environment, not synthetic audio.

Barge-in is crucial for the conversation to feel natural, and its absence is one of the signals customers use to recognize an old architecture.

What the agent handles well and what requires a human

There’s no point in deploying a voice agent for everything. The boundary between automation and escalation to a human must be designed intentionally, not discovered in production.

The agent handles well:

Shipment, order, or ticket status (read from CRM or logistics system)
Opening hours, addresses, basic product information
Scheduling and rescheduling appointments in a calendar (with idempotent protection against double booking)
Simple FAQ: what’s needed for a visit, how long a decision takes, how to cancel a subscription (information, not action)
Initial routing: the agent asks what the customer is calling about before connecting to the right department

The agent MUST handoff to a human (human-handoff):

Complaints and grievances, especially when the customer is clearly frustrated or speaking loudly
Any customer request for a human, at any point in the conversation
Financial matters: refunds, plan changes, any account operations
Personal data: changing PESEL, address, payment details
Ambiguous situations where intent classification confidence is below the threshold (e.g., 0.75)
The customer sounds tearful, frightened, or mentions a crisis situation

Hard rule: no irreversible action may be performed by the agent without confirmation by a human or two-step identity verification. Canceling a contract, changing a bank account, deleting an account: these are not tasks for a voice agent working solo.

Guardrails in the voice layer differ from chat. There’s no way to show the customer a link or button. The only escalation path is verbal: the agent informs about the transfer, and it happens within 30 seconds.

▶Design the scope of a voice agent for your businesssandbox · reasoning

Polish ASR: honest limitations

Polish is challenging for ASR models for several reasons: rich inflection (the same content expressed differently morphologically), long compound words, and regional accents. On top of that, phone lines have limited bandwidth (8 kHz in classic PSTN), which strips the model of some acoustic information.

What this means in practice:

First and last names have a higher WER than general sentences. The same name may appear in the transcript in several spelling variants depending on pronunciation and speaker accent.
Street names, cities, and postal codes are error-prone. The agent should not rely on voice dictation as the only way to input addresses.
Numbers spoken in groups (e.g., phone numbers) are transcribed more reliably than individually. It’s worth asking customers to provide digits in pairs.
Background noise (wind, voices, music) degrades quality more than in English, where models have more training data from difficult conditions.

A reasonable policy is: when ASR signals low confidence in the transcript, the agent asks for a repeat once, and after a second failure, escalates to a human without further attempts. A loop with three requests for repetition ruins the conversation experience more than a direct connection to a consultant.

Monitoring: what to measure after deployment

Deploying a voice agent without an observability layer is operating blind. Key metrics:

Containment rate: The percentage of calls resolved by the agent without transfer to a human. For simple services (statuses, hours), a realistic target is 50 to 70 percent. A higher result without manual call verification may mean the agent ended the call instead of handling it properly.
Transfer rate: The percentage of customers requesting a consultant. A high transfer rate (above 40 percent) indicates too narrow an agent scope or too high an escalation threshold.
Abandon rate: The percentage of customers hanging up before getting an answer. A direct indicator of poor experience or excessive latency.
WER on production samples: Regular reviews of 50 to 100 random calls by a human, with manual assessment of transcription quality. ASR degrades when the calling population or acoustic conditions change.
Unrecognized intents: The percentage of calls without a match to any category. An increase in this metric signals new types of questions the agent doesn’t handle.

Monitoring is detailed in the article on AI classification and routing of requests. The general monitoring architecture for agents is in the article on AI customer service automation.

RODO and AI Act: what’s mandatory

A phone conversation with an AI agent is personal data from the first second. Voice is biometric data under RODO, even without biometric identification intent.

Mandatory elements of deployment:

Disclosure of AI identity at the start of the conversation (AI Act requirement from August 2, 2026): the customer must know they’re speaking with an automated system before providing any data.
Masking PII before sending the transcript to an external LLM: PESEL numbers, payment card details, and other identifying data must be captured by NER and replaced with tokens before analysis by a cloud model.
Recording retention in line with data storage policy: Recordings cannot be kept without a legal basis and retention period.
Path for exercising the right to data erasure: Recordings and transcripts of a specific customer must be locatable and erasable upon request.

For deployments with local voice processing, the data-residency risk is minimal. For cloud variants, a Data Processing Agreement (DPA) with the ASR and TTS provider is necessary.

The differences between a voice agent and a chatbot in terms of architecture and design decisions are described in the article voice AI vs chatbot. Broader context on voice AI for businesses can be found in the article voice AI for businesses.

FAQ

What’s a realistic latency for a voice agent in a Polish deployment?

In a local variant (GPU, faster-whisper + small 7B model + streaming TTS), the median full loop is 480 to 1200 ms. In a cloud variant (commercial API), it’s 600 to 1450 ms. Values above 2.5 seconds cause a noticeable increase in abandon rate. Every deployment requires its own measurements on the target infrastructure because phone line jitter and network delays heavily impact the final result.

Does a voice agent work well with Polish accents and dialects?

It depends on the ASR model and training dataset. Models like Whisper large-v3 and commercial APIs (Azure, Google) perform decently with standard Polish, but WER increases with regional accents. A benchmark on a sample of 200 to 500 recordings from your customers is mandatory before deciding on the architecture. Don’t rely on the provider’s general benchmarks.

Which conversations MUST the agent handoff to a human?

Any complaint with emotions, any customer request for a human (immediately), any irreversible actions (canceling a contract, changing payment details, deleting an account), and situations where the intent classifier has low confidence. The lack of a clear escalation path is the most common mistake in early deployments. The customer should be able to request a consultant at any time and reach a human within 30 seconds.

Can a voice agent accept payments over the phone?

Not without additional safeguards. Accepting payment card details over voice requires PCI DSS compliance, which is a separate and complex requirement. The approach used in practice is redirecting the customer to a payment page via SMS or email instead of dictating the card number to the agent. Changing bank account details via a voice agent without human confirmation is unacceptable.

How much does deploying a voice agent cost for a small business?

The cost depends on call volume and chosen architecture. A cloud variant (external ASR + LLM API + TTS API) has a low entry threshold, but per-call costs rise with volume. At 100 to 200 calls per day, the local variant becomes cost-effective after 6 to 12 months. A realistic cost estimate for your scenario is provided by the ROI calculator. AI deployment in call centers, including voice bots, is detailed in the article AI in call centers.

AI voice assistant for phone: replacing IVR with honesty

Pipeline: what one conversation turn looks like

Every exchange in a conversation with a voice agent goes through four stages:

STT (Speech-to-Text): The microphone or phone line sends the audio stream to an ASR model. The model converts speech to text. On a good phone line, the Word Error Rate for Polish is 5 to 12 percent with models like Whisper large-v3 or commercial APIs. On a noisy street or with a weak mobile connection, WER jumps to 20 to 35 percent. This is the harsh reality of Polish telephony, and no marketing will change it.
Intent classification: The transcript text goes to a language model, which assigns it to one of the predefined categories (check status, schedule appointment, opening hours, no match). The agent uses tool-use here: it calls a function in the CRM, calendar, or FAQ database depending on the intent.
Response with content: The agent retrieves data from the system (shipment status, available slots) and composes a response. Short text, 1 to 3 sentences. The longer the agent speaks, the higher the risk the customer will interrupt.
TTS (Text-to-Speech): The response text goes to a voice synthesizer. Modern TTS models (ElevenLabs, Azure Neural TTS, OpenAI TTS) sound natural in Polish. The synthesis delay itself is 80 to 200 ms when streaming the first tokens.

Latency budget: where time is spent

The table below shows how time is distributed in a realistic local deployment (faster-whisper on GPU) and a cloud variant (commercial API):

Stage	Local (GPU)	Cloud (API)
STT (2 to 5 sec. audio)	150 to 300 ms	300 to 600 ms
Intent classification (small LLM, 7B)	200 to 500 ms	150 to 400 ms
Query to system (CRM/DB)	50 to 200 ms	50 to 200 ms
TTS (first word, streaming)	80 to 200 ms	100 to 250 ms
Total (median)	480 to 1200 ms	600 to 1450 ms

Barge-in: the customer speaks before the agent finishes

Barge-in is crucial for the conversation to feel natural, and its absence is one of the signals customers use to recognize an old architecture.

What the agent handles well and what requires a human

There’s no point in deploying a voice agent for everything. The boundary between automation and escalation to a human must be designed intentionally, not discovered in production.

The agent handles well:

Shipment, order, or ticket status (read from CRM or logistics system)
Opening hours, addresses, basic product information
Scheduling and rescheduling appointments in a calendar (with idempotent protection against double booking)
Simple FAQ: what’s needed for a visit, how long a decision takes, how to cancel a subscription (information, not action)
Initial routing: the agent asks what the customer is calling about before connecting to the right department

The agent MUST handoff to a human (human-handoff):

Complaints and grievances, especially when the customer is clearly frustrated or speaking loudly
Any customer request for a human, at any point in the conversation
Financial matters: refunds, plan changes, any account operations
Personal data: changing PESEL, address, payment details
Ambiguous situations where intent classification confidence is below the threshold (e.g., 0.75)
The customer sounds tearful, frightened, or mentions a crisis situation

▶Design the scope of a voice agent for your businesssandbox · reasoning

Polish ASR: honest limitations

What this means in practice:

First and last names have a higher WER than general sentences. The same name may appear in the transcript in several spelling variants depending on pronunciation and speaker accent.
Street names, cities, and postal codes are error-prone. The agent should not rely on voice dictation as the only way to input addresses.
Numbers spoken in groups (e.g., phone numbers) are transcribed more reliably than individually. It’s worth asking customers to provide digits in pairs.
Background noise (wind, voices, music) degrades quality more than in English, where models have more training data from difficult conditions.

Monitoring: what to measure after deployment

Deploying a voice agent without an observability layer is operating blind. Key metrics:

Containment rate: The percentage of calls resolved by the agent without transfer to a human. For simple services (statuses, hours), a realistic target is 50 to 70 percent. A higher result without manual call verification may mean the agent ended the call instead of handling it properly.
Transfer rate: The percentage of customers requesting a consultant. A high transfer rate (above 40 percent) indicates too narrow an agent scope or too high an escalation threshold.
Abandon rate: The percentage of customers hanging up before getting an answer. A direct indicator of poor experience or excessive latency.
WER on production samples: Regular reviews of 50 to 100 random calls by a human, with manual assessment of transcription quality. ASR degrades when the calling population or acoustic conditions change.
Unrecognized intents: The percentage of calls without a match to any category. An increase in this metric signals new types of questions the agent doesn’t handle.

Monitoring is detailed in the article on AI classification and routing of requests. The general monitoring architecture for agents is in the article on AI customer service automation.

RODO and AI Act: what’s mandatory

A phone conversation with an AI agent is personal data from the first second. Voice is biometric data under RODO, even without biometric identification intent.

Mandatory elements of deployment:

Disclosure of AI identity at the start of the conversation (AI Act requirement from August 2, 2026): the customer must know they’re speaking with an automated system before providing any data.
Masking PII before sending the transcript to an external LLM: PESEL numbers, payment card details, and other identifying data must be captured by NER and replaced with tokens before analysis by a cloud model.
Recording retention in line with data storage policy: Recordings cannot be kept without a legal basis and retention period.
Path for exercising the right to data erasure: Recordings and transcripts of a specific customer must be locatable and erasable upon request.

For deployments with local voice processing, the data-residency risk is minimal. For cloud variants, a Data Processing Agreement (DPA) with the ASR and TTS provider is necessary.