Realtime Voice and Audio Agents

Voice agents are not just chatbots with text-to-speech. They are realtime systems with audio streaming, turn detection, interruption handling, tool calls, and strict latency budgets.

The voice-agent loop

text

microphone audio
  -> voice activity detection
  -> speech understanding
  -> model reasoning
  -> optional tool calls
  -> streamed speech response
  -> interruption handling

Core components

Component	Job
VAD	detect when the user starts/stops speaking
ASR	convert speech to text or features
Realtime model	reason over audio/text context
Tool runtime	call APIs during conversation
TTS	generate speech response
Barge-in	stop speaking when user interrupts

Latency budget

Humans notice delay quickly. Track:

time to detect end of turn
model first-token latency
tool-call latency
speech synthesis first-audio latency
network jitter
interruption latency

Design tips

Use streaming everywhere.
Keep tool calls short and cancellable.
Confirm before high-impact actions.
Summarize long context instead of replaying the whole call.
Use short spoken responses.
Handle silence, background noise, and overlapping speech.

Voice safety

Voice agents may sound more trustworthy than they are. Add:

explicit confirmations
call recording disclosure where required
safe escalation to humans
transcript review
rate limits
abuse monitoring

Knowledge check

Q1: Why is interruption handling important?
Users naturally talk over voice assistants; the system must stop speaking and update context quickly.

Q2: Why should voice tools be cancellable?
Because users can change intent mid-conversation and long actions can create unsafe or stale outcomes.