Back
advanced
Cutting-Edge Topics

Realtime Voice and Audio Agents

Design low-latency voice agents with streaming audio, turn detection, interruptions, and tool use

26 min read· voice agents· realtime· audio· multimodal

Realtime Voice and Audio Agents

Voice agents are not just chatbots with text-to-speech. They are realtime systems with audio streaming, turn detection, interruption handling, tool calls, and strict latency budgets.

The voice-agent loop

text
microphone audio
  -> voice activity detection
  -> speech understanding
  -> model reasoning
  -> optional tool calls
  -> streamed speech response
  -> interruption handling

Core components

ComponentJob
VADdetect when the user starts/stops speaking
ASRconvert speech to text or features
Realtime modelreason over audio/text context
Tool runtimecall APIs during conversation
TTSgenerate speech response
Barge-instop speaking when user interrupts

Latency budget

Humans notice delay quickly. Track:

  • time to detect end of turn
  • model first-token latency
  • tool-call latency
  • speech synthesis first-audio latency
  • network jitter
  • interruption latency

Design tips

  • Use streaming everywhere.
  • Keep tool calls short and cancellable.
  • Confirm before high-impact actions.
  • Summarize long context instead of replaying the whole call.
  • Use short spoken responses.
  • Handle silence, background noise, and overlapping speech.

Voice safety

Voice agents may sound more trustworthy than they are. Add:

  • explicit confirmations
  • call recording disclosure where required
  • safe escalation to humans
  • transcript review
  • rate limits
  • abuse monitoring

Knowledge check

Q1: Why is interruption handling important?
Users naturally talk over voice assistants; the system must stop speaking and update context quickly.

Q2: Why should voice tools be cancellable?
Because users can change intent mid-conversation and long actions can create unsafe or stale outcomes.