How We Achieved Sub-800ms Voice AI Latency for Indian Accents
Getting voice AI to feel natural requires sub-800ms end-to-end latency. Here is the exact technical stack and optimisations that got us there — and what we tried that did not work.
Why 800ms?
Research shows callers hang up after 2 seconds of silence. With 800ms latency, you have 1,200ms of buffer — enough for a natural pause without sounding broken.
Getting to 800ms end-to-end (from end of user speech to start of AI voice playback) required optimising every step of the pipeline.
Our Pipeline
Twilio WebSocket (mulaw 8kHz)
→ Deepgram Nova-2 streaming STT (~160ms)
→ Turn detection (500ms silence threshold)
→ LLM streaming (Claude Haiku ~180ms to first token)
→ ElevenLabs streaming TTS (~60ms to first byte)
→ Twilio audio playback
The Key Optimisations
1. Parallel processing
Don't wait for LLM to finish — start TTS as soon as the first sentence segment arrives. This saved ~400ms.
2. Streaming everything
Deepgram streams transcripts. We send partial transcripts to the LLM after 200ms of inactivity. The LLM starts generating before the user finishes speaking.
3. Prompt caching
LiteLLM prompt caching for the system prompt reduces LLM latency by ~40%.
4. Regional STT for Indian accents
Deepgram Nova-2 accuracy for Indian English: 84%. Sarvam AI for Hindi: 91%. We route by detected language.
What Didn't Work
- ✓OpenAI Whisper: Too slow (800ms+ just for STT)
- ✓ElevenLabs Flash v2: Artifacts on Indian English
- ✓Groq: Fast but quality inconsistent at high load
Current Benchmark
P50 latency: 720ms. P95: 980ms. P99: 1,340ms.
The P99 cases are network issues on 2G/Edge connections. We now detect poor connections and switch to a lighter TTS voice.
Writing about AI automation, India SMBs, and building products that work for the next billion users.
Ready to try it for your business?
7-day free trial. No credit card. Setup in 30 minutes.
Start Free Trial