Chat vs Voice: Choosing the Right Interface | Voice & Chat Agent Engineering | Celestinosalim.com

Chat vs Voice: Choosing the Right Interface

The Failure

A startup shipped a voice-first customer support agent. The pitch: "talk to us like a person." Users called in about billing disputes -- issues that required reading line items, comparing dates, and referencing policy documents. The voice agent read back a 200-word paragraph. Users interrupted to ask "wait, what was the third charge?" The agent had no way to let them scroll back. Average call time tripled. Customer satisfaction dropped 18%.

The interface was wrong for the task. The model was fine. The engineering was competent. But voice cannot do what a scrollable transcript can. This lesson teaches you how to make that decision before you build.

Two Paradigms, Different Physics

Chat and voice are not interchangeable skins over the same logic. They impose fundamentally different constraints on memory, latency, and error recovery.

Chat is asynchronous by nature. The user types, waits, reads. They can re-read. They can scroll up. They can copy-paste. The interaction has a built-in paper trail that reduces cognitive load. The conversation state is visible on screen at all times.

Voice is synchronous by nature. The user speaks, the agent listens, and responds -- all in real time. There is no scrollback. There is no "let me re-read that." If the agent fumbles, the user heard it happen. The conversation state lives only in working memory.

This distinction drives every engineering choice that follows.

When Chat Wins

| Scenario | Why Chat Works | |----------|---------------| | Complex information delivery | Users need to reference, copy, or re-read | | Code or structured data | Formatting matters -- markdown, tables, JSON | | Multi-step workflows | Users need to see progress and go back | | Sensitive topics | Users want time to think before responding | | Noisy environments | Audio input/output is unreliable | | Accessibility (visual) | Screen readers handle text well |

Chat has a lower engineering bar. Streaming text over SSE is well-understood. The AI SDK's useChat hook handles the streaming protocol, message state, and error handling. You can ship a production chat interface in a day.

When Voice Wins

| Scenario | Why Voice Works | |----------|----------------| | Hands-free operation | Driving, cooking, exercising | | Speed of input | Speaking is 3-4x faster than typing | | Emotional connection | Voice creates intimacy and trust | | Accessibility (motor) | Users with motor impairments or low literacy | | Onboarding and guided flows | A voice agent can lead naturally | | Short, focused queries | "What is my next meeting?" |

Voice demands more engineering investment. You need real-time audio transport (WebRTC), speech-to-text, text-to-speech, voice activity detection, turn-taking, interruption handling, and total mouth-to-ear latency under 1 second to feel natural. Each of these is its own failure mode.

Latency Budgets

The acceptable latency differs by modality. Get this wrong and the interface feels broken regardless of response quality.

| Metric | Chat Target | Voice Target | |--------|-------------|--------------| | Time to first token (TTFT) | Under 500ms | Under 400ms (LLM stage) | | Total response delivery | Under 2s for first sentence | Under 1000ms mouth-to-ear | | Tool execution visible to user | Acceptable up to 3s with indicator | Must be under 500ms or use filler | | Reconnection after drop | Background, user may not notice | Immediate -- silence is failure |

A 200ms delay in chat is invisible. A 200ms delay in voice is noticeable. A 2-second delay in chat is acceptable. A 2-second delay in voice feels broken.

The Hybrid Approach

The most robust systems offer both. This is what celestino.ai runs in production.

[User] --text--> [Chat API] --streamText--> [Response]
[User] --voice-> [LiveKit Room] --STT/LLM/TTS--> [Audio Response]
                       |
                  [Shared Session]
                       |
              [Same DB, Same History]

Users start in chat. When they click the microphone button, the app transitions to a full-screen voice experience powered by LiveKit. Both modalities share the same session ID, the same conversation history in Supabase, and the same RAG knowledge base.

The key engineering decision: the voice agent syncs its transcripts back to chat. When the user returns from voice mode, they see the full conversation -- both what they typed and what they said. This solves voice's biggest weakness (no paper trail) without sacrificing its strengths.

Cost Comparison

Voice is significantly more expensive to operate:

| Cost Factor | Chat | Voice | |-------------|------|-------| | STT | None | $0.006/min (Deepgram, ElevenLabs Scribe) | | TTS | None | $0.015-$0.10 per 1,000 chars (ElevenLabs) | | Real-time infra | SSE over HTTP | WebRTC TURN servers, media routing | | LLM inference | Same | Same (voice adds audio conversion on both ends) | | Relative cost per conversation | 1x | 3-5x |

For a bootstrapped product, start with chat. Add voice when you have proven the conversational UX works and have revenue to fund the infrastructure.

Decision Framework

When choosing between chat, voice, or hybrid, answer these five questions:

Does the user need to reference the output later? If yes, you need chat or transcripts.
Is total response latency under 1 second achievable? If not, voice will feel broken.
Will users be in noisy or public environments? If yes, voice is unreliable.
Is emotional connection a product differentiator? If yes, voice creates trust faster.
Do you have the budget for both? Hybrid is the gold standard, but it is genuinely twice the engineering work.

Build This

Before writing any code, create a modality decision document for your agent:

List the top 5 user tasks your agent must support.
For each task, score chat and voice on a 1-5 scale across: information density, latency tolerance, environment reliability, and emotional value.
Total the scores. If voice wins on 3+ tasks, build hybrid. If chat wins on 4+, ship chat first.
Define your latency budget per modality using the table above as a starting point.

This document becomes the engineering spec that drives your architecture decisions for the rest of the course.

Key Takeaways

Chat and voice impose different engineering constraints. Do not treat them as interchangeable.
Chat wins for reference, complexity, and low cost. Voice wins for speed, emotion, and accessibility.
Hybrid is the gold standard -- but requires shared session management and transcript sync.
Start with chat, add voice later. The conversational design skills transfer; the infrastructure does not.
Voice costs 3-5x more to operate than chat for the same conversation.

What's Next

You have decided which modality to build. But a powerful model behind the wrong conversation structure is still a bad product. Next, we cover Conversation Design for AI Agents -- the principles that determine whether users trust your agent, regardless of whether they are typing or talking.