Start Lesson
A startup shipped a voice-first customer support agent. The pitch: "talk to us like a person." Users called in about billing disputes -- issues that required reading line items, comparing dates, and referencing policy documents. The voice agent read back a 200-word paragraph. Users interrupted to ask "wait, what was the third charge?" The agent had no way to let them scroll back. Average call time tripled. Customer satisfaction dropped 18%.
The interface was wrong for the task. The model was fine. The engineering was competent. But voice cannot do what a scrollable transcript can. This lesson teaches you how to make that decision before you build.
Chat and voice are not interchangeable skins over the same logic. They impose fundamentally different constraints on memory, latency, and error recovery.
Chat is asynchronous by nature. The user types, waits, reads. They can re-read. They can scroll up. They can copy-paste. The interaction has a built-in paper trail that reduces cognitive load. The conversation state is visible on screen at all times.
Voice is synchronous by nature. The user speaks, the agent listens, and responds -- all in real time. There is no scrollback. There is no "let me re-read that." If the agent fumbles, the user heard it happen. The conversation state lives only in working memory.
This distinction drives every engineering choice that follows.
| Scenario | Why Chat Works | |----------|---------------| | Complex information delivery | Users need to reference, copy, or re-read | | Code or structured data | Formatting matters -- markdown, tables, JSON | | Multi-step workflows | Users need to see progress and go back | | Sensitive topics | Users want time to think before responding | | Noisy environments | Audio input/output is unreliable | | Accessibility (visual) | Screen readers handle text well |
Chat has a lower engineering bar. Streaming text over SSE is well-understood. The AI SDK's useChat hook handles the streaming protocol, message state, and error handling. You can ship a production chat interface in a day.
| Scenario | Why Voice Works | |----------|----------------| | Hands-free operation | Driving, cooking, exercising | | Speed of input | Speaking is 3-4x faster than typing | | Emotional connection | Voice creates intimacy and trust | | Accessibility (motor) | Users with motor impairments or low literacy | | Onboarding and guided flows | A voice agent can lead naturally | | Short, focused queries | "What is my next meeting?" |
Voice demands more engineering investment. You need real-time audio transport (WebRTC), speech-to-text, text-to-speech, voice activity detection, turn-taking, interruption handling, and total mouth-to-ear latency under 1 second to feel natural. Each of these is its own failure mode.
The acceptable latency differs by modality. Get this wrong and the interface feels broken regardless of response quality.
| Metric | Chat Target | Voice Target | |--------|-------------|--------------| | Time to first token (TTFT) | Under 500ms | Under 400ms (LLM stage) | | Total response delivery | Under 2s for first sentence | Under 1000ms mouth-to-ear | | Tool execution visible to user | Acceptable up to 3s with indicator | Must be under 500ms or use filler | | Reconnection after drop | Background, user may not notice | Immediate -- silence is failure |
A 200ms delay in chat is invisible. A 200ms delay in voice is noticeable. A 2-second delay in chat is acceptable. A 2-second delay in voice feels broken.
The most robust systems offer both. This is what celestino.ai runs in production.
[User] --text--> [Chat API] --streamText--> [Response]
[User] --voice-> [LiveKit Room] --STT/LLM/TTS--> [Audio Response]
|
[Shared Session]
|
[Same DB, Same History]
Users start in chat. When they click the microphone button, the app transitions to a full-screen voice experience powered by LiveKit. Both modalities share the same session ID, the same conversation history in Supabase, and the same RAG knowledge base.
The key engineering decision: the voice agent syncs its transcripts back to chat. When the user returns from voice mode, they see the full conversation -- both what they typed and what they said. This solves voice's biggest weakness (no paper trail) without sacrificing its strengths.
Voice is significantly more expensive to operate:
| Cost Factor | Chat | Voice | |-------------|------|-------| | STT | None | $0.006/min (Deepgram, ElevenLabs Scribe) | | TTS | None | $0.015-$0.10 per 1,000 chars (ElevenLabs) | | Real-time infra | SSE over HTTP | WebRTC TURN servers, media routing | | LLM inference | Same | Same (voice adds audio conversion on both ends) | | Relative cost per conversation | 1x | 3-5x |
For a bootstrapped product, start with chat. Add voice when you have proven the conversational UX works and have revenue to fund the infrastructure.
When choosing between chat, voice, or hybrid, answer these five questions:
Before writing any code, create a modality decision document for your agent:
This document becomes the engineering spec that drives your architecture decisions for the rest of the course.
You have decided which modality to build. But a powerful model behind the wrong conversation structure is still a bad product. Next, we cover Conversation Design for AI Agents -- the principles that determine whether users trust your agent, regardless of whether they are typing or talking.