Start Lesson
A team used OpenAI's Realtime API for their voice agent. It worked well -- until they needed a specific voice that OpenAI did not offer. Their brand required a particular vocal quality, and the limited voice options were a dealbreaker. They also discovered that their Spanish-speaking users got poor transcription accuracy because the built-in STT was tuned for English. They could not swap the STT without swapping the entire model. They were locked in.
The pipeline approach solves this. Instead of one model doing everything, you compose a pipeline from best-in-class components: choose the STT that handles your languages, the LLM that fits your latency budget, and the TTS that sounds like your brand. LiveKit handles the real-time transport, turn detection, and interruption management. You control every stage.
A LiveKit voice agent follows this flow:
[User's Microphone]
|
v
+---------+ Audio frames
| VAD | -- (Voice Activity Detection)
+----+----+
| Speech detected
v
+---------+
| STT | -- Speech-to-Text (e.g., ElevenLabs Scribe)
+----+----+
| Transcript text
v
+---------+
| LLM | -- Language Model (e.g., Gemini 2.5 Flash)
+----+----+
| Response tokens
v
+---------+
| TTS | -- Text-to-Speech (e.g., ElevenLabs Flash v2.5)
+----+----+
| Audio frames
v
[User's Speaker]
Each stage is independent. You can swap ElevenLabs for Deepgram STT, or Gemini for Claude, or Cartesia for ElevenLabs TTS -- without touching the rest of the pipeline.
LiveKit Agents 1.0 introduced AgentSession as the unified orchestrator. Here is the production configuration from celestino.ai:
import {
cli, JobContext, WorkerOptions,
defineAgent, voice, llm, inference,
} from '@livekit/agents';
import { z } from 'zod';
export default defineAgent({
entry: async (ctx: JobContext) => {
await ctx.connect();
const participant = await ctx.waitForParticipant();
// Initialize each pipeline stage independently
const stt = new inference.STT({
model: 'elevenlabs/scribe_v2_realtime',
language: 'en',
});
const llmModel = new inference.LLM({
model: 'google/gemini-2.5-flash',
});
const tts = new inference.TTS({
model: 'elevenlabs/eleven_flash_v2_5',
voice: 'cjVigY5qzO86Huf0OWal',
language: 'en',
});
// Create the agent with instructions and tools
const agent = new voice.Agent({
instructions: systemPrompt,
tools: {
search: llm.tool({
description: 'Search the knowledge base',
parameters: z.object({
query: z.string().describe('The search query'),
}),
execute: async ({ query }) => {
const docs = await retrieveContext(query);
return docs.map((d) => d.content).join('\n\n');
},
}),
},
});
// Create the session -- this wires up the full pipeline
const session = new voice.AgentSession({
stt,
llm: llmModel,
tts,
});
// Start: connects the pipeline to the room
await session.start({
agent,
room: ctx.room,
inputOptions: {
participantIdentity: participant.identity,
},
});
},
});
The defineAgent + AgentSession pattern separates concerns. The agent defines what to say (instructions, tools). The session defines how to process audio (STT, LLM, TTS pipeline stages). Swapping a provider means changing one line, not rewriting the agent.
VAD determines when the user is speaking versus when there is background noise. Without it, your agent tries to transcribe silence, dog barks, and keyboard clicks.
import * as silero from '@livekit/agents-plugin-silero';
const vad = await silero.VAD.load();
const session = new voice.AgentSession({
stt,
llm: llmModel,
tts,
vad, // Silero VAD filters non-speech audio
});
Silero VAD is a neural network specifically trained to distinguish speech from noise. It runs locally (no API call), adding negligible latency. Without VAD, you will see phantom transcripts from environmental noise -- a significant problem in non-studio environments.
Turn detection answers: "Has the user finished speaking?" Get this wrong and either the agent interrupts mid-sentence (too aggressive) or there is a long, awkward pause after every utterance (too conservative).
LiveKit provides multiple turn detection modes:
// Option 1: STT-based (uses the STT model's endpoint detection)
let turnDetection = 'stt';
// Option 2: Multilingual neural turn detector
import * as livekitPlugin from '@livekit/agents-plugin-livekit';
turnDetection = new livekitPlugin.turnDetector.MultilingualModel();
Fine-tuning is done through voiceOptions:
const session = new voice.AgentSession({
stt, llm: llmModel, tts, vad,
turnDetection,
voiceOptions: {
minEndpointingDelay: 1000, // Minimum silence before responding
maxEndpointingDelay: 5000, // Maximum wait time
minInterruptionDuration: 800, // How long user must speak to interrupt
minInterruptionWords: 2, // Minimum words to count as interruption
preemptiveGeneration: true, // Start LLM while user may still be talking
},
});
These values come from real user testing. The minEndpointingDelay of 1000ms is generous -- it prevents the agent from cutting off users who pause to think. The minInterruptionWords of 2 prevents single-syllable backchannels ("mm", "yeah") from being treated as interruptions.
Production voice agents encounter background noise, other voices, and echo. LiveKit provides noise cancellation as a pipeline input option:
import { BackgroundVoiceCancellation } from '@livekit/noise-cancellation-node';
const noiseCancellation = BackgroundVoiceCancellation();
await session.start({
agent,
room: ctx.room,
inputOptions: {
participantIdentity: participant.identity,
noiseCancellation, // Filters background voices and noise
},
});
This runs on the server side, filtering the audio before it reaches STT. The result is dramatically better transcription accuracy in non-ideal environments -- coffee shops, open offices, rooms with other speakers.
One of the hardest problems in hybrid agents is keeping voice and chat in sync. The solution is syncing messages through LiveKit's data channel and your database simultaneously:
class UnifiedAgent extends voice.Agent {
private room: Room;
async syncMessage(msg: llm.ChatMessage) {
const content = msg.textContent;
if (!content) return;
const payload = {
id: uuidv4(),
role: msg.role === 'user' ? 'user' : 'assistant',
content,
createdAt: new Date().toISOString(),
};
// 1. Save to database (persistent state)
await saveMessage(this.room.name, payload);
// 2. Broadcast to frontend via data channel (real-time sync)
const data = new TextEncoder().encode(
JSON.stringify({ type: 'chat_update', message: payload })
);
await this.room.localParticipant.publishData(data, {
reliable: true,
});
}
}
On the frontend, listen for data channel messages and append them to the chat:
room.on(RoomEvent.DataReceived, (payload) => {
const data = JSON.parse(new TextDecoder().decode(payload));
if (data.type === 'chat_update' && data.message) {
setMessages((prev) => [...prev, data.message]);
}
});
Speak in voice mode, see transcripts in chat mode. The session is continuous across modalities because the conversation state is shared.
A voice agent that forgets previous conversations is frustrating. Loading conversation history into the LLM context gives the agent memory:
const history = await loadRecentMessages(roomName, userId, 20);
const chatCtx = llm.ChatContext.empty();
for (const item of history) {
chatCtx.addMessage({
role: item.role,
content: item.content,
id: item.id,
});
}
const agent = new voice.Agent({
instructions: systemPrompt,
chatCtx, // Pre-loaded conversation history
tools: { /* ... */ },
});
"Last time we talked about your experience with LiveKit" is a dramatically better opening than "Hello, how can I help you?" Memory is what makes a conversation feel like a conversation instead of a series of disconnected queries.
For a voice agent to feel natural, total mouth-to-ear latency should be under 1 second.
| Stage | Target | Notes | |-------|--------|-------| | VAD + Audio capture | 50-100ms | Depends on buffer size | | STT | 100-300ms | Streaming STT is faster | | LLM (time to first token) | 200-400ms | Model and prompt size dependent | | TTS (time to first byte) | 75-150ms | ElevenLabs Flash: ~100ms | | Audio transport (WebRTC) | 50-100ms | Depends on geography | | Total | 475-1050ms | |
The biggest lever is LLM time to first token. Use the fastest model that meets your quality bar. Gemini 2.5 Flash was chosen for celestino.ai specifically because of its low latency -- not because it is the most capable model available.
Build a LiveKit voice agent from scratch:
defineAgent and AgentSession.inference.STT, inference.LLM, and inference.TTS.voiceOptions with the endpointing values from this lesson.llm.tool.chatCtx for warm starts.UserInputTranscribed, SpeechCreated). Compare against the budget table.You have a working voice pipeline with modular components, noise cancellation, and chat sync. But every component in this pipeline will fail at some point. Next, we cover Error Handling and Graceful Degradation -- building agents that recover from provider outages, transcription failures, and connection drops without the user hearing silence.