LiveKit Voice Pipelines | Voice & Chat Agent Engineering | Celestinosalim.com

LiveKit Voice Pipelines

The Failure

A team used OpenAI's Realtime API for their voice agent. It worked well -- until they needed a specific voice that OpenAI did not offer. Their brand required a particular vocal quality, and the limited voice options were a dealbreaker. They also discovered that their Spanish-speaking users got poor transcription accuracy because the built-in STT was tuned for English. They could not swap the STT without swapping the entire model. They were locked in.

The pipeline approach solves this. Instead of one model doing everything, you compose a pipeline from best-in-class components: choose the STT that handles your languages, the LLM that fits your latency budget, and the TTS that sounds like your brand. LiveKit handles the real-time transport, turn detection, and interruption management. You control every stage.

The Pipeline Architecture

A LiveKit voice agent follows this flow:

[User's Microphone]
       |
       v
  +---------+    Audio frames
  |   VAD   | -- (Voice Activity Detection)
  +----+----+
       | Speech detected
       v
  +---------+
  |   STT   | -- Speech-to-Text (e.g., ElevenLabs Scribe)
  +----+----+
       | Transcript text
       v
  +---------+
  |   LLM   | -- Language Model (e.g., Gemini 2.5 Flash)
  +----+----+
       | Response tokens
       v
  +---------+
  |   TTS   | -- Text-to-Speech (e.g., ElevenLabs Flash v2.5)
  +----+----+
       | Audio frames
       v
[User's Speaker]

Each stage is independent. You can swap ElevenLabs for Deepgram STT, or Gemini for Claude, or Cartesia for ElevenLabs TTS -- without touching the rest of the pipeline.

Setting Up the Agent

LiveKit Agents 1.0 introduced AgentSession as the unified orchestrator. Here is the production configuration from celestino.ai:

import {
  cli, JobContext, WorkerOptions,
  defineAgent, voice, llm, inference,
} from '@livekit/agents';
import { z } from 'zod';

export default defineAgent({
  entry: async (ctx: JobContext) => {
    await ctx.connect();
    const participant = await ctx.waitForParticipant();

    // Initialize each pipeline stage independently
    const stt = new inference.STT({
      model: 'elevenlabs/scribe_v2_realtime',
      language: 'en',
    });

    const llmModel = new inference.LLM({
      model: 'google/gemini-2.5-flash',
    });

    const tts = new inference.TTS({
      model: 'elevenlabs/eleven_flash_v2_5',
      voice: 'cjVigY5qzO86Huf0OWal',
      language: 'en',
    });

    // Create the agent with instructions and tools
    const agent = new voice.Agent({
      instructions: systemPrompt,
      tools: {
        search: llm.tool({
          description: 'Search the knowledge base',
          parameters: z.object({
            query: z.string().describe('The search query'),
          }),
          execute: async ({ query }) => {
            const docs = await retrieveContext(query);
            return docs.map((d) => d.content).join('\n\n');
          },
        }),
      },
    });

    // Create the session -- this wires up the full pipeline
    const session = new voice.AgentSession({
      stt,
      llm: llmModel,
      tts,
    });

    // Start: connects the pipeline to the room
    await session.start({
      agent,
      room: ctx.room,
      inputOptions: {
        participantIdentity: participant.identity,
      },
    });
  },
});

The defineAgent + AgentSession pattern separates concerns. The agent defines what to say (instructions, tools). The session defines how to process audio (STT, LLM, TTS pipeline stages). Swapping a provider means changing one line, not rewriting the agent.

Voice Activity Detection (VAD)

VAD determines when the user is speaking versus when there is background noise. Without it, your agent tries to transcribe silence, dog barks, and keyboard clicks.

import * as silero from '@livekit/agents-plugin-silero';

const vad = await silero.VAD.load();

const session = new voice.AgentSession({
  stt,
  llm: llmModel,
  tts,
  vad,  // Silero VAD filters non-speech audio
});

Silero VAD is a neural network specifically trained to distinguish speech from noise. It runs locally (no API call), adding negligible latency. Without VAD, you will see phantom transcripts from environmental noise -- a significant problem in non-studio environments.

Turn Detection and Endpointing

Turn detection answers: "Has the user finished speaking?" Get this wrong and either the agent interrupts mid-sentence (too aggressive) or there is a long, awkward pause after every utterance (too conservative).

LiveKit provides multiple turn detection modes:

// Option 1: STT-based (uses the STT model's endpoint detection)
let turnDetection = 'stt';

// Option 2: Multilingual neural turn detector
import * as livekitPlugin from '@livekit/agents-plugin-livekit';
turnDetection = new livekitPlugin.turnDetector.MultilingualModel();

Fine-tuning is done through voiceOptions:

const session = new voice.AgentSession({
  stt, llm: llmModel, tts, vad,
  turnDetection,
  voiceOptions: {
    minEndpointingDelay: 1000,    // Minimum silence before responding
    maxEndpointingDelay: 5000,    // Maximum wait time
    minInterruptionDuration: 800, // How long user must speak to interrupt
    minInterruptionWords: 2,      // Minimum words to count as interruption
    preemptiveGeneration: true,   // Start LLM while user may still be talking
  },
});

These values come from real user testing. The minEndpointingDelay of 1000ms is generous -- it prevents the agent from cutting off users who pause to think. The minInterruptionWords of 2 prevents single-syllable backchannels ("mm", "yeah") from being treated as interruptions.

Noise Cancellation

Production voice agents encounter background noise, other voices, and echo. LiveKit provides noise cancellation as a pipeline input option:

import { BackgroundVoiceCancellation } from '@livekit/noise-cancellation-node';

const noiseCancellation = BackgroundVoiceCancellation();

await session.start({
  agent,
  room: ctx.room,
  inputOptions: {
    participantIdentity: participant.identity,
    noiseCancellation,  // Filters background voices and noise
  },
});

This runs on the server side, filtering the audio before it reaches STT. The result is dramatically better transcription accuracy in non-ideal environments -- coffee shops, open offices, rooms with other speakers.

Syncing Voice with Chat

One of the hardest problems in hybrid agents is keeping voice and chat in sync. The solution is syncing messages through LiveKit's data channel and your database simultaneously:

class UnifiedAgent extends voice.Agent {
  private room: Room;

  async syncMessage(msg: llm.ChatMessage) {
    const content = msg.textContent;
    if (!content) return;

    const payload = {
      id: uuidv4(),
      role: msg.role === 'user' ? 'user' : 'assistant',
      content,
      createdAt: new Date().toISOString(),
    };

    // 1. Save to database (persistent state)
    await saveMessage(this.room.name, payload);

    // 2. Broadcast to frontend via data channel (real-time sync)
    const data = new TextEncoder().encode(
      JSON.stringify({ type: 'chat_update', message: payload })
    );
    await this.room.localParticipant.publishData(data, {
      reliable: true,
    });
  }
}

On the frontend, listen for data channel messages and append them to the chat:

room.on(RoomEvent.DataReceived, (payload) => {
  const data = JSON.parse(new TextDecoder().decode(payload));
  if (data.type === 'chat_update' && data.message) {
    setMessages((prev) => [...prev, data.message]);
  }
});

Speak in voice mode, see transcripts in chat mode. The session is continuous across modalities because the conversation state is shared.

Conversation History: Warm Starts

A voice agent that forgets previous conversations is frustrating. Loading conversation history into the LLM context gives the agent memory:

const history = await loadRecentMessages(roomName, userId, 20);
const chatCtx = llm.ChatContext.empty();

for (const item of history) {
  chatCtx.addMessage({
    role: item.role,
    content: item.content,
    id: item.id,
  });
}

const agent = new voice.Agent({
  instructions: systemPrompt,
  chatCtx,  // Pre-loaded conversation history
  tools: { /* ... */ },
});

"Last time we talked about your experience with LiveKit" is a dramatically better opening than "Hello, how can I help you?" Memory is what makes a conversation feel like a conversation instead of a series of disconnected queries.

Pipeline Latency Budget

For a voice agent to feel natural, total mouth-to-ear latency should be under 1 second.

| Stage | Target | Notes | |-------|--------|-------| | VAD + Audio capture | 50-100ms | Depends on buffer size | | STT | 100-300ms | Streaming STT is faster | | LLM (time to first token) | 200-400ms | Model and prompt size dependent | | TTS (time to first byte) | 75-150ms | ElevenLabs Flash: ~100ms | | Audio transport (WebRTC) | 50-100ms | Depends on geography | | Total | 475-1050ms | |

The biggest lever is LLM time to first token. Use the fastest model that meets your quality bar. Gemini 2.5 Flash was chosen for celestino.ai specifically because of its low latency -- not because it is the most capable model available.

Build This

Build a LiveKit voice agent from scratch:

Set up a LiveKit Agents project with defineAgent and AgentSession.
Configure STT, LLM, and TTS using inference.STT, inference.LLM, and inference.TTS.
Add Silero VAD and configure voiceOptions with the endpointing values from this lesson.
Add one tool (knowledge base search) using llm.tool.
Load conversation history into chatCtx for warm starts.
Measure the latency of each pipeline stage using session events (UserInputTranscribed, SpeechCreated). Compare against the budget table.

Key Takeaways

LiveKit gives you modular control -- swap any pipeline stage without rewriting the agent.
VAD is not optional. Without it, background noise generates phantom transcripts.
Turn detection values are tuned through user testing, not guesswork. Start conservative, then tighten.
Sync voice transcripts to chat via data channels and shared database sessions.
Pre-load conversation history for warm starts that demonstrate memory.
Target under 1 second total latency. LLM time to first token is the biggest lever.

What's Next

You have a working voice pipeline with modular components, noise cancellation, and chat sync. But every component in this pipeline will fail at some point. Next, we cover Error Handling and Graceful Degradation -- building agents that recover from provider outages, transcription failures, and connection drops without the user hearing silence.