Streaming Chat with the AI SDK | Voice & Chat Agent Engineering | Celestinosalim.com

Streaming Chat with the AI SDK

The Failure

A team built a chat agent that worked perfectly in development. The model returned answers in under 2 seconds. Then they deployed it. Users would type a question and stare at a blank screen for 2-3 seconds while the entire response generated server-side before being sent as one block. Users clicked away before seeing the answer. Bounce rate on the chat page was 40%.

The fix was not a faster model. It was streaming -- sending tokens to the client as they are generated. The first word appears in under 500ms. The user sees the response forming in real time. That visual feedback is enough to keep them engaged through a 3-second generation. Streaming is not a nice-to-have. It is the minimum bar for chat UX.

The Core Abstraction: useChat

The AI SDK provides useChat on the client and streamText on the server. Together, they handle the streaming protocol, message state management, and error handling.

Here is the minimal version:

// Client: app/page.tsx
'use client';
import { useChat } from '@ai-sdk/react';

export default function Chat() {
  const { messages, input, setInput, sendMessage, status } = useChat();

  return (
    <div>
      {messages.map((m) => (
        <div key={m.id}>
          <strong>{m.role}:</strong> {m.parts
            .filter((p) => p.type === 'text')
            .map((p) => p.text)
            .join('')}
        </div>
      ))}
      <form onSubmit={(e) => {
        e.preventDefault();
        sendMessage({ text: input });
        setInput('');
      }}>
        <input value={input} onChange={(e) => setInput(e.target.value)} />
        <button type="submit" disabled={status === 'streaming'}>
          Send
        </button>
      </form>
    </div>
  );
}

// Server: app/api/chat/route.ts
import { streamText } from 'ai';
import { google } from '@ai-sdk/google';

export async function POST(request: Request) {
  const { messages } = await request.json();

  const result = streamText({
    model: google('gemini-2.5-flash'),
    messages,
  });

  return result.toDataStreamResponse();
}

This works. But it handles nothing that matters in production: no session management, no conversation history, no rate limiting, no custom data channels. The rest of this lesson builds the production version.

Production Streaming: Custom Transports

In production, you need control over what gets sent to the server and what comes back. The AI SDK's DefaultChatTransport lets you customize request preparation and response handling.

Here is how celestino.ai configures its transport:

const transport = useMemo(
  () => new DefaultChatTransport({
    api: '/api/chat',
    prepareSendMessagesRequest: ({ messages }) => ({
      body: {
        sessionId: sessionIdRef.current ?? undefined,
        message: messages[messages.length - 1],
      },
    }),
    fetch: async (input, init) => {
      const response = await fetch(input, init);
      // Extract custom headers from the response
      const sessionHeader = response.headers.get('X-Session-Id');
      if (sessionHeader) setSessionId(sessionHeader);
      const remaining = response.headers.get('X-RateLimit-Remaining');
      const limit = response.headers.get('X-RateLimit-Limit');
      if (remaining && limit) {
        setRateLimitInfo({
          remaining: Number(remaining),
          limit: Number(limit),
        });
      }
      return response;
    },
  }),
  []
);

Three important patterns:

Session ID in the request body: The client sends its session ID so the server can load conversation history from the database.
Custom headers on the response: The server sends rate limit info and session IDs back via headers -- data that is not part of the chat stream but is critical for the UI.
The transport wraps fetch: You intercept both the request and response without modifying the streaming protocol.

Server-Side: createUIMessageStream

The server side is where streaming gets interesting. The AI SDK provides createUIMessageStream for building custom streaming responses with metadata, tool results, and control flow.

import {
  createUIMessageStream,
  createUIMessageStreamResponse,
  consumeStream,
  streamText,
  convertToModelMessages,
} from 'ai';

export async function POST(request: Request) {
  const { message, sessionId } = await request.json();

  // Load conversation history from database
  const history = await loadHistory(sessionId);
  const allMessages = [...history, message];
  const modelMessages = await convertToModelMessages(allMessages);

  // Build system prompt with RAG context
  const ragContext = await retrieveContext(message.text);
  const systemPrompt = buildSystemPrompt(ragContext);

  const stream = createUIMessageStream({
    originalMessages: allMessages,
    execute: ({ writer }) => {
      // Send metadata before the response starts
      writer.write({
        type: 'data-rate-limit',
        data: { remaining: 14, limit: 15 },
        transient: true,  // Do not persist in message history
      });

      const result = streamText({
        model: google('gemini-2.5-flash'),
        system: systemPrompt,
        messages: modelMessages,
        onFinish: async ({ text }) => {
          // Persist both messages to database
          await Promise.all([
            logMessage(sessionId, 'user', message.text),
            logMessage(sessionId, 'assistant', text),
          ]);
        },
      });

      writer.merge(result.toUIMessageStream());
    },
  });

  return createUIMessageStreamResponse({
    stream,
    consumeSseStream: consumeStream,
    headers: {
      'X-Session-Id': sessionId,
      'X-RateLimit-Remaining': '14',
    },
  });
}

The key concepts:

writer.write with transient: true: Sends data to the client that does not become part of the message history. Use this for rate limits, session metadata, progress indicators.
writer.merge: Pipes the streamText result into the UI stream. This is what actually sends tokens to the client.
onFinish callback: Runs after the full response is generated. Use this for database writes, analytics, memory updates -- anything that needs the complete text.

Message Persistence and History

A production chat needs persistent history. Conversations are state machines -- they have memory that spans sessions. The pattern:

On page load: Fetch conversation history from the server.
On each message: The server loads history, appends the new message, sends to the model.
On response complete: Persist both user and assistant messages.

// Client: load history on mount
useEffect(() => {
  const loadHistory = async () => {
    const response = await fetch(
      `/api/chat/history?limit=${PAGE_SIZE}`
    );
    const data = await response.json();
    if (data.sessionId) setSessionId(data.sessionId);
    if (data.messages) setMessages(data.messages);
    setHasMore(Boolean(data.hasMore));
  };
  loadHistory();
}, []);

Use cursor-based pagination to prevent loading the entire conversation on every page load. For long-running agents with hundreds of turns, this is essential.

Handling Streaming State

The status field from useChat tells you the current state of the conversation:

const { status } = useChat({ transport });

// status values:
// 'ready'     - Idle, waiting for input
// 'submitted' - Request sent, waiting for first token
// 'streaming' - Tokens arriving
// 'error'     - Something went wrong

const isLoading = status === 'streaming' || status === 'submitted';

Use this to disable the input field during streaming, show a typing indicator, and prevent duplicate submissions:

{isLoading && (
  <div role="status" aria-label="Assistant is typing">
    <span className="typing-dot" />
    <span className="typing-dot" />
    <span className="typing-dot" />
  </div>
)}

The Data Channel

The AI SDK supports custom data parts -- structured data that rides alongside the text stream. This is how you send information from server to client without a separate API call.

const { messages, status } = useChat({
  transport,
  onData: (dataPart) => {
    if (dataPart.type === 'data-rate-limit') {
      const { remaining, limit } = dataPart.data;
      setRateLimitInfo({ remaining, limit });
    }
  },
});

This pattern replaces the need for polling endpoints or WebSocket side-channels for metadata. Rate limits, session state, feature flags -- anything the client needs to know during the conversation can flow through the data channel.

Streaming Latency Budget

| Stage | Target | What Affects It | |-------|--------|-----------------| | Client to server | Under 50ms | Network, payload size | | Server processing (RAG, history) | Under 200ms | Database queries, embedding search | | LLM time to first token | Under 500ms | Model size, prompt length | | Token delivery rate | 30+ tokens/sec | Model, streaming implementation | | User-perceived TTFT | Under 750ms | Sum of above |

If your time to first token exceeds 1 second consistently, investigate server-side processing time first. RAG retrieval and history loading are the usual culprits -- run them in parallel with Promise.all.

Build This

Build a streaming chat with session persistence:

Set up useChat with a DefaultChatTransport that sends a session ID in the request body.
Create an API route that uses createUIMessageStream with writer.write for a custom data part (rate limit or session metadata).
Implement onFinish to persist messages to a database (Supabase, Postgres, or even a JSON file for prototyping).
Add a history endpoint that returns paginated messages on page load.
Wire up onData on the client to display the custom data part in the UI.

Test by opening two tabs with the same session ID. Both should load the same conversation history.

Key Takeaways

Streaming is not optional. Token-by-token delivery is the minimum bar for chat UX.
useChat + streamText handle the protocol. Your job is everything around it: sessions, history, metadata.
Custom transports let you attach session IDs, extract response headers, and control request bodies.
createUIMessageStream with writer.write enables metadata streaming alongside text.
Persist messages in onFinish, not during streaming -- you need the complete response.
Use the data channel for rate limits and session metadata instead of separate API calls.

What's Next

You have a streaming chat with session management and persistent history. But a chat agent that can only produce text is limited. Next, we cover Tool Use and Structured Outputs -- giving your agent the ability to call functions, query databases, and return typed data that your application can consume directly.