Start Lesson
A team built a chat agent that worked perfectly in development. The model returned answers in under 2 seconds. Then they deployed it. Users would type a question and stare at a blank screen for 2-3 seconds while the entire response generated server-side before being sent as one block. Users clicked away before seeing the answer. Bounce rate on the chat page was 40%.
The fix was not a faster model. It was streaming -- sending tokens to the client as they are generated. The first word appears in under 500ms. The user sees the response forming in real time. That visual feedback is enough to keep them engaged through a 3-second generation. Streaming is not a nice-to-have. It is the minimum bar for chat UX.
The AI SDK provides useChat on the client and streamText on the server. Together, they handle the streaming protocol, message state management, and error handling.
Here is the minimal version:
// Client: app/page.tsx
'use client';
import { useChat } from '@ai-sdk/react';
export default function Chat() {
const { messages, input, setInput, sendMessage, status } = useChat();
return (
<div>
{messages.map((m) => (
<div key={m.id}>
<strong>{m.role}:</strong> {m.parts
.filter((p) => p.type === 'text')
.map((p) => p.text)
.join('')}
</div>
))}
<form onSubmit={(e) => {
e.preventDefault();
sendMessage({ text: input });
setInput('');
}}>
<input value={input} onChange={(e) => setInput(e.target.value)} />
<button type="submit" disabled={status === 'streaming'}>
Send
</button>
</form>
</div>
);
}
// Server: app/api/chat/route.ts
import { streamText } from 'ai';
import { google } from '@ai-sdk/google';
export async function POST(request: Request) {
const { messages } = await request.json();
const result = streamText({
model: google('gemini-2.5-flash'),
messages,
});
return result.toDataStreamResponse();
}
This works. But it handles nothing that matters in production: no session management, no conversation history, no rate limiting, no custom data channels. The rest of this lesson builds the production version.
In production, you need control over what gets sent to the server and what comes back. The AI SDK's DefaultChatTransport lets you customize request preparation and response handling.
Here is how celestino.ai configures its transport:
const transport = useMemo(
() => new DefaultChatTransport({
api: '/api/chat',
prepareSendMessagesRequest: ({ messages }) => ({
body: {
sessionId: sessionIdRef.current ?? undefined,
message: messages[messages.length - 1],
},
}),
fetch: async (input, init) => {
const response = await fetch(input, init);
// Extract custom headers from the response
const sessionHeader = response.headers.get('X-Session-Id');
if (sessionHeader) setSessionId(sessionHeader);
const remaining = response.headers.get('X-RateLimit-Remaining');
const limit = response.headers.get('X-RateLimit-Limit');
if (remaining && limit) {
setRateLimitInfo({
remaining: Number(remaining),
limit: Number(limit),
});
}
return response;
},
}),
[]
);
Three important patterns:
The server side is where streaming gets interesting. The AI SDK provides createUIMessageStream for building custom streaming responses with metadata, tool results, and control flow.
import {
createUIMessageStream,
createUIMessageStreamResponse,
consumeStream,
streamText,
convertToModelMessages,
} from 'ai';
export async function POST(request: Request) {
const { message, sessionId } = await request.json();
// Load conversation history from database
const history = await loadHistory(sessionId);
const allMessages = [...history, message];
const modelMessages = await convertToModelMessages(allMessages);
// Build system prompt with RAG context
const ragContext = await retrieveContext(message.text);
const systemPrompt = buildSystemPrompt(ragContext);
const stream = createUIMessageStream({
originalMessages: allMessages,
execute: ({ writer }) => {
// Send metadata before the response starts
writer.write({
type: 'data-rate-limit',
data: { remaining: 14, limit: 15 },
transient: true, // Do not persist in message history
});
const result = streamText({
model: google('gemini-2.5-flash'),
system: systemPrompt,
messages: modelMessages,
onFinish: async ({ text }) => {
// Persist both messages to database
await Promise.all([
logMessage(sessionId, 'user', message.text),
logMessage(sessionId, 'assistant', text),
]);
},
});
writer.merge(result.toUIMessageStream());
},
});
return createUIMessageStreamResponse({
stream,
consumeSseStream: consumeStream,
headers: {
'X-Session-Id': sessionId,
'X-RateLimit-Remaining': '14',
},
});
}
The key concepts:
writer.write with transient: true: Sends data to the client that does not become part of the message history. Use this for rate limits, session metadata, progress indicators.writer.merge: Pipes the streamText result into the UI stream. This is what actually sends tokens to the client.onFinish callback: Runs after the full response is generated. Use this for database writes, analytics, memory updates -- anything that needs the complete text.A production chat needs persistent history. Conversations are state machines -- they have memory that spans sessions. The pattern:
// Client: load history on mount
useEffect(() => {
const loadHistory = async () => {
const response = await fetch(
`/api/chat/history?limit=${PAGE_SIZE}`
);
const data = await response.json();
if (data.sessionId) setSessionId(data.sessionId);
if (data.messages) setMessages(data.messages);
setHasMore(Boolean(data.hasMore));
};
loadHistory();
}, []);
Use cursor-based pagination to prevent loading the entire conversation on every page load. For long-running agents with hundreds of turns, this is essential.
The status field from useChat tells you the current state of the conversation:
const { status } = useChat({ transport });
// status values:
// 'ready' - Idle, waiting for input
// 'submitted' - Request sent, waiting for first token
// 'streaming' - Tokens arriving
// 'error' - Something went wrong
const isLoading = status === 'streaming' || status === 'submitted';
Use this to disable the input field during streaming, show a typing indicator, and prevent duplicate submissions:
{isLoading && (
<div role="status" aria-label="Assistant is typing">
<span className="typing-dot" />
<span className="typing-dot" />
<span className="typing-dot" />
</div>
)}
The AI SDK supports custom data parts -- structured data that rides alongside the text stream. This is how you send information from server to client without a separate API call.
const { messages, status } = useChat({
transport,
onData: (dataPart) => {
if (dataPart.type === 'data-rate-limit') {
const { remaining, limit } = dataPart.data;
setRateLimitInfo({ remaining, limit });
}
},
});
This pattern replaces the need for polling endpoints or WebSocket side-channels for metadata. Rate limits, session state, feature flags -- anything the client needs to know during the conversation can flow through the data channel.
| Stage | Target | What Affects It | |-------|--------|-----------------| | Client to server | Under 50ms | Network, payload size | | Server processing (RAG, history) | Under 200ms | Database queries, embedding search | | LLM time to first token | Under 500ms | Model size, prompt length | | Token delivery rate | 30+ tokens/sec | Model, streaming implementation | | User-perceived TTFT | Under 750ms | Sum of above |
If your time to first token exceeds 1 second consistently, investigate server-side processing time first. RAG retrieval and history loading are the usual culprits -- run them in parallel with Promise.all.
Build a streaming chat with session persistence:
useChat with a DefaultChatTransport that sends a session ID in the request body.createUIMessageStream with writer.write for a custom data part (rate limit or session metadata).onFinish to persist messages to a database (Supabase, Postgres, or even a JSON file for prototyping).onData on the client to display the custom data part in the UI.Test by opening two tabs with the same session ID. Both should load the same conversation history.
useChat + streamText handle the protocol. Your job is everything around it: sessions, history, metadata.createUIMessageStream with writer.write enables metadata streaming alongside text.onFinish, not during streaming -- you need the complete response.You have a streaming chat with session management and persistent history. But a chat agent that can only produce text is limited. Next, we cover Tool Use and Structured Outputs -- giving your agent the ability to call functions, query databases, and return typed data that your application can consume directly.