Real-time voice conversations with AI Agents using WebSockets
The Voice API enables real-time, bi-directional audio conversations with Datagrid AI Agents over WebSockets. Audio is streamed as base64-encoded PCM data, and the agent responds with synthesized speech in real time.
Call POST /v1/voice to validate your request and receive a WebSocket URL with a ready-made start message. Then connect to the URL and send the message as the first frame.
from datagrid import Datagridimport asyncioimport websocketsimport jsonimport base64client = Datagrid()# 1. Prepare the session via RESTsession = client.voice.start_session(agent_id="agent_abc123")# 2. Connect to the WebSocket URLasync def voice_session(): async with websockets.connect(session.url) as ws: # 3. Send the pre-built start message await ws.send(json.dumps(session.start_message)) # 4. Wait for ready while True: msg = json.loads(await ws.recv()) print(f"← {msg['type']}") if msg["type"] == "started": print(f" Session: {msg['payload']['session_id']}") if msg["type"] == "ready": break # 5. Stream audio with open("recording.pcm", "rb") as f: while chunk := f.read(4096): await ws.send(json.dumps({ "type": "audio", "payload": {"data": base64.b64encode(chunk).decode()} })) # 6. End session and get transcript await ws.send(json.dumps({"type": "stop"})) while True: msg = json.loads(await ws.recv()) if msg["type"] == "audio": audio_bytes = base64.b64decode(msg["payload"]["data"]) # Play or save audio_bytes... elif msg["type"] == "ended": print("Transcript:", msg["payload"]["transcript"]) breakasyncio.run(voice_session())
Then send a start message manually as the first frame:
import asyncioimport websocketsimport jsonimport base64import osAPI_KEY = os.environ["DATAGRID_API_KEY"]async def voice_session(): uri = f"wss://api.datagrid.com/ws/voice?token={API_KEY}" async with websockets.connect(uri) as ws: # 1. Start a session await ws.send(json.dumps({ "type": "start", "payload": { "agent_id": "agent_abc123" } })) # 2. Wait for ready while True: msg = json.loads(await ws.recv()) print(f"← {msg['type']}") if msg["type"] == "ready": break # 3. Stream audio and collect responses... # (same as Option A from step 5 onward)asyncio.run(voice_session())
You must send a start message within 30 seconds of connecting. If the server doesn’t receive one in time, it closes the connection with code 4000 (idle timeout).
{ "type": "start", "payload": { "agent_id": "agent_abc123", "conversation_id": "conv_xyz789", "config": { "system_prompt": "You are a helpful travel assistant.", "custom_prompt": "Always respond in a friendly, conversational tone." }, "knowledge_ids": ["know_123"], "page_ids": ["page_456"], "file_ids": ["file_789"], "secret_ids": ["secret_012"], "user": { "first_name": "Jane", "last_name": "Doe", "email": "jane@example.com" }, "initial_context": "The user is looking at their latest sales report.", "ephemeral": false, "voice_config": { "voice_preset": "sage", "silence_commit_ms": 30000, "segment_max_duration_ms": 180000, "silence_discard_ratio": 0.9, "input_transcription": true, "output_transcription": true } }}
Field
Type
Description
agent_id
string | null
Agent to use. If omitted, the default agent is used.
conversation_id
string | null
Continue an existing conversation. If omitted, a new one is created.
config
object | null
Prompt overrides — system_prompt and/or custom_prompt. Voice sessions always use Gemini Live, so LLM model, planning prompt, and tool overrides are not applicable.
knowledge_ids
string[] | null
Knowledge sources to make available to the agent.
page_ids
string[] | null
Pages (and their knowledge) to make available.
file_ids
string[] | null
Files to attach to the conversation.
secret_ids
string[] | null
Secrets to include in the context.
user
object | null
Override user info (first_name, last_name, email).
initial_context
string | null
Context text the agent will briefly address before listening.
ephemeral
boolean
When true, messages are not saved to conversation history. Default: false.
Gracefully ends the session. The server responds with an ended message containing the session transcript and credits consumed.
stop is optional. Closing the WebSocket connection also gracefully ends the session and commits all buffered content server-side. The only difference is that with stop, you receive the ended response containing the final transcript and credit usage before the connection closes.
Send this when the user starts speaking while the agent is responding. The agent will stop its current response and the server sends an interrupted message.
Sent in real-time as transcription becomes available. The role field is either "user" or "agent". Use this to display a live transcript as the conversation progresses.