Skip to main content
The Voice API enables real-time, bi-directional audio conversations with Datagrid AI Agents over WebSockets. Audio is streamed as base64-encoded PCM data, and the agent responds with synthesized speech in real time.

How it works

  1. Start a session — Call the REST endpoint or connect directly via WebSocket
  2. Connect — Open a WebSocket connection to the returned URL
  3. Stream audio — Send microphone audio as base64 PCM chunks; receive audio responses the same way
  4. End the session — Send a stop message, or simply close the WebSocket

Getting Started

There are two ways to start a voice session. Choose the one that fits your stack. Call POST /v1/voice to validate your request and receive a WebSocket URL with a ready-made start message. Then connect to the URL and send the message as the first frame.
from datagrid import Datagrid
import asyncio
import websockets
import json
import base64

client = Datagrid()

# 1. Prepare the session via REST
session = client.voice.start_session(agent_id="agent_abc123")

# 2. Connect to the WebSocket URL
async def voice_session():
    async with websockets.connect(session.url) as ws:
        # 3. Send the pre-built start message
        await ws.send(json.dumps(session.start_message))

        # 4. Wait for ready
        while True:
            msg = json.loads(await ws.recv())
            print(f"← {msg['type']}")
            if msg["type"] == "started":
                print(f"  Session: {msg['payload']['session_id']}")
            if msg["type"] == "ready":
                break

        # 5. Stream audio
        with open("recording.pcm", "rb") as f:
            while chunk := f.read(4096):
                await ws.send(json.dumps({
                    "type": "audio",
                    "payload": {"data": base64.b64encode(chunk).decode()}
                }))

        # 6. End session and get transcript
        await ws.send(json.dumps({"type": "stop"}))
        while True:
            msg = json.loads(await ws.recv())
            if msg["type"] == "audio":
                audio_bytes = base64.b64decode(msg["payload"]["data"])
                # Play or save audio_bytes...
            elif msg["type"] == "ended":
                print("Transcript:", msg["payload"]["transcript"])
                break

asyncio.run(voice_session())

Option B: Direct WebSocket

If you prefer to skip the REST call, connect directly to the WebSocket endpoint with your API key:
wss://api.datagrid.com/ws/voice?token=YOUR_API_KEY
Then send a start message manually as the first frame:
import asyncio
import websockets
import json
import base64
import os

API_KEY = os.environ["DATAGRID_API_KEY"]

async def voice_session():
    uri = f"wss://api.datagrid.com/ws/voice?token={API_KEY}"

    async with websockets.connect(uri) as ws:
        # 1. Start a session
        await ws.send(json.dumps({
            "type": "start",
            "payload": {
                "agent_id": "agent_abc123"
            }
        }))

        # 2. Wait for ready
        while True:
            msg = json.loads(await ws.recv())
            print(f"← {msg['type']}")
            if msg["type"] == "ready":
                break

        # 3. Stream audio and collect responses...
        # (same as Option A from step 5 onward)

asyncio.run(voice_session())
You must send a start message within 30 seconds of connecting. If the server doesn’t receive one in time, it closes the connection with code 4000 (idle timeout).

Client → Server Messages

All messages are JSON objects with a type field and an optional payload.

start — Begin a voice session

{
  "type": "start",
  "payload": {
    "agent_id": "agent_abc123",
    "conversation_id": "conv_xyz789",
    "config": {
      "system_prompt": "You are a helpful travel assistant.",
      "custom_prompt": "Always respond in a friendly, conversational tone."
    },
    "knowledge_ids": ["know_123"],
    "page_ids": ["page_456"],
    "file_ids": ["file_789"],
    "secret_ids": ["secret_012"],
    "user": {
      "first_name": "Jane",
      "last_name": "Doe",
      "email": "jane@example.com"
    },
    "initial_context": "The user is looking at their latest sales report.",
    "ephemeral": false,
    "voice_config": {
      "voice_preset": "sage",
      "silence_commit_ms": 30000,
      "segment_max_duration_ms": 180000,
      "silence_discard_ratio": 0.9,
      "input_transcription": true,
      "output_transcription": true
    }
  }
}
FieldTypeDescription
agent_idstring | nullAgent to use. If omitted, the default agent is used.
conversation_idstring | nullContinue an existing conversation. If omitted, a new one is created.
configobject | nullPrompt overrides — system_prompt and/or custom_prompt. Voice sessions always use Gemini Live, so LLM model, planning prompt, and tool overrides are not applicable.
knowledge_idsstring[] | nullKnowledge sources to make available to the agent.
page_idsstring[] | nullPages (and their knowledge) to make available.
file_idsstring[] | nullFiles to attach to the conversation.
secret_idsstring[] | nullSecrets to include in the context.
userobject | nullOverride user info (first_name, last_name, email).
initial_contextstring | nullContext text the agent will briefly address before listening.
ephemeralbooleanWhen true, messages are not saved to conversation history. Default: false.
voice_configobject | nullVoice session configuration options. See Voice Configuration below.

audio — Send an audio chunk

{
  "type": "audio",
  "payload": {
    "data": "<base64-encoded PCM audio>",
    "mime_type": "audio/pcm;rate=16000"
  }
}
Audio should be sent as 16-bit mono PCM at 16kHz, base64-encoded. Wait for the ready message before sending audio.

stop — End the session

{ "type": "stop" }
Gracefully ends the session. The server responds with an ended message containing the session transcript and credits consumed.
stop is optional. Closing the WebSocket connection also gracefully ends the session and commits all buffered content server-side. The only difference is that with stop, you receive the ended response containing the final transcript and credit usage before the connection closes.

interrupt — Interrupt the agent

{ "type": "interrupt" }
Send this when the user starts speaking while the agent is responding. The agent will stop its current response and the server sends an interrupted message.

Voice Configuration

The voice_config option in the start message allows you to customize voice session behavior:
FieldTypeDefaultDescription
voice_presetstringAgent’s configured presetVoice preset to use. See Available Presets below.
silence_commit_msnumber30000Duration of silence (ms) before auto-committing a segment.
segment_max_duration_msnumber180000Maximum segment duration (ms) before force-commit (3 minutes).
silence_discard_rationumber0.9Discard a segment if this fraction (0–1) of its audio is silence.
input_transcriptionbooleantrueEnable transcription of user input audio.
output_transcriptionbooleantrueEnable transcription of agent output audio.

Available Voice Presets

PresetDescription
sparkBright, higher pitch
emberUpbeat, middle pitch
sageInformative, lower pitch (default)
novaFirm, middle pitch
valeExcitable, lower-middle pitch
driftYouthful, higher pitch
crestFirm, lower-middle pitch
orbitBreezy, middle pitch
brookEasy-going, middle pitch
gleamBright, middle pitch
duskBreathy, lower pitch
prismClear, lower-middle pitch
coastEasy-going, lower-middle pitch
velvetSmooth, lower pitch
silkSmooth, middle pitch
crystalClear, middle pitch
ridgeGravelly, lower pitch
atlasInformative, middle pitch
bloomUpbeat, higher pitch
whisperSoft, higher pitch
steelFirm, lower-middle pitch
steadyEven, lower-middle pitch
cedarMature, middle pitch
forgeForward, middle pitch
havenFriendly, lower-middle pitch
tideCasual, lower-middle pitch
meadowGentle, middle pitch
rhythmLively, lower pitch
quillArticulate, middle pitch
glowWarm, lower-middle pitch

Server → Client Messages

started — Session established

{
  "type": "started",
  "payload": {
    "session_id": "sess_abc123",
    "conversation_id": "conv_xyz789",
    "message_id": "msg_def456"
  }
}
Sent immediately after a start message is processed. Contains the IDs for the session, conversation, and initial message.

ready — Agent is ready to receive audio

{ "type": "ready" }
Wait for this message before sending audio chunks. The agent needs a moment to initialize after the session starts.

audio — Agent audio response

{
  "type": "audio",
  "payload": {
    "data": "<base64-encoded PCM audio>",
    "mime_type": "audio/pcm;rate=24000"
  }
}
Response audio is 16-bit mono PCM at 24kHz. Multiple audio messages are sent in sequence as the agent speaks.

tool_call — Agent is using a tool

{
  "type": "tool_call",
  "payload": {
    "tool_name": "search_knowledge",
    "status": "started"
  }
}
Status is either "started" or "completed". Use this to show loading indicators while the agent searches knowledge or uses other tools.

transcript — Real-time transcription

{
  "type": "transcript",
  "payload": {
    "role": "user",
    "text": "What were our Q4 sales?"
  }
}
Sent in real-time as transcription becomes available. The role field is either "user" or "agent". Use this to display a live transcript as the conversation progresses.

citation — Source citation

{
  "type": "citation",
  "payload": {
    "citations": [
      { "source": "Q4 Sales Report.pdf", "page": 3 }
    ],
    "timestamp_ms": 12500
  }
}
Sent when the agent references a knowledge source. The timestamp_ms is relative to the session start.

interrupted — Agent was interrupted

{ "type": "interrupted" }
Confirms that the agent’s response was interrupted after a client interrupt message.

error — An error occurred

{
  "type": "error",
  "payload": {
    "message": "Description of what went wrong"
  }
}
Errors do not necessarily close the session. Transient errors are recoverable — only fatal errors are followed by a WebSocket close.

ended — Session ended

{
  "type": "ended",
  "payload": {
    "credits_consumed": 5,
    "transcript": [
      { "role": "user", "text": "What were our Q4 sales?" },
      { "role": "agent", "text": "Based on your sales report, Q4 revenue was $2.3M..." }
    ]
  }
}
Sent when the session ends (either from a stop message, server-side timeout, or error). Contains the final transcript and credit usage.

Session Lifecycle

Client                              Server
  │                                    │
  │── POST /v1/voice ────────────────→ │  (Optional: get URL + start_message)
  │← ─ { url, start_message } ──────  │
  │                                    │
  │─── WebSocket Connect (url) ──────→ │
  │                                    │
  │─── start_message ────────────────→ │
  │                                    │
  │← ─ { type: "started", ... } ──── │  (session_id, conversation_id, message_id)
  │← ─ { type: "ready" } ─────────── │
  │                                    │
  │─── { type: "audio", ... } ──────→ │  (stream mic audio)
  │─── { type: "audio", ... } ──────→ │
  │     ...                            │
  │                                    │
  │← ─ { type: "transcript", ... } ── │  (real-time transcription)
  │← ─ { type: "audio", ... } ──────  │  (agent speaks back)
  │← ─ { type: "audio", ... } ──────  │
  │← ─ { type: "citation", ... } ──── │  (source citations)
  │                                    │
  │─── { type: "interrupt" } ───────→ │  (user interrupts)
  │← ─ { type: "interrupted" } ─────  │
  │                                    │
  │─── { type: "stop" } ───────────→  │  (or just close the WebSocket)
  │← ─ { type: "ended", ... } ──────  │  (transcript + credits)
  │                                    │
  │─── WebSocket Close ──────────────→ │

WebSocket Close Codes

CodeMeaning
1000Normal closure
1001Server shutting down
1008Authentication failed or invalid API key
1011Internal server error during connection
4000Idle timeout — no start message received within 30 seconds

Audio Format Reference

DirectionFormatSample RateChannelsEncoding
Client → ServerPCM 16-bit16 kHzMonoBase64
Server → ClientPCM 16-bit24 kHzMonoBase64

Platform Guides