> ## Documentation Index
> Fetch the complete documentation index at: https://developers.datagrid.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Voice Conversations

> Real-time voice conversations with AI Agents using WebSockets

The Voice API enables real-time, bi-directional audio conversations with Datagrid AI Agents over WebSockets. Audio is streamed as base64-encoded PCM data, and the agent responds with synthesized speech in real time.

## How it works

1. **Start a session** — Call the REST endpoint or connect directly via WebSocket
2. **Connect** — Open a WebSocket connection to the returned URL
3. **Stream audio** — Send microphone audio as base64 PCM chunks; receive audio responses the same way
4. **End the session** — Send a `stop` message, or simply close the WebSocket

## Getting Started

There are two ways to start a voice session. Choose the one that fits your stack.

### Option A: SDK / REST (recommended)

Call `POST /v1/voice` to validate your request and receive a WebSocket URL with a ready-made `start` message. Then connect to the URL and send the message as the first frame.

<Note>
  `POST /v1/voice` depends on Redis to issue a short-lived REST-to-WebSocket handoff token. During a Redis incident, clients that can construct their own `start` message may use the direct WebSocket flow below with a raw API key.
</Note>

<CodeGroup>
  ```python Python theme={null}
  from datagrid import Datagrid
  import asyncio
  import websockets
  import json
  import base64

  client = Datagrid()

  # 1. Prepare the session via REST
  session = client.voice.start_session(agent_id="agent_abc123")

  # 2. Connect to the WebSocket URL
  async def voice_session():
      async with websockets.connect(session.url) as ws:
          # 3. Send the pre-built start message
          await ws.send(json.dumps(session.start_message))

          # 4. Wait for ready
          while True:
              msg = json.loads(await ws.recv())
              print(f"← {msg['type']}")
              if msg["type"] == "started":
                  print(f"  Session: {msg['payload']['session_id']}")
              if msg["type"] == "ready":
                  break

          # 5. Stream audio
          with open("recording.pcm", "rb") as f:
              while chunk := f.read(4096):
                  await ws.send(json.dumps({
                      "type": "audio",
                      "payload": {"data": base64.b64encode(chunk).decode()}
                  }))

          # 6. End session and get transcript
          await ws.send(json.dumps({"type": "stop"}))
          while True:
              msg = json.loads(await ws.recv())
              if msg["type"] == "audio":
                  audio_bytes = base64.b64decode(msg["payload"]["data"])
                  # Play or save audio_bytes...
              elif msg["type"] == "ended":
                  print("Transcript:", msg["payload"]["transcript"])
                  break

  asyncio.run(voice_session())
  ```

  ```javascript JavaScript theme={null}
  import Datagrid from "datagrid-ai";
  import WebSocket from "ws";

  const client = new Datagrid();

  // 1. Prepare the session via REST
  const session = await client.voice.startSession({
    agent_id: "agent_abc123",
  });

  // 2. Connect to the WebSocket URL
  const ws = new WebSocket(session.url);

  ws.on("open", () => {
    // 3. Send the pre-built start message
    ws.send(JSON.stringify(session.start_message));
  });

  ws.on("message", (data) => {
    const msg = JSON.parse(data.toString());

    switch (msg.type) {
      case "started":
        console.log("Session:", msg.payload.session_id);
        break;
      case "ready":
        console.log("Ready — start sending audio");
        // Send audio chunks here...
        break;
      case "audio":
        // Decode and play: Buffer.from(msg.payload.data, "base64")
        break;
      case "transcript":
        console.log(`[${msg.payload.role}] ${msg.payload.text}`);
        break;
      case "ended":
        console.log("Transcript:", msg.payload.transcript);
        ws.close();
        break;
      case "error":
        console.error("Error:", msg.payload.message);
        break;
    }
  });
  ```

  ```bash cURL + wscat theme={null}
  # 1. Prepare the session
  curl -X POST https://api.datagrid.com/v1/voice \
    -H "Authorization: Bearer $DATAGRID_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"agent_id": "agent_abc123"}'

  # Response:
  # {
  #   "object": "voice.session",
  #   "url": "wss://api.datagrid.com/ws/voice?token=dg_live_...",
  #   "agent_id": "agent_abc123",
  #   "start_message": {"type":"start","payload":{"agent_id":"agent_abc123"}}
  # }

  # 2. Connect and send the start_message
  wscat -c "wss://api.datagrid.com/ws/voice?token=$DATAGRID_API_KEY"
  > {"type":"start","payload":{"agent_id":"agent_abc123"}}
  ```
</CodeGroup>

### Option B: Direct WebSocket

If you prefer to skip the REST call, connect directly to the WebSocket endpoint with your API key:

```
wss://api.datagrid.com/ws/voice?token=YOUR_API_KEY
```

Then send a `start` message manually as the first frame:

<CodeGroup>
  ```python Python theme={null}
  import asyncio
  import websockets
  import json
  import base64
  import os

  API_KEY = os.environ["DATAGRID_API_KEY"]

  async def voice_session():
      uri = f"wss://api.datagrid.com/ws/voice?token={API_KEY}"

      async with websockets.connect(uri) as ws:
          # 1. Start a session
          await ws.send(json.dumps({
              "type": "start",
              "payload": {
                  "agent_id": "agent_abc123"
              }
          }))

          # 2. Wait for ready
          while True:
              msg = json.loads(await ws.recv())
              print(f"← {msg['type']}")
              if msg["type"] == "ready":
                  break

          # 3. Stream audio and collect responses...
          # (same as Option A from step 5 onward)

  asyncio.run(voice_session())
  ```

  ```javascript JavaScript theme={null}
  const API_KEY = process.env.DATAGRID_API_KEY;
  const WebSocket = require("ws");

  const ws = new WebSocket(
    `wss://api.datagrid.com/ws/voice?token=${API_KEY}`
  );

  ws.on("open", () => {
    ws.send(JSON.stringify({
      type: "start",
      payload: { agent_id: "agent_abc123" }
    }));
  });

  ws.on("message", (data) => {
    const msg = JSON.parse(data.toString());
    // Handle messages (same as Option A)
  });
  ```
</CodeGroup>

<Warning>
  You must send a `start` message within **30 seconds** of connecting. If the server doesn't receive one in time, it closes the connection with code `4000` (idle timeout).
</Warning>

## Client → Server Messages

All messages are JSON objects with a `type` field and an optional `payload`.

### `start` — Begin a voice session

```json theme={null}
{
  "type": "start",
  "payload": {
    "agent_id": "agent_abc123",
    "conversation_id": "conv_xyz789",
    "config": {
      "system_prompt": "You are a helpful travel assistant.",
      "custom_prompt": "Always respond in a friendly, conversational tone."
    },
    "knowledge_ids": ["know_123"],
    "page_ids": ["page_456"],
    "file_ids": ["file_789"],
    "secret_ids": ["secret_012"],
    "user": {
      "first_name": "Jane",
      "last_name": "Doe",
      "email": "jane@example.com"
    },
    "initial_context": "The user is looking at their latest sales report.",
    "ephemeral": false,
    "voice_config": {
      "voice_preset": "sage",
      "silence_commit_ms": 30000,
      "segment_max_duration_ms": 180000,
      "silence_discard_ratio": 0.9,
      "input_transcription": true,
      "output_transcription": true
    }
  }
}
```

| Field             | Type              | Description                                                                                                                                                             |
| ----------------- | ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `agent_id`        | string \| null    | Agent to use. If omitted, the default agent is used.                                                                                                                    |
| `conversation_id` | string \| null    | Continue an existing conversation. If omitted, a new one is created.                                                                                                    |
| `config`          | object \| null    | Prompt overrides — `system_prompt` and/or `custom_prompt`. Voice sessions always use Gemini Live, so LLM model, planning prompt, and tool overrides are not applicable. |
| `knowledge_ids`   | string\[] \| null | Knowledge sources to make available to the agent.                                                                                                                       |
| `page_ids`        | string\[] \| null | Pages (and their knowledge) to make available.                                                                                                                          |
| `file_ids`        | string\[] \| null | Files to attach to the conversation.                                                                                                                                    |
| `secret_ids`      | string\[] \| null | Secrets to include in the context.                                                                                                                                      |
| `user`            | object \| null    | Override user info (`first_name`, `last_name`, `email`).                                                                                                                |
| `initial_context` | string \| null    | Context text the agent will briefly address before listening.                                                                                                           |
| `ephemeral`       | boolean           | When `true`, messages are not saved to conversation history. Default: `false`.                                                                                          |
| `voice_config`    | object \| null    | Voice session configuration options. See [Voice Configuration](#voice-configuration) below.                                                                             |

### `audio` — Send an audio chunk

```json theme={null}
{
  "type": "audio",
  "payload": {
    "data": "<base64-encoded PCM audio>",
    "mime_type": "audio/pcm;rate=16000"
  }
}
```

Audio should be sent as **16-bit mono PCM at 16kHz**, base64-encoded. Wait for the `ready` message before sending audio.

### `stop` — End the session

```json theme={null}
{ "type": "stop" }
```

Gracefully ends the session. The server responds with an `ended` message containing the session transcript and credits consumed.

<Note>
  `stop` is optional. Closing the WebSocket connection also gracefully ends the session and commits all buffered content server-side. The only difference is that with `stop`, you receive the `ended` response containing the final transcript and credit usage before the connection closes.
</Note>

### `interrupt` — Interrupt the agent

```json theme={null}
{ "type": "interrupt" }
```

Send this when the user starts speaking while the agent is responding. The agent will stop its current response and the server sends an `interrupted` message.

## Voice Configuration

The `voice_config` option in the `start` message allows you to customize voice session behavior:

| Field                     | Type    | Default                   | Description                                                                   |
| ------------------------- | ------- | ------------------------- | ----------------------------------------------------------------------------- |
| `voice_preset`            | string  | Agent's configured preset | Voice preset to use. See [Available Presets](#available-voice-presets) below. |
| `silence_commit_ms`       | number  | 30000                     | Duration of silence (ms) before auto-committing a segment.                    |
| `segment_max_duration_ms` | number  | 180000                    | Maximum segment duration (ms) before force-commit (3 minutes).                |
| `silence_discard_ratio`   | number  | 0.9                       | Discard a segment if this fraction (0–1) of its audio is silence.             |
| `input_transcription`     | boolean | true                      | Enable transcription of user input audio.                                     |
| `output_transcription`    | boolean | true                      | Enable transcription of agent output audio.                                   |

### Available Voice Presets

| Preset    | Description                        |
| --------- | ---------------------------------- |
| `spark`   | Bright, higher pitch               |
| `ember`   | Upbeat, middle pitch               |
| `sage`    | Informative, lower pitch (default) |
| `nova`    | Firm, middle pitch                 |
| `vale`    | Excitable, lower-middle pitch      |
| `drift`   | Youthful, higher pitch             |
| `crest`   | Firm, lower-middle pitch           |
| `orbit`   | Breezy, middle pitch               |
| `brook`   | Easy-going, middle pitch           |
| `gleam`   | Bright, middle pitch               |
| `dusk`    | Breathy, lower pitch               |
| `prism`   | Clear, lower-middle pitch          |
| `coast`   | Easy-going, lower-middle pitch     |
| `velvet`  | Smooth, lower pitch                |
| `silk`    | Smooth, middle pitch               |
| `crystal` | Clear, middle pitch                |
| `ridge`   | Gravelly, lower pitch              |
| `atlas`   | Informative, middle pitch          |
| `bloom`   | Upbeat, higher pitch               |
| `whisper` | Soft, higher pitch                 |
| `steel`   | Firm, lower-middle pitch           |
| `steady`  | Even, lower-middle pitch           |
| `cedar`   | Mature, middle pitch               |
| `forge`   | Forward, middle pitch              |
| `haven`   | Friendly, lower-middle pitch       |
| `tide`    | Casual, lower-middle pitch         |
| `meadow`  | Gentle, middle pitch               |
| `rhythm`  | Lively, lower pitch                |
| `quill`   | Articulate, middle pitch           |
| `glow`    | Warm, lower-middle pitch           |

## Server → Client Messages

### `started` — Session established

```json theme={null}
{
  "type": "started",
  "payload": {
    "session_id": "sess_abc123",
    "conversation_id": "conv_xyz789",
    "message_id": "msg_def456"
  }
}
```

Sent immediately after a `start` message is processed. Contains the IDs for the session, conversation, and initial message.

### `ready` — Agent is ready to receive audio

```json theme={null}
{ "type": "ready" }
```

**Wait for this message before sending audio chunks.** The agent needs a moment to initialize after the session starts.

### `audio` — Agent audio response

```json theme={null}
{
  "type": "audio",
  "payload": {
    "data": "<base64-encoded PCM audio>",
    "mime_type": "audio/pcm;rate=24000"
  }
}
```

Response audio is **16-bit mono PCM at 24kHz**. Multiple `audio` messages are sent in sequence as the agent speaks.

### `tool_call` — Agent is using a tool

```json theme={null}
{
  "type": "tool_call",
  "payload": {
    "tool_name": "search_knowledge",
    "status": "started"
  }
}
```

Status is either `"started"` or `"completed"`. Use this to show loading indicators while the agent searches knowledge or uses other tools.

### `transcript` — Real-time transcription

```json theme={null}
{
  "type": "transcript",
  "payload": {
    "role": "user",
    "text": "What were our Q4 sales?"
  }
}
```

Sent in real-time as transcription becomes available. The `role` field is either `"user"` or `"agent"`. Use this to display a live transcript as the conversation progresses.

### `citation` — Source citation

```json theme={null}
{
  "type": "citation",
  "payload": {
    "citations": [
      { "source": "Q4 Sales Report.pdf", "page": 3 }
    ],
    "timestamp_ms": 12500
  }
}
```

Sent when the agent references a knowledge source. The `timestamp_ms` is relative to the session start.

### `interrupted` — Agent was interrupted

```json theme={null}
{ "type": "interrupted" }
```

Confirms that the agent's response was interrupted after a client `interrupt` message.

### `error` — An error occurred

```json theme={null}
{
  "type": "error",
  "payload": {
    "message": "Description of what went wrong"
  }
}
```

Errors do not necessarily close the session. Transient errors are recoverable — only fatal errors are followed by a WebSocket close.

### `ended` — Session ended

```json theme={null}
{
  "type": "ended",
  "payload": {
    "credits_consumed": 5,
    "transcript": [
      { "role": "user", "text": "What were our Q4 sales?" },
      { "role": "agent", "text": "Based on your sales report, Q4 revenue was $2.3M..." }
    ]
  }
}
```

Sent when the session ends (either from a `stop` message, server-side timeout, or error). Contains the final transcript and credit usage.

## Session Lifecycle

```
Client                              Server
  │                                    │
  │── POST /v1/voice ────────────────→ │  (Optional: get URL + start_message)
  │← ─ { url, start_message } ──────  │
  │                                    │
  │─── WebSocket Connect (url) ──────→ │
  │                                    │
  │─── start_message ────────────────→ │
  │                                    │
  │← ─ { type: "started", ... } ──── │  (session_id, conversation_id, message_id)
  │← ─ { type: "ready" } ─────────── │
  │                                    │
  │─── { type: "audio", ... } ──────→ │  (stream mic audio)
  │─── { type: "audio", ... } ──────→ │
  │     ...                            │
  │                                    │
  │← ─ { type: "transcript", ... } ── │  (real-time transcription)
  │← ─ { type: "audio", ... } ──────  │  (agent speaks back)
  │← ─ { type: "audio", ... } ──────  │
  │← ─ { type: "citation", ... } ──── │  (source citations)
  │                                    │
  │─── { type: "interrupt" } ───────→ │  (user interrupts)
  │← ─ { type: "interrupted" } ─────  │
  │                                    │
  │─── { type: "stop" } ───────────→  │  (or just close the WebSocket)
  │← ─ { type: "ended", ... } ──────  │  (transcript + credits)
  │                                    │
  │─── WebSocket Close ──────────────→ │
```

## WebSocket Close Codes

| Code   | Meaning                                                      |
| ------ | ------------------------------------------------------------ |
| `1000` | Normal closure                                               |
| `1001` | Server shutting down                                         |
| `1008` | Authentication failed or invalid API key                     |
| `1011` | Internal server error during connection                      |
| `4000` | Idle timeout — no `start` message received within 30 seconds |

## Voice Orchestrator Tasks

Default orchestrator voice sessions can delegate longer-running work to
specialist agents. When delegated work continues after the voice-safe turn
budget, the server persists task status so clients can show a user-scoped task
inbox.

Use these REST endpoints to surface delegated task state:

| Endpoint                                                   | Purpose                                                                                                                          |
| ---------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
| `GET /v1/voice-orchestrator/tasks`                         | List non-expired active tasks plus unacknowledged terminal tasks for the authenticated user. List responses omit result content. |
| `GET /v1/voice-orchestrator/tasks/{task_id}`               | Retrieve one owned, non-expired task with result or error content when available.                                                |
| `PATCH /v1/voice-orchestrator/tasks/{task_id}/acknowledge` | Mark an owned terminal task as seen so it is not repeatedly surfaced.                                                            |

Supported task states are `queued`, `running`, `completed`, `failed`, and
`cancelled`. The `cancelled` state is reserved for terminal records produced by
future cancellation flows; this API does not currently expose a cancel
operation. Missing, unowned, expired, non-terminal acknowledgement, and
feature-disabled tasks are returned as not found.

The current backend persists task status and terminal results. In-flight
specialist execution still runs in the existing voice server process, so a
server crash or redeploy during execution can leave a task `running` until its
`expires_at` time.

## Audio Format Reference

| Direction       | Format     | Sample Rate | Channels | Encoding |
| --------------- | ---------- | ----------- | -------- | -------- |
| Client → Server | PCM 16-bit | 16 kHz      | Mono     | Base64   |
| Server → Client | PCM 16-bit | 24 kHz      | Mono     | Base64   |

## Platform Guides

* [iOS / Swift Integration](/api-reference/voice/ios-integration) — Full walkthrough for building a voice assistant in a native iOS app
