Skip to main content

Get Result

This endpoint enables real-time audio transcription using WebSocket. It supports streaming audio input and provides partial and final transcription results during the session.

Request

URL:
GET /ws (Websocket)

Headers:
Authorization: Bearer <TOKEN>

WebSocket Connection Initialization

Once the WebSocket connection is established, the first message must be a JSON string to initialize the transcription session.

Initialization Message Parameters

  • uid (required): A unique identifier for the client/session.
  • sample_rate (required): Audio sample rate (e.g. 16000).
  • language (optional, default: "en"): The language of the transcription.
  • task (optional, default: "transcribe"): Task type, "transcribe" or "translate".
  • duration_minutes (required if token is provided): The intended duration of the session.
  • model (optional, default: "large-v3-turbo"): Model version to use.
  • initial_prompt (optional): Initial text prompt to guide the transcription.
  • use_vad (optional, default: true): Whether to use voice activity detection.
  • spoken_languages (optional): List of expected spoken languages.
  • vad_parameters (optional): Custom VAD settings (e.g., silence threshold).
  • max_delay (optional, default: 7): Max delay before producing partials.

Example Initialization Message

{
"uid": "client-1234",
"sample_rate": 16000,
"language": "en",
"task": "transcribe",
"duration_minutes": 5,
"model": "large-v3-turbo",
"initial_prompt": "Welcome to Whisperly.",
"use_vad": true,
"spoken_languages": ["en"],
"vad_parameters": {
"min_silence_duration_ms": 800,
"max_speech_duration_s": 7
},
"max_delay": 7
}

Streaming Audio

After the initialization, the client must send raw audio chunks as float32 arrays. End the stream by sending the b"END_OF_AUDIO" signal.

Audio Frame Format

  • Must be float32 PCM raw data.
  • Send data continuously to avoid timeout or disconnects.

Server Messages

The server will respond with transcription results in real-time.

Response Types

  • Partial Transcription:
{
"uid": "client-1234",
"segments": [
{
"start": "0.000",
"end": "3.500",
"text": "Hello everyone",
"type": "partial"
}
]
}
  • Final Transcription:

Sent when a complete sentence or silence is detected.

{
"uid": "client-1234",
"segments": [
{
"start": "0.000",
"end": "3.500",
"text": "Hello everyone",
"type": "final"
}
]
}

- Server Ready:

{
"uid": "client-1234",
"message": "SERVER_READY",
"backend": "faster_whisper"
}

- Error Message:

{
"uid": "client-1234",
"status": "ERROR",
"message": "duration_minutes is required"
}

Closing the Session

The connection is automatically closed:

  • If the specified duration is exceeded.
  • If client sends b"END_OF_AUDIO" signal.
  • If an error occurs.

You can also send websocket.close() from the client to terminate the session manually.