API Endpoints

Quick start¶

After running the docker image interactive Swagger API documentation is available at localhost:9000/docs

There are 2 endpoints available:

2 task choices:
transcribe: (default) task, transcribes the uploaded file.
translate: will provide an English transcript no matter which language was spoken.
Files are automatically converted with FFmpeg.
Full list of supported audio and video formats.
You can enable word level timestamps output by word_timestamps parameter
You can Enable the voice activity detection (VAD) to filter out parts of the audio without speech by vad_filter parameter (only with Faster Whisper for now).

Name	Values	Description
audio_file	File	Audio or video file to transcribe
output	`text` (default), `json`, `vtt`, `srt`, `tsv`	Output format
task	`transcribe`, `translate`	Task type - transcribe in source language or translate to English
language	`en` (default is auto recognition)	Source language code (see supported languages)
word_timestamps	false (default)	Enable word-level timestamps (Faster Whisper only)
vad_filter	false (default)	Enable voice activity detection filtering (Faster Whisper only)
encode	true (default)	Encode audio through FFmpeg before processing
diarize	false (default)	Enable speaker diarization (WhisperX only)
min_speakers	null (default)	Minimum number of speakers for diarization (WhisperX only)
max_speakers	null (default)	Maximum number of speakers for diarization (WhisperX only)

Example request with cURL

curl -X POST -H "content-type: multipart/form-data" -F "audio_file=@/path/to/file" 0.0.0.0:9000/asr?output=json

text: Contains the full transcript
segments: Contains an entry per segment. Each entry provides timestamps, transcript, token ids, word level timestamps and other metadata
language: Detected or provided language (as a language code)

The API supports multiple output formats:

The service supports all languages supported by Whisper. Some common language codes:

See the Whisper documentation for the full list of supported languages.

When using the WhisperX engine with diarization enabled (diarize=true), the output will include speaker labels for each segment. This requires:

You can optionally specify min_speakers and max_speakers if you know the expected number of speakers.

Detects the language spoken in the uploaded file. Only processes first 30 seconds.

Returns a json with following fields:

detected_language: Human readable language name (e.g. "english")
language_code: ISO language code (e.g. "en")
confidence: Confidence score between 0 and 1 indicating detection reliability

Example response:

{
    "detected_language": "english",
    "language_code": "en",
    "confidence": 0.98
}