API Endpoints
Quick start¶
After running the docker image interactive Swagger API documentation is available at localhost:9000/docs
There are 2 endpoints available:
- /asr (Automatic Speech Recognition)
- /detect-language
Automatic speech recognition service /asr¶
- 2 task choices:
- transcribe: (default) task, transcribes the uploaded file.
- translate: will provide an English transcript no matter which language was spoken.
- Files are automatically converted with FFmpeg.
- Full list of supported audio and video formats.
- You can enable word level timestamps output by
word_timestamps
parameter - You can Enable the voice activity detection (VAD) to filter out parts of the audio without speech by
vad_filter
parameter (only withFaster Whisper
for now).
Request URL Query Params¶
Name | Values | Description |
---|---|---|
audio_file | File | Audio or video file to transcribe |
output | text (default), json , vtt , srt , tsv |
Output format |
task | transcribe , translate |
Task type - transcribe in source language or translate to English |
language | en (default is auto recognition) |
Source language code (see supported languages) |
word_timestamps | false (default) | Enable word-level timestamps (Faster Whisper only) |
vad_filter | false (default) | Enable voice activity detection filtering (Faster Whisper only) |
encode | true (default) | Encode audio through FFmpeg before processing |
diarize | false (default) | Enable speaker diarization (WhisperX only) |
min_speakers | null (default) | Minimum number of speakers for diarization (WhisperX only) |
max_speakers | null (default) | Maximum number of speakers for diarization (WhisperX only) |
Example request with cURL
curl -X POST -H "content-type: multipart/form-data" -F "audio_file=@/path/to/file" 0.0.0.0:9000/asr?output=json
Response (JSON)¶
- text: Contains the full transcript
- segments: Contains an entry per segment. Each entry provides
timestamps
,transcript
,token ids
,word level timestamps
and other metadata - language: Detected or provided language (as a language code)
Response Formats¶
The API supports multiple output formats:
- text: Plain text transcript (default)
- json: Detailed JSON with segments, timestamps, and metadata
- vtt: WebVTT subtitle format
- srt: SubRip subtitle format
- tsv: Tab-separated values with timestamps
Supported Languages¶
The service supports all languages supported by Whisper. Some common language codes:
- Turkish (tr)
- English (en)
- Spanish (es)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- And many more...
See the Whisper documentation for the full list of supported languages.
Speaker Diarization¶
When using the WhisperX engine with diarization enabled (diarize=true
), the output will include speaker labels for each segment. This requires:
- WhisperX engine to be configured
- Valid Hugging Face token set in HF_TOKEN
- Sufficient memory for diarization models
You can optionally specify min_speakers
and max_speakers
if you know the expected number of speakers.
Language detection service /detect-language¶
Detects the language spoken in the uploaded file. Only processes first 30 seconds.
Returns a json with following fields:
- detected_language: Human readable language name (e.g. "english")
- language_code: ISO language code (e.g. "en")
- confidence: Confidence score between 0 and 1 indicating detection reliability
Example response:
{
"detected_language": "english",
"language_code": "en",
"confidence": 0.98
}