siraaj-whisperx-service / README.md

WhisperX Transcription Service

Speech-to-Text (SST) and Transcription service

Last updated: 4/16/2026GitHub

WhisperX Transcription Service

Overview

FastAPI-based microservice for speech-to-text transcription with speaker diarization. This service provides two main endpoints: a simple STT endpoint using Whisper model only, and a full WhisperX pipeline with cross-chunk speaker alignment for multi-file audio processing with speaker diarization.

Key Features:

Simple speech-to-text transcription (Whisper-only)
Full WhisperX pipeline with speaker diarization
Multi-file audio processing with speaker alignment
GPU-accelerated inference with CUDA

Workflow/Process Flow

Data Flow:

Simple STT: Audio File → Whisper Model → Transcribed Text
Full Pipeline: Audio File(s) → WhisperX Transcription → Alignment → Speaker Diarization → Speaker-Tagged Segments

Note: Generate or update workflow diagrams using Mermaid Live Editor. Describe your changes to an LLM, get Mermaid code, and paste it here.

Technology Stack

Runtime: Python 3.10+
API Framework: FastAPI
ML Models:
- WhisperX (large-v3) for transcription
- Whisper for simple STT
- pyannote for speaker diarization
ML Framework: PyTorch with CUDA support
Dependency Management: UV
Containerization: Docker with NVIDIA CUDA
Container Registry: GitHub Container Registry (ghcr.io)

File Structure

├── src/                   # Source code
│   ├── server.py          # FastAPI application and routes
│   ├── models.py          # Pydantic request/response models
│   ├── pipelines.py       # WhisperX and STT processing logic
│   ├── config.py          # Configuration and environment variables
│   └── utils.py           # Audio processing utilities
├── tests/                 # Test suite (see tests/README.md)
├── pyproject.toml         # UV dependency management
├── Dockerfile             # Docker configuration
├── docker-compose.yml     # Docker Compose for deployment
├── .env.example           # Example environment variables
└── README.md

Setup

Prerequisites

GPU Required: This service requires an NVIDIA GPU with CUDA support.

HuggingFace Token: Required for speaker diarization models.

Create a Hugging Face account at https://huggingface.co
Accept model terms:
- pyannote/speaker-diarization@2.1
- pyannote/segmentation@3.0
Generate access token: https://huggingface.co/settings/tokens (Read permissions)

Local Development

# Clone the repository
git clone <repo-url>
cd siraaj-whisperx-service

# Create virtual environment
uv venv
source .venv/bin/activate  # Linux/Mac/WSL

# Install dependencies
uv pip install -e .

# Set environment variables
cp .env.example .env
# Edit .env and add your HF_TOKEN

# Verify GPU is available
nvidia-smi

# Start the service
python -m uvicorn src.server:app --reload --host 0.0.0.0 --port 8000

The service will start on http://0.0.0.0:8000

Docker Deployment

Prerequisites:

NVIDIA GPU with CUDA support
NVIDIA Docker Runtime: https://github.com/NVIDIA/nvidia-docker
Docker 20.10+

Versioning: This project follows Semantic Versioning - MAJOR.MINOR.PATCH

Build and Push:

# Set version
export VERSION=1.0.0

# Build the image
docker build . -t ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:${VERSION} .

# Login to GitHub Container Registry
echo $GITHUB_TOKEN | docker login ghcr.io -u USERNAME --password-stdin

# Push to registry
docker push ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:${VERSION}

Pull and Run:

# Pull the image
docker pull ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:1.0.0

# Run with GPU support
docker run --gpus all \
  -e HF_TOKEN=your_hf_token_here \
  -p 8000:8000 \
  ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:1.0.0

# Run in background (detached mode)
docker run -d \
  --gpus all \
  --name whisperx-service \
  -e HF_TOKEN=your_hf_token_here \
  -p 8000:8000 \
  --restart unless-stopped \
  ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:1.0.0

Docker Compose (Recommended for Portainer):

Create a .env file with your HuggingFace token:

HF_TOKEN=your_hf_token_here

Deploy:

# Start the service
docker compose up -d

# View logs
docker compose logs -f

# Stop the service
docker compose down

For Portainer:

Upload docker-compose.yml to your stack
Set HF_TOKEN in environment variables
Deploy the stack
Service will be available on port 8000

The compose file includes GPU reservation, health checks, auto-restart, and port mapping.

Environment Variables

Variable	Required	Description	Default	Example
`HF_TOKEN`	Yes	Hugging Face API token for speaker diarization	-	`hf_xxxxxxxxxxxx`
`WHISPER_MODEL_SIZE`	No	Whisper model size	`large-v3`	`base`, `medium`, `large-v3`
`DEVICE`	No	Compute device	`cuda`	`cuda`, `cpu`
`COMPUTE_TYPE`	No	Compute precision	`float16`	`float16`, `int8`
`DEFAULT_BATCH_SIZE`	No	Batch size for inference	`8`	`8`, `16`
`DEFAULT_LANGUAGE`	No	Default language code	`en`	`en`, `ar`, `fr`

Testing

This service includes comprehensive unit and integration tests with mocked models (no GPU required).

For detailed testing documentation, see tests/README.md.

API Endpoints

`GET /health`

Health check endpoint with GPU status and memory info.

`GET /`

Service information and available endpoints.

`POST /stt/`

Simple speech-to-text using Whisper model only.

Input: Audio file (multipart), language (optional)
Output: Transcribed text, language, duration

`POST /upload-audio/`

Full WhisperX pipeline with speaker diarization.

Input: Audio file(s) (multipart), num_speakers (optional), language (optional)
Output: Text file with speaker-tagged segments and timestamps

Development

Code Style

Format code with Black:

uv pip install black
uv run black .

Lint with Ruff:

uv pip install ruff
uv run ruff check .

# Auto-fix issues
uv run ruff check --fix .

General Notes

Project History:

Initially the code was transferred from https://github.com/rihal-om/meeting-analyzer/tree/main/api_compute, but with updated structure, updated NVIDIA image, dependency management to uv instead of requirements.txt, and a comprehensive test suite - since it became a key service within Siraaj product. All changes for better maintainability when adding new features.

Resource Requirements:

Minimum VRAM: 10 GB for large-v3 model
Recommended VRAM: 12 GB+ for optimal performance
CPU: 4+ cores recommended
Disk Space: ~5 GB for models and dependencies
Memory: 8 GB+ RAM