siraaj-whisperx-service / README.md
WhisperX Transcription Service
Speech-to-Text (SST) and Transcription service
WhisperX Transcription Service
Overview
FastAPI-based microservice for speech-to-text transcription with speaker diarization. This service provides two main endpoints: a simple STT endpoint using Whisper model only, and a full WhisperX pipeline with cross-chunk speaker alignment for multi-file audio processing with speaker diarization.
Key Features:
- Simple speech-to-text transcription (Whisper-only)
- Full WhisperX pipeline with speaker diarization
- Multi-file audio processing with speaker alignment
- GPU-accelerated inference with CUDA
Workflow/Process Flow
Data Flow:
- Simple STT: Audio File → Whisper Model → Transcribed Text
- Full Pipeline: Audio File(s) → WhisperX Transcription → Alignment → Speaker Diarization → Speaker-Tagged Segments
Note: Generate or update workflow diagrams using Mermaid Live Editor. Describe your changes to an LLM, get Mermaid code, and paste it here.
Technology Stack
- Runtime: Python 3.10+
- API Framework: FastAPI
- ML Models:
- WhisperX (large-v3) for transcription
- Whisper for simple STT
- pyannote for speaker diarization
- ML Framework: PyTorch with CUDA support
- Dependency Management: UV
- Containerization: Docker with NVIDIA CUDA
- Container Registry: GitHub Container Registry (ghcr.io)
File Structure
├── src/ # Source code
│ ├── server.py # FastAPI application and routes
│ ├── models.py # Pydantic request/response models
│ ├── pipelines.py # WhisperX and STT processing logic
│ ├── config.py # Configuration and environment variables
│ └── utils.py # Audio processing utilities
├── tests/ # Test suite (see tests/README.md)
├── pyproject.toml # UV dependency management
├── Dockerfile # Docker configuration
├── docker-compose.yml # Docker Compose for deployment
├── .env.example # Example environment variables
└── README.md
Setup
Prerequisites
GPU Required: This service requires an NVIDIA GPU with CUDA support.
HuggingFace Token: Required for speaker diarization models.
- Create a Hugging Face account at https://huggingface.co
- Accept model terms:
- pyannote/speaker-diarization@2.1
- pyannote/segmentation@3.0
- Generate access token: https://huggingface.co/settings/tokens (Read permissions)
Local Development
# Clone the repository
git clone <repo-url>
cd siraaj-whisperx-service
# Create virtual environment
uv venv
source .venv/bin/activate # Linux/Mac/WSL
# Install dependencies
uv pip install -e .
# Set environment variables
cp .env.example .env
# Edit .env and add your HF_TOKEN
# Verify GPU is available
nvidia-smi
# Start the service
python -m uvicorn src.server:app --reload --host 0.0.0.0 --port 8000
The service will start on http://0.0.0.0:8000
Docker Deployment
Prerequisites:
- NVIDIA GPU with CUDA support
- NVIDIA Docker Runtime: https://github.com/NVIDIA/nvidia-docker
- Docker 20.10+
Versioning: This project follows Semantic Versioning - MAJOR.MINOR.PATCH
Build and Push:
# Set version
export VERSION=1.0.0
# Build the image
docker build . -t ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:${VERSION} .
# Login to GitHub Container Registry
echo $GITHUB_TOKEN | docker login ghcr.io -u USERNAME --password-stdin
# Push to registry
docker push ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:${VERSION}
Pull and Run:
# Pull the image
docker pull ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:1.0.0
# Run with GPU support
docker run --gpus all \
-e HF_TOKEN=your_hf_token_here \
-p 8000:8000 \
ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:1.0.0
# Run in background (detached mode)
docker run -d \
--gpus all \
--name whisperx-service \
-e HF_TOKEN=your_hf_token_here \
-p 8000:8000 \
--restart unless-stopped \
ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:1.0.0
Docker Compose (Recommended for Portainer):
Create a .env file with your HuggingFace token:
HF_TOKEN=your_hf_token_here
Deploy:
# Start the service
docker compose up -d
# View logs
docker compose logs -f
# Stop the service
docker compose down
For Portainer:
- Upload
docker-compose.ymlto your stack - Set
HF_TOKENin environment variables - Deploy the stack
- Service will be available on port 8000
The compose file includes GPU reservation, health checks, auto-restart, and port mapping.
Environment Variables
| Variable | Required | Description | Default | Example |
|---|---|---|---|---|
HF_TOKEN | Yes | Hugging Face API token for speaker diarization | - | hf_xxxxxxxxxxxx |
WHISPER_MODEL_SIZE | No | Whisper model size | large-v3 | base, medium, large-v3 |
DEVICE | No | Compute device | cuda | cuda, cpu |
COMPUTE_TYPE | No | Compute precision | float16 | float16, int8 |
DEFAULT_BATCH_SIZE | No | Batch size for inference | 8 | 8, 16 |
DEFAULT_LANGUAGE | No | Default language code | en | en, ar, fr |
Testing
This service includes comprehensive unit and integration tests with mocked models (no GPU required).
For detailed testing documentation, see tests/README.md.
API Endpoints
GET /health
Health check endpoint with GPU status and memory info.
GET /
Service information and available endpoints.
POST /stt/
Simple speech-to-text using Whisper model only.
- Input: Audio file (multipart), language (optional)
- Output: Transcribed text, language, duration
POST /upload-audio/
Full WhisperX pipeline with speaker diarization.
- Input: Audio file(s) (multipart), num_speakers (optional), language (optional)
- Output: Text file with speaker-tagged segments and timestamps
Development
Code Style
Format code with Black:
uv pip install black
uv run black .
Lint with Ruff:
uv pip install ruff
uv run ruff check .
# Auto-fix issues
uv run ruff check --fix .
General Notes
Project History:
- Initially the code was transferred from https://github.com/rihal-om/meeting-analyzer/tree/main/api_compute, but with updated structure, updated NVIDIA image, dependency management to uv instead of requirements.txt, and a comprehensive test suite - since it became a key service within Siraaj product. All changes for better maintainability when adding new features.
Resource Requirements:
- Minimum VRAM: 10 GB for large-v3 model
- Recommended VRAM: 12 GB+ for optimal performance
- CPU: 4+ cores recommended
- Disk Space: ~5 GB for models and dependencies
- Memory: 8 GB+ RAM