siraaj-whisperx-service / README.md

WhisperX Transcription Service

Speech-to-Text (SST) and Transcription service

Last updated: 4/16/2026GitHub

WhisperX Transcription Service

Overview

FastAPI-based microservice for speech-to-text transcription with speaker diarization. This service provides two main endpoints: a simple STT endpoint using Whisper model only, and a full WhisperX pipeline with cross-chunk speaker alignment for multi-file audio processing with speaker diarization.

Key Features:

  • Simple speech-to-text transcription (Whisper-only)
  • Full WhisperX pipeline with speaker diarization
  • Multi-file audio processing with speaker alignment
  • GPU-accelerated inference with CUDA

Workflow/Process Flow

Data Flow:

  • Simple STT: Audio File → Whisper Model → Transcribed Text
  • Full Pipeline: Audio File(s) → WhisperX Transcription → Alignment → Speaker Diarization → Speaker-Tagged Segments

Note: Generate or update workflow diagrams using Mermaid Live Editor. Describe your changes to an LLM, get Mermaid code, and paste it here.

Technology Stack

  • Runtime: Python 3.10+
  • API Framework: FastAPI
  • ML Models:
    • WhisperX (large-v3) for transcription
    • Whisper for simple STT
    • pyannote for speaker diarization
  • ML Framework: PyTorch with CUDA support
  • Dependency Management: UV
  • Containerization: Docker with NVIDIA CUDA
  • Container Registry: GitHub Container Registry (ghcr.io)

File Structure

├── src/                   # Source code
│   ├── server.py          # FastAPI application and routes
│   ├── models.py          # Pydantic request/response models
│   ├── pipelines.py       # WhisperX and STT processing logic
│   ├── config.py          # Configuration and environment variables
│   └── utils.py           # Audio processing utilities
├── tests/                 # Test suite (see tests/README.md)
├── pyproject.toml         # UV dependency management
├── Dockerfile             # Docker configuration
├── docker-compose.yml     # Docker Compose for deployment
├── .env.example           # Example environment variables
└── README.md

Setup

Prerequisites

GPU Required: This service requires an NVIDIA GPU with CUDA support.

HuggingFace Token: Required for speaker diarization models.

  1. Create a Hugging Face account at https://huggingface.co
  2. Accept model terms:
    • pyannote/speaker-diarization@2.1
    • pyannote/segmentation@3.0
  3. Generate access token: https://huggingface.co/settings/tokens (Read permissions)

Local Development

# Clone the repository
git clone <repo-url>
cd siraaj-whisperx-service

# Create virtual environment
uv venv
source .venv/bin/activate  # Linux/Mac/WSL

# Install dependencies
uv pip install -e .

# Set environment variables
cp .env.example .env
# Edit .env and add your HF_TOKEN

# Verify GPU is available
nvidia-smi

# Start the service
python -m uvicorn src.server:app --reload --host 0.0.0.0 --port 8000

The service will start on http://0.0.0.0:8000

Docker Deployment

Prerequisites:

Versioning: This project follows Semantic Versioning - MAJOR.MINOR.PATCH

Build and Push:

# Set version
export VERSION=1.0.0

# Build the image
docker build . -t ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:${VERSION} .

# Login to GitHub Container Registry
echo $GITHUB_TOKEN | docker login ghcr.io -u USERNAME --password-stdin

# Push to registry
docker push ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:${VERSION}

Pull and Run:

# Pull the image
docker pull ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:1.0.0

# Run with GPU support
docker run --gpus all \
  -e HF_TOKEN=your_hf_token_here \
  -p 8000:8000 \
  ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:1.0.0

# Run in background (detached mode)
docker run -d \
  --gpus all \
  --name whisperx-service \
  -e HF_TOKEN=your_hf_token_here \
  -p 8000:8000 \
  --restart unless-stopped \
  ghcr.io/rihal-om/siraaj-whisperx-service/whisperx:1.0.0

Docker Compose (Recommended for Portainer):

Create a .env file with your HuggingFace token:

HF_TOKEN=your_hf_token_here

Deploy:

# Start the service
docker compose up -d

# View logs
docker compose logs -f

# Stop the service
docker compose down

For Portainer:

  1. Upload docker-compose.yml to your stack
  2. Set HF_TOKEN in environment variables
  3. Deploy the stack
  4. Service will be available on port 8000

The compose file includes GPU reservation, health checks, auto-restart, and port mapping.

Environment Variables

VariableRequiredDescriptionDefaultExample
HF_TOKENYesHugging Face API token for speaker diarization-hf_xxxxxxxxxxxx
WHISPER_MODEL_SIZENoWhisper model sizelarge-v3base, medium, large-v3
DEVICENoCompute devicecudacuda, cpu
COMPUTE_TYPENoCompute precisionfloat16float16, int8
DEFAULT_BATCH_SIZENoBatch size for inference88, 16
DEFAULT_LANGUAGENoDefault language codeenen, ar, fr

Testing

This service includes comprehensive unit and integration tests with mocked models (no GPU required).

For detailed testing documentation, see tests/README.md.

API Endpoints

GET /health

Health check endpoint with GPU status and memory info.

GET /

Service information and available endpoints.

POST /stt/

Simple speech-to-text using Whisper model only.

  • Input: Audio file (multipart), language (optional)
  • Output: Transcribed text, language, duration

POST /upload-audio/

Full WhisperX pipeline with speaker diarization.

  • Input: Audio file(s) (multipart), num_speakers (optional), language (optional)
  • Output: Text file with speaker-tagged segments and timestamps

Development

Code Style

Format code with Black:

uv pip install black
uv run black .

Lint with Ruff:

uv pip install ruff
uv run ruff check .

# Auto-fix issues
uv run ruff check --fix .

General Notes

Project History:

  • Initially the code was transferred from https://github.com/rihal-om/meeting-analyzer/tree/main/api_compute, but with updated structure, updated NVIDIA image, dependency management to uv instead of requirements.txt, and a comprehensive test suite - since it became a key service within Siraaj product. All changes for better maintainability when adding new features.

Resource Requirements:

  • Minimum VRAM: 10 GB for large-v3 model
  • Recommended VRAM: 12 GB+ for optimal performance
  • CPU: 4+ cores recommended
  • Disk Space: ~5 GB for models and dependencies
  • Memory: 8 GB+ RAM