siraaj-dot-ocr-service / README.md

OCR Service

Last updated: 4/16/2026GitHub

OCR Service

Azure Document Intelligence-compatible OCR service using Temporal workflows, Minio storage, and a multi-model fallback strategy with SigLIP2 language routing. English pages use LightOnOCR → DotsOCR → Qwen3-VL → PaddleOCR; general/Arabic pages use DotsOCR → Qwen3-VL → PaddleOCR. Supports single images and multi-page PDFs.

Documentation

See the docs index for a full guide to the documentation structure.

Architecture & Flows — processing flows, fallback strategies, and prompt documentation
Migration Complete — Triton migration checklist
OCR Model Comparison Report — benchmark across 20+ OCR models (English + Arabic)
Hard English Benchmark — English-only benchmark of 17 models on 9 difficult pages with ground truth scoring
Research — benchmark scripts, baselines, and raw results

Endpoints

Azure Document Intelligence API compatible endpoints:

POST /documentintelligence/documentModels/{modelId}:analyze
- Content-Type: image/jpeg, image/png, or application/pdf
- Query: api-version required (2024-11-30). Optional: outputContentFormat, stringIndexType, pages.
- pages: 1-based page selection for PDFs, e.g. "1-3,5,7-9". Omit to process all pages. Ignored for images.
- Response: 202 Accepted with Operation-Location header and Retry-After.
GET /documentintelligence/documentModels/{modelId}/analyzeResults/{resultId}
- Query: api-version required (2024-11-30).
- Response: AnalyzeOperation envelope with status and analyzeResult on success.
- For PDFs: analyzeResult.pages[] contains all pages sorted by pageNumber.

Configuration

Environment Variables

Temporal:

TEMPORAL_HOST - Temporal server address (default: localhost:7233)
TEMPORAL_NAMESPACE - Temporal namespace (default: default)
TEMPORAL_TASK_QUEUE - Task queue name (default: ocr-tasks)

Minio:

MINIO_ENDPOINT - Minio server address (default: localhost:9000)
MINIO_ACCESS_KEY - Access key (default: minioadmin)
MINIO_SECRET_KEY - Secret key (default: minioadmin)
MINIO_BUCKET - Bucket name (default: ocr-documents)

Models:

DOTS_VLLM_BASE_URL - DotsOCR server URL (default: http://localhost:8000)
DOTS_VLLM_MODEL - DotsOCR model name (default: model)
QWEN_VLLM_BASE_URL - Qwen server URL (default: http://localhost:8000)
QWEN_VLLM_MODEL - Qwen model name (default: model)
LIGHTONOCR_VLLM_BASE_URL - LightOnOCR-2 server URL (default: http://localhost:8003)
LIGHTONOCR_VLLM_MODEL - LightOnOCR-2 model name (default: lightonocr-2)
TRITON_GRPC_URL - Triton gRPC URL for PaddleOCR + language router (default: triton:8001)

Model Tuning:

VLLM_TIMEOUT - vLLM request timeout in seconds (default: 300)
VLLM_MAX_TOKENS - Max output tokens (default: 4096)
DOTS_TEMPERATURE - Temperature for DOTS model (default: 0.1)
DOTS_TOP_P - Top-p for DOTS model (default: 0.9)
LIGHTONOCR_TEMPERATURE - Temperature for LightOnOCR (default: 0.2)
LIGHTONOCR_TOP_P - Top-p for LightOnOCR (default: 0.9)
QWEN_TEMPERATURE - Temperature for Qwen (default: 0.7)
QWEN_TOP_P - Top-p for Qwen (default: 0.8)
QWEN_TOP_K - Top-k for Qwen (default: 20)
QWEN_PRESENCE_PENALTY - Presence penalty for Qwen (default: 1.5)
TRITON_TIMEOUT - Triton gRPC timeout in seconds (default: 30)

Processing:

MAX_IMAGES - Maximum images per page (default: 25)
MAX_UPLOAD_SIZE_MB - Maximum upload size in MB (default: 100)
WORKFLOW_WORKERS - Number of Temporal workers (default: 4)

PDF Processing:

MAX_PDF_PAGES - Maximum pages per PDF (default: 1000)
PDF_EXTRACT_TIMEOUT - Page extraction timeout in seconds (default: 900 = 15 min)
PDF_PAGE_CONCURRENCY - Parallel page processing limit (default: 1)
PDF_BATCH_SIZE - Pages per processing batch (default: 10)

Operations:

LOG_LEVEL - Logging level (default: info)
API_WORKERS - Uvicorn worker processes (default: 1)
MAX_NUM_SEQS - vLLM concurrent sequences (default: 4)
MAX_CONCURRENT_WORKFLOW_TASKS - Max concurrent workflow tasks per worker (default: 5)
MAX_CACHED_WORKFLOWS - Max cached workflow instances (default: 50)
OCR_SAVE_STATES - Save OCR workflow states (default: false)
MINIO_USE_SSL - Use SSL for Minio (default: false)
TEMPORAL_WORKFLOW_TIMEOUT - Workflow timeout in seconds (default: 3600)
TEMPORAL_ACTIVITY_TIMEOUT - Activity timeout in seconds (default: 600)
TEMPORAL_ACTIVITY_RETRY_MAX_ATTEMPTS - Max activity retries (default: 3)

Setup and Run

Docker Compose (Recommended)

# Create shared network (if not exists)
docker network create siraaj-shared-network

# Start all services
docker compose up --build

# View logs
docker compose logs -f

# Stop services
docker compose down

Local Development

Setup

uv sync

Start Infrastructure

# Start Minio
docker run -d --name minio \
  -p 9000:9000 -p 9001:9001 \
  -e MINIO_ROOT_USER=minioadmin \
  -e MINIO_ROOT_PASSWORD=minioadmin \
  minio/minio:latest server /data --console-address ":9001"

# Start Temporal (with PostgreSQL persistence)
docker run -d --name temporal-postgres \
  -p 5432:5432 \
  -e POSTGRES_USER=temporal \
  -e POSTGRES_PASSWORD=temporal \
  -e POSTGRES_DB=temporal \
  postgres:15-alpine

docker run -d --name temporal \
  -p 7233:7233 \
  -e DB=postgres12 \
  -e DB_PORT=5432 \
  -e POSTGRES_USER=temporal \
  -e POSTGRES_PWD=temporal \
  -e POSTGRES_SEEDS=host.docker.internal \
  temporalio/auto-setup:latest

# Start Temporal UI
docker run -d --name temporal-ui \
  -p 8080:8080 \
  -e TEMPORAL_ADDRESS=host.docker.internal:7233 \
  temporalio/ui:latest

Start Temporal Workers

At least one worker is required to process OCR jobs:

# Start worker(s)
uv run python -m ocr_workflow.workers.temporal_worker

Start API Server

uv run uvicorn ocr_workflow.main:app --port 9191

# With logging
mkdir -p logs
uv run uvicorn ocr_workflow.main:app --port 9191 2>&1 | tee logs/$(date -u +%Y%m%d-%H%M%S).log

Sample Test

# Submit image for analysis
curl -i -X POST \
  -H "Content-Type: image/jpeg" \
  --data-binary @data/ar_table.jpg \
  "http://localhost:9191/documentintelligence/documentModels/myModel:analyze?api-version=2024-11-30"

# Submit PDF for analysis
curl -i -X POST \
  -H "Content-Type: application/pdf" \
  --data-binary @data/pdfs/sample.pdf \
  "http://localhost:9191/documentintelligence/documentModels/myModel:analyze?api-version=2024-11-30"

# Get result (use resultId from Operation-Location header)
curl -s \
  "http://localhost:9191/documentintelligence/documentModels/myModel/analyzeResults/{resultId}?api-version=2024-11-30" \
  | jq

Quick PDF Test (kfd.pdf)

Test with the kfd.pdf sample:

# Submit kfd.pdf for analysis
curl -i -X POST \
  -H "Content-Type: application/pdf" \
  --data-binary @data/pdfs/kfd.pdf \
  "http://localhost:9191/documentintelligence/documentModels/myModel:analyze?api-version=2024-11-30"

# Save the resultId from the Operation-Location header, then poll for results:
curl -s \
  "http://localhost:9191/documentintelligence/documentModels/myModel/analyzeResults/{resultId}?api-version=2024-11-30" \
  | jq '.'

PDF Error Handling

Failure	Handling
PDF encrypted/corrupted	Fail fast with clear error
Extraction fails mid-way	Cleanup uploaded pages, retry whole extraction
OCR fails for a page	Retry activity, then mark page failed, continue others
All pages fail OCR	Workflow fails with aggregated error
Partial success	Return succeeded pages + errors for failed (sorted by page number)

Temporal UI

Access Temporal UI at http://localhost:8080 to:

View workflow execution history
Inspect workflow state and activities
Debug failed workflows
Monitor worker health

Azure Integration Test

# Test PDF pages with custom worker count
MAX_OCR_WORKERS=16 uv run python tests/integration_tests/azure_pdf_test.py \
  --pdf data/pdfs/ien.pdf \
  --pages "1-10,15,20"

Qwen3-VL (vLLM Docker)

See the Qwen3-VL setup in the Docker Compose configuration. For manual setup:

# Pull vLLM image
docker pull vllm/vllm-openai:v0.11.0

# Create volume and download model
docker volume create qwen3vl

docker run -i --rm \
  --entrypoint python3 \
  -e HF_HOME=/root/.cache/huggingface \
  -v qwen3vl:/models \
  vllm/vllm-openai:v0.11.0 \
  - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download(
    "Qwen/Qwen3-VL-2B-Instruct-FP8",
    local_dir="/models/qwen3vl-2b-fp8",
    ignore_patterns=["*.md","*.png","*.jpg","*.jpeg","*.gif","*.webp"],
)
PY

# Run server
docker run -d --name qwen3vl-server \
  --gpus all -p 8001:8000 \
  -v qwen3vl:/models \
  vllm/vllm-openai:v0.11.0 \
  --model /models/qwen3vl-2b-fp8 \
  --host 0.0.0.0 --port 8000 \
  --served-model-name qwen3-vl-2b-fp8