siraaj-dot-ocr-service / README.md
OCR Service
OCR Service
Azure Document Intelligence-compatible OCR service using Temporal workflows, Minio storage, and a multi-model fallback strategy with SigLIP2 language routing. English pages use LightOnOCR → DotsOCR → Qwen3-VL → PaddleOCR; general/Arabic pages use DotsOCR → Qwen3-VL → PaddleOCR. Supports single images and multi-page PDFs.
Documentation
See the docs index for a full guide to the documentation structure.
- Architecture & Flows — processing flows, fallback strategies, and prompt documentation
- Migration Complete — Triton migration checklist
- OCR Model Comparison Report — benchmark across 20+ OCR models (English + Arabic)
- Hard English Benchmark — English-only benchmark of 17 models on 9 difficult pages with ground truth scoring
- Research — benchmark scripts, baselines, and raw results
Endpoints
Azure Document Intelligence API compatible endpoints:
-
POST
/documentintelligence/documentModels/{modelId}:analyze- Content-Type:
image/jpeg,image/png, orapplication/pdf - Query:
api-versionrequired (2024-11-30). Optional:outputContentFormat,stringIndexType,pages. pages: 1-based page selection for PDFs, e.g."1-3,5,7-9". Omit to process all pages. Ignored for images.- Response:
202 AcceptedwithOperation-Locationheader andRetry-After.
- Content-Type:
-
GET
/documentintelligence/documentModels/{modelId}/analyzeResults/{resultId}- Query:
api-versionrequired (2024-11-30). - Response:
AnalyzeOperationenvelope withstatusandanalyzeResulton success. - For PDFs:
analyzeResult.pages[]contains all pages sorted bypageNumber.
- Query:
Configuration
Environment Variables
Temporal:
TEMPORAL_HOST- Temporal server address (default:localhost:7233)TEMPORAL_NAMESPACE- Temporal namespace (default:default)TEMPORAL_TASK_QUEUE- Task queue name (default:ocr-tasks)
Minio:
MINIO_ENDPOINT- Minio server address (default:localhost:9000)MINIO_ACCESS_KEY- Access key (default:minioadmin)MINIO_SECRET_KEY- Secret key (default:minioadmin)MINIO_BUCKET- Bucket name (default:ocr-documents)
Models:
DOTS_VLLM_BASE_URL- DotsOCR server URL (default:http://localhost:8000)DOTS_VLLM_MODEL- DotsOCR model name (default:model)QWEN_VLLM_BASE_URL- Qwen server URL (default:http://localhost:8000)QWEN_VLLM_MODEL- Qwen model name (default:model)LIGHTONOCR_VLLM_BASE_URL- LightOnOCR-2 server URL (default:http://localhost:8003)LIGHTONOCR_VLLM_MODEL- LightOnOCR-2 model name (default:lightonocr-2)TRITON_GRPC_URL- Triton gRPC URL for PaddleOCR + language router (default:triton:8001)
Model Tuning:
VLLM_TIMEOUT- vLLM request timeout in seconds (default:300)VLLM_MAX_TOKENS- Max output tokens (default:4096)DOTS_TEMPERATURE- Temperature for DOTS model (default:0.1)DOTS_TOP_P- Top-p for DOTS model (default:0.9)LIGHTONOCR_TEMPERATURE- Temperature for LightOnOCR (default:0.2)LIGHTONOCR_TOP_P- Top-p for LightOnOCR (default:0.9)QWEN_TEMPERATURE- Temperature for Qwen (default:0.7)QWEN_TOP_P- Top-p for Qwen (default:0.8)QWEN_TOP_K- Top-k for Qwen (default:20)QWEN_PRESENCE_PENALTY- Presence penalty for Qwen (default:1.5)TRITON_TIMEOUT- Triton gRPC timeout in seconds (default:30)
Processing:
MAX_IMAGES- Maximum images per page (default: 25)MAX_UPLOAD_SIZE_MB- Maximum upload size in MB (default: 100)WORKFLOW_WORKERS- Number of Temporal workers (default: 4)
PDF Processing:
MAX_PDF_PAGES- Maximum pages per PDF (default: 1000)PDF_EXTRACT_TIMEOUT- Page extraction timeout in seconds (default: 900 = 15 min)PDF_PAGE_CONCURRENCY- Parallel page processing limit (default: 1)PDF_BATCH_SIZE- Pages per processing batch (default: 10)
Operations:
LOG_LEVEL- Logging level (default:info)API_WORKERS- Uvicorn worker processes (default:1)MAX_NUM_SEQS- vLLM concurrent sequences (default:4)MAX_CONCURRENT_WORKFLOW_TASKS- Max concurrent workflow tasks per worker (default:5)MAX_CACHED_WORKFLOWS- Max cached workflow instances (default:50)OCR_SAVE_STATES- Save OCR workflow states (default:false)MINIO_USE_SSL- Use SSL for Minio (default:false)TEMPORAL_WORKFLOW_TIMEOUT- Workflow timeout in seconds (default:3600)TEMPORAL_ACTIVITY_TIMEOUT- Activity timeout in seconds (default:600)TEMPORAL_ACTIVITY_RETRY_MAX_ATTEMPTS- Max activity retries (default:3)
Setup and Run
Docker Compose (Recommended)
# Create shared network (if not exists)
docker network create siraaj-shared-network
# Start all services
docker compose up --build
# View logs
docker compose logs -f
# Stop services
docker compose down
Local Development
Setup
uv sync
Start Infrastructure
# Start Minio
docker run -d --name minio \
-p 9000:9000 -p 9001:9001 \
-e MINIO_ROOT_USER=minioadmin \
-e MINIO_ROOT_PASSWORD=minioadmin \
minio/minio:latest server /data --console-address ":9001"
# Start Temporal (with PostgreSQL persistence)
docker run -d --name temporal-postgres \
-p 5432:5432 \
-e POSTGRES_USER=temporal \
-e POSTGRES_PASSWORD=temporal \
-e POSTGRES_DB=temporal \
postgres:15-alpine
docker run -d --name temporal \
-p 7233:7233 \
-e DB=postgres12 \
-e DB_PORT=5432 \
-e POSTGRES_USER=temporal \
-e POSTGRES_PWD=temporal \
-e POSTGRES_SEEDS=host.docker.internal \
temporalio/auto-setup:latest
# Start Temporal UI
docker run -d --name temporal-ui \
-p 8080:8080 \
-e TEMPORAL_ADDRESS=host.docker.internal:7233 \
temporalio/ui:latest
Start Temporal Workers
At least one worker is required to process OCR jobs:
# Start worker(s)
uv run python -m ocr_workflow.workers.temporal_worker
Start API Server
uv run uvicorn ocr_workflow.main:app --port 9191
# With logging
mkdir -p logs
uv run uvicorn ocr_workflow.main:app --port 9191 2>&1 | tee logs/$(date -u +%Y%m%d-%H%M%S).log
Sample Test
# Submit image for analysis
curl -i -X POST \
-H "Content-Type: image/jpeg" \
--data-binary @data/ar_table.jpg \
"http://localhost:9191/documentintelligence/documentModels/myModel:analyze?api-version=2024-11-30"
# Submit PDF for analysis
curl -i -X POST \
-H "Content-Type: application/pdf" \
--data-binary @data/pdfs/sample.pdf \
"http://localhost:9191/documentintelligence/documentModels/myModel:analyze?api-version=2024-11-30"
# Get result (use resultId from Operation-Location header)
curl -s \
"http://localhost:9191/documentintelligence/documentModels/myModel/analyzeResults/{resultId}?api-version=2024-11-30" \
| jq
Quick PDF Test (kfd.pdf)
Test with the kfd.pdf sample:
# Submit kfd.pdf for analysis
curl -i -X POST \
-H "Content-Type: application/pdf" \
--data-binary @data/pdfs/kfd.pdf \
"http://localhost:9191/documentintelligence/documentModels/myModel:analyze?api-version=2024-11-30"
# Save the resultId from the Operation-Location header, then poll for results:
curl -s \
"http://localhost:9191/documentintelligence/documentModels/myModel/analyzeResults/{resultId}?api-version=2024-11-30" \
| jq '.'
PDF Error Handling
| Failure | Handling |
|---|---|
| PDF encrypted/corrupted | Fail fast with clear error |
| Extraction fails mid-way | Cleanup uploaded pages, retry whole extraction |
| OCR fails for a page | Retry activity, then mark page failed, continue others |
| All pages fail OCR | Workflow fails with aggregated error |
| Partial success | Return succeeded pages + errors for failed (sorted by page number) |
Temporal UI
Access Temporal UI at http://localhost:8080 to:
- View workflow execution history
- Inspect workflow state and activities
- Debug failed workflows
- Monitor worker health
Azure Integration Test
# Test PDF pages with custom worker count
MAX_OCR_WORKERS=16 uv run python tests/integration_tests/azure_pdf_test.py \
--pdf data/pdfs/ien.pdf \
--pages "1-10,15,20"
Qwen3-VL (vLLM Docker)
See the Qwen3-VL setup in the Docker Compose configuration. For manual setup:
# Pull vLLM image
docker pull vllm/vllm-openai:v0.11.0
# Create volume and download model
docker volume create qwen3vl
docker run -i --rm \
--entrypoint python3 \
-e HF_HOME=/root/.cache/huggingface \
-v qwen3vl:/models \
vllm/vllm-openai:v0.11.0 \
- <<'PY'
from huggingface_hub import snapshot_download
snapshot_download(
"Qwen/Qwen3-VL-2B-Instruct-FP8",
local_dir="/models/qwen3vl-2b-fp8",
ignore_patterns=["*.md","*.png","*.jpg","*.jpeg","*.gif","*.webp"],
)
PY
# Run server
docker run -d --name qwen3vl-server \
--gpus all -p 8001:8000 \
-v qwen3vl:/models \
vllm/vllm-openai:v0.11.0 \
--model /models/qwen3vl-2b-fp8 \
--host 0.0.0.0 --port 8000 \
--served-model-name qwen3-vl-2b-fp8