siraaj-dot-ocr-service / docs/architecture.md
Architecture & Flows
Architecture & Flows
Architecture
The service uses Temporal for workflow orchestration and Minio for document storage:
Components:
- API Server (FastAPI/Uvicorn): Receives Azure Document Intelligence API requests, uploads documents to Minio, starts Temporal workflows, queries workflow status
- Minio: Blob storage for document images (24-hour lifecycle policy)
- Temporal Server: Workflow orchestration with PostgreSQL persistence
- Temporal UI: Debugging and visibility at port 8080
- Temporal Workers: Execute OCR workflows/activities with automatic retry and failure handling
- Triton Inference Server: Runs PaddleOCR and language router models, accessed directly via gRPC
- Router Init: Exports SigLIP2 router ONNX model to host filesystem for Triton's bind mount (mirrors qwen_init/lightonocr_init pattern)
Ports:
9191: FastAPI API Server9000: Minio S3 API9001: Minio Console UI7233: Temporal Server8080: Temporal UI8000: DotsOCR (DOTS model)8001: Qwen (QWEN model)8003: LightOnOCR-2 (English OCR model)8100: Triton HTTP API8101: Triton gRPC API (used by TritonClient for PaddleOCR + language router)
Flows
Single Image:
- Client POSTs image (JPEG/PNG) → API generates workflow ID → uploads to Minio → starts
ImageAnalyzeWorkflow→ returns 202 Accepted - Worker downloads image → prepares (EXIF transpose + JPEG) → classifies language (ENGLISH/GENERAL) → selects model → runs LangGraph OCR pipeline → stores result
- Client polls for result
Multi-Page PDF:
- Client POSTs PDF → API generates workflow ID → uploads to Minio → starts
PDFAnalyzeWorkflow→ returns 202 Accepted - Worker executes
extract_pages_activity: Downloads PDF, validates, extracts pages as JPEGs at 200 DPI (zoom=2.78) via PyMuPDF, uploads to{workflow_id}/pages/001.jpg, etc. - Fan-out: Worker spawns
process_page_activityfor each page in parallel (configurable concurrency). Each page is classified by language and routed to the appropriate OCR model (no re-encoding needed — already JPEG from extraction). - Fan-in: Collects all page results, sorts by page number, aggregates into final result
- Client polls for result with all pages combined
Language Routing
The router classifies each page as ENGLISH or GENERAL before OCR, determining which model chain to use.
PDF pages use a two-tier routing system:
- PyMuPDF text-layer (primary, <1ms): Extracts embedded text from PDF, counts Latin (
A-Za-z) vs Arabic (5 Unicode blocks) characters.- >95% Latin → ENGLISH
- >50% Arabic → GENERAL
- Otherwise → falls through to SigLIP2
- SigLIP2 visual router (fallback, ~30ms): Classifies page image via Triton
router_pipeline. Used for scanned PDFs, blank pages, and mixed-language pages where the text-layer is ambiguous.
Single images: SigLIP2 only (no text layer available).
- Router failure: Defaults to GENERAL (safe, preserves current behavior)
- ENGLISH route: LightOnOCR as first model, then DOTS → QWEN → PADDLE
- GENERAL route: DOTS as first model (unchanged), then QWEN → PADDLE
OCR Fallback Strategy
The service uses a multi-tiered fallback strategy:
Client Initialization
The prepare_input entry node sets the initial client for the graph:
- If
clientis already set in the initial state (e.g., injected by a routing activity), it is preserved - If no
clientis set, defaults to DOTS_MODEL
This allows callers to override the starting model via graph.invoke({"data_url": ..., "client": custom_client}).
Document Fallback Flow
English route (router classified page as ENGLISH):
- Attempt 1: LIGHTONOCR_MODEL + plain prompt (prompt_lightonocr) → fail
- Attempt 2: DOTS_MODEL + complex JSON prompt (prompt_dots) → fail
- Attempt 3: QWEN_MODEL + plain text prompt (prompt_markdown) → fail
- Attempt 4: PADDLE_MODEL (traditional OCR) → final attempt
- Result: status="error", markdown="" (if all 4 attempts fail)
General route (router classified page as GENERAL, or router failure):
- Attempt 1: DOTS_MODEL + complex JSON prompt (prompt_dots) → fail
- Attempt 2: QWEN_MODEL + plain text prompt (prompt_markdown) → fail
- Attempt 3: PADDLE_MODEL (traditional OCR) → final attempt
- Result: status="error", markdown="" (if all 3 attempts fail)
Image (Picture Blocks) Fallback Flow
When DOTS_MODEL detects "Picture" blocks in the layout, each embedded image is processed with its own fallback:
- Attempt 1: QWEN_MODEL + plain text prompt (per image) → fail
- Attempt 2: PADDLE_MODEL (traditional OCR) → final attempt for that image
- Result: Failed images (after both attempts) are skipped and logged
Image Count Guard
To prevent recursion limits and excessive processing time, there's a configurable maximum image count per page (MAX_IMAGES, default: 25):
- If DOTS_MODEL extracts ≤25 images: Process each image individually
- If DOTS_MODEL extracts >25 images: Treated as a failed attempt — triggers the fallback strategy which advances to the next model in the chain (e.g., QWEN for plain text extraction, bypassing per-image processing entirely)
Models
- DOTS_MODEL: DotsOCR via
DOTS_VLLM_BASE_URL(OpenAI-compatible HTTP) - QWEN_MODEL: Qwen3-VL via
QWEN_VLLM_BASE_URL(OpenAI-compatible HTTP) - PADDLE_MODEL: PaddleOCR via
TRITON_GRPC_URL(direct Triton gRPC) - LIGHTONOCR_MODEL: LightOnOCR-2 via
LIGHTONOCR_VLLM_BASE_URL(OpenAI-compatible HTTP) — English OCR, routed via SigLIP2 language classifier
Failure Detection
- Token Limit Detection: If LLM response uses ≥95% of
max_tokens, it's treated as truncated → fallback. Applies to both document-level and image-level OCR. Not applicable to PADDLE (not an LLM). - Crop Errors: If image cropping fails (invalid bbox), the image is skipped
- Model Errors: Network errors, timeouts, or inference failures trigger fallback
- Parse Errors: Invalid JSON or unparseable responses trigger fallback
- gRPC Errors (Triton):
UNAVAILABLE→ModelConnectionError,DEADLINE_EXCEEDED→ModelTimeoutError,InferenceServerException→ModelError— all trigger fallback
The fallback strategy is attempt-number-based (transport-agnostic): it decides the next client based on the attempt count, not the error type.
Prompts
All prompts are centralized in ocr_workflow/domain/prompts.py via the PROMPTS dictionary.
| Key | Used By | When |
|---|---|---|
prompt_dots | DOTS_MODEL | Main document OCR — extracts layout with bboxes, categories, and text in JSON format. |
prompt_markdown | QWEN_MODEL, PADDLE_MODEL, raw_text | Document fallback, raw-text extraction, and image OCR — plain text extraction using Qwen3-VL's recommended OCR prompt. |
prompt_lightonocr | LIGHTONOCR_MODEL | English pages — "Read all the text in the image." |
Note: PADDLE_MODEL receives the prompt for interface compatibility but ignores it — Triton OCR does not use text prompts.
Image Preprocessing
- Images sent at full resolution (no resize)
- PDF pages rendered directly to JPEG by PyMuPDF at 200 DPI (zoom=2.78) — no re-encoding needed
- User images converted to JPEG via
prepare_image(EXIF transpose + RGB conversion)
Timing Instrumentation
All key operations log elapsed time at INFO level using time.perf_counter(). Log format: operation_name key=value elapsed=X.XXXs.
| Operation | Location | Keys logged |
|---|---|---|
blob_download | activities/ocr.py, activities/pdf.py | object_key, bytes, page (PDF) |
image_preprocess | activities/ocr.py, activities/pdf.py | object_key or page |
graph_invoke | activities/ocr.py, activities/pdf.py | object_key or page, chars, status on error |
pdf_extract | activities/pdf.py | object_key, pages |
model_call | services/model_service.py | client (DOTS/QWEN/PADDLE), tokens |
router_classify | infrastructure/triton_client.py | language, margin |
json2md | core/nodes/document.py | — |
image_crop | core/nodes/images.py | index |
Error paths also log timing (e.g., model timeouts, failed graph invocations).
Baseline Timing (L40S GPU, 2026-03-28)
Measured across 5 files (3 single images + 2 PDFs, 15 pages total):
| Operation | Min | Median | Max | Notes |
|---|---|---|---|---|
blob_download | 3ms | 4ms | 7ms | Minio local, negligible |
image_preprocess | 116ms | 163ms | 181ms | EXIF + JPEG + base64 (user images); base64 only (PDF pages) |
router_classify | 25ms | 36ms | 50ms | SigLIP2 language classification |
pdf_extract (6 pages) | 768ms | 833ms | 833ms | PyMuPDF render to JPEG at 200 DPI + upload |
model_call LIGHTONOCR | 2.3s | 3.7s | 7.6s | English pages, 350–950 tokens |
model_call DOTS | 1.5s | 13.2s | 19.1s | General pages, JSON layout |
model_call QWEN | 0.4s | 2.7s | 12.6s | Fallback / embedded image OCR |
image_crop | 3ms | 9ms | 9ms | Negligible |
json2md | <1ms | <1ms | 1ms | Negligible |
graph_invoke | 2.3s | 5.1s | 19.1s | Full pipeline per page |
| total per page | ~2.5s | ~5.4s | ~19.4s | Sum of download + preprocess + router + graph |
Model inference (LIGHTONOCR/DOTS + QWEN) accounts for >98% of total processing time. English pages via LIGHTONOCR are ~2x faster than DOTS (median 3.7s vs 13.2s).
Performance: New Stack vs Old Stack
| Metric | Old (DOTS+QWEN) | New (Router+LIGHTONOCR+DOTS) |
|---|---|---|
| English page (median) | ~12.6s | ~3.9s (3.2x faster) |
| Arabic/General page (median) | ~12.6s | ~13.4s (same) |
| Total per page (median) | ~12.6s | ~5.4s |
| 6-page English PDF | ~60-75s | ~10s |
| English accuracy | 91.1/100 | 93.7/100 |
| Arabic numerals (kfd p4) | 6/6 | 6/6 |
| Router overhead | — | 36ms/page |
Other Resilience Behaviors
- Temporal retry policies: Activities retry up to 3 times with exponential backoff before failing
- Partial PDF success: If some pages fail, the workflow returns
"partial"status with a list of failed pages rather than failing entirely - Non-retryable error classification:
ValueError/TypeError/WorkflowErrorare marked as non-retryable (immediate failure); other exceptions (includingStorageError) are retried by Temporal - Bucket creation race handling: Concurrent
create_bucketconflicts (BucketAlreadyOwnedByYou/BucketAlreadyExists) are handled gracefully - Graph recursion limit: LangGraph recursion limits are caught and surfaced as
WorkflowError(non-retryable) in bothrun_ocr_activityandprocess_page_activity. Limit is auto-computed fromMAX_IMAGES. - Error mapping: OpenAI SDK errors (timeout, connection, HTTP status) and Triton gRPC errors are mapped to typed
ModelErrorsubclasses for consistent fallback handling - Page cleanup on extraction failure: Uploaded page images are deleted from blob storage when page extraction fails
- Log levels for fallbacks: Non-critical fallbacks (image dimension failures in
ocr.pyandpdf.py) log atWARNINGbecause they don't affect OCR output — dimensions are optional metadata. Health check failures log atERRORbecause they indicate the service cannot function. - Router model init:
router_inituses atomic copy (.tmp+mv) to prevent partial model files on interrupted exports. Idempotent — skips if all 3 files exist. Retries up to 3 times on failure (restart: on-failure:3). - Model init version checking: All model init containers (
dots_ocr_init,qwen_init,lightonocr_init) write a.model_versionmarker file after successful download. On startup, if the marker is missing or doesn't match the expected HuggingFace repo, the stale model directory is removed and re-downloaded. This prevents Docker named volumes from silently serving old model versions after image rebuilds. Init containers retry up to 3 times on download failure (restart: on-failure:3). - Azure status mapping: Unknown Temporal workflow statuses map to
runningin the Azure-compat API (azure_routes.py) and log atWARNING. Safe default — unknown statuses imply the workflow hasn't reached a terminal state. - MIME type fallback: If file extension MIME detection fails or returns a non-image type,
prepare_image_from_pathdefaults toimage/jpegand logs atWARNING(ocr_service.py). Only affects local file loading (dev/test path). - Blob storage content-type defaults: Unknown upload content-type maps to
.binextension and logs atWARNING; missing download content-type defaults toapplication/octet-streamand logs atWARNING(blob_storage.py). - Router threshold env var:
ROUTER_THRESHOLD(default0.02) is configurable in the Triton router pipeline model (paddle/triton_models/router_pipeline/1/model.py), not insettings.py— it runs inside Triton, not the Python app. Logged atINFOon model init. - Upload size limit: Requests over
MAX_UPLOAD_SIZE_MB(default 100) rejected with 413 before any processing. IfContent-Lengthheader is present, checked before reading body (avoids memory allocation). If missing (chunked transfer), body is streamed with a counting reader that aborts at the limit — never loads more thanmax_bytesinto memory. MalformedContent-Lengthreturns 400. - Minio timeouts: boto3 client configured with 10s connect / 60s read timeout. Stalled connections raise
ConnectTimeoutError/ReadTimeoutError, retried by boto3 (3 attempts) then by Temporal activity retries. Prevents indefinite hangs on storage operations. - Failed image tracking: Images that fail OCR with all models are tracked in
WorkflowState.failed_imagesand logged as a summary injson2md. The result is returned with empty content for those images — no failure raised. - StorageError: Minio failures wrapped in
StorageErrorfor structured logging (retryable by Temporal) - Auto-computed recursion limit:
langgraph_recursion_limitis derived fromMAX_IMAGESas10 + (4 * MAX_IMAGES), logged at startup. TheLANGGRAPH_RECURSION_LIMITenv var is no longer read — if set, it is silently ignored (extra="ignore"in pydantic-settings). - Empty OCR result detection: If OCR produces zero-length content (blank page, model failure returning empty), logged at
WARNINGlevel (ocr_empty_result). Result is returned as-is — no failure raised. Applies to both single images and PDF pages. - Shutdown cleanup logging: If Triton gRPC channel close fails during shutdown, logged at
ERRORwith traceback. Shutdown continues regardless. - Unified log level:
LOG_LEVELenv var (defaultINFO) controls theocr_workflowlogger. Invalid values fall back toINFOwith aWARNINGlog (no crash). Library loggers (uvicorn, boto3, temporalio) remain atWARNING. - Document-level token limit detection: If an LLM response at the document level uses ≥95% of
max_tokens, treated as truncated and triggers fallback to the next model in the chain. Prevents silently serving cut-off content. Not applicable to PADDLE (no token limit concept). - Image count guard via fallback strategy: When DOTS_MODEL detects >MAX_IMAGES images in a page, the
llm_nodeconsults theDocumentFallbackStrategyto advance to the next model (typically QWEN for plain text). This avoids per-image processing overhead and prevents infinite retry loops. - Router classify failure logging: Triton router classification failures (
InferenceServerException, gRPC errors, unexpected exceptions) log atWARNINGwith elapsed time before raisingModelError(triton_client.py). - Router classify missing keys: If the router response is missing
languageormarginkeys, logged atWARNINGbefore falling back to"?"/0.0defaults (triton_client.py). - Image fallback retry logging: When an image OCR attempt fails and retries with a different model, logged at
WARNING(images.py). - Token limit check failure: If
check_token_limitraises an unexpected exception, logged atWARNINGwith traceback; defaults to no-limit (safe, avoids false-positive truncation detection) (model_service.py).