siraaj-dot-ocr-service / docs/architecture.md

Architecture & Flows

Last updated: 4/16/2026GitHub

Architecture & Flows

Architecture

The service uses Temporal for workflow orchestration and Minio for document storage:

Components:

API Server (FastAPI/Uvicorn): Receives Azure Document Intelligence API requests, uploads documents to Minio, starts Temporal workflows, queries workflow status
Minio: Blob storage for document images (24-hour lifecycle policy)
Temporal Server: Workflow orchestration with PostgreSQL persistence
Temporal UI: Debugging and visibility at port 8080
Temporal Workers: Execute OCR workflows/activities with automatic retry and failure handling
Triton Inference Server: Runs PaddleOCR and language router models, accessed directly via gRPC
Router Init: Exports SigLIP2 router ONNX model to host filesystem for Triton's bind mount (mirrors qwen_init/lightonocr_init pattern)

Ports:

9191: FastAPI API Server
9000: Minio S3 API
9001: Minio Console UI
7233: Temporal Server
8080: Temporal UI
8000: DotsOCR (DOTS model)
8001: Qwen (QWEN model)
8003: LightOnOCR-2 (English OCR model)
8100: Triton HTTP API
8101: Triton gRPC API (used by TritonClient for PaddleOCR + language router)

Flows

Single Image:

Client POSTs image (JPEG/PNG) → API generates workflow ID → uploads to Minio → starts ImageAnalyzeWorkflow → returns 202 Accepted
Worker downloads image → prepares (EXIF transpose + JPEG) → classifies language (ENGLISH/GENERAL) → selects model → runs LangGraph OCR pipeline → stores result
Client polls for result

Multi-Page PDF:

Client POSTs PDF → API generates workflow ID → uploads to Minio → starts PDFAnalyzeWorkflow → returns 202 Accepted
Worker executes extract_pages_activity: Downloads PDF, validates, extracts pages as JPEGs at 200 DPI (zoom=2.78) via PyMuPDF, uploads to {workflow_id}/pages/001.jpg, etc.
Fan-out: Worker spawns process_page_activity for each page in parallel (configurable concurrency). Each page is classified by language and routed to the appropriate OCR model (no re-encoding needed — already JPEG from extraction).
Fan-in: Collects all page results, sorts by page number, aggregates into final result
Client polls for result with all pages combined

Language Routing

The router classifies each page as ENGLISH or GENERAL before OCR, determining which model chain to use.

PDF pages use a two-tier routing system:

PyMuPDF text-layer (primary, <1ms): Extracts embedded text from PDF, counts Latin (A-Za-z) vs Arabic (5 Unicode blocks) characters.
- >95% Latin → ENGLISH
- >50% Arabic → GENERAL
- Otherwise → falls through to SigLIP2
SigLIP2 visual router (fallback, ~30ms): Classifies page image via Triton router_pipeline. Used for scanned PDFs, blank pages, and mixed-language pages where the text-layer is ambiguous.

Single images: SigLIP2 only (no text layer available).

Router failure: Defaults to GENERAL (safe, preserves current behavior)
ENGLISH route: LightOnOCR as first model, then DOTS → QWEN → PADDLE
GENERAL route: DOTS as first model (unchanged), then QWEN → PADDLE

OCR Fallback Strategy

The service uses a multi-tiered fallback strategy:

Client Initialization

The prepare_input entry node sets the initial client for the graph:

If client is already set in the initial state (e.g., injected by a routing activity), it is preserved
If no client is set, defaults to DOTS_MODEL

This allows callers to override the starting model via graph.invoke({"data_url": ..., "client": custom_client}).

Document Fallback Flow

English route (router classified page as ENGLISH):

Attempt 1: LIGHTONOCR_MODEL + plain prompt (prompt_lightonocr) → fail
Attempt 2: DOTS_MODEL + complex JSON prompt (prompt_dots) → fail
Attempt 3: QWEN_MODEL + plain text prompt (prompt_markdown) → fail
Attempt 4: PADDLE_MODEL (traditional OCR) → final attempt
Result: status="error", markdown="" (if all 4 attempts fail)

General route (router classified page as GENERAL, or router failure):

Attempt 1: DOTS_MODEL + complex JSON prompt (prompt_dots) → fail
Attempt 2: QWEN_MODEL + plain text prompt (prompt_markdown) → fail
Attempt 3: PADDLE_MODEL (traditional OCR) → final attempt
Result: status="error", markdown="" (if all 3 attempts fail)

Image (Picture Blocks) Fallback Flow

When DOTS_MODEL detects "Picture" blocks in the layout, each embedded image is processed with its own fallback:

Attempt 1: QWEN_MODEL + plain text prompt (per image) → fail
Attempt 2: PADDLE_MODEL (traditional OCR) → final attempt for that image
Result: Failed images (after both attempts) are skipped and logged

Image Count Guard

To prevent recursion limits and excessive processing time, there's a configurable maximum image count per page (MAX_IMAGES, default: 25):

If DOTS_MODEL extracts ≤25 images: Process each image individually
If DOTS_MODEL extracts >25 images: Treated as a failed attempt — triggers the fallback strategy which advances to the next model in the chain (e.g., QWEN for plain text extraction, bypassing per-image processing entirely)

Models

DOTS_MODEL: DotsOCR via DOTS_VLLM_BASE_URL (OpenAI-compatible HTTP)
QWEN_MODEL: Qwen3-VL via QWEN_VLLM_BASE_URL (OpenAI-compatible HTTP)
PADDLE_MODEL: PaddleOCR via TRITON_GRPC_URL (direct Triton gRPC)
LIGHTONOCR_MODEL: LightOnOCR-2 via LIGHTONOCR_VLLM_BASE_URL (OpenAI-compatible HTTP) — English OCR, routed via SigLIP2 language classifier

Failure Detection

Token Limit Detection: If LLM response uses ≥95% of max_tokens, it's treated as truncated → fallback. Applies to both document-level and image-level OCR. Not applicable to PADDLE (not an LLM).
Crop Errors: If image cropping fails (invalid bbox), the image is skipped
Model Errors: Network errors, timeouts, or inference failures trigger fallback
Parse Errors: Invalid JSON or unparseable responses trigger fallback
gRPC Errors (Triton): UNAVAILABLE → ModelConnectionError, DEADLINE_EXCEEDED → ModelTimeoutError, InferenceServerException → ModelError — all trigger fallback

The fallback strategy is attempt-number-based (transport-agnostic): it decides the next client based on the attempt count, not the error type.

Prompts

All prompts are centralized in ocr_workflow/domain/prompts.py via the PROMPTS dictionary.

Key	Used By	When
`prompt_dots`	DOTS_MODEL	Main document OCR — extracts layout with bboxes, categories, and text in JSON format.
`prompt_markdown`	QWEN_MODEL, PADDLE_MODEL, raw_text	Document fallback, raw-text extraction, and image OCR — plain text extraction using Qwen3-VL's recommended OCR prompt.
`prompt_lightonocr`	LIGHTONOCR_MODEL	English pages — "Read all the text in the image."

Note: PADDLE_MODEL receives the prompt for interface compatibility but ignores it — Triton OCR does not use text prompts.

Image Preprocessing

Images sent at full resolution (no resize)
PDF pages rendered directly to JPEG by PyMuPDF at 200 DPI (zoom=2.78) — no re-encoding needed
User images converted to JPEG via prepare_image (EXIF transpose + RGB conversion)

Timing Instrumentation

All key operations log elapsed time at INFO level using time.perf_counter(). Log format: operation_name key=value elapsed=X.XXXs.

Operation	Location	Keys logged
`blob_download`	`activities/ocr.py`, `activities/pdf.py`	`object_key`, `bytes`, `page` (PDF)
`image_preprocess`	`activities/ocr.py`, `activities/pdf.py`	`object_key` or `page`
`graph_invoke`	`activities/ocr.py`, `activities/pdf.py`	`object_key` or `page`, `chars`, `status` on error
`pdf_extract`	`activities/pdf.py`	`object_key`, `pages`
`model_call`	`services/model_service.py`	`client` (DOTS/QWEN/PADDLE), `tokens`
`router_classify`	`infrastructure/triton_client.py`	`language`, `margin`
`json2md`	`core/nodes/document.py`	—
`image_crop`	`core/nodes/images.py`	`index`

Error paths also log timing (e.g., model timeouts, failed graph invocations).

Baseline Timing (L40S GPU, 2026-03-28)

Measured across 5 files (3 single images + 2 PDFs, 15 pages total):

Operation	Min	Median	Max	Notes
`blob_download`	3ms	4ms	7ms	Minio local, negligible
`image_preprocess`	116ms	163ms	181ms	EXIF + JPEG + base64 (user images); base64 only (PDF pages)
`router_classify`	25ms	36ms	50ms	SigLIP2 language classification
`pdf_extract` (6 pages)	768ms	833ms	833ms	PyMuPDF render to JPEG at 200 DPI + upload
`model_call` LIGHTONOCR	2.3s	3.7s	7.6s	English pages, 350–950 tokens
`model_call` DOTS	1.5s	13.2s	19.1s	General pages, JSON layout
`model_call` QWEN	0.4s	2.7s	12.6s	Fallback / embedded image OCR
`image_crop`	3ms	9ms	9ms	Negligible
`json2md`	<1ms	<1ms	1ms	Negligible
`graph_invoke`	2.3s	5.1s	19.1s	Full pipeline per page
total per page	~2.5s	~5.4s	~19.4s	Sum of download + preprocess + router + graph

Model inference (LIGHTONOCR/DOTS + QWEN) accounts for >98% of total processing time. English pages via LIGHTONOCR are ~2x faster than DOTS (median 3.7s vs 13.2s).

Performance: New Stack vs Old Stack

Metric	Old (DOTS+QWEN)	New (Router+LIGHTONOCR+DOTS)
English page (median)	~12.6s	~3.9s (3.2x faster)
Arabic/General page (median)	~12.6s	~13.4s (same)
Total per page (median)	~12.6s	~5.4s
6-page English PDF	~60-75s	~10s
English accuracy	91.1/100	93.7/100
Arabic numerals (kfd p4)	6/6	6/6
Router overhead	—	36ms/page

Other Resilience Behaviors

Temporal retry policies: Activities retry up to 3 times with exponential backoff before failing
Partial PDF success: If some pages fail, the workflow returns "partial" status with a list of failed pages rather than failing entirely
Non-retryable error classification: ValueError/TypeError/WorkflowError are marked as non-retryable (immediate failure); other exceptions (including StorageError) are retried by Temporal
Bucket creation race handling: Concurrent create_bucket conflicts (BucketAlreadyOwnedByYou/BucketAlreadyExists) are handled gracefully
Graph recursion limit: LangGraph recursion limits are caught and surfaced as WorkflowError (non-retryable) in both run_ocr_activity and process_page_activity. Limit is auto-computed from MAX_IMAGES.
Error mapping: OpenAI SDK errors (timeout, connection, HTTP status) and Triton gRPC errors are mapped to typed ModelError subclasses for consistent fallback handling
Page cleanup on extraction failure: Uploaded page images are deleted from blob storage when page extraction fails
Log levels for fallbacks: Non-critical fallbacks (image dimension failures in ocr.py and pdf.py) log at WARNING because they don't affect OCR output — dimensions are optional metadata. Health check failures log at ERROR because they indicate the service cannot function.
Router model init: router_init uses atomic copy (.tmp + mv) to prevent partial model files on interrupted exports. Idempotent — skips if all 3 files exist. Retries up to 3 times on failure (restart: on-failure:3).
Model init version checking: All model init containers (dots_ocr_init, qwen_init, lightonocr_init) write a .model_version marker file after successful download. On startup, if the marker is missing or doesn't match the expected HuggingFace repo, the stale model directory is removed and re-downloaded. This prevents Docker named volumes from silently serving old model versions after image rebuilds. Init containers retry up to 3 times on download failure (restart: on-failure:3).
Azure status mapping: Unknown Temporal workflow statuses map to running in the Azure-compat API (azure_routes.py) and log at WARNING. Safe default — unknown statuses imply the workflow hasn't reached a terminal state.
MIME type fallback: If file extension MIME detection fails or returns a non-image type, prepare_image_from_path defaults to image/jpeg and logs at WARNING (ocr_service.py). Only affects local file loading (dev/test path).
Blob storage content-type defaults: Unknown upload content-type maps to .bin extension and logs at WARNING; missing download content-type defaults to application/octet-stream and logs at WARNING (blob_storage.py).
Router threshold env var: ROUTER_THRESHOLD (default 0.02) is configurable in the Triton router pipeline model (paddle/triton_models/router_pipeline/1/model.py), not in settings.py — it runs inside Triton, not the Python app. Logged at INFO on model init.
Upload size limit: Requests over MAX_UPLOAD_SIZE_MB (default 100) rejected with 413 before any processing. If Content-Length header is present, checked before reading body (avoids memory allocation). If missing (chunked transfer), body is streamed with a counting reader that aborts at the limit — never loads more than max_bytes into memory. Malformed Content-Length returns 400.
Minio timeouts: boto3 client configured with 10s connect / 60s read timeout. Stalled connections raise ConnectTimeoutError/ReadTimeoutError, retried by boto3 (3 attempts) then by Temporal activity retries. Prevents indefinite hangs on storage operations.
Failed image tracking: Images that fail OCR with all models are tracked in WorkflowState.failed_images and logged as a summary in json2md. The result is returned with empty content for those images — no failure raised.
StorageError: Minio failures wrapped in StorageError for structured logging (retryable by Temporal)
Auto-computed recursion limit: langgraph_recursion_limit is derived from MAX_IMAGES as 10 + (4 * MAX_IMAGES), logged at startup. The LANGGRAPH_RECURSION_LIMIT env var is no longer read — if set, it is silently ignored (extra="ignore" in pydantic-settings).
Empty OCR result detection: If OCR produces zero-length content (blank page, model failure returning empty), logged at WARNING level (ocr_empty_result). Result is returned as-is — no failure raised. Applies to both single images and PDF pages.
Shutdown cleanup logging: If Triton gRPC channel close fails during shutdown, logged at ERROR with traceback. Shutdown continues regardless.
Unified log level: LOG_LEVEL env var (default INFO) controls the ocr_workflow logger. Invalid values fall back to INFO with a WARNING log (no crash). Library loggers (uvicorn, boto3, temporalio) remain at WARNING.
Document-level token limit detection: If an LLM response at the document level uses ≥95% of max_tokens, treated as truncated and triggers fallback to the next model in the chain. Prevents silently serving cut-off content. Not applicable to PADDLE (no token limit concept).
Image count guard via fallback strategy: When DOTS_MODEL detects >MAX_IMAGES images in a page, the llm_node consults the DocumentFallbackStrategy to advance to the next model (typically QWEN for plain text). This avoids per-image processing overhead and prevents infinite retry loops.
Router classify failure logging: Triton router classification failures (InferenceServerException, gRPC errors, unexpected exceptions) log at WARNING with elapsed time before raising ModelError (triton_client.py).
Router classify missing keys: If the router response is missing language or margin keys, logged at WARNING before falling back to "?" / 0.0 defaults (triton_client.py).
Image fallback retry logging: When an image OCR attempt fails and retries with a different model, logged at WARNING (images.py).
Token limit check failure: If check_token_limit raises an unexpected exception, logged at WARNING with traceback; defaults to no-limit (safe, avoids false-positive truncation detection) (model_service.py).