siraaj-dot-ocr-service / docs/architecture.md

Architecture & Flows

Last updated: 4/16/2026GitHub

Architecture & Flows

Architecture

The service uses Temporal for workflow orchestration and Minio for document storage:

Components:

  • API Server (FastAPI/Uvicorn): Receives Azure Document Intelligence API requests, uploads documents to Minio, starts Temporal workflows, queries workflow status
  • Minio: Blob storage for document images (24-hour lifecycle policy)
  • Temporal Server: Workflow orchestration with PostgreSQL persistence
  • Temporal UI: Debugging and visibility at port 8080
  • Temporal Workers: Execute OCR workflows/activities with automatic retry and failure handling
  • Triton Inference Server: Runs PaddleOCR and language router models, accessed directly via gRPC
  • Router Init: Exports SigLIP2 router ONNX model to host filesystem for Triton's bind mount (mirrors qwen_init/lightonocr_init pattern)

Ports:

  • 9191: FastAPI API Server
  • 9000: Minio S3 API
  • 9001: Minio Console UI
  • 7233: Temporal Server
  • 8080: Temporal UI
  • 8000: DotsOCR (DOTS model)
  • 8001: Qwen (QWEN model)
  • 8003: LightOnOCR-2 (English OCR model)
  • 8100: Triton HTTP API
  • 8101: Triton gRPC API (used by TritonClient for PaddleOCR + language router)

Flows

Single Image:

  1. Client POSTs image (JPEG/PNG) → API generates workflow ID → uploads to Minio → starts ImageAnalyzeWorkflow → returns 202 Accepted
  2. Worker downloads image → prepares (EXIF transpose + JPEG) → classifies language (ENGLISH/GENERAL) → selects model → runs LangGraph OCR pipeline → stores result
  3. Client polls for result

Multi-Page PDF:

  1. Client POSTs PDF → API generates workflow ID → uploads to Minio → starts PDFAnalyzeWorkflow → returns 202 Accepted
  2. Worker executes extract_pages_activity: Downloads PDF, validates, extracts pages as JPEGs at 200 DPI (zoom=2.78) via PyMuPDF, uploads to {workflow_id}/pages/001.jpg, etc.
  3. Fan-out: Worker spawns process_page_activity for each page in parallel (configurable concurrency). Each page is classified by language and routed to the appropriate OCR model (no re-encoding needed — already JPEG from extraction).
  4. Fan-in: Collects all page results, sorts by page number, aggregates into final result
  5. Client polls for result with all pages combined

Language Routing

The router classifies each page as ENGLISH or GENERAL before OCR, determining which model chain to use.

PDF pages use a two-tier routing system:

  1. PyMuPDF text-layer (primary, <1ms): Extracts embedded text from PDF, counts Latin (A-Za-z) vs Arabic (5 Unicode blocks) characters.
    • >95% Latin → ENGLISH
    • >50% Arabic → GENERAL
    • Otherwise → falls through to SigLIP2
  2. SigLIP2 visual router (fallback, ~30ms): Classifies page image via Triton router_pipeline. Used for scanned PDFs, blank pages, and mixed-language pages where the text-layer is ambiguous.

Single images: SigLIP2 only (no text layer available).

  • Router failure: Defaults to GENERAL (safe, preserves current behavior)
  • ENGLISH route: LightOnOCR as first model, then DOTS → QWEN → PADDLE
  • GENERAL route: DOTS as first model (unchanged), then QWEN → PADDLE

OCR Fallback Strategy

The service uses a multi-tiered fallback strategy:

Client Initialization

The prepare_input entry node sets the initial client for the graph:

  • If client is already set in the initial state (e.g., injected by a routing activity), it is preserved
  • If no client is set, defaults to DOTS_MODEL

This allows callers to override the starting model via graph.invoke({"data_url": ..., "client": custom_client}).

Document Fallback Flow

English route (router classified page as ENGLISH):

  • Attempt 1: LIGHTONOCR_MODEL + plain prompt (prompt_lightonocr) → fail
  • Attempt 2: DOTS_MODEL + complex JSON prompt (prompt_dots) → fail
  • Attempt 3: QWEN_MODEL + plain text prompt (prompt_markdown) → fail
  • Attempt 4: PADDLE_MODEL (traditional OCR) → final attempt
  • Result: status="error", markdown="" (if all 4 attempts fail)

General route (router classified page as GENERAL, or router failure):

  • Attempt 1: DOTS_MODEL + complex JSON prompt (prompt_dots) → fail
  • Attempt 2: QWEN_MODEL + plain text prompt (prompt_markdown) → fail
  • Attempt 3: PADDLE_MODEL (traditional OCR) → final attempt
  • Result: status="error", markdown="" (if all 3 attempts fail)

Image (Picture Blocks) Fallback Flow

When DOTS_MODEL detects "Picture" blocks in the layout, each embedded image is processed with its own fallback:

  • Attempt 1: QWEN_MODEL + plain text prompt (per image) → fail
  • Attempt 2: PADDLE_MODEL (traditional OCR) → final attempt for that image
  • Result: Failed images (after both attempts) are skipped and logged

Image Count Guard

To prevent recursion limits and excessive processing time, there's a configurable maximum image count per page (MAX_IMAGES, default: 25):

  • If DOTS_MODEL extracts ≤25 images: Process each image individually
  • If DOTS_MODEL extracts >25 images: Treated as a failed attempt — triggers the fallback strategy which advances to the next model in the chain (e.g., QWEN for plain text extraction, bypassing per-image processing entirely)

Models

  • DOTS_MODEL: DotsOCR via DOTS_VLLM_BASE_URL (OpenAI-compatible HTTP)
  • QWEN_MODEL: Qwen3-VL via QWEN_VLLM_BASE_URL (OpenAI-compatible HTTP)
  • PADDLE_MODEL: PaddleOCR via TRITON_GRPC_URL (direct Triton gRPC)
  • LIGHTONOCR_MODEL: LightOnOCR-2 via LIGHTONOCR_VLLM_BASE_URL (OpenAI-compatible HTTP) — English OCR, routed via SigLIP2 language classifier

Failure Detection

  1. Token Limit Detection: If LLM response uses ≥95% of max_tokens, it's treated as truncated → fallback. Applies to both document-level and image-level OCR. Not applicable to PADDLE (not an LLM).
  2. Crop Errors: If image cropping fails (invalid bbox), the image is skipped
  3. Model Errors: Network errors, timeouts, or inference failures trigger fallback
  4. Parse Errors: Invalid JSON or unparseable responses trigger fallback
  5. gRPC Errors (Triton): UNAVAILABLEModelConnectionError, DEADLINE_EXCEEDEDModelTimeoutError, InferenceServerExceptionModelError — all trigger fallback

The fallback strategy is attempt-number-based (transport-agnostic): it decides the next client based on the attempt count, not the error type.

Prompts

All prompts are centralized in ocr_workflow/domain/prompts.py via the PROMPTS dictionary.

KeyUsed ByWhen
prompt_dotsDOTS_MODELMain document OCR — extracts layout with bboxes, categories, and text in JSON format.
prompt_markdownQWEN_MODEL, PADDLE_MODEL, raw_textDocument fallback, raw-text extraction, and image OCR — plain text extraction using Qwen3-VL's recommended OCR prompt.
prompt_lightonocrLIGHTONOCR_MODELEnglish pages — "Read all the text in the image."

Note: PADDLE_MODEL receives the prompt for interface compatibility but ignores it — Triton OCR does not use text prompts.

Image Preprocessing

  • Images sent at full resolution (no resize)
  • PDF pages rendered directly to JPEG by PyMuPDF at 200 DPI (zoom=2.78) — no re-encoding needed
  • User images converted to JPEG via prepare_image (EXIF transpose + RGB conversion)

Timing Instrumentation

All key operations log elapsed time at INFO level using time.perf_counter(). Log format: operation_name key=value elapsed=X.XXXs.

OperationLocationKeys logged
blob_downloadactivities/ocr.py, activities/pdf.pyobject_key, bytes, page (PDF)
image_preprocessactivities/ocr.py, activities/pdf.pyobject_key or page
graph_invokeactivities/ocr.py, activities/pdf.pyobject_key or page, chars, status on error
pdf_extractactivities/pdf.pyobject_key, pages
model_callservices/model_service.pyclient (DOTS/QWEN/PADDLE), tokens
router_classifyinfrastructure/triton_client.pylanguage, margin
json2mdcore/nodes/document.py
image_cropcore/nodes/images.pyindex

Error paths also log timing (e.g., model timeouts, failed graph invocations).

Baseline Timing (L40S GPU, 2026-03-28)

Measured across 5 files (3 single images + 2 PDFs, 15 pages total):

OperationMinMedianMaxNotes
blob_download3ms4ms7msMinio local, negligible
image_preprocess116ms163ms181msEXIF + JPEG + base64 (user images); base64 only (PDF pages)
router_classify25ms36ms50msSigLIP2 language classification
pdf_extract (6 pages)768ms833ms833msPyMuPDF render to JPEG at 200 DPI + upload
model_call LIGHTONOCR2.3s3.7s7.6sEnglish pages, 350–950 tokens
model_call DOTS1.5s13.2s19.1sGeneral pages, JSON layout
model_call QWEN0.4s2.7s12.6sFallback / embedded image OCR
image_crop3ms9ms9msNegligible
json2md<1ms<1ms1msNegligible
graph_invoke2.3s5.1s19.1sFull pipeline per page
total per page~2.5s~5.4s~19.4sSum of download + preprocess + router + graph

Model inference (LIGHTONOCR/DOTS + QWEN) accounts for >98% of total processing time. English pages via LIGHTONOCR are ~2x faster than DOTS (median 3.7s vs 13.2s).

Performance: New Stack vs Old Stack

MetricOld (DOTS+QWEN)New (Router+LIGHTONOCR+DOTS)
English page (median)~12.6s~3.9s (3.2x faster)
Arabic/General page (median)~12.6s~13.4s (same)
Total per page (median)~12.6s~5.4s
6-page English PDF~60-75s~10s
English accuracy91.1/10093.7/100
Arabic numerals (kfd p4)6/66/6
Router overhead36ms/page

Other Resilience Behaviors

  • Temporal retry policies: Activities retry up to 3 times with exponential backoff before failing
  • Partial PDF success: If some pages fail, the workflow returns "partial" status with a list of failed pages rather than failing entirely
  • Non-retryable error classification: ValueError/TypeError/WorkflowError are marked as non-retryable (immediate failure); other exceptions (including StorageError) are retried by Temporal
  • Bucket creation race handling: Concurrent create_bucket conflicts (BucketAlreadyOwnedByYou/BucketAlreadyExists) are handled gracefully
  • Graph recursion limit: LangGraph recursion limits are caught and surfaced as WorkflowError (non-retryable) in both run_ocr_activity and process_page_activity. Limit is auto-computed from MAX_IMAGES.
  • Error mapping: OpenAI SDK errors (timeout, connection, HTTP status) and Triton gRPC errors are mapped to typed ModelError subclasses for consistent fallback handling
  • Page cleanup on extraction failure: Uploaded page images are deleted from blob storage when page extraction fails
  • Log levels for fallbacks: Non-critical fallbacks (image dimension failures in ocr.py and pdf.py) log at WARNING because they don't affect OCR output — dimensions are optional metadata. Health check failures log at ERROR because they indicate the service cannot function.
  • Router model init: router_init uses atomic copy (.tmp + mv) to prevent partial model files on interrupted exports. Idempotent — skips if all 3 files exist. Retries up to 3 times on failure (restart: on-failure:3).
  • Model init version checking: All model init containers (dots_ocr_init, qwen_init, lightonocr_init) write a .model_version marker file after successful download. On startup, if the marker is missing or doesn't match the expected HuggingFace repo, the stale model directory is removed and re-downloaded. This prevents Docker named volumes from silently serving old model versions after image rebuilds. Init containers retry up to 3 times on download failure (restart: on-failure:3).
  • Azure status mapping: Unknown Temporal workflow statuses map to running in the Azure-compat API (azure_routes.py) and log at WARNING. Safe default — unknown statuses imply the workflow hasn't reached a terminal state.
  • MIME type fallback: If file extension MIME detection fails or returns a non-image type, prepare_image_from_path defaults to image/jpeg and logs at WARNING (ocr_service.py). Only affects local file loading (dev/test path).
  • Blob storage content-type defaults: Unknown upload content-type maps to .bin extension and logs at WARNING; missing download content-type defaults to application/octet-stream and logs at WARNING (blob_storage.py).
  • Router threshold env var: ROUTER_THRESHOLD (default 0.02) is configurable in the Triton router pipeline model (paddle/triton_models/router_pipeline/1/model.py), not in settings.py — it runs inside Triton, not the Python app. Logged at INFO on model init.
  • Upload size limit: Requests over MAX_UPLOAD_SIZE_MB (default 100) rejected with 413 before any processing. If Content-Length header is present, checked before reading body (avoids memory allocation). If missing (chunked transfer), body is streamed with a counting reader that aborts at the limit — never loads more than max_bytes into memory. Malformed Content-Length returns 400.
  • Minio timeouts: boto3 client configured with 10s connect / 60s read timeout. Stalled connections raise ConnectTimeoutError/ReadTimeoutError, retried by boto3 (3 attempts) then by Temporal activity retries. Prevents indefinite hangs on storage operations.
  • Failed image tracking: Images that fail OCR with all models are tracked in WorkflowState.failed_images and logged as a summary in json2md. The result is returned with empty content for those images — no failure raised.
  • StorageError: Minio failures wrapped in StorageError for structured logging (retryable by Temporal)
  • Auto-computed recursion limit: langgraph_recursion_limit is derived from MAX_IMAGES as 10 + (4 * MAX_IMAGES), logged at startup. The LANGGRAPH_RECURSION_LIMIT env var is no longer read — if set, it is silently ignored (extra="ignore" in pydantic-settings).
  • Empty OCR result detection: If OCR produces zero-length content (blank page, model failure returning empty), logged at WARNING level (ocr_empty_result). Result is returned as-is — no failure raised. Applies to both single images and PDF pages.
  • Shutdown cleanup logging: If Triton gRPC channel close fails during shutdown, logged at ERROR with traceback. Shutdown continues regardless.
  • Unified log level: LOG_LEVEL env var (default INFO) controls the ocr_workflow logger. Invalid values fall back to INFO with a WARNING log (no crash). Library loggers (uvicorn, boto3, temporalio) remain at WARNING.
  • Document-level token limit detection: If an LLM response at the document level uses ≥95% of max_tokens, treated as truncated and triggers fallback to the next model in the chain. Prevents silently serving cut-off content. Not applicable to PADDLE (no token limit concept).
  • Image count guard via fallback strategy: When DOTS_MODEL detects >MAX_IMAGES images in a page, the llm_node consults the DocumentFallbackStrategy to advance to the next model (typically QWEN for plain text). This avoids per-image processing overhead and prevents infinite retry loops.
  • Router classify failure logging: Triton router classification failures (InferenceServerException, gRPC errors, unexpected exceptions) log at WARNING with elapsed time before raising ModelError (triton_client.py).
  • Router classify missing keys: If the router response is missing language or margin keys, logged at WARNING before falling back to "?" / 0.0 defaults (triton_client.py).
  • Image fallback retry logging: When an image OCR attempt fails and retries with a different model, logged at WARNING (images.py).
  • Token limit check failure: If check_token_limit raises an unexpected exception, logged at WARNING with traceback; defaults to no-limit (safe, avoids false-positive truncation detection) (model_service.py).