siraaj-dot-ocr-service / docs/research/benchmark_router/router-benchmark-report.md

Document Language Router Benchmark Report

Last updated: 4/16/2026GitHub

Document Language Router Benchmark Report

Date: 2026-03-18 GPU: Apple M2 (MPS), 8 GB unified memory Test: 32 pages across 5 PDFs — English tables, Arabic text, mixed bilingual, blank/decorative, photos Goal: Classify page images as "pure English" vs "everything else" for OCR model routing

Context

The OCR pipeline uses two models on a single NVIDIA L40S (46 GB VRAM):

  • LightOnOCR-2 (1B): Best English OCR (92.8/100, 3.0s/page, ~2 GB) — English only
  • DotsOCR v1.5 (14B): Best Arabic OCR (6/6 Arabic numerals, ~14 GB) — handles English too (84.4/100)

A fast router is needed to classify each page image and send it to the right model:

  • Pure English -> LightOnOCR-2 (faster, higher quality on English)
  • Everything else (Arabic, mixed, blank, unknown) -> DotsOCR v1.5 (safe default)

Key constraint: false positives (Arabic page sent to English-only model) are dangerous — LightOnOCR-2 garbles Arabic. False negatives (English page sent to DotsOCR) are merely suboptimal.

Approach

Models Tested

Zero-shot CLIP-family models that classify images against text prompts, requiring no training data:

ModelParamsReleasePretrained on
SigLIP2 ViT-B-32-25686MFeb 2025WebLI (Google's multilingual web data)
SigLIP2 ViT-B-16-25686MFeb 2025WebLI
MobileCLIP2-S011M+63MAug 2025DataComp (natural images)
MobileCLIP-S121M+63MCVPR 2024DataComp

Key Finding: Binary Prompts Required

Initial tests used 3 prompts: ["English text", "Arabic text", "mixed text"]. All models scored ~33% English / ~67% Arabic on every page — no discrimination. The 3-prompt softmax creates a structural bias: "Arabic + mixed" always sums to 2/3.

Binary prompts ["English text", "Arabic text"] fixed this completely. The raw cosine similarity (logit) margin between the two prompts clearly separates English and Arabic pages.

Confidence Threshold

Instead of a simple argmax, a margin threshold is used:

  • margin = en_logit - ar_logit
  • margin > 0.02 -> English (LightOnOCR-2)
  • margin <= 0.02 -> DotsOCR (safe default)

This eliminates all false positives by sending borderline cases to DotsOCR.

Threshold Analysis

ThresholdAccuracyAR->EN (dangerous)EN->AR (suboptimal)
0.00087.5%40
0.01087.5%22
0.02093.8%02
0.02593.8%02
0.03093.8%02
0.03578.1%07

Test Corpus

32 pages across 5 PDFs, ground truth verified by visual inspection of each page:

PDFPagesContent
adobe-6-page.pdfp1-p6 (6 EN)Pure English — cloud vendor tables, compliance diagrams
oman-2040-en.pdfp1-6, p10, p16, p21, p31, p46 (4 EN, 7 AR)Mixed — EN content, AR cover/Bismillah, blank decorative, photos, mixed EN+AR
kfd.pdfp1-p6 (3 EN, 3 AR)Bilingual — p1-3 English insurance tables, p4-6 Arabic translation
ar-novel.pdfp1, p2, p5, p10 (4 AR)Arabic academic paper — mixed title, Arabic body
hasini-asma.pdfp1-3, p5, p14 (5 AR)Arabic thesis — cover, blank, decorative, body text

Total: 13 English, 19 Arabic/other

Results

Model Comparison (binary prompts, threshold=0.02)

ModelAccuracyEN AccAR AccFP (AR->EN)FN (EN->AR)Avg Time
SigLIP2 ViT-B-32-25693.8%84.6%100%0221ms
SigLIP2 ViT-B-16-25693.8%84.6%100%0232ms
MobileCLIP2-S053.3%33ms
MobileCLIP-S133.3%34ms

MobileCLIP models failed completely — they predict the same class for every page regardless of content. Trained on natural image-caption pairs (DataComp), they never learned to recognize document scripts. SigLIP2's WebLI training data (multilingual web content) is the differentiator.

SigLIP2 ViT-B-32-256 Per-Page Breakdown (threshold=0.02)

PageTruthPredEN logitAR logitMarginTimeNote
adobe p1enen0.08630.0558+0.031116mswarmup
adobe p2enen0.09840.0597+0.03918ms
adobe p3enen0.08630.0530+0.03314ms
adobe p4enen0.10160.0656+0.03619ms
adobe p5enen0.08070.0465+0.03425ms
adobe p6enen0.09690.0668+0.03022ms
oman p1arar0.10930.0926+0.01723msmixed cover, below threshold
oman p2arar0.07020.0718-0.00215msblank decorative
oman p3arar0.06930.0662+0.00322msblank decorative, below threshold
oman p4arar0.07690.0740+0.00321msblank decorative, below threshold
oman p5arar0.08540.1408-0.05513msBismillah calligraphy
oman p6arar0.05660.0858-0.02916msphoto (Sultan)
oman p10enar0.07690.0736+0.00325msEN infographic, sparse text
oman p16arar0.09120.0723+0.01916msmixed EN+AR, below threshold
oman p21enen0.07830.0407+0.03824msEN table
oman p31enen0.07580.0438+0.03217msEN table
oman p46enen0.09110.0525+0.03924msEN text
kfd p1enar0.04550.0429+0.00321msEN table with large car photo
kfd p2enen0.08320.0432+0.04011msEN table, dense text
kfd p3enen0.09690.0430+0.05414msEN text page
kfd p4arar0.07600.1046-0.02917msAR table
kfd p5arar0.08830.1067-0.01817msAR text
kfd p6arar0.09960.1208-0.02111msAR text
ar-novel p1arar0.09470.1156-0.02113msmixed title page
ar-novel p2arar0.09890.1092-0.01021msmixed abstract
ar-novel p5arar0.09960.1228-0.02322msAR body text
ar-novel p10arar0.10250.1201-0.01825msAR body text
hasini p1arar0.09600.1176-0.02215msAR cover page
hasini p2arar0.08320.0886-0.00616msblank page
hasini p3arar0.09480.1163-0.02220msAR cover page
hasini p5arar0.08880.1347-0.04621msAR decorative + calligraphy
hasini p14arar0.10310.1351-0.03214msAR body text

The 2 False Negatives (English -> DotsOCR)

Both are English pages with weak text signal (margin +0.003, below 0.02 threshold):

  • oman p10: infographic with sparse text, mostly visual chart
  • kfd p1: table with large car photo occupying most of the page

Impact is minimal: DotsOCR handles English at 84.4/100 (vs LightOnOCR-2's 92.8/100).

Margin Distribution

Clear separation between English and Arabic text pages:

  • English text pages: +0.030 to +0.054 (well above 0.02 threshold)
  • Arabic text pages: -0.006 to -0.055 (well below threshold)
  • Mixed/blank pages: -0.002 to +0.019 (correctly below threshold)
  • Weak English: +0.003 (sparse text / photo-heavy — below threshold, safe fallback)

Also Tested: PyMuPDF Text Extraction

For PDF inputs (not scanned images), PyMuPDF can extract the embedded text layer and detect language via Unicode character ranges:

text = page.get_text()
arabic = sum(1 for c in text if '\u0600' <= c <= '\u06FF' or ...)
latin = sum(1 for c in text if 'A' <= c <= 'Z' or 'a' <= c <= 'z')
MetricPyMuPDF + UnicodeSigLIP2 B32
Accuracy~95% (on PDFs with text layer)93.8% (on any image)
Speed<1ms/page21ms/page
VRAM0~350 MB
Works on scansNo (returns empty)Yes
Works on imagesNo (needs PDF)Yes

PyMuPDF is faster and more accurate when a text layer exists, but fails on scanned documents. Can be used as a first-pass optimization before SigLIP2 fallback.

Triton Deployment Validation

The SigLIP2 vision encoder was exported to ONNX and deployed inside Triton Inference Server with a BLS (Business Logic Scripting) pipeline. Full end-to-end validation on the L40S GPU confirmed identical results to the PyTorch benchmark.

ONNX Export

  • Vision encoder only (text embeddings pre-computed as static .npy file)
  • ONNX opset 17, native IR version (no downgrade needed on Triton 26.02+)
  • Model size: 1.4 MB ONNX + 361 MB weights
  • Validation: 0.000000 max difference in features and logits vs PyTorch

Triton Architecture

triton-server (GPU)
  └── router_onnx (ONNX Runtime, GPU) — vision encoder inference

TritonClient (direct gRPC)
  └── classify(data_url) → gRPC → router_pipeline (BLS, CPU)
        └── image decode, PIL BICUBIC resize, normalize, call router_onnx,
            cosine similarity with text embeddings, margin threshold → route decision

Triton Results (L40S, 10 runs per page)

MetricPyTorch (M2 MPS)Triton (L40S GPU)
Accuracy93.8% (30/32)93.8% (30/32)
English accuracy84.6% (11/13)84.6% (11/13)
Arabic accuracy100% (19/19)100% (19/19)
FP (AR→EN, dangerous)00
FN (EN→AR, suboptimal)22
Avg time21ms (model only)55ms (end-to-end HTTP)

All 32 page logits match the PyTorch benchmark to 4 decimal places. The same 2 false negatives (kfd p1, oman p10 — sparse text with large photos) fall below the 0.02 threshold.

The 55ms end-to-end time includes HTTP request parsing, base64 decode, PIL preprocessing, gRPC to Triton, GPU inference, cosine similarity, and JSON response serialization.

Key Preprocessing Detail

PIL BICUBIC resize must be used (not OpenCV INTER_CUBIC) to match open_clip's preprocessing pipeline. Using OpenCV produces slightly different pixel values that shift margins enough to affect threshold-sensitive pages.

Concurrency

The router handles ~32 RPS on L40S with original instance counts (1 GPU, 2 CPU). Increasing instances caused GPU contention and degraded performance. See Triton Concurrency Tuning for full analysis.

Recommendation

Primary router: SigLIP2 ViT-B-32-256 (google/siglip2-base-patch32-256)

MetricValue
Accuracy93.8% (30/32)
False positives (AR->EN)0
False negatives (EN->AR)2 (safe fallback)
Inference time~21ms/page (MPS), ~55ms end-to-end via Triton on L40S
VRAM~350 MB
Model size86M params
Thresholdmargin > 0.02 -> English
Prompts["English text", "Arabic text"]

Proposed routing logic:

page_image -> SigLIP2 encode_image -> cosine similarity with pre-computed text embeddings
  if (en_logit - ar_logit) > 0.02:
    route to LightOnOCR-2  (pure English, fast)
  else:
    route to DotsOCR v1.5  (Arabic/mixed/unknown, safe)

Optional optimization for PDF inputs:

if pdf has text layer:
  extract text -> count Arabic vs Latin Unicode chars
  if >95% Latin: route to LightOnOCR-2
  else: route to DotsOCR v1.5
else:
  fall through to SigLIP2 visual classification

Both LightOnOCR-2 (~2 GB) + DotsOCR (~14 GB) + SigLIP2 (~350 MB) fit on a single L40S with ~30 GB to spare.

How to Reproduce

PyTorch Benchmark

pip install torch open_clip_torch transformers pymupdf pillow

# Run the benchmark (default: SigLIP2 B32 on MPS)
python docs/research/benchmark_router/benchmark_router.py

# Run on CUDA
python docs/research/benchmark_router/benchmark_router.py --device cuda --runs 10

ONNX Export & Triton Deployment

The router model is automatically exported and baked into the Triton Docker image via a multi-stage build. No manual setup needed:

# Build Triton image (exports SigLIP2 to ONNX in a builder stage)
docker compose build triton

# Start Triton
docker compose up -d triton

# Validate against the 32-page test corpus
python scripts/validate_router_triton.py --runs 10

For standalone export without Docker (e.g., local development):

pip install torch open_clip_torch transformers onnx onnxscript onnxruntime
python scripts/export_router_onnx.py

Test PDFs must be in data/pdfs/ relative to the repo root.