siraaj-dot-ocr-service / docs/research/benchmark_router/router-benchmark-report.md

Document Language Router Benchmark Report

Last updated: 4/16/2026GitHub

Document Language Router Benchmark Report

Date: 2026-03-18 GPU: Apple M2 (MPS), 8 GB unified memory Test: 32 pages across 5 PDFs — English tables, Arabic text, mixed bilingual, blank/decorative, photos Goal: Classify page images as "pure English" vs "everything else" for OCR model routing

Context

The OCR pipeline uses two models on a single NVIDIA L40S (46 GB VRAM):

LightOnOCR-2 (1B): Best English OCR (92.8/100, 3.0s/page, ~2 GB) — English only
DotsOCR v1.5 (14B): Best Arabic OCR (6/6 Arabic numerals, ~14 GB) — handles English too (84.4/100)

A fast router is needed to classify each page image and send it to the right model:

Pure English -> LightOnOCR-2 (faster, higher quality on English)
Everything else (Arabic, mixed, blank, unknown) -> DotsOCR v1.5 (safe default)

Key constraint: false positives (Arabic page sent to English-only model) are dangerous — LightOnOCR-2 garbles Arabic. False negatives (English page sent to DotsOCR) are merely suboptimal.

Approach

Models Tested

Zero-shot CLIP-family models that classify images against text prompts, requiring no training data:

Model	Params	Release	Pretrained on
SigLIP2 ViT-B-32-256	86M	Feb 2025	WebLI (Google's multilingual web data)
SigLIP2 ViT-B-16-256	86M	Feb 2025	WebLI
MobileCLIP2-S0	11M+63M	Aug 2025	DataComp (natural images)
MobileCLIP-S1	21M+63M	CVPR 2024	DataComp

Key Finding: Binary Prompts Required

Initial tests used 3 prompts: ["English text", "Arabic text", "mixed text"]. All models scored ~33% English / ~67% Arabic on every page — no discrimination. The 3-prompt softmax creates a structural bias: "Arabic + mixed" always sums to 2/3.

Binary prompts ["English text", "Arabic text"] fixed this completely. The raw cosine similarity (logit) margin between the two prompts clearly separates English and Arabic pages.

Confidence Threshold

Instead of a simple argmax, a margin threshold is used:

margin = en_logit - ar_logit
margin > 0.02 -> English (LightOnOCR-2)
margin <= 0.02 -> DotsOCR (safe default)

This eliminates all false positives by sending borderline cases to DotsOCR.

Threshold Analysis

Threshold	Accuracy	AR->EN (dangerous)	EN->AR (suboptimal)
0.000	87.5%	4	0
0.010	87.5%	2	2
0.020	93.8%	0	2
0.025	93.8%	0	2
0.030	93.8%	0	2
0.035	78.1%	0	7

Test Corpus

32 pages across 5 PDFs, ground truth verified by visual inspection of each page:

PDF	Pages	Content
adobe-6-page.pdf	p1-p6 (6 EN)	Pure English — cloud vendor tables, compliance diagrams
oman-2040-en.pdf	p1-6, p10, p16, p21, p31, p46 (4 EN, 7 AR)	Mixed — EN content, AR cover/Bismillah, blank decorative, photos, mixed EN+AR
kfd.pdf	p1-p6 (3 EN, 3 AR)	Bilingual — p1-3 English insurance tables, p4-6 Arabic translation
ar-novel.pdf	p1, p2, p5, p10 (4 AR)	Arabic academic paper — mixed title, Arabic body
hasini-asma.pdf	p1-3, p5, p14 (5 AR)	Arabic thesis — cover, blank, decorative, body text

Total: 13 English, 19 Arabic/other

Results

Model Comparison (binary prompts, threshold=0.02)

Model	Accuracy	EN Acc	AR Acc	FP (AR->EN)	FN (EN->AR)	Avg Time
SigLIP2 ViT-B-32-256	93.8%	84.6%	100%	0	2	21ms
SigLIP2 ViT-B-16-256	93.8%	84.6%	100%	0	2	32ms
MobileCLIP2-S0	53.3%	—	—	—	—	33ms
MobileCLIP-S1	33.3%	—	—	—	—	34ms

MobileCLIP models failed completely — they predict the same class for every page regardless of content. Trained on natural image-caption pairs (DataComp), they never learned to recognize document scripts. SigLIP2's WebLI training data (multilingual web content) is the differentiator.

SigLIP2 ViT-B-32-256 Per-Page Breakdown (threshold=0.02)

Page	Truth	Pred	EN logit	AR logit	Margin	Time	Note
adobe p1	en	en	0.0863	0.0558	+0.031	116ms	warmup
adobe p2	en	en	0.0984	0.0597	+0.039	18ms
adobe p3	en	en	0.0863	0.0530	+0.033	14ms
adobe p4	en	en	0.1016	0.0656	+0.036	19ms
adobe p5	en	en	0.0807	0.0465	+0.034	25ms
adobe p6	en	en	0.0969	0.0668	+0.030	22ms
oman p1	ar	ar	0.1093	0.0926	+0.017	23ms	mixed cover, below threshold
oman p2	ar	ar	0.0702	0.0718	-0.002	15ms	blank decorative
oman p3	ar	ar	0.0693	0.0662	+0.003	22ms	blank decorative, below threshold
oman p4	ar	ar	0.0769	0.0740	+0.003	21ms	blank decorative, below threshold
oman p5	ar	ar	0.0854	0.1408	-0.055	13ms	Bismillah calligraphy
oman p6	ar	ar	0.0566	0.0858	-0.029	16ms	photo (Sultan)
oman p10	en	ar	0.0769	0.0736	+0.003	25ms	EN infographic, sparse text
oman p16	ar	ar	0.0912	0.0723	+0.019	16ms	mixed EN+AR, below threshold
oman p21	en	en	0.0783	0.0407	+0.038	24ms	EN table
oman p31	en	en	0.0758	0.0438	+0.032	17ms	EN table
oman p46	en	en	0.0911	0.0525	+0.039	24ms	EN text
kfd p1	en	ar	0.0455	0.0429	+0.003	21ms	EN table with large car photo
kfd p2	en	en	0.0832	0.0432	+0.040	11ms	EN table, dense text
kfd p3	en	en	0.0969	0.0430	+0.054	14ms	EN text page
kfd p4	ar	ar	0.0760	0.1046	-0.029	17ms	AR table
kfd p5	ar	ar	0.0883	0.1067	-0.018	17ms	AR text
kfd p6	ar	ar	0.0996	0.1208	-0.021	11ms	AR text
ar-novel p1	ar	ar	0.0947	0.1156	-0.021	13ms	mixed title page
ar-novel p2	ar	ar	0.0989	0.1092	-0.010	21ms	mixed abstract
ar-novel p5	ar	ar	0.0996	0.1228	-0.023	22ms	AR body text
ar-novel p10	ar	ar	0.1025	0.1201	-0.018	25ms	AR body text
hasini p1	ar	ar	0.0960	0.1176	-0.022	15ms	AR cover page
hasini p2	ar	ar	0.0832	0.0886	-0.006	16ms	blank page
hasini p3	ar	ar	0.0948	0.1163	-0.022	20ms	AR cover page
hasini p5	ar	ar	0.0888	0.1347	-0.046	21ms	AR decorative + calligraphy
hasini p14	ar	ar	0.1031	0.1351	-0.032	14ms	AR body text

The 2 False Negatives (English -> DotsOCR)

Both are English pages with weak text signal (margin +0.003, below 0.02 threshold):

oman p10: infographic with sparse text, mostly visual chart
kfd p1: table with large car photo occupying most of the page

Impact is minimal: DotsOCR handles English at 84.4/100 (vs LightOnOCR-2's 92.8/100).

Margin Distribution

Clear separation between English and Arabic text pages:

English text pages: +0.030 to +0.054 (well above 0.02 threshold)
Arabic text pages: -0.006 to -0.055 (well below threshold)
Mixed/blank pages: -0.002 to +0.019 (correctly below threshold)
Weak English: +0.003 (sparse text / photo-heavy — below threshold, safe fallback)

Also Tested: PyMuPDF Text Extraction

For PDF inputs (not scanned images), PyMuPDF can extract the embedded text layer and detect language via Unicode character ranges:

text = page.get_text()
arabic = sum(1 for c in text if '\u0600' <= c <= '\u06FF' or ...)
latin = sum(1 for c in text if 'A' <= c <= 'Z' or 'a' <= c <= 'z')

Metric	PyMuPDF + Unicode	SigLIP2 B32
Accuracy	~95% (on PDFs with text layer)	93.8% (on any image)
Speed	<1ms/page	21ms/page
VRAM	0	~350 MB
Works on scans	No (returns empty)	Yes
Works on images	No (needs PDF)	Yes

PyMuPDF is faster and more accurate when a text layer exists, but fails on scanned documents. Can be used as a first-pass optimization before SigLIP2 fallback.

Triton Deployment Validation

The SigLIP2 vision encoder was exported to ONNX and deployed inside Triton Inference Server with a BLS (Business Logic Scripting) pipeline. Full end-to-end validation on the L40S GPU confirmed identical results to the PyTorch benchmark.

ONNX Export

Vision encoder only (text embeddings pre-computed as static .npy file)
ONNX opset 17, native IR version (no downgrade needed on Triton 26.02+)
Model size: 1.4 MB ONNX + 361 MB weights
Validation: 0.000000 max difference in features and logits vs PyTorch

Triton Architecture

triton-server (GPU)
  └── router_onnx (ONNX Runtime, GPU) — vision encoder inference

TritonClient (direct gRPC)
  └── classify(data_url) → gRPC → router_pipeline (BLS, CPU)
        └── image decode, PIL BICUBIC resize, normalize, call router_onnx,
            cosine similarity with text embeddings, margin threshold → route decision

Triton Results (L40S, 10 runs per page)

Metric	PyTorch (M2 MPS)	Triton (L40S GPU)
Accuracy	93.8% (30/32)	93.8% (30/32)
English accuracy	84.6% (11/13)	84.6% (11/13)
Arabic accuracy	100% (19/19)	100% (19/19)
FP (AR→EN, dangerous)	0	0
FN (EN→AR, suboptimal)	2	2
Avg time	21ms (model only)	55ms (end-to-end HTTP)

All 32 page logits match the PyTorch benchmark to 4 decimal places. The same 2 false negatives (kfd p1, oman p10 — sparse text with large photos) fall below the 0.02 threshold.

The 55ms end-to-end time includes HTTP request parsing, base64 decode, PIL preprocessing, gRPC to Triton, GPU inference, cosine similarity, and JSON response serialization.

Key Preprocessing Detail

PIL BICUBIC resize must be used (not OpenCV INTER_CUBIC) to match open_clip's preprocessing pipeline. Using OpenCV produces slightly different pixel values that shift margins enough to affect threshold-sensitive pages.

Concurrency

The router handles ~32 RPS on L40S with original instance counts (1 GPU, 2 CPU). Increasing instances caused GPU contention and degraded performance. See Triton Concurrency Tuning for full analysis.

Recommendation

Primary router: SigLIP2 ViT-B-32-256 (google/siglip2-base-patch32-256)

Metric	Value
Accuracy	93.8% (30/32)
False positives (AR->EN)	0
False negatives (EN->AR)	2 (safe fallback)
Inference time	~21ms/page (MPS), ~55ms end-to-end via Triton on L40S
VRAM	~350 MB
Model size	86M params
Threshold	margin > 0.02 -> English
Prompts	`["English text", "Arabic text"]`

Proposed routing logic:

page_image -> SigLIP2 encode_image -> cosine similarity with pre-computed text embeddings
  if (en_logit - ar_logit) > 0.02:
    route to LightOnOCR-2  (pure English, fast)
  else:
    route to DotsOCR v1.5  (Arabic/mixed/unknown, safe)

Optional optimization for PDF inputs:

if pdf has text layer:
  extract text -> count Arabic vs Latin Unicode chars
  if >95% Latin: route to LightOnOCR-2
  else: route to DotsOCR v1.5
else:
  fall through to SigLIP2 visual classification

Both LightOnOCR-2 (~2 GB) + DotsOCR (~14 GB) + SigLIP2 (~350 MB) fit on a single L40S with ~30 GB to spare.

How to Reproduce

PyTorch Benchmark

pip install torch open_clip_torch transformers pymupdf pillow

# Run the benchmark (default: SigLIP2 B32 on MPS)
python docs/research/benchmark_router/benchmark_router.py

# Run on CUDA
python docs/research/benchmark_router/benchmark_router.py --device cuda --runs 10

ONNX Export & Triton Deployment

The router model is automatically exported and baked into the Triton Docker image via a multi-stage build. No manual setup needed:

# Build Triton image (exports SigLIP2 to ONNX in a builder stage)
docker compose build triton

# Start Triton
docker compose up -d triton

# Validate against the 32-page test corpus
python scripts/validate_router_triton.py --runs 10

For standalone export without Docker (e.g., local development):

pip install torch open_clip_torch transformers onnx onnxscript onnxruntime
python scripts/export_router_onnx.py

Test PDFs must be in data/pdfs/ relative to the repo root.