siraaj-dot-ocr-service / docs/research/benchmark_router/router-benchmark-report.md
Document Language Router Benchmark Report
Document Language Router Benchmark Report
Date: 2026-03-18 GPU: Apple M2 (MPS), 8 GB unified memory Test: 32 pages across 5 PDFs — English tables, Arabic text, mixed bilingual, blank/decorative, photos Goal: Classify page images as "pure English" vs "everything else" for OCR model routing
Context
The OCR pipeline uses two models on a single NVIDIA L40S (46 GB VRAM):
- LightOnOCR-2 (1B): Best English OCR (92.8/100, 3.0s/page, ~2 GB) — English only
- DotsOCR v1.5 (14B): Best Arabic OCR (6/6 Arabic numerals, ~14 GB) — handles English too (84.4/100)
A fast router is needed to classify each page image and send it to the right model:
- Pure English -> LightOnOCR-2 (faster, higher quality on English)
- Everything else (Arabic, mixed, blank, unknown) -> DotsOCR v1.5 (safe default)
Key constraint: false positives (Arabic page sent to English-only model) are dangerous — LightOnOCR-2 garbles Arabic. False negatives (English page sent to DotsOCR) are merely suboptimal.
Approach
Models Tested
Zero-shot CLIP-family models that classify images against text prompts, requiring no training data:
| Model | Params | Release | Pretrained on |
|---|---|---|---|
| SigLIP2 ViT-B-32-256 | 86M | Feb 2025 | WebLI (Google's multilingual web data) |
| SigLIP2 ViT-B-16-256 | 86M | Feb 2025 | WebLI |
| MobileCLIP2-S0 | 11M+63M | Aug 2025 | DataComp (natural images) |
| MobileCLIP-S1 | 21M+63M | CVPR 2024 | DataComp |
Key Finding: Binary Prompts Required
Initial tests used 3 prompts: ["English text", "Arabic text", "mixed text"]. All models scored ~33% English / ~67% Arabic on every page — no discrimination. The 3-prompt softmax creates a structural bias: "Arabic + mixed" always sums to 2/3.
Binary prompts ["English text", "Arabic text"] fixed this completely. The raw cosine similarity (logit) margin between the two prompts clearly separates English and Arabic pages.
Confidence Threshold
Instead of a simple argmax, a margin threshold is used:
margin = en_logit - ar_logitmargin > 0.02-> English (LightOnOCR-2)margin <= 0.02-> DotsOCR (safe default)
This eliminates all false positives by sending borderline cases to DotsOCR.
Threshold Analysis
| Threshold | Accuracy | AR->EN (dangerous) | EN->AR (suboptimal) |
|---|---|---|---|
| 0.000 | 87.5% | 4 | 0 |
| 0.010 | 87.5% | 2 | 2 |
| 0.020 | 93.8% | 0 | 2 |
| 0.025 | 93.8% | 0 | 2 |
| 0.030 | 93.8% | 0 | 2 |
| 0.035 | 78.1% | 0 | 7 |
Test Corpus
32 pages across 5 PDFs, ground truth verified by visual inspection of each page:
| Pages | Content | |
|---|---|---|
| adobe-6-page.pdf | p1-p6 (6 EN) | Pure English — cloud vendor tables, compliance diagrams |
| oman-2040-en.pdf | p1-6, p10, p16, p21, p31, p46 (4 EN, 7 AR) | Mixed — EN content, AR cover/Bismillah, blank decorative, photos, mixed EN+AR |
| kfd.pdf | p1-p6 (3 EN, 3 AR) | Bilingual — p1-3 English insurance tables, p4-6 Arabic translation |
| ar-novel.pdf | p1, p2, p5, p10 (4 AR) | Arabic academic paper — mixed title, Arabic body |
| hasini-asma.pdf | p1-3, p5, p14 (5 AR) | Arabic thesis — cover, blank, decorative, body text |
Total: 13 English, 19 Arabic/other
Results
Model Comparison (binary prompts, threshold=0.02)
| Model | Accuracy | EN Acc | AR Acc | FP (AR->EN) | FN (EN->AR) | Avg Time |
|---|---|---|---|---|---|---|
| SigLIP2 ViT-B-32-256 | 93.8% | 84.6% | 100% | 0 | 2 | 21ms |
| SigLIP2 ViT-B-16-256 | 93.8% | 84.6% | 100% | 0 | 2 | 32ms |
| MobileCLIP2-S0 | 53.3% | — | — | — | — | 33ms |
| MobileCLIP-S1 | 33.3% | — | — | — | — | 34ms |
MobileCLIP models failed completely — they predict the same class for every page regardless of content. Trained on natural image-caption pairs (DataComp), they never learned to recognize document scripts. SigLIP2's WebLI training data (multilingual web content) is the differentiator.
SigLIP2 ViT-B-32-256 Per-Page Breakdown (threshold=0.02)
| Page | Truth | Pred | EN logit | AR logit | Margin | Time | Note |
|---|---|---|---|---|---|---|---|
| adobe p1 | en | en | 0.0863 | 0.0558 | +0.031 | 116ms | warmup |
| adobe p2 | en | en | 0.0984 | 0.0597 | +0.039 | 18ms | |
| adobe p3 | en | en | 0.0863 | 0.0530 | +0.033 | 14ms | |
| adobe p4 | en | en | 0.1016 | 0.0656 | +0.036 | 19ms | |
| adobe p5 | en | en | 0.0807 | 0.0465 | +0.034 | 25ms | |
| adobe p6 | en | en | 0.0969 | 0.0668 | +0.030 | 22ms | |
| oman p1 | ar | ar | 0.1093 | 0.0926 | +0.017 | 23ms | mixed cover, below threshold |
| oman p2 | ar | ar | 0.0702 | 0.0718 | -0.002 | 15ms | blank decorative |
| oman p3 | ar | ar | 0.0693 | 0.0662 | +0.003 | 22ms | blank decorative, below threshold |
| oman p4 | ar | ar | 0.0769 | 0.0740 | +0.003 | 21ms | blank decorative, below threshold |
| oman p5 | ar | ar | 0.0854 | 0.1408 | -0.055 | 13ms | Bismillah calligraphy |
| oman p6 | ar | ar | 0.0566 | 0.0858 | -0.029 | 16ms | photo (Sultan) |
| oman p10 | en | ar | 0.0769 | 0.0736 | +0.003 | 25ms | EN infographic, sparse text |
| oman p16 | ar | ar | 0.0912 | 0.0723 | +0.019 | 16ms | mixed EN+AR, below threshold |
| oman p21 | en | en | 0.0783 | 0.0407 | +0.038 | 24ms | EN table |
| oman p31 | en | en | 0.0758 | 0.0438 | +0.032 | 17ms | EN table |
| oman p46 | en | en | 0.0911 | 0.0525 | +0.039 | 24ms | EN text |
| kfd p1 | en | ar | 0.0455 | 0.0429 | +0.003 | 21ms | EN table with large car photo |
| kfd p2 | en | en | 0.0832 | 0.0432 | +0.040 | 11ms | EN table, dense text |
| kfd p3 | en | en | 0.0969 | 0.0430 | +0.054 | 14ms | EN text page |
| kfd p4 | ar | ar | 0.0760 | 0.1046 | -0.029 | 17ms | AR table |
| kfd p5 | ar | ar | 0.0883 | 0.1067 | -0.018 | 17ms | AR text |
| kfd p6 | ar | ar | 0.0996 | 0.1208 | -0.021 | 11ms | AR text |
| ar-novel p1 | ar | ar | 0.0947 | 0.1156 | -0.021 | 13ms | mixed title page |
| ar-novel p2 | ar | ar | 0.0989 | 0.1092 | -0.010 | 21ms | mixed abstract |
| ar-novel p5 | ar | ar | 0.0996 | 0.1228 | -0.023 | 22ms | AR body text |
| ar-novel p10 | ar | ar | 0.1025 | 0.1201 | -0.018 | 25ms | AR body text |
| hasini p1 | ar | ar | 0.0960 | 0.1176 | -0.022 | 15ms | AR cover page |
| hasini p2 | ar | ar | 0.0832 | 0.0886 | -0.006 | 16ms | blank page |
| hasini p3 | ar | ar | 0.0948 | 0.1163 | -0.022 | 20ms | AR cover page |
| hasini p5 | ar | ar | 0.0888 | 0.1347 | -0.046 | 21ms | AR decorative + calligraphy |
| hasini p14 | ar | ar | 0.1031 | 0.1351 | -0.032 | 14ms | AR body text |
The 2 False Negatives (English -> DotsOCR)
Both are English pages with weak text signal (margin +0.003, below 0.02 threshold):
- oman p10: infographic with sparse text, mostly visual chart
- kfd p1: table with large car photo occupying most of the page
Impact is minimal: DotsOCR handles English at 84.4/100 (vs LightOnOCR-2's 92.8/100).
Margin Distribution
Clear separation between English and Arabic text pages:
- English text pages: +0.030 to +0.054 (well above 0.02 threshold)
- Arabic text pages: -0.006 to -0.055 (well below threshold)
- Mixed/blank pages: -0.002 to +0.019 (correctly below threshold)
- Weak English: +0.003 (sparse text / photo-heavy — below threshold, safe fallback)
Also Tested: PyMuPDF Text Extraction
For PDF inputs (not scanned images), PyMuPDF can extract the embedded text layer and detect language via Unicode character ranges:
text = page.get_text()
arabic = sum(1 for c in text if '\u0600' <= c <= '\u06FF' or ...)
latin = sum(1 for c in text if 'A' <= c <= 'Z' or 'a' <= c <= 'z')
| Metric | PyMuPDF + Unicode | SigLIP2 B32 |
|---|---|---|
| Accuracy | ~95% (on PDFs with text layer) | 93.8% (on any image) |
| Speed | <1ms/page | 21ms/page |
| VRAM | 0 | ~350 MB |
| Works on scans | No (returns empty) | Yes |
| Works on images | No (needs PDF) | Yes |
PyMuPDF is faster and more accurate when a text layer exists, but fails on scanned documents. Can be used as a first-pass optimization before SigLIP2 fallback.
Triton Deployment Validation
The SigLIP2 vision encoder was exported to ONNX and deployed inside Triton Inference Server with a BLS (Business Logic Scripting) pipeline. Full end-to-end validation on the L40S GPU confirmed identical results to the PyTorch benchmark.
ONNX Export
- Vision encoder only (text embeddings pre-computed as static
.npyfile) - ONNX opset 17, native IR version (no downgrade needed on Triton 26.02+)
- Model size: 1.4 MB ONNX + 361 MB weights
- Validation: 0.000000 max difference in features and logits vs PyTorch
Triton Architecture
triton-server (GPU)
└── router_onnx (ONNX Runtime, GPU) — vision encoder inference
TritonClient (direct gRPC)
└── classify(data_url) → gRPC → router_pipeline (BLS, CPU)
└── image decode, PIL BICUBIC resize, normalize, call router_onnx,
cosine similarity with text embeddings, margin threshold → route decision
Triton Results (L40S, 10 runs per page)
| Metric | PyTorch (M2 MPS) | Triton (L40S GPU) |
|---|---|---|
| Accuracy | 93.8% (30/32) | 93.8% (30/32) |
| English accuracy | 84.6% (11/13) | 84.6% (11/13) |
| Arabic accuracy | 100% (19/19) | 100% (19/19) |
| FP (AR→EN, dangerous) | 0 | 0 |
| FN (EN→AR, suboptimal) | 2 | 2 |
| Avg time | 21ms (model only) | 55ms (end-to-end HTTP) |
All 32 page logits match the PyTorch benchmark to 4 decimal places. The same 2 false negatives (kfd p1, oman p10 — sparse text with large photos) fall below the 0.02 threshold.
The 55ms end-to-end time includes HTTP request parsing, base64 decode, PIL preprocessing, gRPC to Triton, GPU inference, cosine similarity, and JSON response serialization.
Key Preprocessing Detail
PIL BICUBIC resize must be used (not OpenCV INTER_CUBIC) to match open_clip's preprocessing pipeline. Using OpenCV produces slightly different pixel values that shift margins enough to affect threshold-sensitive pages.
Concurrency
The router handles ~32 RPS on L40S with original instance counts (1 GPU, 2 CPU). Increasing instances caused GPU contention and degraded performance. See Triton Concurrency Tuning for full analysis.
Recommendation
Primary router: SigLIP2 ViT-B-32-256 (google/siglip2-base-patch32-256)
| Metric | Value |
|---|---|
| Accuracy | 93.8% (30/32) |
| False positives (AR->EN) | 0 |
| False negatives (EN->AR) | 2 (safe fallback) |
| Inference time | ~21ms/page (MPS), ~55ms end-to-end via Triton on L40S |
| VRAM | ~350 MB |
| Model size | 86M params |
| Threshold | margin > 0.02 -> English |
| Prompts | ["English text", "Arabic text"] |
Proposed routing logic:
page_image -> SigLIP2 encode_image -> cosine similarity with pre-computed text embeddings
if (en_logit - ar_logit) > 0.02:
route to LightOnOCR-2 (pure English, fast)
else:
route to DotsOCR v1.5 (Arabic/mixed/unknown, safe)
Optional optimization for PDF inputs:
if pdf has text layer:
extract text -> count Arabic vs Latin Unicode chars
if >95% Latin: route to LightOnOCR-2
else: route to DotsOCR v1.5
else:
fall through to SigLIP2 visual classification
Both LightOnOCR-2 (~2 GB) + DotsOCR (~14 GB) + SigLIP2 (~350 MB) fit on a single L40S with ~30 GB to spare.
How to Reproduce
PyTorch Benchmark
pip install torch open_clip_torch transformers pymupdf pillow
# Run the benchmark (default: SigLIP2 B32 on MPS)
python docs/research/benchmark_router/benchmark_router.py
# Run on CUDA
python docs/research/benchmark_router/benchmark_router.py --device cuda --runs 10
ONNX Export & Triton Deployment
The router model is automatically exported and baked into the Triton Docker image via a multi-stage build. No manual setup needed:
# Build Triton image (exports SigLIP2 to ONNX in a builder stage)
docker compose build triton
# Start Triton
docker compose up -d triton
# Validate against the 32-page test corpus
python scripts/validate_router_triton.py --runs 10
For standalone export without Docker (e.g., local development):
pip install torch open_clip_torch transformers onnx onnxscript onnxruntime
python scripts/export_router_onnx.py
Test PDFs must be in data/pdfs/ relative to the repo root.