siraaj-dot-ocr-service / docs/research/benchmark_router/pymupdf-text-layer-routing.md
PyMuPDF Text-Layer Routing: Research Findings
PyMuPDF Text-Layer Routing: Research Findings
Date: 2026-04-05 GPU: NVIDIA L40S (EC2 instance) Test: 78 pages (OIA English report) + original 32-page benchmark corpus + 3 additional PDFs (388 total pages) Goal: Fix 65% misrouting of English pages to DOTS in OIA_Annual_report_English_2024.pdf
Problem
OIA_Annual_report_English_2024.pdf is a 78-page, 100% English digital PDF. The SigLIP2 visual router (threshold=0.02) misclassifies ~65% of its pages as Arabic, sending them to DotsOCR instead of LightOnOCR-2.
Router log analysis (17 sampled pages):
| Route | Count | Margins |
|---|---|---|
| Correctly routed to LIGHTONOCR | 6/17 (35%) | +0.027 to +0.037 |
| Misrouted to DOTS | 11/17 (65%) | -0.021 to +0.017 |
Pages with sparse text (cover pages, section dividers) and pages with charts/infographics consistently fall below the 0.02 threshold despite being English.
Root Cause
The SigLIP2 text embeddings for ["English text", "Arabic text"] have cosine similarity of 0.887 — too close together. This produces tiny classification margins where English and Arabic pages overlap:
- Arabic pages (ar-novel.pdf): margins from -0.023 to -0.010
- English pages misclassified (OIA): margins from -0.021 to +0.017
- English pages correctly classified (OIA): margins from +0.027 to +0.037
The overlap zone (-0.021 to -0.010) means no single threshold can perfectly separate the two.
Approaches Investigated
Approach 1: Better SigLIP2 Text Prompts
Tested 10 prompt pairs including the SigLIP2 official template ("this is a photo of {label}.") on 27 English + 6 Arabic pages:
| Prompt Pair | Separation | Optimal Threshold | Notes |
|---|---|---|---|
"A scanned document page in {lang}" | +0.0098 | -0.028 | Best separation |
"this is a photo of a document in {lang}." | +0.0093 | -0.027 | SigLIP2 official template variant |
"A photo of a page with {lang} text" | +0.0076 | -0.023 | |
"this is a photo of english/arabic text." | -0.0010 | — | Official template, but overlaps |
"English text" / "Arabic text" (current) | -0.0106 | — | Current prompts, overlaps |
The best prompt pair achieves perfect separation with gap +0.0098, but the gap is narrow (worst English margin -0.023, best Arabic margin -0.033). Verified end-to-end:
- OIA English PDF: 17/17 → LIGHTONOCR
- ar-novel.pdf Arabic: 5/5 → DOTS
- kfd.pdf bilingual: EN pages → LIGHTONOCR, AR page → DOTS
Risk: Narrow gap (+0.0098) leaves little margin of safety for unseen documents. Also requires negative threshold (-0.028), which inverts the current safety bias — borderline pages would default to English instead of Arabic.
Approach 2: PyMuPDF Text-Layer Extraction (Recommended)
The original benchmark report already proposed PyMuPDF text extraction as a first-pass optimization. This approach extracts the embedded text layer from PDFs and classifies by Unicode character ranges:
text = page.get_text()
latin = count characters in [A-Za-z]
arabic = count characters in [\u0600-\u06FF]
if latin / (latin + arabic) > 0.95 → LIGHTONOCR
elif arabic / (latin + arabic) > 0.50 → DOTS
else → fall through to SigLIP2 visual router
PyMuPDF Results
OIA_Annual_report_English_2024.pdf (78 pages)
76/78 pages → LIGHTONOCR via text-layer, 2 → SigLIP2 fallback (97.4% direct)
Most pages have Latin text layers with 97-100% Latin characters. Two bilingual pages with Arabic headers fall through to SigLIP2 (page 12: 543 Latin, 52 Arabic = 91.3% Latin; page 62: 313 Latin, 23 Arabic = 93.2% Latin). Both are below the 95% threshold because they contain Arabic Presentation Forms in their headers. This is correct behavior — bilingual pages are deferred to the visual classifier.
Original 32-Page Benchmark Corpus
PyMuPDF resolves both known SigLIP2 false negatives:
| Page | SigLIP2 Result | SigLIP2 Margin | PyMuPDF Result | Text Layer |
|---|---|---|---|---|
| oman p10 | AR (wrong) | +0.003 | EN (correct) | 469 Latin, 0 Arabic = 100% |
| kfd p1 | AR (wrong) | +0.003 | EN (correct) | 649 Latin, 0 Arabic = 100% |
Benchmark accuracy: 93.8% (SigLIP2 only) → 100% (PyMuPDF + SigLIP2 fallback)
Full Test Corpus (411 pages across 9 PDFs)
| Pages | LIGHTONOCR | DOTS | SigLIP2 Fallback | |
|---|---|---|---|---|
| OIA_Annual_report_English_2024.pdf | 78 | 76 | 0 | 2 |
| adobe-6-page.pdf | 6 | 6 | 0 | 0 |
| oman-2040-en.pdf | 52 | 41 | 1 | 10 |
| oman-2040-en-min.pdf | 17 | 15 | 0 | 2 |
| ar-novel.pdf | 28 | 0 | 27 | 1 |
| hasini-asma.pdf | 68 | 4 | 59 | 5 |
| kfd.pdf | 6 | 3 | 3 | 0 |
| ien.pdf | 150 | 17 | 129 | 4 |
| tickets.pdf | 6 | 0 | 0 | 6 |
| Total | 411 | 162 | 219 | 30 |
The 30 SigLIP2 fallback pages break down as:
- 6 scanned ticket images (no text layer) → SigLIP2 routes as before
- 12 blank/decorative pages (no text at all) → SigLIP2 routes to DOTS (safe)
- 12 mixed-language pages (Arabic headers + English body, or vice versa) → SigLIP2 classifies visually
Zero false positives (no Arabic page incorrectly sent to LIGHTONOCR).
Manually verified edge cases:
- hasini-asma.pdf English pages → LIGHTONOCR: genuinely English sections (abstract, references) of an Arabic thesis
- ar-novel.pdf p1 → SigLIP2 fallback: mixed Arabic/English title page (36.8% Arabic, below 50% threshold)
Proposed Routing Logic
# In activities/pdf.py process_page_activity, BEFORE classify_language():
def classify_by_text_layer(page_text: str) -> LanguageRoute | None:
"""Classify language from PDF text layer (Unicode character ranges).
Returns LanguageRoute.ENGLISH or LanguageRoute.GENERAL if confident,
or None to fall through to SigLIP2 visual classification.
"""
latin = sum(1 for c in page_text if 'A' <= c <= 'Z' or 'a' <= c <= 'z')
arabic = sum(1 for c in page_text if '\u0600' <= c <= '\u06FF')
total = latin + arabic
if total == 0:
return None # No alphabetic text → fall through to SigLIP2
if latin / total > 0.95:
return LanguageRoute.ENGLISH
if arabic / total > 0.50:
return LanguageRoute.GENERAL
return None # Mixed content → fall through to SigLIP2
Implementation Plan
Files to Change
ocr_workflow/core/routing.py: Addclassify_by_text_layer()functionocr_workflow/activities/pdf.py: Inprocess_page_activity, extract text from the page duringextract_pages_activity(PyMuPDF already has the page open), pass it through toprocess_page_activity, and callclassify_by_text_layer()beforeclassify_language()
What Does NOT Change
- SigLIP2 router remains as fallback (unchanged threshold, unchanged prompts)
- Fallback chain (LIGHTONOCR → DOTS → QWEN → PADDLE) unchanged
- Single-image OCR activity (
run_ocr_activity) unchanged (no text layer for images) - All existing tests remain valid
Key Design Decisions
- 95% Latin threshold (not 100%): accounts for occasional Arabic characters in English documents (e.g., OIA page 12 has 14 Arabic chars in 557 alphabetic)
- 50% Arabic threshold (not lower): avoids false positives on mixed pages
- None return for ambiguous cases: preserves the SigLIP2 safety net rather than guessing
Comparison of Approaches
| Metric | SigLIP2 Only (Current) | Better Prompts | PyMuPDF + SigLIP2 |
|---|---|---|---|
| Benchmark accuracy | 93.8% (30/32) | 100% (32/32) | 100% (32/32) |
| OIA accuracy | ~35% (6/17 sampled) | 100% (17/17) | 97.4% text-layer (76/78) + 2 SigLIP2 fallback |
| False positive risk | 0 | Low (narrow gap) | 0 |
| Latency per page | ~55ms | ~55ms | <1ms (0ms for SigLIP2 skip) |
| Works on scans | Yes | Yes | Falls through to SigLIP2 |
| Code changes | None | Export script + threshold | routing.py + pdf.py |
| Risk | Known misrouting | Threshold sensitivity | None (additive layer) |
Recommendation: PyMuPDF text-layer routing as primary fix. Optionally, better SigLIP2 prompts as a supplementary improvement for scanned documents.
Implementation (2026-04-05)
Changes Made
ocr_workflow/core/routing.py:
- Added
classify_by_text_layer(page_text: str) -> LanguageRoute | None - Counts Latin (
A-Za-z) and Arabic (U+0600-U+06FF, U+0750-U+077F, U+08A0-U+08FF, U+FB50-U+FDFF, U+FE70-U+FEFF) characters - Returns ENGLISH if >95% Latin, GENERAL if >50% Arabic, None otherwise
ocr_workflow/activities/pdf.py:
extract_pages_activitynow extracts text layer viapage.get_text()[:1000]during page rendering- Returns
List[Dict[str, str]]with{"key": ..., "page_text": ...}instead ofList[str] process_page_activityaccepts optionalpage_textparameter- Calls
classify_by_text_layer()first; falls through to SigLIP2 if None
ocr_workflow/workflows/pdf.py:
- Updated to handle new dict format from
extract_pages_activity - Passes
page_textthrough toprocess_page_activity
tests/test_routing.py:
- 11 new unit tests for
classify_by_text_layer()covering: pure English, pure Arabic, mixed, empty, whitespace-only, numbers-only, Arabic Presentation Forms, borderline thresholds
Bugs Found and Fixed
-
Arabic Presentation Forms (initial): PDF generators often use U+FB50-U+FDFF and U+FE70-U+FEFF instead of basic Arabic U+0600-U+06FF. Initial implementation missed these ranges, causing ar-novel.pdf pages to be misclassified. Fixed by including all five Arabic Unicode blocks.
-
Private Use Area pollution (2026-04-06): Initial fix used
unicodedata.bidirectional(c) == "L"to count Latin characters, but PDF generators also map Arabic glyphs to Private Use Area codepoints (U+F000-F8FF) which have default bidi class "L". This inflated the "Latin" count by ~10x on Arabic PDFs (e.g., ar-novel.pdf page 2: 48 actual Latin vs 532 bidi "L"). Result: all 28 ar-novel.pdf pages fell through to SigLIP2 instead of 27 being classified as GENERAL. Fixed by using explicit character ranges (A-Za-zfor Latin, five Arabic Unicode blocks) instead of bidi categories.
Integration Test Results
Logs show source=text_layer or source=siglip2 for each routing decision:
OIA English report (17 sampled pages): 17/17 → language=english model=LIGHTONOCR source=text_layer
ar-novel.pdf (5 pages): 4/5 → language=general model=DOTS source=text_layer, 1/5 → language=general model=DOTS source=siglip2 (mixed title page fell through to SigLIP2, correctly handled)
Routing Fallback Chain (PDF pages)
1. PyMuPDF text-layer (new, <1ms):
>95% Latin alphabetic chars → ENGLISH → LightOnOCR-2
>50% Arabic alphabetic chars → GENERAL → DotsOCR
Otherwise → fall through
2. SigLIP2 visual router (existing, ~55ms):
margin > 0.02 → ENGLISH → LightOnOCR-2
margin ≤ 0.02 → GENERAL → DotsOCR
3. SigLIP2 failure (existing):
→ GENERAL → DotsOCR (safe default)
Single-image OCR: SigLIP2 only (unchanged, no text layer available).