siraaj-dot-ocr-service / docs/research/benchmark_router/pymupdf-text-layer-routing.md

PyMuPDF Text-Layer Routing: Research Findings

Last updated: 4/16/2026GitHub

PyMuPDF Text-Layer Routing: Research Findings

Date: 2026-04-05 GPU: NVIDIA L40S (EC2 instance) Test: 78 pages (OIA English report) + original 32-page benchmark corpus + 3 additional PDFs (388 total pages) Goal: Fix 65% misrouting of English pages to DOTS in OIA_Annual_report_English_2024.pdf

Problem

OIA_Annual_report_English_2024.pdf is a 78-page, 100% English digital PDF. The SigLIP2 visual router (threshold=0.02) misclassifies ~65% of its pages as Arabic, sending them to DotsOCR instead of LightOnOCR-2.

Router log analysis (17 sampled pages):

Route	Count	Margins
Correctly routed to LIGHTONOCR	6/17 (35%)	+0.027 to +0.037
Misrouted to DOTS	11/17 (65%)	-0.021 to +0.017

Pages with sparse text (cover pages, section dividers) and pages with charts/infographics consistently fall below the 0.02 threshold despite being English.

Root Cause

The SigLIP2 text embeddings for ["English text", "Arabic text"] have cosine similarity of 0.887 — too close together. This produces tiny classification margins where English and Arabic pages overlap:

Arabic pages (ar-novel.pdf): margins from -0.023 to -0.010
English pages misclassified (OIA): margins from -0.021 to +0.017
English pages correctly classified (OIA): margins from +0.027 to +0.037

The overlap zone (-0.021 to -0.010) means no single threshold can perfectly separate the two.

Approaches Investigated

Approach 1: Better SigLIP2 Text Prompts

Tested 10 prompt pairs including the SigLIP2 official template ("this is a photo of {label}.") on 27 English + 6 Arabic pages:

Prompt Pair	Separation	Optimal Threshold	Notes
`"A scanned document page in {lang}"`	+0.0098	-0.028	Best separation
`"this is a photo of a document in {lang}."`	+0.0093	-0.027	SigLIP2 official template variant
`"A photo of a page with {lang} text"`	+0.0076	-0.023
`"this is a photo of english/arabic text."`	-0.0010	—	Official template, but overlaps
`"English text" / "Arabic text"` (current)	-0.0106	—	Current prompts, overlaps

The best prompt pair achieves perfect separation with gap +0.0098, but the gap is narrow (worst English margin -0.023, best Arabic margin -0.033). Verified end-to-end:

OIA English PDF: 17/17 → LIGHTONOCR
ar-novel.pdf Arabic: 5/5 → DOTS
kfd.pdf bilingual: EN pages → LIGHTONOCR, AR page → DOTS

Risk: Narrow gap (+0.0098) leaves little margin of safety for unseen documents. Also requires negative threshold (-0.028), which inverts the current safety bias — borderline pages would default to English instead of Arabic.

Approach 2: PyMuPDF Text-Layer Extraction (Recommended)

The original benchmark report already proposed PyMuPDF text extraction as a first-pass optimization. This approach extracts the embedded text layer from PDFs and classifies by Unicode character ranges:

text = page.get_text()
latin = count characters in [A-Za-z]
arabic = count characters in [\u0600-\u06FF]

if latin / (latin + arabic) > 0.95 → LIGHTONOCR
elif arabic / (latin + arabic) > 0.50 → DOTS
else → fall through to SigLIP2 visual router

PyMuPDF Results

OIA_Annual_report_English_2024.pdf (78 pages)

76/78 pages → LIGHTONOCR via text-layer, 2 → SigLIP2 fallback (97.4% direct)

Most pages have Latin text layers with 97-100% Latin characters. Two bilingual pages with Arabic headers fall through to SigLIP2 (page 12: 543 Latin, 52 Arabic = 91.3% Latin; page 62: 313 Latin, 23 Arabic = 93.2% Latin). Both are below the 95% threshold because they contain Arabic Presentation Forms in their headers. This is correct behavior — bilingual pages are deferred to the visual classifier.

Original 32-Page Benchmark Corpus

PyMuPDF resolves both known SigLIP2 false negatives:

Page	SigLIP2 Result	SigLIP2 Margin	PyMuPDF Result	Text Layer
oman p10	AR (wrong)	+0.003	EN (correct)	469 Latin, 0 Arabic = 100%
kfd p1	AR (wrong)	+0.003	EN (correct)	649 Latin, 0 Arabic = 100%

Benchmark accuracy: 93.8% (SigLIP2 only) → 100% (PyMuPDF + SigLIP2 fallback)

Full Test Corpus (411 pages across 9 PDFs)

PDF	Pages	LIGHTONOCR	DOTS	SigLIP2 Fallback
OIA_Annual_report_English_2024.pdf	78	76	0	2
adobe-6-page.pdf	6	6	0	0
oman-2040-en.pdf	52	41	1	10
oman-2040-en-min.pdf	17	15	0	2
ar-novel.pdf	28	0	27	1
hasini-asma.pdf	68	4	59	5
kfd.pdf	6	3	3	0
ien.pdf	150	17	129	4
tickets.pdf	6	0	0	6
Total	411	162	219	30

The 30 SigLIP2 fallback pages break down as:

6 scanned ticket images (no text layer) → SigLIP2 routes as before
12 blank/decorative pages (no text at all) → SigLIP2 routes to DOTS (safe)
12 mixed-language pages (Arabic headers + English body, or vice versa) → SigLIP2 classifies visually

Zero false positives (no Arabic page incorrectly sent to LIGHTONOCR).

Manually verified edge cases:

hasini-asma.pdf English pages → LIGHTONOCR: genuinely English sections (abstract, references) of an Arabic thesis
ar-novel.pdf p1 → SigLIP2 fallback: mixed Arabic/English title page (36.8% Arabic, below 50% threshold)

Proposed Routing Logic

# In activities/pdf.py process_page_activity, BEFORE classify_language():

def classify_by_text_layer(page_text: str) -> LanguageRoute | None:
    """Classify language from PDF text layer (Unicode character ranges).
    
    Returns LanguageRoute.ENGLISH or LanguageRoute.GENERAL if confident,
    or None to fall through to SigLIP2 visual classification.
    """
    latin = sum(1 for c in page_text if 'A' <= c <= 'Z' or 'a' <= c <= 'z')
    arabic = sum(1 for c in page_text if '\u0600' <= c <= '\u06FF')
    total = latin + arabic
    
    if total == 0:
        return None  # No alphabetic text → fall through to SigLIP2
    
    if latin / total > 0.95:
        return LanguageRoute.ENGLISH
    if arabic / total > 0.50:
        return LanguageRoute.GENERAL
    
    return None  # Mixed content → fall through to SigLIP2

Implementation Plan

Files to Change

ocr_workflow/core/routing.py: Add classify_by_text_layer() function
ocr_workflow/activities/pdf.py: In process_page_activity, extract text from the page during extract_pages_activity (PyMuPDF already has the page open), pass it through to process_page_activity, and call classify_by_text_layer() before classify_language()

What Does NOT Change

SigLIP2 router remains as fallback (unchanged threshold, unchanged prompts)
Fallback chain (LIGHTONOCR → DOTS → QWEN → PADDLE) unchanged
Single-image OCR activity (run_ocr_activity) unchanged (no text layer for images)
All existing tests remain valid

Key Design Decisions

95% Latin threshold (not 100%): accounts for occasional Arabic characters in English documents (e.g., OIA page 12 has 14 Arabic chars in 557 alphabetic)
50% Arabic threshold (not lower): avoids false positives on mixed pages
None return for ambiguous cases: preserves the SigLIP2 safety net rather than guessing

Comparison of Approaches

Metric	SigLIP2 Only (Current)	Better Prompts	PyMuPDF + SigLIP2
Benchmark accuracy	93.8% (30/32)	100% (32/32)	100% (32/32)
OIA accuracy	~35% (6/17 sampled)	100% (17/17)	97.4% text-layer (76/78) + 2 SigLIP2 fallback
False positive risk	0	Low (narrow gap)	0
Latency per page	~55ms	~55ms	<1ms (0ms for SigLIP2 skip)
Works on scans	Yes	Yes	Falls through to SigLIP2
Code changes	None	Export script + threshold	routing.py + pdf.py
Risk	Known misrouting	Threshold sensitivity	None (additive layer)

Recommendation: PyMuPDF text-layer routing as primary fix. Optionally, better SigLIP2 prompts as a supplementary improvement for scanned documents.

Implementation (2026-04-05)

Changes Made

ocr_workflow/core/routing.py:

Added classify_by_text_layer(page_text: str) -> LanguageRoute | None
Counts Latin (A-Za-z) and Arabic (U+0600-U+06FF, U+0750-U+077F, U+08A0-U+08FF, U+FB50-U+FDFF, U+FE70-U+FEFF) characters
Returns ENGLISH if >95% Latin, GENERAL if >50% Arabic, None otherwise

ocr_workflow/activities/pdf.py:

extract_pages_activity now extracts text layer via page.get_text()[:1000] during page rendering
Returns List[Dict[str, str]] with {"key": ..., "page_text": ...} instead of List[str]
process_page_activity accepts optional page_text parameter
Calls classify_by_text_layer() first; falls through to SigLIP2 if None

ocr_workflow/workflows/pdf.py:

Updated to handle new dict format from extract_pages_activity
Passes page_text through to process_page_activity

tests/test_routing.py:

11 new unit tests for classify_by_text_layer() covering: pure English, pure Arabic, mixed, empty, whitespace-only, numbers-only, Arabic Presentation Forms, borderline thresholds

Bugs Found and Fixed

Arabic Presentation Forms (initial): PDF generators often use U+FB50-U+FDFF and U+FE70-U+FEFF instead of basic Arabic U+0600-U+06FF. Initial implementation missed these ranges, causing ar-novel.pdf pages to be misclassified. Fixed by including all five Arabic Unicode blocks.
Private Use Area pollution (2026-04-06): Initial fix used unicodedata.bidirectional(c) == "L" to count Latin characters, but PDF generators also map Arabic glyphs to Private Use Area codepoints (U+F000-F8FF) which have default bidi class "L". This inflated the "Latin" count by ~10x on Arabic PDFs (e.g., ar-novel.pdf page 2: 48 actual Latin vs 532 bidi "L"). Result: all 28 ar-novel.pdf pages fell through to SigLIP2 instead of 27 being classified as GENERAL. Fixed by using explicit character ranges (A-Za-z for Latin, five Arabic Unicode blocks) instead of bidi categories.

Integration Test Results

Logs show source=text_layer or source=siglip2 for each routing decision:

OIA English report (17 sampled pages): 17/17 → language=english model=LIGHTONOCR source=text_layer ar-novel.pdf (5 pages): 4/5 → language=general model=DOTS source=text_layer, 1/5 → language=general model=DOTS source=siglip2 (mixed title page fell through to SigLIP2, correctly handled)

Routing Fallback Chain (PDF pages)

1. PyMuPDF text-layer (new, <1ms):
   >95% Latin alphabetic chars → ENGLISH → LightOnOCR-2
   >50% Arabic alphabetic chars → GENERAL → DotsOCR
   Otherwise → fall through

2. SigLIP2 visual router (existing, ~55ms):
   margin > 0.02 → ENGLISH → LightOnOCR-2
   margin ≤ 0.02 → GENERAL → DotsOCR

3. SigLIP2 failure (existing):
   → GENERAL → DotsOCR (safe default)

Single-image OCR: SigLIP2 only (unchanged, no text layer available).