siraaj-dot-ocr-service / docs/research/benchmark_router/pymupdf-text-layer-routing.md

PyMuPDF Text-Layer Routing: Research Findings

Last updated: 4/16/2026GitHub

PyMuPDF Text-Layer Routing: Research Findings

Date: 2026-04-05 GPU: NVIDIA L40S (EC2 instance) Test: 78 pages (OIA English report) + original 32-page benchmark corpus + 3 additional PDFs (388 total pages) Goal: Fix 65% misrouting of English pages to DOTS in OIA_Annual_report_English_2024.pdf

Problem

OIA_Annual_report_English_2024.pdf is a 78-page, 100% English digital PDF. The SigLIP2 visual router (threshold=0.02) misclassifies ~65% of its pages as Arabic, sending them to DotsOCR instead of LightOnOCR-2.

Router log analysis (17 sampled pages):

RouteCountMargins
Correctly routed to LIGHTONOCR6/17 (35%)+0.027 to +0.037
Misrouted to DOTS11/17 (65%)-0.021 to +0.017

Pages with sparse text (cover pages, section dividers) and pages with charts/infographics consistently fall below the 0.02 threshold despite being English.

Root Cause

The SigLIP2 text embeddings for ["English text", "Arabic text"] have cosine similarity of 0.887 — too close together. This produces tiny classification margins where English and Arabic pages overlap:

  • Arabic pages (ar-novel.pdf): margins from -0.023 to -0.010
  • English pages misclassified (OIA): margins from -0.021 to +0.017
  • English pages correctly classified (OIA): margins from +0.027 to +0.037

The overlap zone (-0.021 to -0.010) means no single threshold can perfectly separate the two.

Approaches Investigated

Approach 1: Better SigLIP2 Text Prompts

Tested 10 prompt pairs including the SigLIP2 official template ("this is a photo of {label}.") on 27 English + 6 Arabic pages:

Prompt PairSeparationOptimal ThresholdNotes
"A scanned document page in {lang}"+0.0098-0.028Best separation
"this is a photo of a document in {lang}."+0.0093-0.027SigLIP2 official template variant
"A photo of a page with {lang} text"+0.0076-0.023
"this is a photo of english/arabic text."-0.0010Official template, but overlaps
"English text" / "Arabic text" (current)-0.0106Current prompts, overlaps

The best prompt pair achieves perfect separation with gap +0.0098, but the gap is narrow (worst English margin -0.023, best Arabic margin -0.033). Verified end-to-end:

  • OIA English PDF: 17/17 → LIGHTONOCR
  • ar-novel.pdf Arabic: 5/5 → DOTS
  • kfd.pdf bilingual: EN pages → LIGHTONOCR, AR page → DOTS

Risk: Narrow gap (+0.0098) leaves little margin of safety for unseen documents. Also requires negative threshold (-0.028), which inverts the current safety bias — borderline pages would default to English instead of Arabic.

The original benchmark report already proposed PyMuPDF text extraction as a first-pass optimization. This approach extracts the embedded text layer from PDFs and classifies by Unicode character ranges:

text = page.get_text()
latin = count characters in [A-Za-z]
arabic = count characters in [\u0600-\u06FF]

if latin / (latin + arabic) > 0.95 → LIGHTONOCR
elif arabic / (latin + arabic) > 0.50 → DOTS
else → fall through to SigLIP2 visual router

PyMuPDF Results

OIA_Annual_report_English_2024.pdf (78 pages)

76/78 pages → LIGHTONOCR via text-layer, 2 → SigLIP2 fallback (97.4% direct)

Most pages have Latin text layers with 97-100% Latin characters. Two bilingual pages with Arabic headers fall through to SigLIP2 (page 12: 543 Latin, 52 Arabic = 91.3% Latin; page 62: 313 Latin, 23 Arabic = 93.2% Latin). Both are below the 95% threshold because they contain Arabic Presentation Forms in their headers. This is correct behavior — bilingual pages are deferred to the visual classifier.

Original 32-Page Benchmark Corpus

PyMuPDF resolves both known SigLIP2 false negatives:

PageSigLIP2 ResultSigLIP2 MarginPyMuPDF ResultText Layer
oman p10AR (wrong)+0.003EN (correct)469 Latin, 0 Arabic = 100%
kfd p1AR (wrong)+0.003EN (correct)649 Latin, 0 Arabic = 100%

Benchmark accuracy: 93.8% (SigLIP2 only) → 100% (PyMuPDF + SigLIP2 fallback)

Full Test Corpus (411 pages across 9 PDFs)

PDFPagesLIGHTONOCRDOTSSigLIP2 Fallback
OIA_Annual_report_English_2024.pdf787602
adobe-6-page.pdf6600
oman-2040-en.pdf5241110
oman-2040-en-min.pdf171502
ar-novel.pdf280271
hasini-asma.pdf684595
kfd.pdf6330
ien.pdf150171294
tickets.pdf6006
Total41116221930

The 30 SigLIP2 fallback pages break down as:

  • 6 scanned ticket images (no text layer) → SigLIP2 routes as before
  • 12 blank/decorative pages (no text at all) → SigLIP2 routes to DOTS (safe)
  • 12 mixed-language pages (Arabic headers + English body, or vice versa) → SigLIP2 classifies visually

Zero false positives (no Arabic page incorrectly sent to LIGHTONOCR).

Manually verified edge cases:

  • hasini-asma.pdf English pages → LIGHTONOCR: genuinely English sections (abstract, references) of an Arabic thesis
  • ar-novel.pdf p1 → SigLIP2 fallback: mixed Arabic/English title page (36.8% Arabic, below 50% threshold)

Proposed Routing Logic

# In activities/pdf.py process_page_activity, BEFORE classify_language():

def classify_by_text_layer(page_text: str) -> LanguageRoute | None:
    """Classify language from PDF text layer (Unicode character ranges).
    
    Returns LanguageRoute.ENGLISH or LanguageRoute.GENERAL if confident,
    or None to fall through to SigLIP2 visual classification.
    """
    latin = sum(1 for c in page_text if 'A' <= c <= 'Z' or 'a' <= c <= 'z')
    arabic = sum(1 for c in page_text if '\u0600' <= c <= '\u06FF')
    total = latin + arabic
    
    if total == 0:
        return None  # No alphabetic text → fall through to SigLIP2
    
    if latin / total > 0.95:
        return LanguageRoute.ENGLISH
    if arabic / total > 0.50:
        return LanguageRoute.GENERAL
    
    return None  # Mixed content → fall through to SigLIP2

Implementation Plan

Files to Change

  1. ocr_workflow/core/routing.py: Add classify_by_text_layer() function
  2. ocr_workflow/activities/pdf.py: In process_page_activity, extract text from the page during extract_pages_activity (PyMuPDF already has the page open), pass it through to process_page_activity, and call classify_by_text_layer() before classify_language()

What Does NOT Change

  • SigLIP2 router remains as fallback (unchanged threshold, unchanged prompts)
  • Fallback chain (LIGHTONOCR → DOTS → QWEN → PADDLE) unchanged
  • Single-image OCR activity (run_ocr_activity) unchanged (no text layer for images)
  • All existing tests remain valid

Key Design Decisions

  • 95% Latin threshold (not 100%): accounts for occasional Arabic characters in English documents (e.g., OIA page 12 has 14 Arabic chars in 557 alphabetic)
  • 50% Arabic threshold (not lower): avoids false positives on mixed pages
  • None return for ambiguous cases: preserves the SigLIP2 safety net rather than guessing

Comparison of Approaches

MetricSigLIP2 Only (Current)Better PromptsPyMuPDF + SigLIP2
Benchmark accuracy93.8% (30/32)100% (32/32)100% (32/32)
OIA accuracy~35% (6/17 sampled)100% (17/17)97.4% text-layer (76/78) + 2 SigLIP2 fallback
False positive risk0Low (narrow gap)0
Latency per page~55ms~55ms<1ms (0ms for SigLIP2 skip)
Works on scansYesYesFalls through to SigLIP2
Code changesNoneExport script + thresholdrouting.py + pdf.py
RiskKnown misroutingThreshold sensitivityNone (additive layer)

Recommendation: PyMuPDF text-layer routing as primary fix. Optionally, better SigLIP2 prompts as a supplementary improvement for scanned documents.

Implementation (2026-04-05)

Changes Made

ocr_workflow/core/routing.py:

  • Added classify_by_text_layer(page_text: str) -> LanguageRoute | None
  • Counts Latin (A-Za-z) and Arabic (U+0600-U+06FF, U+0750-U+077F, U+08A0-U+08FF, U+FB50-U+FDFF, U+FE70-U+FEFF) characters
  • Returns ENGLISH if >95% Latin, GENERAL if >50% Arabic, None otherwise

ocr_workflow/activities/pdf.py:

  • extract_pages_activity now extracts text layer via page.get_text()[:1000] during page rendering
  • Returns List[Dict[str, str]] with {"key": ..., "page_text": ...} instead of List[str]
  • process_page_activity accepts optional page_text parameter
  • Calls classify_by_text_layer() first; falls through to SigLIP2 if None

ocr_workflow/workflows/pdf.py:

  • Updated to handle new dict format from extract_pages_activity
  • Passes page_text through to process_page_activity

tests/test_routing.py:

  • 11 new unit tests for classify_by_text_layer() covering: pure English, pure Arabic, mixed, empty, whitespace-only, numbers-only, Arabic Presentation Forms, borderline thresholds

Bugs Found and Fixed

  1. Arabic Presentation Forms (initial): PDF generators often use U+FB50-U+FDFF and U+FE70-U+FEFF instead of basic Arabic U+0600-U+06FF. Initial implementation missed these ranges, causing ar-novel.pdf pages to be misclassified. Fixed by including all five Arabic Unicode blocks.

  2. Private Use Area pollution (2026-04-06): Initial fix used unicodedata.bidirectional(c) == "L" to count Latin characters, but PDF generators also map Arabic glyphs to Private Use Area codepoints (U+F000-F8FF) which have default bidi class "L". This inflated the "Latin" count by ~10x on Arabic PDFs (e.g., ar-novel.pdf page 2: 48 actual Latin vs 532 bidi "L"). Result: all 28 ar-novel.pdf pages fell through to SigLIP2 instead of 27 being classified as GENERAL. Fixed by using explicit character ranges (A-Za-z for Latin, five Arabic Unicode blocks) instead of bidi categories.

Integration Test Results

Logs show source=text_layer or source=siglip2 for each routing decision:

OIA English report (17 sampled pages): 17/17 → language=english model=LIGHTONOCR source=text_layer ar-novel.pdf (5 pages): 4/5 → language=general model=DOTS source=text_layer, 1/5 → language=general model=DOTS source=siglip2 (mixed title page fell through to SigLIP2, correctly handled)

Routing Fallback Chain (PDF pages)

1. PyMuPDF text-layer (new, <1ms):
   >95% Latin alphabetic chars → ENGLISH → LightOnOCR-2
   >50% Arabic alphabetic chars → GENERAL → DotsOCR
   Otherwise → fall through

2. SigLIP2 visual router (existing, ~55ms):
   margin > 0.02 → ENGLISH → LightOnOCR-2
   margin ≤ 0.02 → GENERAL → DotsOCR

3. SigLIP2 failure (existing):
   → GENERAL → DotsOCR (safe default)

Single-image OCR: SigLIP2 only (unchanged, no text layer available).