siraaj-dot-ocr-service / docs/research/benchmark_ocr_models/hard-english-benchmark-report.md
Hard English OCR Benchmark Report
Hard English OCR Benchmark Report
Date: 2026-03-15 (updated with GLM-OCR, FireRed-OCR, Nanonets-OCR2, Granite Vision, DeepSeek-OCR-2, MinerU2.5, Dolphin-v2, MonkeyOCR-pro-3B, Qwen3.5-4B-AWQ) GPU: NVIDIA L40S (46 GB VRAM) Test: 9 difficult pages — tables with merged headers, radial infographics, dense KPI tables, complex diagrams, strategic visuals
See also: General Model Comparison — broader benchmark covering English + Arabic across 20+ models with E2E pipeline comparison.
Approach
Ground Truth Creation
Ground truth baselines were created by Claude (AI) through direct visual inspection of each PDF page rendered at 150 DPI. For each page, Claude identified:
- All key numbers and numeric values visible in the document
- Important labels, headers, and terms
- Expected table structure (row/column counts)
- Minimum character count for completeness
Baselines are stored as markdown files in baselines/ directory (9 files, one per test page).
Benchmark Scripts
benchmark_hard_english.py— Main benchmark script. Sends each page as a JPEG image to an OpenAI-compatible API (vLLM), captures the response, and scores it against ground truth. Supports configurable prompts, temperature, top-p/top-k, and repetition penalty per model. Used for all single-call models (LightOnOCR-2, FireRed-OCR, Nanonets-OCR2, GLM-OCR, DotsOCR, Nemotron Parse, DeepSeek-OCR-2, Granite Vision, PaddleOCR-VL).benchmark_got_ocr_hard.py— Standalone script for GOT-OCR 2.0, which doesn't support vLLM. Uses HuggingFace transformers directly with the same ground truth and scoring functions.run_hard_english_benchmark.sh— Shell runner that starts each vLLM model sequentially, waits for server readiness, runs the benchmark, and kills the server before the next model.- Custom two-stage pipelines — MinerU2.5 and Dolphin-v2 require multi-call pipelines, not single prompts. Tested via inline scripts that use vLLM as backend:
- MinerU2.5: Uses
MinerUClientlibrary (pip install mineru-vl-utils[vllm]) withbackend='http-client'. Callstwo_step_extract(image)which internally runs layout detection ("Layout Detection:") then per-region content extraction ("Text Recognition:","Table Recognition:","Formula Recognition:"). Requires--logits-processors mineru_vl_utils:MinerULogitsProcessorwhen starting vLLM. - Dolphin-v2: Manual two-stage via vLLM chat API. Stage 1: full image +
"Parse the reading order of this document."→ returns element bboxes and types. Stage 2: for each element, crop the region from the original image and send with type-specific prompt ("Parse the table in the image.","Read text in the image.", etc.). Output combined in reading order.
- MinerU2.5: Uses
Model Search Methodology
Exhaustive search to find all candidate models ≤1B parameters for document OCR:
HuggingFace API searches:
- By pipeline tag (
image-text-to-text,image-to-text,text-generation) sorted by newest and by downloads - By actual safetensors file size < 2.5GB (filtering 500+ models with
files_metadata=True) - By keyword search (
document OCR VLM,page OCR model,document parsing VLM,OCR 256M,OCR 500M) - Checked 4,886 recent
image-text-to-textmodels for OCR-related tags + file size
Google/web searches (30+ queries):
- Broad:
"OCR LLM","OCR VLM","OCR model 2026","best OCR model","document OCR open source" - Size-specific:
"sub 1B VLM document OCR","small vision language model OCR","tiny VLM document understanding" - Model-specific:
"moondream OCR","Vary-toy OCR","pix2struct OCR","donut document OCR","nougat OCR","kosmos document OCR","gemma 3 1B vision OCR" - Benchmark-focused:
"OmniDocBench leaderboard","OCRBench small model under 1B","DocVQA small model" - Community:
"reddit small VLM OCR document 2025 2026"
Models found and tested (17 benchmarked): All competitive sub-1B/1B document OCR models on HuggingFace as of March 2026 were identified and benchmarked. Models over 1.5B (RolmOCR 7B, Penguin-VL 2B, CogAgent 9B, Moondream 1.9B, etc.) were excluded from the fast tier evaluation. Models requiring custom pipelines not compatible with vLLM (MonkeyOCR-pro-3B) or failing to load (H2OVL-Mississippi-800M) were noted but not scored.
Scoring Methodology
Composite score (0-100) per page:
- Numeric accuracy (40%): exact match of ground-truth numbers in output
- Label accuracy (30%): case-insensitive match of key terms/labels
- Structure (20%): table detection, correct row/column counts
- Completeness (10%): minimum character threshold met
For pages with no expected numbers, weights redistribute to labels (55%), structure (30%), completeness (15%).
Results Summary
| Rank | Model | Avg Score | Avg Time/Page | VRAM | Inference |
|---|---|---|---|---|---|
| 1 | LightOnOCR-2 (1B) | 92.8 | 3.0s | ~2 GB | Single call |
| 2 | Qwen3.5-4B-AWQ | 92.5 | 16.5s | ~4 GB | Single call*** |
| 3 | Qwen3.5-2B-AWQ | 90.1 | 9.4s**** | ~2.4 GB | Single call*** |
| 4 | Nanonets-OCR2-3B | 89.7 | 8.8s | ~8 GB | Single call |
| 5 | FireRed-OCR-2B | 89.4 | 5.5s | ~5 GB | Single call |
| 6 | HunyuanOCR (1B) | 87.3 | 1.4s | ~2 GB***** | Single call |
| 7 | GLM-OCR (0.9B) | 86.9 | 1.9s | ~2 GB | Single call* |
| 8 | MinerU2.5-1.2B | 85.4 | 2.4s | ~2 GB | Two-stage pipeline |
| 9 | DotsOCR v1.5 (14B) | 84.4 | 6.8s | ~14 GB | Single call |
| 10 | Nemotron Parse v1.1 (7B) | 79.8 | 3.0s | ~15 GB | Single call |
| 11 | Qwen3.5-0.8B | 77.3 | 9.8s | ~1.7 GB | Single call*** |
| 12 | DeepSeek-OCR-2 (3B MoE) | 67.3 | 1.9s | ~6.5 GB | Single call** |
| 13 | Dolphin-v2 (3B) | 72.9 | 11.4s | ~7 GB | Two-stage pipeline |
| 14 | Granite Vision 3.3 2B | 71.7 | 22.5s | ~4 GB | Single call |
| 15 | InternVL2.5-1B | 73.0 | 3.5s | ~1.8 GB | Single call |
| 16 | PaddleOCR-VL-1.5 (0.9B) | 58.3 | 8.2s | ~2 GB | Single call |
| 17 | GOT-OCR 2.0 (0.6B) | 40.5 | 160.4s | ~4.6 GB | Transformers |
*GLM-OCR requires a separate venv with transformers 5.x.
**DeepSeek-OCR-2 re-tested with recommended <|grounding|>Convert the document to markdown. prompt. Score dropped from 75.9 to 67.3 because grounding mode outputs bounding box coordinates that waste tokens and produce fragmented table structure.
***Qwen3.5 models require vLLM with transformers 5.x and --default-chat-template-kwargs '{"enable_thinking": false}'. Tested with recommended prompt "qwenvl markdown" and params (temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5).
****Qwen3.5-2B avg time inflated by adobe p3 (65s, 9 tok/s — slow prefill on complex diagram). Other 8 pages average 2.4s.
*****HunyuanOCR requires vLLM 0.17.1 with transformers from git main (TransformersMultiModal fallback). Tested with recommended Chinese doc parsing prompt ("请将图片中的内容按照文档解析的规范转成markdown格式输出"). Native vLLM support would reduce VRAM to ~2 GB; current fallback uses ~42 GB.
Not benchmarked — requires own pipeline: MonkeyOCR-pro-3B (echo840/MonkeyOCR-pro-3B) is a three-component system: Structure (YOLO layout detection via PaddlePaddle) + Relation + Recognition (Qwen2.5-VL fine-tune). Cannot be tested via vLLM alone — requires its own conda env with PaddlePaddle and the monkeyocr library. Claims +8.6% over MinerU on OmniDocBench tables.
Two-stage models (MinerU2.5, Dolphin-v2) use vLLM as backend but require multiple API calls per page: layout detection → per-region content extraction. MinerU2.5 uses MinerUClient library; Dolphin-v2 uses manual crop + type-specific prompts.
Sub-1B general VLMs not suitable for OCR: SmolVLM2-500M (HuggingFaceTB/SmolVLM2-500M-Video-Instruct) and SmolVLM-256M were tested but produce image descriptions ("this table provides information about...") instead of actual text extraction. These are visual Q&A/captioning models, not OCR models. Florence-2 is not supported by vLLM. H2OVL-Mississippi-800M (h2oai/h2ovl-mississippi-800m, 0.8B, OCR-specialized) failed to load on vLLM due to custom config serialization errors. The sub-1B models that actually do document OCR are GLM-OCR (0.9B, 86.9/100), Qwen3.5-0.8B (77.3/100, inconsistent), and PaddleOCR-VL-1.5 (0.9B, 58.3/100).
Note: Chandra-OCR (9B) was planned but the original model has been removed from HuggingFace.
Per-Page Breakdown
adobe-6-page p3 — CCF Framework Diagram + Compliance Infographic
Complex visual with framework diagram, compliance wheel, and infographic with embedded numbers (18 standards, 1000 CRs, 200 controls, 12 domains).
| Model | Score | Time | Numbers | Labels | Structure |
|---|---|---|---|---|---|
| Nanonets-OCR2 | 87.0 | 8.8s | 100% | 57% | 100% |
| Nemotron Parse | 78.3 | 1.6s | 75% | 61% | 100% |
| LightOnOCR-2 | 77.0 | 4.8s | 75% | 56% | 100% |
| PaddleOCR-VL | 77.0 | 4.6s | 75% | 56% | 100% |
| FireRed-OCR | 77.0 | 6.4s | 75% | 57% | 100% |
| MinerU2.5 | 75.7 | 4.4s | 75% | 52% | 100% |
| DeepSeek-OCR-2 | 77.0 | 1.6s | 75% | 57% | 100% |
| DotsOCR v1.5 | 56.5 | 6.6s | 50% | 22% | 100% |
| GOT-OCR 2.0 | 36.5 | 163.4s | 0% | 22% | 100% |
| GLM-OCR | 33.2 | 2.9s | 0% | 17% | 100% |
| Granite Vision | 31.2 | 4.1s | 0% | 17% | 80% |
| Dolphin-v2 | 30.3 | 4.2s | 0% | 17% | 100% |
All models struggle with the dense embedded labels (HIPAA, FISMA, NIST etc.). DotsOCR and GLM-OCR perform worst — DotsOCR skips picture content, GLM-OCR's table-only prompt misses diagram text.
adobe-6-page p4 — Cloud Vendor Comparison Table (Merged Headers)
6 vendors x 4 services with merged column headers.
| Model | Score | Time | Labels | Structure |
|---|---|---|---|---|
| DeepSeek-OCR-2 | 90.0 | 2.2s | 100% | 67% |
| LightOnOCR-2 | 86.1 | 3.0s | 93% | 67% |
| DotsOCR v1.5 | 86.1 | 8.0s | 93% | 67% |
| GLM-OCR | 86.1 | 2.2s | 93% | 67% |
| Nanonets-OCR2 | 86.1 | 10.7s | 93% | 67% |
| MinerU2.5 | 86.1 | 1.7s | 93% | 0% |
| Dolphin-v2 | 70.4 | 12.7s | 64% | 67% |
| Nemotron Parse | 66.1 | 1.9s | 93% | 0% |
| FireRed-OCR | 58.6 | 6.7s | 43% | 67% |
| GOT-OCR 2.0 | 38.6 | 159.6s | 43% | 0% |
| PaddleOCR-VL | 30.7 | 16.0s | 29% | 0% |
| Granite Vision | 15.0 | 78.0s | 0% | 0% |
Granite Vision degenerated (78s, 8192 tokens, 93K chars of repeated content). PaddleOCR also degenerates on this page.
adobe-6-page p5 — Cloud Vendor Continuation (Multi-line Cells)
Continuation table with 7 services and compliance text.
| Model | Score | Time | Labels | Structure |
|---|---|---|---|---|
| LightOnOCR-2 | 100.0 | 2.4s | 100% | 100% |
| DotsOCR v1.5 | 100.0 | 6.3s | 100% | 100% |
| GLM-OCR | 100.0 | 1.5s | 100% | 100% |
| Nanonets-OCR2 | 100.0 | 8.5s | 100% | 100% |
| FireRed-OCR | 100.0 | 4.1s | 100% | 100% |
| DeepSeek-OCR-2 | 100.0 | 1.5s | 100% | 100% |
| MinerU2.5 | 96.6 | 2.0s | 94% | 100% |
| Dolphin-v2 | 96.6 | 10.1s | 94% | 100% |
| Granite Vision | 72.5 | 4.3s | 50% | 100% |
| Nemotron Parse | 63.1 | 1.4s | 88% | 0% |
| GOT-OCR 2.0 | 42.5 | 162.0s | 50% | 0% |
| PaddleOCR-VL | 42.5 | 16.0s | 50% | 0% |
oman-2040-en p10 — Radial Infographic with KPI Targets
Circular chart with 9 numeric targets and 8 KPI labels embedded in visual elements.
| Model | Score | Time | Numbers | Labels | Structure |
|---|---|---|---|---|---|
| LightOnOCR-2 | 91.1 | 2.7s | 78% | 100% | 100% |
| Nemotron Parse | 91.1 | 0.8s | 78% | 100% | 100% |
| FireRed-OCR | 91.1 | 1.8s | 78% | 100% | 100% |
| Nanonets-OCR2 | 86.7 | 3.1s | 67% | 100% | 100% |
| GLM-OCR | 75.9 | 1.0s | 78% | 62% | 80% |
| Granite Vision | 67.7 | 4.3s | 67% | 50% | 80% |
| DeepSeek-OCR-2 | 67.2 | 12.3s | 56% | 50% | 100% |
| PaddleOCR-VL | 37.5 | 16.0s | 0% | 25% | 100% |
| DotsOCR v1.5 | 30.0 | 1.5s | 0% | 0% | 100% |
| GOT-OCR 2.0 | 30.0 | 161.0s | 0% | 0% | 100% |
| MinerU2.5 | 23.1 | 0.7s | 0% | 0% | 100% |
| Dolphin-v2 | 23.0 | 1.9s | 0% | 0% | 100% |
DotsOCR completely fails on infographics — its layout prompt categorizes this as "Picture" and omits text.
oman-2040-en p21 — KPI Performance Table (Vector Graphics)
Table rendered as vector graphics, not standard HTML/PDF table.
| Model | Score | Time | Numbers | Labels | Structure |
|---|---|---|---|---|---|
| GLM-OCR | 100.0 | 2.1s | 100% | 100% | 100% |
| MinerU2.5 | 100.0 | 3.8s | 100% | 100% | 100% |
| LightOnOCR-2 | 93.3 | 2.8s | 100% | 100% | 67% |
| DotsOCR v1.5 | 93.3 | 9.5s | 100% | 100% | 67% |
| FireRed-OCR | 93.3 | 6.8s | 100% | 100% | 67% |
| Nanonets-OCR2 | 86.7 | 6.1s | 83% | 100% | 67% |
| Nemotron Parse | 80.0 | 15.7s | 100% | 100% | 0% |
| DeepSeek-OCR-2 | 80.0 | 1.5s | 100% | 100% | 0% |
| Granite Vision | 76.2 | 9.1s | 100% | 43% | 67% |
| Dolphin-v2 | 75.7 | 19.6s | 100% | 86% | 0% |
| PaddleOCR-VL | 58.6 | 1.0s | 100% | 29% | 0% |
| GOT-OCR 2.0 | 10.0 | 159.1s | 0% | 0% | 0% |
oman-2040-en p31 — Dense Economic KPI Table
21 numeric values across economic indicators with percentages and fractions.
| Model | Score | Time | Numbers | Labels | Structure |
|---|---|---|---|---|---|
| LightOnOCR-2 | 100.0 | 2.9s | 100% | 100% | 100% |
| DotsOCR v1.5 | 100.0 | 7.1s | 100% | 100% | 100% |
| GLM-OCR | 100.0 | 2.2s | 100% | 100% | 100% |
| FireRed-OCR | 100.0 | 6.2s | 100% | 100% | 100% |
| MinerU2.5 | 100.0 | 2.8s | 100% | 100% | 100% |
| Granite Vision | 93.3 | 9.3s | 100% | 100% | 67% |
| Nanonets-OCR2 | 86.7 | 12.9s | 100% | 100% | 33% |
| Nemotron Parse | 80.0 | 1.7s | 100% | 100% | 0% |
| Dolphin-v2 | 67.1 | 21.9s | 76% | 89% | 0% |
| PaddleOCR-VL | 63.3 | 2.2s | 100% | 44% | 0% |
| DeepSeek-OCR-2 | 51.0 | 12.3s | 52% | 67% | 0% |
| GOT-OCR 2.0 | 10.0 | 159.1s | 0% | 0% | 0% |
oman-2040-en p46 — Strategic Directions Visual (1549 Drawings)
Text-heavy strategic visual with 12 direction items.
| Model | Score | Time | Labels | Structure |
|---|---|---|---|---|
| Most models | 94-100 | 0.7-10.4s | 100% | 80-100% |
| GOT-OCR 2.0 | 100.0 | 160.1s | 100% | 100% |
| DeepSeek-OCR-2 | 45.0 | 12.3s | 0% | 100% |
All models handle this text-heavy page well except DeepSeek-OCR-2 (hit 4096 token limit, 0% labels). MinerU2.5 (94.0) and Dolphin-v2 (100.0) both perform well here.
kfd p1 — Insurance Product Comparison (Merged Headers)
3-column table with AED amounts (3,500,000, 6,770, 5,000, etc.) and merged headers.
| Model | Score | Time | Numbers | Labels | Structure |
|---|---|---|---|---|---|
| LightOnOCR-2 | 93.3 | 2.9s | 100% | 100% | 67% |
| DotsOCR v1.5 | 93.3 | 5.5s | 100% | 100% | 67% |
| GLM-OCR | 93.3 | 2.1s | 100% | 100% | 67% |
| Nanonets-OCR2 | 93.3 | 10.6s | 100% | 100% | 67% |
| DeepSeek-OCR-2 | 93.3 | 1.4s | 100% | 100% | 67% |
| MinerU2.5 | 93.3 | 1.9s | 100% | 100% | 67% |
| Dolphin-v2 | 93.3 | 8.0s | 100% | 100% | 67% |
| FireRed-OCR | 90.6 | 6.8s | 100% | 91% | 67% |
| Granite Vision | 90.6 | 7.1s | 100% | 91% | 67% |
| Nemotron Parse | 80.0 | 1.2s | 100% | 100% | 0% |
| PaddleOCR-VL | 37.8 | 16.0s | 29% | 54% | 0% |
| GOT-OCR 2.0 | 10.0 | 159.8s | 0% | 0% | 0% |
kfd p2 — Insurance Benefits Continuation
Simpler table with fewer numbers and clear labels.
| Model | Score | Time | Numbers | Labels | Structure |
|---|---|---|---|---|---|
| LightOnOCR-2 | 100.0 | 3.2s | 100% | 100% | 100% |
| DotsOCR v1.5 | 100.0 | 9.4s | 100% | 100% | 100% |
| GLM-OCR | 100.0 | 1.7s | 100% | 100% | 100% |
| Nanonets-OCR2 | 100.0 | 10.5s | 100% | 100% | 100% |
| FireRed-OCR | 100.0 | 5.3s | 100% | 100% | 100% |
| MinerU2.5 | 100.0 | 2.9s | 100% | 100% | 100% |
| Dolphin-v2 | 100.0 | 13.5s | 100% | 100% | 100% |
| Granite Vision | 93.3 | 5.2s | 100% | 100% | 67% |
| GOT-OCR 2.0 | 86.7 | 159.8s | 100% | 100% | 33% |
| Nemotron Parse | 80.0 | 1.8s | 100% | 100% | 0% |
| DeepSeek-OCR-2 | 80.0 | 1.7s | 100% | 100% | 0% |
| PaddleOCR-VL | 77.3 | 1.2s | 100% | 91% | 0% |
Key Findings
1. LightOnOCR-2 is the best English OCR model
- Highest quality: 92.8/100 average across all 9 pages — #1 out of 13 models tested
- Fastest tier: 3.0s/page average
- Smallest: ~2 GB VRAM — can run alongside DotsOCR v1.5 on a single GPU
- Reads everything: tables, infographics, charts, diagrams, text — unlike DotsOCR which skips picture content
- Perfect scores on 3/9 pages (100.0), never below 77.0
- Limitation: English-only — garbles Arabic numerals
1b. Qwen3.5 models nearly tie on quality but slower
- Qwen3.5-4B-AWQ: 92.5/100, 16.5s/page, ~4 GB — only 0.3 points behind LightOnOCR-2 but 5.5x slower
- Qwen3.5-2B-AWQ: 90.1/100, ~2.4 GB — most pages 1.4-3.3s (competitive with LightOnOCR-2!) but adobe p3 spikes to 65s (slow prefill on complex diagram, 9 tok/s)
- Qwen3.5-0.8B: 77.3/100, ~1.7 GB — great on simple tables (97.3 on kfd p1) but degenerates on complex pages (21.8 on adobe p3, 51.5 on oman p46 — hits 8192 token limit). Too inconsistent for production
- Both require vLLM with transformers 5.x and
--default-chat-template-kwargs '{"enable_thinking": false}' - Tested with recommended prompt
"qwenvl markdown"and params (temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5) - Note: Qwen3.5-2B output on adobe p3 is actually more accurate than ground truth expectations — it reads the real text (SOC 2, ISO 27001, PCI DSS) from the diagram rather than label abbreviations (HIPAA, FISMA, NIST) that appear in a different visual element. The 77.0 score understates its quality.
- Production currently uses Qwen3-VL (not 3.5), which doesn't have the thinking mode issue
2. FireRed-OCR-2B and Nanonets-OCR2-3B are strong runners-up
- FireRed-OCR: 89.4/100, 5.5s/page, ~5 GB — good all-rounder, based on Qwen3-VL-2B
- Nanonets-OCR2: 89.7/100, 8.8s/page, ~8 GB — slightly better score but slower and heavier
- Both use recommended prompts and produce HTML tables
- Neither beats LightOnOCR-2 on the quality/speed/VRAM combination
2b. HunyuanOCR is fastest but needs vLLM nightly setup
- 87.3/100, 1.4s/page (fastest of all models), ~2 GB native VRAM (1B params)
- Three perfect 100.0 scores (adobe p5, oman p46, kfd p2)
- Weak on diagrams (35.4 on adobe p3 — same weakness as GLM-OCR)
- Tested with recommended Chinese doc parsing prompt (
"请将图片中的内容按照文档解析的规范转成markdown格式输出") - Requires vLLM 0.17.1 + transformers from git main (TransformersMultiModal fallback, uses ~42 GB VRAM unoptimized). Native vLLM
hunyuan_vlsupport not yet in stable releases - Custom license (not Apache 2.0)
3. GLM-OCR is fastest but table-only
- 86.9/100, 1.9s/page (fastest), ~2 GB — excellent on pure tables (4 perfect scores)
- Fails on diagrams/infographics (33.2 on adobe p3) because "Table Recognition:" prompt skips non-table content
- Requires separate venv with transformers 5.x — operational complexity
4. DotsOCR v1.5 is strong on tables but blind to infographics
- 84.4/100 average, but drops to 30.0 on infographics (oman p10) and 56.5 on diagrams (adobe p3)
- Its JSON layout prompt classifies charts as "Picture" and omits text extraction
- Excellent on pure table pages (93-100)
- Remains the only model with correct Arabic numeral extraction (6/6)
5. Nemotron Parse v1.1 has perfect numbers but no table structure
- 79.8/100 — all numbers correct on every page, but outputs LaTeX instead of markdown/HTML tables
- Structure score is 0% on 6/9 pages because LaTeX tables aren't detected as tables
- Extremely fast (3.0s avg) but the LaTeX output requires post-processing
5b. MinerU2.5-1.2B is the best two-stage pipeline model
- 85.4/100, 2.4s/page, ~2 GB — compact and fast
- Uses
MinerUClientlibrary for proper two-stage extraction (layout detection → per-region recognition) - Outputs HTML tables via
Table Recognition:prompt, text viaText Recognition: - Weak on infographics (23.1 on oman p10 — can't read chart data)
- Requires
--logits-processors mineru_vl_utils:MinerULogitsProcessorandpip install mineru-vl-utils[vllm]
6. DeepSeek-OCR-2 limited by 8192 context and grounding mode
- 67.3/100 with recommended
<|grounding|>prompt (was 75.9 with generic prompt — grounding mode wastes tokens on bounding box coordinates and produces fragmented table HTML) - Fast (1.9s/page) but 8192 context limit restricts output
- MoE architecture (3B total, ~570M active) — 6.5 GB VRAM
7. Dolphin-v2 slow but decent with proper pipeline
- 72.9/100, 11.4s/page — two-stage pipeline (reading order → per-element crop + type-specific prompts)
- Good on tables (93-100 on kfd) but slow due to multiple API calls per element
- Can't read infographics (23.0 on oman p10)
- ByteDance model, Qwen2.5-VL-3B based, ~7 GB VRAM
8. Granite Vision 3.3 2B unreliable
- 71.7/100 (re-tested with recommended temp=0.2) — degenerated on adobe p3 and p4 (78-79s each, hit 8192 tokens)
- Good on simple tables (85-94) but inconsistent on complex pages
9. MonkeyOCR-pro-3B — not benchmarked (requires own pipeline)
- Three-component system: Structure (YOLO via PaddlePaddle) + Relation + Recognition (Qwen2.5-VL)
- Cannot run via vLLM alone — requires its own conda env with PaddlePaddle and
monkeyocrlibrary - Claims +8.6% over MinerU on OmniDocBench tables, but could not be fairly tested in our setup
10. PaddleOCR-VL-1.5 and GOT-OCR 2.0 not competitive
- PaddleOCR: 58.3/100 — degenerates on complex pages (4/9 pages hit token limit)
- GOT-OCR: 40.5/100 at 160s/page — worst quality and slowest by far
Recommendation
English fast tier: LightOnOCR-2 (1B)
- Best quality (92.8), fast (3.0s), smallest (2 GB)
- Handles all content types: tables, charts, infographics, text
- Runner-up: FireRed-OCR-2B (89.4, 5.5s, 5 GB) if LightOnOCR-2 has issues
Arabic tier: DotsOCR v1.5 (14B)
- Only model with 6/6 Arabic numeral extraction
- Strong on tables (93-100), weak on infographics (30)
Production Stack vs LightOnOCR-2 (10-run stability test)
Both systems tested 10 times on all 9 benchmark pages to measure consistency.
| Metric | Prod Stack (DotsOCR + Qwen + Paddle) | LightOnOCR-2 (single model) |
|---|---|---|
| Mean score | 89.7/100 | 92.8/100 |
| Std deviation | 0.7 | 0.0 |
| Min / Max | 88.5 / 91.1 | 92.8 / 92.8 |
| Mean time/page | 14.6s | 2.7s |
| Std dev time | 0.5s | 0.1s |
| Models | 3 (DotsOCR 14B + Qwen 4B + PaddleOCR) | 1 (LightOnOCR-2 1B) |
| VRAM | ~22 GB | ~2 GB |
Production stack variance comes from Qwen secondary (temp=0.7) on crops — pages like adobe p3 (std=3.8) and oman p10 (std=3.7) fluctuate between runs. LightOnOCR-2 is perfectly deterministic (temp=0.0) with zero variance on all pages.
The production stack scores well (89.7) because DotsOCR handles tables and Qwen handles Picture crops. But LightOnOCR-2 handles everything in a single call — tables, charts, infographics, text — with higher quality, 5.4x faster, and zero variance.
Recommendation
English fast tier: LightOnOCR-2 (1B)
- Best quality (92.8), fastest (2.7s), smallest (2 GB), deterministic
- Handles all content types: tables, charts, infographics, text
- Runner-up: FireRed-OCR-2B (89.4, 5.5s, 5 GB) if LightOnOCR-2 has issues
Arabic tier: DotsOCR v1.5 (14B)
- Only model with 6/6 Arabic numeral extraction
- Strong on tables (93-100), weak on infographics (30)
Proposed routing: Detect document language → English pages to LightOnOCR-2, Arabic pages to DotsOCR v1.5. Both models fit on a single L40S GPU (~16 GB combined). See Router Benchmark for the language detection model evaluation.