siraaj-dot-ocr-service / docs/research/benchmark_ocr_models/hard-english-benchmark-report.md

Hard English OCR Benchmark Report

Last updated: 4/16/2026GitHub

Hard English OCR Benchmark Report

Date: 2026-03-15 (updated with GLM-OCR, FireRed-OCR, Nanonets-OCR2, Granite Vision, DeepSeek-OCR-2, MinerU2.5, Dolphin-v2, MonkeyOCR-pro-3B, Qwen3.5-4B-AWQ) GPU: NVIDIA L40S (46 GB VRAM) Test: 9 difficult pages — tables with merged headers, radial infographics, dense KPI tables, complex diagrams, strategic visuals

See also: General Model Comparison — broader benchmark covering English + Arabic across 20+ models with E2E pipeline comparison.

Approach

Ground Truth Creation

Ground truth baselines were created by Claude (AI) through direct visual inspection of each PDF page rendered at 150 DPI. For each page, Claude identified:

All key numbers and numeric values visible in the document
Important labels, headers, and terms
Expected table structure (row/column counts)
Minimum character count for completeness

Baselines are stored as markdown files in baselines/ directory (9 files, one per test page).

Benchmark Scripts

benchmark_hard_english.py — Main benchmark script. Sends each page as a JPEG image to an OpenAI-compatible API (vLLM), captures the response, and scores it against ground truth. Supports configurable prompts, temperature, top-p/top-k, and repetition penalty per model. Used for all single-call models (LightOnOCR-2, FireRed-OCR, Nanonets-OCR2, GLM-OCR, DotsOCR, Nemotron Parse, DeepSeek-OCR-2, Granite Vision, PaddleOCR-VL).
benchmark_got_ocr_hard.py — Standalone script for GOT-OCR 2.0, which doesn't support vLLM. Uses HuggingFace transformers directly with the same ground truth and scoring functions.
run_hard_english_benchmark.sh — Shell runner that starts each vLLM model sequentially, waits for server readiness, runs the benchmark, and kills the server before the next model.
Custom two-stage pipelines — MinerU2.5 and Dolphin-v2 require multi-call pipelines, not single prompts. Tested via inline scripts that use vLLM as backend:
- MinerU2.5: Uses MinerUClient library (pip install mineru-vl-utils[vllm]) with backend='http-client'. Calls two_step_extract(image) which internally runs layout detection ("Layout Detection:") then per-region content extraction ("Text Recognition:", "Table Recognition:", "Formula Recognition:"). Requires --logits-processors mineru_vl_utils:MinerULogitsProcessor when starting vLLM.
- Dolphin-v2: Manual two-stage via vLLM chat API. Stage 1: full image + "Parse the reading order of this document." → returns element bboxes and types. Stage 2: for each element, crop the region from the original image and send with type-specific prompt ("Parse the table in the image.", "Read text in the image.", etc.). Output combined in reading order.

Model Search Methodology

Exhaustive search to find all candidate models ≤1B parameters for document OCR:

HuggingFace API searches:

By pipeline tag (image-text-to-text, image-to-text, text-generation) sorted by newest and by downloads
By actual safetensors file size < 2.5GB (filtering 500+ models with files_metadata=True)
By keyword search (document OCR VLM, page OCR model, document parsing VLM, OCR 256M, OCR 500M)
Checked 4,886 recent image-text-to-text models for OCR-related tags + file size

Google/web searches (30+ queries):

Broad: "OCR LLM", "OCR VLM", "OCR model 2026", "best OCR model", "document OCR open source"
Size-specific: "sub 1B VLM document OCR", "small vision language model OCR", "tiny VLM document understanding"
Model-specific: "moondream OCR", "Vary-toy OCR", "pix2struct OCR", "donut document OCR", "nougat OCR", "kosmos document OCR", "gemma 3 1B vision OCR"
Benchmark-focused: "OmniDocBench leaderboard", "OCRBench small model under 1B", "DocVQA small model"
Community: "reddit small VLM OCR document 2025 2026"

Models found and tested (17 benchmarked): All competitive sub-1B/1B document OCR models on HuggingFace as of March 2026 were identified and benchmarked. Models over 1.5B (RolmOCR 7B, Penguin-VL 2B, CogAgent 9B, Moondream 1.9B, etc.) were excluded from the fast tier evaluation. Models requiring custom pipelines not compatible with vLLM (MonkeyOCR-pro-3B) or failing to load (H2OVL-Mississippi-800M) were noted but not scored.

Scoring Methodology

Composite score (0-100) per page:

Numeric accuracy (40%): exact match of ground-truth numbers in output
Label accuracy (30%): case-insensitive match of key terms/labels
Structure (20%): table detection, correct row/column counts
Completeness (10%): minimum character threshold met

For pages with no expected numbers, weights redistribute to labels (55%), structure (30%), completeness (15%).

Results Summary

Rank	Model	Avg Score	Avg Time/Page	VRAM	Inference
1	LightOnOCR-2 (1B)	92.8	3.0s	~2 GB	Single call
2	Qwen3.5-4B-AWQ	92.5	16.5s	~4 GB	Single call***
3	Qwen3.5-2B-AWQ	90.1	9.4s****	~2.4 GB	Single call***
4	Nanonets-OCR2-3B	89.7	8.8s	~8 GB	Single call
5	FireRed-OCR-2B	89.4	5.5s	~5 GB	Single call
6	HunyuanOCR (1B)	87.3	1.4s	~2 GB*****	Single call
7	GLM-OCR (0.9B)	86.9	1.9s	~2 GB	Single call*
8	MinerU2.5-1.2B	85.4	2.4s	~2 GB	Two-stage pipeline
9	DotsOCR v1.5 (14B)	84.4	6.8s	~14 GB	Single call
10	Nemotron Parse v1.1 (7B)	79.8	3.0s	~15 GB	Single call
11	Qwen3.5-0.8B	77.3	9.8s	~1.7 GB	Single call***
12	DeepSeek-OCR-2 (3B MoE)	67.3	1.9s	~6.5 GB	Single call**
13	Dolphin-v2 (3B)	72.9	11.4s	~7 GB	Two-stage pipeline
14	Granite Vision 3.3 2B	71.7	22.5s	~4 GB	Single call
15	InternVL2.5-1B	73.0	3.5s	~1.8 GB	Single call
16	PaddleOCR-VL-1.5 (0.9B)	58.3	8.2s	~2 GB	Single call
17	GOT-OCR 2.0 (0.6B)	40.5	160.4s	~4.6 GB	Transformers

*GLM-OCR requires a separate venv with transformers 5.x. **DeepSeek-OCR-2 re-tested with recommended <|grounding|>Convert the document to markdown. prompt. Score dropped from 75.9 to 67.3 because grounding mode outputs bounding box coordinates that waste tokens and produce fragmented table structure. ***Qwen3.5 models require vLLM with transformers 5.x and --default-chat-template-kwargs '{"enable_thinking": false}'. Tested with recommended prompt "qwenvl markdown" and params (temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5). ****Qwen3.5-2B avg time inflated by adobe p3 (65s, 9 tok/s — slow prefill on complex diagram). Other 8 pages average 2.4s. *****HunyuanOCR requires vLLM 0.17.1 with transformers from git main (TransformersMultiModal fallback). Tested with recommended Chinese doc parsing prompt ("请将图片中的内容按照文档解析的规范转成markdown格式输出"). Native vLLM support would reduce VRAM to ~2 GB; current fallback uses ~42 GB.

Not benchmarked — requires own pipeline: MonkeyOCR-pro-3B (echo840/MonkeyOCR-pro-3B) is a three-component system: Structure (YOLO layout detection via PaddlePaddle) + Relation + Recognition (Qwen2.5-VL fine-tune). Cannot be tested via vLLM alone — requires its own conda env with PaddlePaddle and the monkeyocr library. Claims +8.6% over MinerU on OmniDocBench tables.

Two-stage models (MinerU2.5, Dolphin-v2) use vLLM as backend but require multiple API calls per page: layout detection → per-region content extraction. MinerU2.5 uses MinerUClient library; Dolphin-v2 uses manual crop + type-specific prompts.

Sub-1B general VLMs not suitable for OCR: SmolVLM2-500M (HuggingFaceTB/SmolVLM2-500M-Video-Instruct) and SmolVLM-256M were tested but produce image descriptions ("this table provides information about...") instead of actual text extraction. These are visual Q&A/captioning models, not OCR models. Florence-2 is not supported by vLLM. H2OVL-Mississippi-800M (h2oai/h2ovl-mississippi-800m, 0.8B, OCR-specialized) failed to load on vLLM due to custom config serialization errors. The sub-1B models that actually do document OCR are GLM-OCR (0.9B, 86.9/100), Qwen3.5-0.8B (77.3/100, inconsistent), and PaddleOCR-VL-1.5 (0.9B, 58.3/100).

Note: Chandra-OCR (9B) was planned but the original model has been removed from HuggingFace.

Per-Page Breakdown

adobe-6-page p3 — CCF Framework Diagram + Compliance Infographic

Complex visual with framework diagram, compliance wheel, and infographic with embedded numbers (18 standards, 1000 CRs, 200 controls, 12 domains).

Model	Score	Time	Numbers	Labels	Structure
Nanonets-OCR2	87.0	8.8s	100%	57%	100%
Nemotron Parse	78.3	1.6s	75%	61%	100%
LightOnOCR-2	77.0	4.8s	75%	56%	100%
PaddleOCR-VL	77.0	4.6s	75%	56%	100%
FireRed-OCR	77.0	6.4s	75%	57%	100%
MinerU2.5	75.7	4.4s	75%	52%	100%
DeepSeek-OCR-2	77.0	1.6s	75%	57%	100%
DotsOCR v1.5	56.5	6.6s	50%	22%	100%
GOT-OCR 2.0	36.5	163.4s	0%	22%	100%
GLM-OCR	33.2	2.9s	0%	17%	100%
Granite Vision	31.2	4.1s	0%	17%	80%
Dolphin-v2	30.3	4.2s	0%	17%	100%

All models struggle with the dense embedded labels (HIPAA, FISMA, NIST etc.). DotsOCR and GLM-OCR perform worst — DotsOCR skips picture content, GLM-OCR's table-only prompt misses diagram text.

adobe-6-page p4 — Cloud Vendor Comparison Table (Merged Headers)

6 vendors x 4 services with merged column headers.

Model	Score	Time	Labels	Structure
DeepSeek-OCR-2	90.0	2.2s	100%	67%
LightOnOCR-2	86.1	3.0s	93%	67%
DotsOCR v1.5	86.1	8.0s	93%	67%
GLM-OCR	86.1	2.2s	93%	67%
Nanonets-OCR2	86.1	10.7s	93%	67%
MinerU2.5	86.1	1.7s	93%	0%
Dolphin-v2	70.4	12.7s	64%	67%
Nemotron Parse	66.1	1.9s	93%	0%
FireRed-OCR	58.6	6.7s	43%	67%
GOT-OCR 2.0	38.6	159.6s	43%	0%
PaddleOCR-VL	30.7	16.0s	29%	0%
Granite Vision	15.0	78.0s	0%	0%

Granite Vision degenerated (78s, 8192 tokens, 93K chars of repeated content). PaddleOCR also degenerates on this page.

adobe-6-page p5 — Cloud Vendor Continuation (Multi-line Cells)

Continuation table with 7 services and compliance text.

Model	Score	Time	Labels	Structure
LightOnOCR-2	100.0	2.4s	100%	100%
DotsOCR v1.5	100.0	6.3s	100%	100%
GLM-OCR	100.0	1.5s	100%	100%
Nanonets-OCR2	100.0	8.5s	100%	100%
FireRed-OCR	100.0	4.1s	100%	100%
DeepSeek-OCR-2	100.0	1.5s	100%	100%
MinerU2.5	96.6	2.0s	94%	100%
Dolphin-v2	96.6	10.1s	94%	100%
Granite Vision	72.5	4.3s	50%	100%
Nemotron Parse	63.1	1.4s	88%	0%
GOT-OCR 2.0	42.5	162.0s	50%	0%
PaddleOCR-VL	42.5	16.0s	50%	0%

oman-2040-en p10 — Radial Infographic with KPI Targets

Circular chart with 9 numeric targets and 8 KPI labels embedded in visual elements.

Model	Score	Time	Numbers	Labels	Structure
LightOnOCR-2	91.1	2.7s	78%	100%	100%
Nemotron Parse	91.1	0.8s	78%	100%	100%
FireRed-OCR	91.1	1.8s	78%	100%	100%
Nanonets-OCR2	86.7	3.1s	67%	100%	100%
GLM-OCR	75.9	1.0s	78%	62%	80%
Granite Vision	67.7	4.3s	67%	50%	80%
DeepSeek-OCR-2	67.2	12.3s	56%	50%	100%
PaddleOCR-VL	37.5	16.0s	0%	25%	100%
DotsOCR v1.5	30.0	1.5s	0%	0%	100%
GOT-OCR 2.0	30.0	161.0s	0%	0%	100%
MinerU2.5	23.1	0.7s	0%	0%	100%
Dolphin-v2	23.0	1.9s	0%	0%	100%

DotsOCR completely fails on infographics — its layout prompt categorizes this as "Picture" and omits text.

oman-2040-en p21 — KPI Performance Table (Vector Graphics)

Table rendered as vector graphics, not standard HTML/PDF table.

Model	Score	Time	Numbers	Labels	Structure
GLM-OCR	100.0	2.1s	100%	100%	100%
MinerU2.5	100.0	3.8s	100%	100%	100%
LightOnOCR-2	93.3	2.8s	100%	100%	67%
DotsOCR v1.5	93.3	9.5s	100%	100%	67%
FireRed-OCR	93.3	6.8s	100%	100%	67%
Nanonets-OCR2	86.7	6.1s	83%	100%	67%
Nemotron Parse	80.0	15.7s	100%	100%	0%
DeepSeek-OCR-2	80.0	1.5s	100%	100%	0%
Granite Vision	76.2	9.1s	100%	43%	67%
Dolphin-v2	75.7	19.6s	100%	86%	0%
PaddleOCR-VL	58.6	1.0s	100%	29%	0%
GOT-OCR 2.0	10.0	159.1s	0%	0%	0%

oman-2040-en p31 — Dense Economic KPI Table

21 numeric values across economic indicators with percentages and fractions.

Model	Score	Time	Numbers	Labels	Structure
LightOnOCR-2	100.0	2.9s	100%	100%	100%
DotsOCR v1.5	100.0	7.1s	100%	100%	100%
GLM-OCR	100.0	2.2s	100%	100%	100%
FireRed-OCR	100.0	6.2s	100%	100%	100%
MinerU2.5	100.0	2.8s	100%	100%	100%
Granite Vision	93.3	9.3s	100%	100%	67%
Nanonets-OCR2	86.7	12.9s	100%	100%	33%
Nemotron Parse	80.0	1.7s	100%	100%	0%
Dolphin-v2	67.1	21.9s	76%	89%	0%
PaddleOCR-VL	63.3	2.2s	100%	44%	0%
DeepSeek-OCR-2	51.0	12.3s	52%	67%	0%
GOT-OCR 2.0	10.0	159.1s	0%	0%	0%

oman-2040-en p46 — Strategic Directions Visual (1549 Drawings)

Text-heavy strategic visual with 12 direction items.

Model	Score	Time	Labels	Structure
Most models	94-100	0.7-10.4s	100%	80-100%
GOT-OCR 2.0	100.0	160.1s	100%	100%
DeepSeek-OCR-2	45.0	12.3s	0%	100%

All models handle this text-heavy page well except DeepSeek-OCR-2 (hit 4096 token limit, 0% labels). MinerU2.5 (94.0) and Dolphin-v2 (100.0) both perform well here.

kfd p1 — Insurance Product Comparison (Merged Headers)

3-column table with AED amounts (3,500,000, 6,770, 5,000, etc.) and merged headers.

Model	Score	Time	Numbers	Labels	Structure
LightOnOCR-2	93.3	2.9s	100%	100%	67%
DotsOCR v1.5	93.3	5.5s	100%	100%	67%
GLM-OCR	93.3	2.1s	100%	100%	67%
Nanonets-OCR2	93.3	10.6s	100%	100%	67%
DeepSeek-OCR-2	93.3	1.4s	100%	100%	67%
MinerU2.5	93.3	1.9s	100%	100%	67%
Dolphin-v2	93.3	8.0s	100%	100%	67%
FireRed-OCR	90.6	6.8s	100%	91%	67%
Granite Vision	90.6	7.1s	100%	91%	67%
Nemotron Parse	80.0	1.2s	100%	100%	0%
PaddleOCR-VL	37.8	16.0s	29%	54%	0%
GOT-OCR 2.0	10.0	159.8s	0%	0%	0%

kfd p2 — Insurance Benefits Continuation

Simpler table with fewer numbers and clear labels.

Model	Score	Time	Numbers	Labels	Structure
LightOnOCR-2	100.0	3.2s	100%	100%	100%
DotsOCR v1.5	100.0	9.4s	100%	100%	100%
GLM-OCR	100.0	1.7s	100%	100%	100%
Nanonets-OCR2	100.0	10.5s	100%	100%	100%
FireRed-OCR	100.0	5.3s	100%	100%	100%
MinerU2.5	100.0	2.9s	100%	100%	100%
Dolphin-v2	100.0	13.5s	100%	100%	100%
Granite Vision	93.3	5.2s	100%	100%	67%
GOT-OCR 2.0	86.7	159.8s	100%	100%	33%
Nemotron Parse	80.0	1.8s	100%	100%	0%
DeepSeek-OCR-2	80.0	1.7s	100%	100%	0%
PaddleOCR-VL	77.3	1.2s	100%	91%	0%

Key Findings

1. LightOnOCR-2 is the best English OCR model

Highest quality: 92.8/100 average across all 9 pages — #1 out of 13 models tested
Fastest tier: 3.0s/page average
Smallest: ~2 GB VRAM — can run alongside DotsOCR v1.5 on a single GPU
Reads everything: tables, infographics, charts, diagrams, text — unlike DotsOCR which skips picture content
Perfect scores on 3/9 pages (100.0), never below 77.0
Limitation: English-only — garbles Arabic numerals

1b. Qwen3.5 models nearly tie on quality but slower

Qwen3.5-4B-AWQ: 92.5/100, 16.5s/page, ~4 GB — only 0.3 points behind LightOnOCR-2 but 5.5x slower
Qwen3.5-2B-AWQ: 90.1/100, ~2.4 GB — most pages 1.4-3.3s (competitive with LightOnOCR-2!) but adobe p3 spikes to 65s (slow prefill on complex diagram, 9 tok/s)
Qwen3.5-0.8B: 77.3/100, ~1.7 GB — great on simple tables (97.3 on kfd p1) but degenerates on complex pages (21.8 on adobe p3, 51.5 on oman p46 — hits 8192 token limit). Too inconsistent for production
Both require vLLM with transformers 5.x and --default-chat-template-kwargs '{"enable_thinking": false}'
Tested with recommended prompt "qwenvl markdown" and params (temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5)
Note: Qwen3.5-2B output on adobe p3 is actually more accurate than ground truth expectations — it reads the real text (SOC 2, ISO 27001, PCI DSS) from the diagram rather than label abbreviations (HIPAA, FISMA, NIST) that appear in a different visual element. The 77.0 score understates its quality.
Production currently uses Qwen3-VL (not 3.5), which doesn't have the thinking mode issue

2. FireRed-OCR-2B and Nanonets-OCR2-3B are strong runners-up

FireRed-OCR: 89.4/100, 5.5s/page, ~5 GB — good all-rounder, based on Qwen3-VL-2B
Nanonets-OCR2: 89.7/100, 8.8s/page, ~8 GB — slightly better score but slower and heavier
Both use recommended prompts and produce HTML tables
Neither beats LightOnOCR-2 on the quality/speed/VRAM combination

2b. HunyuanOCR is fastest but needs vLLM nightly setup

87.3/100, 1.4s/page (fastest of all models), ~2 GB native VRAM (1B params)
Three perfect 100.0 scores (adobe p5, oman p46, kfd p2)
Weak on diagrams (35.4 on adobe p3 — same weakness as GLM-OCR)
Tested with recommended Chinese doc parsing prompt ("请将图片中的内容按照文档解析的规范转成markdown格式输出")
Requires vLLM 0.17.1 + transformers from git main (TransformersMultiModal fallback, uses ~42 GB VRAM unoptimized). Native vLLM hunyuan_vl support not yet in stable releases
Custom license (not Apache 2.0)

3. GLM-OCR is fastest but table-only

86.9/100, 1.9s/page (fastest), ~2 GB — excellent on pure tables (4 perfect scores)
Fails on diagrams/infographics (33.2 on adobe p3) because "Table Recognition:" prompt skips non-table content
Requires separate venv with transformers 5.x — operational complexity

4. DotsOCR v1.5 is strong on tables but blind to infographics

84.4/100 average, but drops to 30.0 on infographics (oman p10) and 56.5 on diagrams (adobe p3)
Its JSON layout prompt classifies charts as "Picture" and omits text extraction
Excellent on pure table pages (93-100)
Remains the only model with correct Arabic numeral extraction (6/6)

5. Nemotron Parse v1.1 has perfect numbers but no table structure

79.8/100 — all numbers correct on every page, but outputs LaTeX instead of markdown/HTML tables
Structure score is 0% on 6/9 pages because LaTeX tables aren't detected as tables
Extremely fast (3.0s avg) but the LaTeX output requires post-processing

5b. MinerU2.5-1.2B is the best two-stage pipeline model

85.4/100, 2.4s/page, ~2 GB — compact and fast
Uses MinerUClient library for proper two-stage extraction (layout detection → per-region recognition)
Outputs HTML tables via Table Recognition: prompt, text via Text Recognition:
Weak on infographics (23.1 on oman p10 — can't read chart data)
Requires --logits-processors mineru_vl_utils:MinerULogitsProcessor and pip install mineru-vl-utils[vllm]

6. DeepSeek-OCR-2 limited by 8192 context and grounding mode

67.3/100 with recommended <|grounding|> prompt (was 75.9 with generic prompt — grounding mode wastes tokens on bounding box coordinates and produces fragmented table HTML)
Fast (1.9s/page) but 8192 context limit restricts output
MoE architecture (3B total, ~570M active) — 6.5 GB VRAM

7. Dolphin-v2 slow but decent with proper pipeline

72.9/100, 11.4s/page — two-stage pipeline (reading order → per-element crop + type-specific prompts)
Good on tables (93-100 on kfd) but slow due to multiple API calls per element
Can't read infographics (23.0 on oman p10)
ByteDance model, Qwen2.5-VL-3B based, ~7 GB VRAM

8. Granite Vision 3.3 2B unreliable

71.7/100 (re-tested with recommended temp=0.2) — degenerated on adobe p3 and p4 (78-79s each, hit 8192 tokens)
Good on simple tables (85-94) but inconsistent on complex pages

9. MonkeyOCR-pro-3B — not benchmarked (requires own pipeline)

Three-component system: Structure (YOLO via PaddlePaddle) + Relation + Recognition (Qwen2.5-VL)
Cannot run via vLLM alone — requires its own conda env with PaddlePaddle and monkeyocr library
Claims +8.6% over MinerU on OmniDocBench tables, but could not be fairly tested in our setup

10. PaddleOCR-VL-1.5 and GOT-OCR 2.0 not competitive

PaddleOCR: 58.3/100 — degenerates on complex pages (4/9 pages hit token limit)
GOT-OCR: 40.5/100 at 160s/page — worst quality and slowest by far

Recommendation

English fast tier: LightOnOCR-2 (1B)

Best quality (92.8), fast (3.0s), smallest (2 GB)
Handles all content types: tables, charts, infographics, text
Runner-up: FireRed-OCR-2B (89.4, 5.5s, 5 GB) if LightOnOCR-2 has issues

Arabic tier: DotsOCR v1.5 (14B)

Only model with 6/6 Arabic numeral extraction
Strong on tables (93-100), weak on infographics (30)

Production Stack vs LightOnOCR-2 (10-run stability test)

Both systems tested 10 times on all 9 benchmark pages to measure consistency.

Metric	Prod Stack (DotsOCR + Qwen + Paddle)	LightOnOCR-2 (single model)
Mean score	89.7/100	92.8/100
Std deviation	0.7	0.0
Min / Max	88.5 / 91.1	92.8 / 92.8
Mean time/page	14.6s	2.7s
Std dev time	0.5s	0.1s
Models	3 (DotsOCR 14B + Qwen 4B + PaddleOCR)	1 (LightOnOCR-2 1B)
VRAM	~22 GB	~2 GB

Production stack variance comes from Qwen secondary (temp=0.7) on crops — pages like adobe p3 (std=3.8) and oman p10 (std=3.7) fluctuate between runs. LightOnOCR-2 is perfectly deterministic (temp=0.0) with zero variance on all pages.

The production stack scores well (89.7) because DotsOCR handles tables and Qwen handles Picture crops. But LightOnOCR-2 handles everything in a single call — tables, charts, infographics, text — with higher quality, 5.4x faster, and zero variance.

Recommendation

English fast tier: LightOnOCR-2 (1B)

Best quality (92.8), fastest (2.7s), smallest (2 GB), deterministic
Handles all content types: tables, charts, infographics, text
Runner-up: FireRed-OCR-2B (89.4, 5.5s, 5 GB) if LightOnOCR-2 has issues

Arabic tier: DotsOCR v1.5 (14B)

Only model with 6/6 Arabic numeral extraction
Strong on tables (93-100), weak on infographics (30)

Proposed routing: Detect document language → English pages to LightOnOCR-2, Arabic pages to DotsOCR v1.5. Both models fit on a single L40S GPU (~16 GB combined). See Router Benchmark for the language detection model evaluation.