siraaj-dot-ocr-service / docs/ocr-model-comparison-report.md

OCR Model Comparison Report

Last updated: 4/16/2026GitHub

OCR Model Comparison Report

Date: 2026-03-11 (updated with AIN-7B, Nemotron Parse v1.1, GOT-OCR 2.0, Nemotron Nano VL, Baseer, MiniCPM-V 4.5, DeepSeek-VL2-Tiny, ERNIE-4.5-VL-28B, Arabic-Nougat, OCRFlux-3B, Chandra-OCR, OlmOCR-2, Granite Vision 3.3, NuMarkdown-8B, Surya, DIMI-Arabic-OCR-V2, Arabic-Legal-OCR, HunyuanOCR, GLM-OCR, FireRed-OCR-2B, InternVL3.5-4B benchmarks, QARI-OCR Arabic table test, Qwen table comparison) GPU: NVIDIA L40S (46 GB VRAM) Test Documents: kfd.pdf (English/Arabic motor insurance), oman-2040-en.pdf (52-page government vision document), adobe-6-page.pdf (compliance white paper with diagrams/infographics)

See also: Hard English Benchmark — focused English-only evaluation of 17 models with ground truth scoring on 9 difficult pages.

Models Tested

Model	Params	VRAM (weights)	Serving	License
DotsOCR v1.0 (baseline)	3B	14.2 GB (FP8)	vLLM v0.8.5	MIT
DotsOCR v1.5	3B	5.72 GB (BF16)	vLLM v0.11.0	MIT
DotsOCR v1.5-SVG	3B	5.72 GB (BF16)	vLLM v0.11.0	MIT
Qwen3-VL-2B-Instruct-FP8 (current secondary)	2B	2.93 GB (FP8)	vLLM v0.11.0	Apache-2.0
Qwen3-VL-4B-Instruct-AWQ-4bit	4B	~4 GB (INT4)	vLLM v0.11.0	Apache-2.0
Qwen3-VL-4B-Instruct	4B	~8 GB (BF16)	vLLM v0.11.0	Apache-2.0
Qwen3-VL-8B-Instruct-AWQ-4bit	8B	~7 GB (INT4)	vLLM v0.11.0	Apache-2.0
Qwen3-VL-8B-Instruct	8B	~17 GB (BF16)	vLLM v0.11.0	Apache-2.0
Qwen3-VL-30B-A3B-Instruct-FP8	30B MoE (3B active)	~31 GB (FP8)	vLLM v0.11.0	Apache-2.0
Qwen3.5-2B	2B	4.3 GB (BF16)	vLLM nightly	Apache-2.0
Qwen3.5-4B	4B	8.8 GB (BF16)	vLLM nightly	Apache-2.0
Qwen3.5-4B-AWQ-4bit	4B	3.8 GB (INT4)	vLLM nightly	Apache-2.0
Qwen3.5-9B	9B	19 GB (BF16)	vLLM nightly	Apache-2.0
Qwen3.5-9B-AWQ-4bit	9B	8.5 GB (INT4)	vLLM nightly	Apache-2.0
Qwen3.5-35B-A3B-GPTQ-Int4	35B MoE (3B active)	23 GB (INT4)	vLLM nightly	Apache-2.0
DeepSeek-OCR-2	3B MoE (~570M active)	6.46 GB (BF16)	vLLM nightly	Apache-2.0
Granite-Docling-258M (IBM)	258M	0.52 GB (BF16)	vLLM v0.11.0	Apache-2.0
PaddleOCR-VL-1.5 (Baidu)	0.9B	~1.8 GB (BF16)	vLLM nightly	Apache-2.0
Nanonets-OCR2-3B	3B (Qwen2.5-VL)	~8 GB (BF16)	vLLM v0.17.0	Apache-2.0
LightOnOCR-2-1B	1B	~2 GB (BF16)	vLLM v0.17.0	Apache-2.0
QARI-OCR v0.3 (NAMAA-Space)	2B (Qwen2-VL)	~4 GB (BF16)	vLLM v0.17.0	—
FireRed-OCR-2B (FireRedTeam)	2B (Qwen3-VL-2B)	~5 GB (BF16)	vLLM v0.17.0	Apache-2.0
InternVL3.5-4B (OpenGVLab)	4.7B (0.3B vision + 4.4B LLM)	~9.5 GB (BF16)	vLLM v0.17.0	Apache-2.0
GLM-OCR (Zhipu/zai-org)	0.9B	~2 GB (BF16)	vLLM v0.17.0 + transformers 5.x	MIT
HunyuanOCR (Tencent)	1B (0.4B ViT + 0.5B LLM)	~2 GB (BF16)	vLLM nightly	Other
Chandra-OCR (ChandraAI)	9B (Qwen2.5-VL-7B)	~16 GB (BF16)	vLLM v0.17.0	Apache-2.0
OlmOCR-2-7B-FP8 (Allen AI)	7B (Qwen2.5-VL-7B FP8)	~8 GB (FP8)	vLLM v0.17.0	Apache-2.0
Granite Vision 3.3 2B (IBM)	2B	~4 GB (BF16)	vLLM v0.17.0	Apache-2.0
Arabic-Legal-Documents-OCR (Moha)	4B (Gemma-3-4B)	~9 GB (BF16)	vLLM v0.17.0	—
DIMI-Arabic-OCR-V2 (AhmedZaky1)	7B (Qwen2.5-VL-7B + LoRA)	~16 GB (BF16)	vLLM v0.17.0	—
NuMarkdown-8B-Thinking (NuMind)	8B (Qwen2.5-VL-7B)	~16 GB (BF16)	vLLM v0.17.0	MIT
DeepSeek-VL2-Tiny	3.4B MoE (1B active)	~7 GB (BF16)	vLLM v0.17.0	MIT
ERNIE-4.5-VL-28B-A3B-AWQ-4bit (Baidu)	28B MoE (3B active)	~15 GB (INT4)	vLLM v0.17.0	Apache-2.0
Arabic-Nougat-Large (MohamedRashad)	~400M	~0.8 GB (BF16)	Standalone (transformers)	MIT
OCRFlux-3B (ChatDOC)	3B (Qwen2.5-VL-3B)	~6 GB (BF16)	vLLM v0.17.0	—
Surya (Datalab)	~500M (multiple models)	~10 GB (detection+OCR+table)	Standalone (pip)	GPL-3.0
Baseer (Misraj/Baseer__Nakba)	4B (Qwen2.5-VL-3B)	~7 GB (BF16)	vLLM v0.17.0	—
MiniCPM-V 4.5 AWQ (OpenBMB)	8.7B	~5 GB (INT4)	vLLM v0.17.0	Apache-2.0
Nemotron Nano VL 8B (NVIDIA)	8B	~16 GB (BF16)	vLLM v0.17.0	Llama 3.1 Community
GOT-OCR 2.0 (StepFun)	580M	~1.2 GB (BF16)	Standalone (transformers)	Apache-2.0
Nemotron Parse v1.1 (NVIDIA)	885M	~1.8 GB (BF16)	vLLM v0.17.0	—
AIN-7B (MBZUAI)	8B (Qwen2-VL-7B)	~16 GB (BF16)	vLLM v0.17.0	—

Speed Comparison — Oman 2040 (Full Pages)

OCR / Document Parsing Mode

Page	Content Type	DotsOCR v1.0	DotsOCR v1.5	DeepSeek-OCR-2	Granite-Docling (vLLM)	PaddleOCR-VL (OCR)	Nanonets-OCR2	LightOnOCR-2
7	Quote/directive	7.2s	0.7s	2.4s	0.5s	0.2s	3.8s	2.9s
10	Vision chart	3.6s	1.5s	11.0s	0.6s	0.6s	3.0s	7.4s
16	Mixed text/diagram	5.5s	3.2s	7.9s	1.4s	0.7s	3.6s	1.3s
21	Performance table	7.3s	5.8s	10.7s	1.0s	7.2s*	7.2s	3.0s
23	Performance table	5.2s	2.9s	6.6s	0.7s	7.2s*	4.2s	1.5s
25	Performance table	5.4s	3.2s	6.1s	1.3s	7.2s*	5.1s	1.7s
Avg (6 pages)		4.9s	2.9s	6.5s	0.9s	—	4.5s	3.0s

*PaddleOCR-VL OCR mode degenerates on table pages (repeats one word to max tokens). Use Table Recognition mode instead.

Table Recognition Mode (PaddleOCR-VL only)

Page	DotsOCR v1.0	DotsOCR v1.5	PaddleOCR-VL (Table)	Speedup vs v1.0
21 (full page)	7.3s	5.8s	1.0s	7.3×
23 (full page)	5.2s	2.9s	0.3s	17×
25 (full page)	5.4s	3.2s	0.5s	11×
21 (crop)	—	5.6s	1.0s	—
25 (crop)	—	1.7s	0.5s	—

Table Extraction Quality — Oman Page 21 (Performance Indicators, 8 rows × 4 columns)

DotsOCR v1.0 — BEST OVERALL (BASELINE)

Output: Structured HTML <table> with <thead>, <tbody>, <tr>, <td>, <th>
Accuracy: All 8 indicator names correct, all baseline values correct, all targets correct
OCR: "Average", "Omani", "Quacquarelli Symonds" — all correct
Structure: Proper column alignment, multiline cell values preserved

DotsOCR v1.5 — MATCHES V1.0 QUALITY, 1.3× FASTER

Output: Same structured HTML <table> format as v1.0
Accuracy: All 8 indicator names correct, all values correct, all targets correct
OCR: "Quacquarelli Symonds", "Omani" — all correct
Structure: Proper 4-column alignment with <thead> / <tbody>
Speed: 5.8s (1.3× faster than v1.0's 7.3s)
Crop test: Also works on cropped tables (5.6s for crop P21)

PaddleOCR-VL-1.5 (Table Recognition) — EXCELLENT

Output: Structured cell format using <fcel>/<lcel>/<ucel>/<ecel>/<nl> tags
Accuracy: All 8 indicator names correct, all baseline values correct, all targets correct
OCR: "Symonds", "Omani" — all correct, matches DotsOCR quality
Structure: Proper 4-column alignment, works on both full pages and crops
Speed: 1.0s (7.3× faster than DotsOCR v1.0)
Note: Cell format is easily convertible to HTML/markdown in post-processing

DeepSeek-OCR-2 (Full Page) — FRAGMENTED

Output: ~30 separate <|ref|>text<|/ref|> regions with bounding boxes
Accuracy: Most text correct, some errors ("Rverage", "Omni")
Structure: No table structure — each cell is an independent text block
Note: Cannot reconstruct table from individual regions without spatial reasoning

DeepSeek-OCR-2 (Cropped Table) — STILL FRAGMENTED

Output: Same fragmented text regions even on cropped table images
Conclusion: DeepSeek only produces HTML tables for visually bordered/gridded tables (worked on kfd.pdf which has clear grid lines), NOT for styled/colored table layouts

Granite-Docling (vLLM) — JUMBLED

Output: Single text blob with <loc_x><loc_y>text coordinate format
Accuracy: Wrong numbers, missing values ("Value: (2018) Rank: 69/127")
Structure: No table structure — vLLM serving outputs raw location tokens, not proper DocTags (<doctag>, <text>, <table>)
Stability: With repetition_penalty=1.1, zero degenerations across 10 tests (0.4–2.0s each). Without it, critical degeneration loops (451s, 7668 tokens)
Note: The docling library's DocTagsDocument converter cannot parse vLLM's <loc_x> format — it only works with proper DocTags from raw transformers inference

Granite-Docling (Raw Transformers + Docling) — GOOD BUT SLOW

Output: Proper markdown table via DocTags → Docling library conversion
Accuracy: All values correct, proper column alignment
Speed: 27.8s (5.7× slower than DotsOCR)
Note: Only way to get proper DocTags output; vLLM serving mode loses the format entirely

Table Extraction Quality — kfd.pdf (English Bordered Tables)

DotsOCR v1.0

Speed: ~3.5s per page
Quality: Correct HTML tables, accurate numbers, proper structure

DotsOCR v1.5

Speed: 5.5s per page
Quality: Correct HTML tables, all AED amounts correct (AED 3,500,000, AED 6,770, AED 5,000, etc.)
Structure: Same <table>/<thead>/<tbody> format as v1.0, all 3 product columns detected

PaddleOCR-VL-1.5 (Table Recognition)

Speed: 0.9s per page (3.9× faster than v1.0)
Quality: Correct cell structure with <fcel>/<lcel>/<nl> tags, all AED amounts correct (AED 3,500,000, AED 6,770, AED 5,000, etc.)
Structure: All 3 product columns (Motor Value, Motor Smart, Motor Executive) correctly detected

DeepSeek-OCR-2 (Cropped Tables)

Speed: ~5.5s per crop via vLLM
Quality: Proper HTML tables with <tr>, <td>, colspan — works well on bordered tables
Note: Only works on visually clear gridded tables

Arabic Table Extraction — kfd.pdf (Page 4)

DotsOCR v1.0 — BEST FOR ARABIC

Numbers: Correct (e.g., ٣,٥٠٠,٠٠٠ AED)
Structure: Rows sometimes merged
Text: Accurate Arabic text

DotsOCR v1.5 — MATCHES V1.0 ARABIC QUALITY

Speed: 10.7s (full page), 8.7s (full crop), 4.3s (top crop)
Numbers: ALL correct — ٣,٥٠٠,٠٠٠, ٦,٧٧٠, ٥,٠٠٠, ٣,٥٠٠, ٢٠,٠٠٠, ٧,٥٠٠, ١,٠٠٠, ٢٠٠,٠٠٠, ٣٠٠, ٥٠٠ درهم إماراتي
Structure: Proper HTML table, all rows and columns correct
Text: Accurate Arabic text — التغطيات التأمينية, مسؤولية الغير, etc.
Extra rows: Extracted additional rows v1.0 also captured (emergency repairs, new vehicle, taxi, personal accident ٢٠٠,٠٠٠)
Minor crop issue: Top-half crop garbled ambulance cost (٧٧,٧٧٧,٠٠٠ instead of ٦,٧٧٠), but full page was correct

PaddleOCR-VL-1.5 — Good Structure, Bad Numbers

Tested in 4 configurations:

Test	Mode	Time	Table Structure	AED Numbers	Stability
Full page	Table Recognition	1.4s	Good (3 columns)	Missing — "مقاراتي نعم" instead of values	Stable
Full table crop	Table Recognition	7.2s	Good (4 columns)	Partially — "7,000.00 إماراتي" then degenerated	Degenerated
Top half crop	Table Recognition	0.6s	Good	Missing — "درهم إماراتي" without values	Stable
Full table crop	OCR	0.9s	No (flat text)	Missing entirely	Stable

Arabic text labels: Mostly correct — "التغطيات التأمينية الرئيسية", "مسؤولية الغير", etc.
Critical weakness: AED numbers consistently missing or replaced with garbage ("مقاراتي" = nonsense)
Full table crop degenerated: Output "0.000000..." repeated to 4096 tokens

DeepSeek-OCR-2

Numbers: Garbled — missing digits (٣,,٠٠٠ instead of ٣,٥٠٠,٠٠٠)
Structure: Better row separation than DotsOCR
Text: Decent Arabic text but with errors

Granite-Docling

Arabic: Not tested (documented as "experimental" Arabic support)

QARI-OCR v0.3 (Arabic-Specialized) — DEGENERATED

Tested with recommended params (temperature=0.7, top_p=0.9) and recommended prompt ("Below is the image of one page of a document...Just return the plain text representation...Do not hallucinate."). Also tested with temperature=0.0 and a generic OCR prompt. Both configurations degenerated.

Numbers: ALL zero — ".....,0,0,0 درهم إماراتي" repeated to token limit. No actual AED values extracted
Structure: No table structure — flat HTML <h4> tags with repeated placeholder text
Arabic text labels: Partially correct (التغطيات التأمينية found) but most labels missing
English page (kfd p1): Also degenerated — "<h1></h1><br>" repeated to 2000 tokens, zero content
Verdict: Despite being purpose-built for Arabic OCR, QARI-OCR cannot handle tabular documents with numbers via vLLM. Significantly worse than DotsOCR v1.5

Qwen3.5-4B-AWQ-4bit — BEST NON-DOTSOCR ARABIC

Tested with recommended Qwen params (temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5) and recommended prompt "qwenvl markdown" (production secondary/image prompt). Served via vLLM v0.17.0 (glmocr-venv with transformers 5.3.0).

Speed: 73.3s (very slow — 6.9× slower than DotsOCR v1.5)
Numbers: 3/6 Arabic numerals correct — ٣,٥٠٠,٠٠٠, ٥,٠٠٠, ٣,٥٠٠. Missing: ٦,٧٧٠ (rendered as ٦,٧٧,٠٠٠), ٢٠,٠٠٠ (rendered as ٢,٠٠٠,٠٠٠), ٧,٥٠٠ (rendered as ٧,٥٠,٠٠٠). Numbers have wrong comma grouping — extra zeros added
Structure: Markdown table with proper columns and section headers
Arabic text: Reversed (LTR instead of RTL) — individual words correct but reading direction wrong. Labels like ةينيمأتلا تايطغتلا (reversed التغطيات التأمينية) present
English page (kfd p1): 3.2s, all values correct, clean markdown tables
Oman p21: 4.5s, all 6 key values correct (32.8, 0.938, 71.6, 43.93, Quacquarelli, Omani) but rendered as bullet lists instead of table
Also tested with generic prompt ("Convert the content of the image to Markdown format."): Got 5/6 numerals correct — ٣,٥٠٠,٠٠٠, ٦,٧٧٠, ٥,٠٠٠, ٣,٥٠٠, ٧,٥٠٠ (missing ٢٠,٠٠٠ → ٢,٠٠٠). Better than recommended prompt but not the official configuration
Verdict: Good Arabic numeral extraction (3-5/6 numerals depending on prompt). But very slow on Arabic (73s) and text is reversed. Not a DotsOCR replacement

HunyuanOCR (Tencent) — BEST ARABIC TEXT, TRUNCATED NUMBERS

Tested with recommended params (temperature=0.0) and recommended Chinese document parsing prompt "提取文档图片中正文的所有信息用markdown格式表示，其中页眉、页脚部分忽略，表格用html格式表达，文档中公式用latex格式表示，按照阅读顺序组织进行解析。" (from model card — all recommended prompts are in Chinese). Served via vLLM nightly (glmocr-venv). 1B params (0.4B ViT + 0.5B LLM), ~2 GB VRAM. Supports 100+ languages including Arabic. --no-enable-prefix-caching --mm-processor-cache-gb 0 required.

Speed: 5.1s (2× faster than DotsOCR v1.5's 10.7s on Arabic)
Numbers: 4/6 Arabic numerals correct — ٣,٥٠٠,٠٠٠, ٥,٠٠٠, ٣,٥٠٠, ٧,٥٠٠. Missing: ٦,٧٧٠ (rendered as ٦,٧٧), ٢٠,٠٠٠ (rendered as ٢,٠٠٠). Last digit still dropped on some numbers
Arabic text: BEST of any non-DotsOCR model — proper RTL direction, all 3 reference labels matched (التغطيات التأمينية, مسؤولية الغير, درهم إماراتي). Fully readable Arabic with correct word order
Structure: HTML <table> with <td>, colspan — proper table structure with recommended prompt
Arabic chars: 1,397 (comparable to DotsOCR's ~1,500)
English page (kfd p1): 2.1s, all values correct (3,500,000, 6,770, 5,000), HTML table structure
Oman p21: 2.1s, all 6 key values correct (32.8, 0.938, 71.6, 43.93, Quacquarelli, Omani), HTML table
Oman p23/p25: 0.8-1.0s, correct values, HTML tables
Also tested with generic English prompt ("Extract the text in the image"): Only 2/6 Arabic numerals correct, flat text output with no table structure. Chinese prompt is significantly better
Also tested with Chinese table prompt ("把图中的表格解析为HTML。"): 2/6 Arabic numerals — same as generic English prompt
Verdict: HunyuanOCR has the best Arabic text quality of any non-DotsOCR model — proper RTL, correct labels, readable output, HTML tables. 4/6 Arabic numerals correct with recommended doc parsing prompt (best non-DotsOCR result). But still truncates some numbers (٦,٧٧ instead of ٦,٧٧٠). Fast (0.8-5.1s). Not a full DotsOCR replacement but the closest competitor for Arabic

Chandra-OCR (9B) — SECOND-BEST ARABIC NUMERALS (5/6)

Tested with recommended prompt "Convert the following image to markdown format." and temperature=0.0. Served via vLLM v0.17.0 (glmocr-venv). 9B model based on Qwen2.5-VL-7B, ~16 GB VRAM.

Speed: 52.3s (very slow — 4.9× slower than DotsOCR v1.5)
Numbers: 5/6 Arabic numerals correct — ٣,٥٠٠,٠٠٠, ٦,٧٧٠, ٥,٠٠٠, ٣,٥٠٠, ٧,٥٠٠. Missing: ٢٠,٠٠٠ (rendered as ٢,٠٠٠)
Structure: HTML <table> with <thead>, <tbody>, <th>, <td>, colspan — best HTML output of all tested models, proper RTL with style="text-align: center/right"
Arabic text: Excellent — proper RTL, all 3 reference labels matched (التغطيات التأمينية, مسؤولية الغير, درهم إماراتي). Clean readable Arabic
English page (kfd p1): All values correct, clean HTML tables
Verdict: Second-best Arabic numeral accuracy after DotsOCR v1.5 (5/6 vs 6/6). Best HTML output quality. But extremely slow (52s) and requires ~16 GB VRAM — too large to run alongside other models

OlmOCR-2-7B-FP8 (Allen AI) — 3/6 ARABIC NUMERALS

Tested with recommended prompt (OlmOCR's built-in document parsing prompt) and temperature=0.0. Served via vLLM v0.17.0. 7B model, ~8 GB VRAM (FP8).

Speed: 25.3s
Numbers: 3/6 Arabic numerals correct — ٥,٠٠٠, ٣,٥٠٠, ٧,٥٠٠. Missing: ٣,٥٠٠,٠٠٠ (values merged across columns), ٦,٧٧٠ (rendered as ٦,٧٧), ٢٠,٠٠٠ (rendered as ٢٠,٠٠)
Structure: Markdown tables with proper columns
Arabic text: All 3 reference labels found. Some column values merged together
English page (kfd p1): All values correct
Verdict: Decent Arabic support (3/6 numerals) but column merging and truncated numbers. Slower than DotsOCR v1.5

Granite Vision 3.3 2B (IBM) — 0/6 ARABIC NUMERALS

Tested with temperature=0.0 and generic OCR prompt. Served via vLLM v0.17.0. 2B model, ~4 GB VRAM.

Speed: 18.8s
Numbers: 0/6 Arabic numerals — all numbers westernized (3,500,000 instead of ٣,٥٠٠,٠٠٠)
Structure: Markdown tables
Arabic text: Westernized — Arabic labels present but numerals converted to Western
English page (kfd p1): All values correct
Verdict: Not viable for Arabic — converts all Arabic numerals to Western digits

Arabic-Legal-Documents-OCR (Moha) — 0/6 ARABIC NUMERALS

Tested with recommended prompt from model card. Served via vLLM v0.17.0. 4B model based on Gemma-3-4B, ~9 GB VRAM.

Speed: 29.5s
Numbers: 0/6 Arabic numerals — no numbers extracted at all
Structure: No table structure — flat text output
Arabic text: Labels partially found but garbled. Outputs mostly unstructured text
English page (kfd p1): Partial values only
Verdict: Despite being trained on Arabic legal documents, cannot handle tables with numbers

DIMI-Arabic-OCR-V2 (AhmedZaky1) — 0/6 ARABIC NUMERALS

Tested with recommended prompt. Served via vLLM v0.17.0 with Qwen2.5-VL-7B base + LoRA adapter, --gpu-memory-utilization 0.6. ~16 GB VRAM.

Speed: 95.4s (extremely slow — hit 4096 token limit)
Numbers: 0/6 Arabic numerals — no Arabic numerals found
Structure: No table structure
Arabic text: All 3 reference labels found (التغطيات التأمينية, مسؤولية الغير, درهم إماراتي) but no table structure or numerals
English page (kfd p1): Partial extraction only
Verdict: Very slow, no numerals, no table structure. LoRA adapter on Qwen2.5-VL-7B did not help with Arabic table extraction

NuMarkdown-8B-Thinking (NuMind) — 0/6 ARABIC NUMERALS

Tested with recommended params (temperature=0.7, image-only prompt — model ignores text instructions). Served via vLLM v0.17.0. 8B model based on Qwen2.5-VL-7B, ~16 GB VRAM. MIT license.

Speed: 41.0s
Numbers: 0/6 Arabic numerals — all numbers garbled (e.g., "5,5,5,3 درهم إماراتي" instead of ٣,٥٠٠,٠٠٠). Numbers appear to be jumbled individual digits
Structure: Markdown table with proper columns (best markdown structure of non-HTML models)
Arabic text: All 3 reference labels found (التغطيات التأمينية, مسؤولية الغير, درهم إماراتي). Good Arabic text quality
Chain-of-thought: Model generates <think>...</think> reasoning tokens analyzing document layout before producing <answer>. Reasoning is accurate (correctly identifies RTL table, merged cells) but final output still garbles numbers
English page (kfd p1): 20.9s, all values correct (3,500,000, 6,770, 5,000)
Verdict: Strong English table extraction with reasoning-based layout analysis. Arabic labels good but numbers completely garbled. No documented Arabic support

DeepSeek-VL2-Tiny — 0/6 ARABIC NUMERALS, HALLUCINATED

Tested with generic OCR prompt and temperature=0.0. Served via vLLM v0.17.0 with --hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'. 3.4B MoE (1B active), ~7 GB VRAM. Max context 4096 tokens.

Speed: 8.4s (hit 2048 token limit)
Numbers: 0/6 Arabic numerals — no actual numbers extracted. Table cells filled with "الاسم" (="name") repeated. Completely hallucinated structure
Structure: HTML <table> with CSS styling but entirely fabricated content — correct number of columns but wrong headers and empty data cells
Arabic text: 0/3 reference labels found. Hallucinated generic column headers instead of real content
English page (kfd p1): 3.2s, all values correct (3,500,000, 6,770, 5,000) but generated unnecessary HTML boilerplate (DOCTYPE, stylesheet links)
Verdict: General VLM, not OCR-specialized. Hallucinates Arabic table content entirely. Arabic not officially supported — DeepSeek-OCR-2 (already tested) is the OCR-specific derivative

ERNIE-4.5-VL-28B-A3B-AWQ-4bit (Baidu) — 0/6 ARABIC NUMERALS, REVERSED TEXT

Tested with generic OCR prompt and temperature=0.0. Community AWQ-4bit quantization from cyankiwi/ERNIE-4.5-VL-28B-A3B-Thinking-AWQ-4bit. 28B MoE (3B active), ~15 GB VRAM (INT4). Requires --trust-remote-code and decord package.

Speed: 35.3s (hit 4096 token limit, includes thinking tokens)
Numbers: 0/6 Arabic numerals — all numbers garbled ("٤,٠,...,٥ يتارامإ" instead of ٣,٥٠٠,٠٠٠ درهم إماراتي). Truncated and dots instead of digits
Structure: HTML <table> with extensive CSS styling, rowspan/colspan. Good structural attempt but wrong content
Arabic text: 0/3 reference labels found. All Arabic text is reversed (LTR rendering of RTL) — "ةينيمأتلا تايطغتلا" instead of "التغطيات التأمينية", "يتارامإ مهرد" instead of "درهم إماراتي"
Chain-of-thought: Thinking variant outputs reasoning before answer. Reasoning correctly identifies document as Arabic insurance but final output reverses all text
English page (kfd p1): 10.7s, all values correct (3,500,000, 6,770, 5,000), clean HTML table
Verdict: Strong English OCR but Arabic is reversed and numbers garbled. No official Arabic support. Even at 28B params (3B active), cannot handle Arabic documents. Too large for our VRAM budget anyway

Arabic-Nougat-Large (MohamedRashad) — 0/6 ARABIC NUMERALS, HALLUCINATED

Tested as standalone Nougat-based encoder-decoder model (NOT vLLM — uses transformers directly). ~400M params, ~0.8 GB VRAM. Purpose-built for Arabic book pages to Markdown. Max 8192 tokens. Params: repetition_penalty=1.5, max_new_tokens=8192.

Speed: 3.2s (very fast)
Numbers: 0/6 Arabic numerals — no real numbers extracted. Numbers in output are hallucinated ("٣٦٤", "٢٥٠٠٠")
Structure: Markdown table separator ---|---|--- present but content is entirely fabricated
Arabic text: 0/3 reference labels found. Title hallucinated as "تأكيد رحلة نموذج التأمين" (completely wrong — should be "وثيقة السمات الرئيسية للتأمين على المركبات")
Content: Generates plausible-looking Arabic text that is entirely fabricated — hallucinates academic/historical content ("الولايات", "القرن الرابع الميلادي", "Museum-Homewforum") instead of insurance table data
English page (kfd p1): 2.5s, 0/3 values found. Output is garbled pseudo-English/French — "Égypte", "Múarne Krb", "Cairo Economy Series". Completely unusable
Verdict: Trained on Arabic academic books, not documents. Hallucinates wildly on unfamiliar content (insurance tables). Not viable for any OCR use case outside its training domain

OCRFlux-3B (ChatDOC) — 0/6 ARABIC NUMERALS, DEGENERATED

Tested with recommended prompt from OCRFlux toolkit source (build_page_to_markdown_prompt: "Below is the image of one page of a document. Just return the plain text representation...ALL tables should be presented in HTML format...Do not hallucinate.") and temperature=0.0, max_tokens=8192. Served via vLLM v0.17.0. 3B model based on Qwen2.5-VL-3B, ~6 GB VRAM.

Speed: 44.1s (hit 4096 token limit — degenerated)
Numbers: 0/6 Arabic numerals — degenerated on Arabic numbers. Output: ٣٠٠٠٠٠٠٠٠٠٣٠٠٠٠٠٠٠٠٠ repeated to token limit (same "٣٠" pattern as QARI-OCR). No actual AED values extracted
Structure: HTML <table> tags present but table content degenerated after first few rows
Arabic text: Labels partially found (2/3 — التغطيات التأمينية, مسؤولية الغير). Arabic text before the table was correct, but numbers triggered degeneration loop
Output format: JSON with natural_text field containing markdown + HTML + custom <table> tags (needs post-processing via table_matrix2html)
English page (kfd p1): 7.9s, all values correct (3,500,000, 6,770, 5,000). Good HTML table output
Verdict: Good English OCR but critically degenerates on Arabic numbers. Same degeneration pattern as QARI-OCR (both Qwen2.5-VL-based). Not viable for Arabic documents

Surya (Datalab) — 0/6 ARABIC NUMERALS (Standalone OCR Toolkit)

Tested as a standalone OCR toolkit (not vLLM). Surya is a suite of specialized models for text detection, recognition, layout analysis, and table recognition. ~500M params total across models, ~10 GB VRAM for table recognition at default batch size.

OCR Speed: ~2s for detection + recognition (very fast)
Numbers: 0/6 Arabic numerals — outputs Persian/Urdu numerals instead of Arabic (e.g., ۵۰۰ instead of ٣,٥٠٠,٠٠٠, ،۱٫۷۷ instead of ٦,٧٧٠)
Table Recognition: No tables detected on either kfd page (p1 English or p4 Arabic)
Arabic text: Labels found (3/3 — التغطيات التأمينية, مسؤولية الغير, درهم إماراتي) but text has errors (وثبقة instead of وثيقة, حدول instead of جدول)
Output format: Flat text (OCR) or JSON cell structures (table rec) — no HTML/Markdown. Needs Marker library for formatted output
English page (kfd p1): Good text extraction but as flat text, not table-structured
Verdict: Lightweight and fast OCR toolkit but cannot handle Arabic numerals (wrong numeral system), table detection failed, and output is unstructured. Not a viable candidate for our use case

Baseer (Misraj/Baseer__Nakba) — NON-FUNCTIONAL VIA VLLM

Tested with recommended prompt from model card ("Extract the text from the above document.") and temperature=0.0, max_tokens=4096. Served via vLLM v0.17.0 with --max-model-len 8192. 4B model based on Qwen2.5-VL-3B, ~7 GB VRAM.

Speed: 3.4s (but only generated 3 tokens)
Numbers: 0/6 Arabic numerals — model outputs only 3-10 tokens total (e.g., "ليرة" or "ليرا Motor Insurance KFD")
Structure: No table structure — output too short
Arabic text: Single word output ("ليرة" = "lira" — wrong currency, should be "درهم إماراتي")
English page (kfd p1): 10 tokens only — "ليرا Motor Insurance KFD"
Verdict: Baseer appears to require its specific transformers pipeline (custom processor and generate() params like repetition_penalty=1.1). Via vLLM chat completions API, it generates only a few tokens and stops. Not viable through vLLM

MiniCPM-V 4.5 AWQ (OpenBMB) — 0/6 ARABIC NUMERALS, HALLUCINATED + CHINESE MIXING

Tested with generic OCR prompt and temperature=0.0. Served via vLLM v0.17.0. AWQ-4bit quantization (openbmb/MiniCPM-V-4_5-AWQ), ~5 GB VRAM. Apache-2.0 license. --max-model-len 16384 --max-num-batched-tokens 16384.

Speed: 38.3s (hit 4096 token limit — degenerated)
Numbers: 0/6 Arabic numerals — no real numbers extracted. Table cells filled with repeated "العلاقة的基本组成部分" (Arabic + Chinese mixed hallucination)
Structure: HTML <table> tags present but content is entirely fabricated — repeating the same Chinese+Arabic phrase in every cell
Arabic text: 0/3 reference labels found. Title hallucinated as "العلاقة بين أجزاء جملة مركبة" (completely wrong — "relationship between parts of a compound sentence")
English page (kfd p1): 8.4s, all values correct (3,500,000, 6,770, 5,000). Good HTML table structure
Verdict: Strong English OCR but completely hallucinates on Arabic — generates Chinese characters mixed with Arabic in a degenerate loop. Despite 8.7B params, Arabic is not supported. Only useful for English documents

Nemotron Nano VL 8B (NVIDIA) — 0/6 ARABIC NUMERALS, HALLUCINATED ARABIC TEXT

Tested with generic OCR prompt and temperature=0.0. Served via vLLM v0.17.0 with --trust-remote-code. 8B model (C-RADIOv2-H vision + Llama-3.1-8B LLM), ~16 GB VRAM (BF16). English-only model (per model card). Requires timm, open-clip-torch, einops.

Speed: 21.7s
Numbers: 0/6 Arabic numerals, 0/3 Western numerals — no real numbers extracted at all. Table cells filled with hallucinated Arabic names ("نجم", "إيمان إبراهيم") instead of actual values
Structure: LaTeX \begin{tabular} format with \multicolumn — unusual output format, not HTML or Markdown
Arabic text: 0/3 reference labels found. All Arabic text is hallucinated — "الصفات الشخصية" (personal attributes), "موتور سمارت" (transliterated "Motor Smart") instead of actual Arabic insurance terms. Recognizes document is Arabic but fabricates content
English page (kfd p1): 8.3s, all values correct (3,500,000, 6,770, 5,000). Good LaTeX table with correct structure and values
Verdict: English-only model as documented. Recognizes Arabic script but hallucinates content entirely. LaTeX output format is unusual. Not viable for Arabic documents. Strong English OCR but 8B/16GB is too large for a secondary model

GOT-OCR 2.0 (StepFun) — 0/6 ARABIC NUMERALS, DEGENERATED

Tested as standalone model via HF-native stepfun-ai/GOT-OCR-2.0-hf with AutoModelForImageTextToText. 580M params, ~1.2 GB VRAM. Used format=True (formatted OCR mode) and do_sample=False, max_new_tokens=4096. Not vLLM compatible (custom GOTQwenForCausalLM architecture).

Speed: 4.0s Arabic, 1.2s English (very fast — smallest model tested)
Numbers: 0/6 Arabic numerals, 0/3 Western numerals — no numbers extracted from either page
Structure: LaTeX \title{} wrapper only, no table structure at all
Arabic text: 0/3 reference labels found. Degenerated — repeats "انتشار" (="spread") 27 times then outputs Chinese "汉语". Single-word repetition loop
English page (kfd p1): Only extracted title "Motor Insurance KFD" and "Table of Benefits:" — stopped after 50 chars. No table content, no values
Verdict: Fast and lightweight but essentially non-functional on full-page document images. Designed for scene text and cropped regions, not full-page document parsing. Cannot handle complex layouts with tables. Not viable for any document OCR use case

Qwen3.5-9B-AWQ-4bit — 0/6 ARABIC NUMERALS

Tested with recommended Qwen params (temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5) and recommended prompt "qwenvl markdown". Served via vLLM v0.17.0. 9B model, ~8.5 GB VRAM (INT4).

Speed: 103.2s (extremely slow — hit 4096 token limit with chain-of-thought reasoning)
Numbers: 0/6 Arabic numerals — all numbers westernized (6,770 and 5,000 found as Western, but no Arabic-script numerals). Model generates extensive reasoning tokens analyzing table structure before outputting content
Structure: Markdown table attempted but truncated by token limit
Arabic text: 1/3 reference labels found (مسؤولية الغير). Text partially correct but mostly reasoning in English about the Arabic content
English page (kfd p1): 26.8s, all values correct (3,500,000, 6,770, 5,000). Chain-of-thought reasoning adds significant overhead
Verdict: 9B model wastes most tokens on reasoning instead of extraction. Westernizes Arabic numerals. Much slower than 4B variant with worse Arabic results

Qwen3.5-35B-A3B-GPTQ-Int4 — 2/6 ARABIC NUMERALS

Tested with recommended Qwen params (temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5) and recommended prompt "qwenvl markdown". Served via vLLM v0.17.0 with --gpu-memory-utilization 0.85. 35B MoE (3B active), ~23 GB VRAM (INT4).

Speed: 51.5s (hit 4096 token limit with chain-of-thought reasoning)
Numbers: 2/6 Arabic numerals — ٣,٥٠٠,٠٠٠ and ٣,٥٠٠ found. Missing: ٦,٧٧٠, ٥,٠٠٠, ٢٠,٠٠٠, ٧,٥٠٠. Also found 3,500,000 as Western numeral
Structure: Markdown table with proper columns
Arabic text: 1/3 reference labels found (التغطيات التأمينية). Extensive reasoning about table structure in English
English page (kfd p1): 21.4s, all values correct (3,500,000, 6,770, 5,000)
Verdict: 35B MoE produces some Arabic numerals (2/6) but wastes most tokens on reasoning. Too large (23 GB) and slow (51s) for our use case. Worse than Qwen3.5-4B-AWQ (3/6) despite being 8× larger

Nemotron Parse v1.1 (NVIDIA) — 0/6 ARABIC NUMERALS, GARBLED ARABIC

Tested with recommended prompt </s><s><predict_bbox><predict_classes><output_markdown> and recommended params (temperature=0, top_k=1, repetition_penalty=1.1). Served via vLLM v0.17.0. 885M params, ~1.8 GB VRAM. Supports English, German, French, Spanish, Chinese, Japanese — but NOT Arabic. Max context 9000 tokens.

Speed: 3.9s Arabic, 1.2s English (very fast)
Numbers: 0/6 Arabic numerals, 0/3 Western numerals — no real numbers extracted. Outputs Persian/Urdu characters instead (چارقة, ۈۋۉۋۋ)
Structure: LaTeX \begin{tabular} with bounding box coordinates (<x_0.5352><y_0.1797>) — correct table shape but wrong content
Arabic text: 0/3 reference labels found. All Arabic text garbled — "تارقة الملاد عبادي" (nonsense) repeated throughout. Mixes Arabic with Persian/Urdu script characters
English page (kfd p1): 1.2s, all values correct (3,500,000, 6,770, 5,000). Excellent LaTeX table with bounding boxes. Very fast and accurate for English
Verdict: Outstanding English document parser — fast (1.2s), accurate, with spatial coordinates. But Arabic is not supported and output is garbled nonsense. Only viable for English documents

AIN-7B (MBZUAI) — 0/6 ARABIC NUMERALS, EMPTY TABLE CELLS

Tested with three prompts: English extraction, Arabic extraction (استخرج جميع النصوص من هذه الوثيقة.), and explicit markdown table prompt. Served via vLLM v0.17.0. 8B params (Qwen2-VL-7B base), ~16 GB VRAM BF16. Arabic-first bilingual model (MSA/English) trained on 3.6M Arabic-English multimodal samples.

Speed: 3.6s–84.2s Arabic (varies wildly by prompt), 5.0–8.5s English
Numbers: 0/6 Arabic numerals across all prompts. English extraction prompt: only 13 tokens output (document title only). Arabic prompt: 18 tokens. Table prompt: 4096 tokens but all empty table cells | | | | |
Structure: Detected table grid correctly with markdown table prompt but filled every cell with whitespace — no content extracted at all
Arabic text: 0/3 reference labels. First prompt returned only the page title. Arabic prompt returned even less. Table prompt returned empty cells
English page (kfd p1): 3/3 western numerals found with English and table prompts. Good markdown table output. 5.0s, 656 chars
Verdict: Despite being Arabic-specialized (claims to outperform GPT-4o on Arabic OCR benchmarks), AIN completely fails on Arabic table extraction. Detects table structure but cannot read any cell content. Likely trained on prose OCR, not tabular documents

Table Extraction Quality — Qwen3-VL-4B-AWQ (Secondary Model on Table Pages)

Tested Qwen3-VL-4B-AWQ on the same table-heavy pages used for DotsOCR benchmarks, using production parameters and prompt from settings.py and prompts.py:

Params: temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5
Prompt: "qwenvl markdown" (same as production secondary/image prompt)
Serving: vLLM v0.17.0, cyankiwi/Qwen3-VL-4B-Instruct-AWQ-4bit

Speed Comparison

Page	DotsOCR v1.5	Qwen3-VL-4B-AWQ
oman p21 (performance table)	5.8s	6.6s
oman p23 (health table)	2.9s	2.1s
oman p25 (security table)	3.2s	2.4s
kfd p1 (EN benefits table)	5.5s	2.8s
kfd p4 (AR benefits table)	10.7s	8.3s

Output Format

DotsOCR: Structured HTML <table>/<thead>/<tbody>/<tr>/<td>/<th> — machine-parseable, unified single table per page
Qwen: Markdown | | | tables — human-readable, but splits multi-indicator pages into separate small tables

English Table Quality

Oman p21 (Performance Indicators, 8 rows × 4 columns):

Qwen produces a proper 4-column markdown table with all 8 indicators
All indicator names correct ("Quacquarelli Symonds", "Global Innovation Index", etc.)
Values correct (32.8, 69/127, 0.938, 43.93, etc.)
Table is unified (all rows in one table), matching DotsOCR structure

kfd p1 (English Motor Insurance Benefits, 3 product columns):

All 3 product columns correct (Motor Value, Motor Smart, Motor Executive)
All AED amounts exact: 3,500,000, 6,770, 5,000, 7,500, 3,500, 20,000, 5,000, 1,000, 300, 500
Quality matches DotsOCR v1.5 for English numbers

Oman p23/p25 (2-row indicator tables):

Correct indicator names and values (79.03, 65.6, 94.6, 51.2)
Tables split per indicator (separate small tables vs DotsOCR's unified table)

Arabic Table Quality (kfd p4)

1,327 Arabic chars extracted (comparable to DotsOCR's ~1,500)
Markdown table structure present with proper columns
Arabic text is reversed (LTR instead of RTL) — labels garbled but recognizable
Numbers corrupted: ٣,٥٠٠٠٠ instead of ٣,٥٠٠,٠٠٠, ٦,٧٧. instead of ٦,٧٧٠
DotsOCR v1.5 has ALL Arabic numbers correct — clear winner for Arabic

Verdict: Qwen is NOT a DotsOCR Replacement for Tables

Capability	DotsOCR v1.5	Qwen3-VL-4B-AWQ
English table values	All correct	All correct
Arabic numbers	٣,٥٠٠,٠٠٠ (exact)	٣,٥٠٠٠٠ (corrupted)
Output format	HTML `<table>` (structured)	Markdown `\| \|` (flat)
Table structure	Unified, machine-parseable	Split per indicator
Avg speed (tables)	5.6s	4.4s

Qwen matches DotsOCR on English table content but fails on Arabic numerals and produces less structured output (markdown vs HTML). DotsOCR remains the correct choice for the PRIMARY role (full pages with tables). Qwen's strength is Picture crop text extraction, not table parsing.

Visual Region Extraction — dots.ocr-1.5-SVG vs Qwen3-VL-2B

In the production pipeline, dots.ocr handles layout detection + text/table extraction in one pass. Regions classified as "Picture" (diagrams, charts, infographics) are cropped and sent to the secondary VLM (Qwen) for text extraction. We tested whether dots.ocr-1.5-svg could replace Qwen for this role.

dots.ocr-1.5-SVG Model

A separate 3B model fine-tuned for converting images to SVG code. Available on ModelScope (rednote-hilab/dots.ocr-1.5-svg); removed from HuggingFace. Same VRAM as base model (5.72 GB BF16).

Official prompt format: Please generate the SVG code based on the image.viewBox="0 0 {width} {height}"

Hyperparameters: temperature=0.6, top_p=0.9, repetition_penalty=1.15 (recommended by DotsOCR model card). Critical for preventing SVG path degeneration — without these params, the model wastes all tokens on SVG paths and produces 0 text elements on 5/10 tests.

Test Setup: Actual Pipeline Picture Crops

Used dots.ocr v1.0 with prompt_dots to detect layout regions on test pages, then extracted the actual "Picture" crops that Qwen would receive:

Adobe page 3: 3 Picture regions — logo (113×174), CCF diagram (1207×710), compliance infographic (722×557)
Oman page 10: 1 Picture — donut chart (943×643)
Oman page 16: 2 Pictures — vision partners diagram (471×432), society diagram (379×357)

Head-to-Head: Qwen vs SVG Model on Actual Picture Crops

Crop	Qwen3-VL-2B	dots.ocr-1.5-svg	Winner
CCF diagram (1207×710)	4.0s, 519 tok — all text (SOC 2, ISO, PCI, etc.)	17.2s, 2701 tok — 37 texts in SVG	Qwen (4× faster, clean text)
Compliance infographic (722×557)	31.9s — got keywords then degenerated	38.4s, 6000 tok — 9 texts extracted	SVG (Qwen degenerated)
Donut chart (943×643)	32.0s — got values then repeated	38.5s, 6000 tok — 64 text snippets	SVG (Qwen degenerated)
Vision partners (471×432)	0.3s, 29 tok — clean text	16.8s, 2649 tok — 7 texts	Qwen (56× faster)
Society diagram (379×357)	31.6s — degenerated	4.7s, 747 tok — 5 texts	SVG (Qwen degenerated)

SVG Model on Full Pages (for reference)

When given full pages instead of crops, the SVG model extracts text labels embedded in SVG <text> elements:

Page	Time	Tokens	Finish	Texts	Result
Adobe page 3 (1653×2339)	21.1s	3204	length	1	Minimal text — most tokens on SVG structure
Oman page 10 (1191×1684)	36.3s	5580	length	10	Chart labels and values extracted
Oman page 16 (1191×1684)	24.7s	3825	stop	14	Diagram text extracted, completed naturally
kfd page 1 English (1224×1584)	33.4s	5148	stop	55	Table content extracted as SVG text
kfd page 4 Arabic (1224×1584)	36.7s	5652	length	100	Arabic text extracted but garbled ("مottor")

Key Findings

SVG model works well with recommended params: With repetition_penalty=1.15, the model stops wasting tokens on SVG paths and produces <text> elements — 0/10 degenerations vs 5/10 at default temperature=0.0.
SVG model beats Qwen 2B on hard cases: For compliance infographics, donut charts, and society diagrams where Qwen 2B degenerates, the SVG model extracts text successfully (9, 64, and 5 texts respectively). However, Qwen3-VL-4B-AWQ does not degenerate on these crops and produces more complete extractions than SVG on all 5 crops (full indicator names + values vs partial text snippets).
Qwen 4B-AWQ is faster and higher quality: 0.3–4.0s vs 4.7–38.5s, with more complete text on every crop. SVG model consistently misses labels and truncates content.
Arabic is broken in SVG model: Produces garbled mixed-language output ("مottor", "مسHAOYI").
SVG format is token-expensive: Even with recommended params, SVG paths consume significant tokens. The 8192-token context limits how much text can be extracted from complex pages.
SVG model requires post-processing: Text must be extracted from SVG <text> elements, adding pipeline complexity vs Qwen's clean JSON output.

Qwen3-VL Model Size Comparison

Tested all available Qwen3-VL variants as potential upgrades to the current Qwen3-VL-2B-FP8 secondary model. Each variant was tested on 5 full pages + 5 actual Picture crops from dots.ocr v1.0 layout detection (10 tests total).

Hyperparameters: temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5 (recommended by Qwen model cards for VL tasks).

Models Tested

Model	Quantization	Weights (disk)	VRAM (weights)
Qwen3-VL-2B-Instruct-FP8 (current)	FP8	3.3 GB	~3 GB
Qwen3-VL-4B-Instruct-AWQ-4bit	INT4 (compressed-tensors)	4.2 GB	~4 GB
Qwen3-VL-4B-Instruct	BF16	8.3 GB	~8 GB
Qwen3-VL-8B-Instruct-AWQ-4bit	INT4 (compressed-tensors)	7.1 GB	~7 GB
Qwen3-VL-8B-Instruct	BF16	17 GB	~17 GB
Qwen3-VL-30B-A3B-Instruct-FP8	FP8 (MoE, 3B active)	31 GB	~31 GB

AWQ-4bit models from HuggingFace community (cyankiwi), using compressed-tensors quantization format (auto-detected by vLLM v0.11.0, no --quantization flag needed).

Full Pages Results

Page	2B-FP8	4B-AWQ	4B-BF16	8B-AWQ	8B-BF16	30B-A3B-FP8
adobe_p3	7.2s, 725t	5.0s, 556t	12.2s, 772t	7.9s, 759t	22.8s, 895t	7.4s, 561t
oman_p10	2.2s, 194t	2.1s, 207t	3.6s, 209t	2.9s, 241t	6.3s, 233t	3.1s, 213t
oman_p16	3.4s, 340t	3.1s, 334t	5.4s, 334t	3.7s, 341t	11.1s, 434t	4.4s, 325t
kfd_p1 (EN)	5.4s, 546t	3.1s, 334t	5.4s, 335t	3.9s, 358t	9.0s, 348t	4.7s, 345t
kfd_p4 (AR)	DEGEN	11.4s, 1300t	18.3s, 1178t	42.1s, 4096t†	23.6s, 935t	DEGEN

Picture Crops Results

Crop	2B-FP8	4B-AWQ	4B-BF16	8B-AWQ	8B-BF16	30B-A3B-FP8
CCF diagram	3.1s, 322t	2.6s, 307t	4.8s, 310t	3.2s, 319t	13.1s, 528t	4.2s, 326t
Compliance	1.3s, 132t	0.4s, 44t	0.7s, 44t	0.5s, 48t	1.1s, 44t	0.7s, 48t
Donut chart	1.7s, 175t	1.4s, 167t	2.6s, 167t	2.0s, 202t	4.2s, 169t	2.6s, 200t
Vision partners	0.6s, 58t	0.3s, 25t	0.4s, 25t	0.3s, 29t	0.7s, 25t	0.5s, 29t
Society diagram	DEGEN	0.9s, 103t	0.4s, 25t	0.4s, 34t	0.5s, 19t	0.5s, 36t

†8B-AWQ tested at temperature=0.0 (no recommended-params re-run; already stable).

Summary

Model	Avg Time (ok)	Degenerations	Arabic (kfd_p4)	Weights
2B-FP8 (current)	3.1s	2/10	DEGEN	3.3 GB
4B-AWQ-4bit	3.0s	0/10	ok (1300 tok)	4.2 GB
4B-BF16	5.4s	0/10	ok (1178 tok)	8.3 GB
8B-AWQ-4bit	6.7s	0/10	ok (4096 tok)	7.1 GB
8B-BF16	9.2s	0/10	ok (935 tok, best)	17 GB
30B-A3B-FP8	3.1s	1/10	DEGEN	31 GB

Key Findings

4B-AWQ is the best upgrade — fastest average (3.0s), zero degenerations, handles Arabic cleanly (1300 tok), only ~1 GB more than current 2B-FP8
4B-BF16 also handles Arabic — zero degenerations with recommended params, but 1.8× slower and 2× heavier than AWQ
8B-BF16 has the cleanest Arabic output (935 tok) but 3× slower and 4× larger than 4B-AWQ
Current 2B-FP8 is the weakest — degenerates on Arabic (kfd_p4) and society_diagram with recommended params, and also on compliance crop with prod params (temp=0.1, 32.7s wasted). 4B-AWQ is stable with both prod and recommended params
30B-A3B MoE doesn't justify its 31 GB — still degenerates on Arabic, same speed as 4B-AWQ
AWQ (INT4) models are faster than BF16 counterparts with similar output quality

Qwen3.5 Model Comparison (Native Multimodal)

Qwen3.5 (released Mar 2, 2026) is Alibaba's latest model family with native vision capabilities built-in via early fusion training. Unlike Qwen3-VL (dedicated vision-language models), Qwen3.5 models are natively multimodal — no separate "-VL" variant needed.

Important: Qwen3.5 requires vLLM nightly (not in stable v0.11.0) and has a "thinking mode" enabled by default. All tests below were run with enable_thinking: false via chat_template_kwargs.

Hyperparameters: temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5 (recommended by Qwen model cards). BF16 models (4B, 9B) were tested at temperature=0.0 only.

Models Tested

Model	Quantization	Weights (disk)	VRAM (total)
Qwen3.5-2B	BF16	4.3 GB	~7 GB
Qwen3.5-4B	BF16	8.8 GB	~41 GB
Qwen3.5-4B-AWQ-4bit	INT4 (compressed-tensors)	3.8 GB	~41 GB
Qwen3.5-9B	BF16	19 GB	~41 GB
Qwen3.5-9B-AWQ-4bit	INT4 (compressed-tensors)	8.5 GB	~41 GB
Qwen3.5-35B-A3B-GPTQ-Int4	GPTQ INT4 (MoE, 3B active)	23 GB	~42 GB
Qwen3.5-35B-A3B	BF16 (MoE, 3B active)	67 GB	OOM (>46 GB)

AWQ-4bit models from HuggingFace community (cyankiwi). GPTQ-Int4 from official Qwen repo.

Full Pages Results

Page	3.5-2B	3.5-4B	3.5-4B-AWQ	3.5-9B	3.5-9B-AWQ	3.5-35B-A3B-GPTQ
adobe_p3	7.6s, 1207t	10.2s, 771t	5.5s, 869t	12.7s, 535t	5.7s, 544t	4.8s, 540t
oman_p10	3.3s, 453t	4.9s, 335t	2.9s, 384t	6.1s, 238t	2.9s, 232t	2.5s, 229t
oman_p16	25.6s, 4096t	6.7s, 499t	3.2s, 474t	8.0s, 332t	3.7s, 341t	3.2s, 343t
kfd_p1 (EN)	4.3s, 665t	7.3s, 545t	6.9s, 1086t	9.7s, 405t	4.2s, 394t	3.3s, 352t
kfd_p4 (AR)	6.2s, 990t	DEGEN†	8.9s, 1422t	DEGEN†	6.6s, 1061t	9.0s, 1118t

Picture Crops Results

Crop	3.5-2B	3.5-4B	3.5-4B-AWQ	3.5-9B	3.5-9B-AWQ	3.5-35B-A3B-GPTQ
CCF diagram	3.2s, 500t	8.0s, 622t	3.6s, 582t	9.9s, 427t	3.1s, 306t	2.9s, 323t
Compliance	1.5s, 232t	2.2s, 167t	1.5s, 237t	1.9s, 81t	0.6s, 51t	0.5s, 52t
Donut chart	1.5s, 222t	4.0s, 310t	2.6s, 412t	8.2s, 355t	2.1s, 205t	1.7s, 186t
Vision partners	1.0s, 159t	1.9s, 146t	1.2s, 184t	1.7s, 73t	0.4s, 35t	0.3s, 33t
Society diagram	0.7s, 101t	1.9s, 144t	1.0s, 152t	1.8s, 77t	0.5s, 46t	0.4s, 35t

†BF16 models (3.5-4B, 3.5-9B) tested at temperature=0.0 only — not re-tested with recommended params.

Summary

Model	Avg Time (ok)	Degenerations	Arabic (kfd_p4)	Weights	vLLM
3.5-2B	5.0s	0/10	ok (990 tok)	4.3 GB	nightly
3.5-4B†	5.3s	1/10	DEGEN	8.8 GB	nightly
3.5-4B-AWQ	3.5s	0/10	ok (1422 tok)	3.8 GB	nightly
3.5-9B†	6.7s	1/10	DEGEN	19 GB	nightly
3.5-9B-AWQ	2.7s	0/10	ok (1061 tok)	8.5 GB	nightly
3.5-35B-A3B-GPTQ	2.2s	0/10	ok (1118 tok)	23 GB	nightly
3.5-35B-A3B (BF16)	—	—	—	67 GB	OOM

Key Findings

Most Qwen3.5 variants handle Arabic with recommended hyperparameters — 2B, 4B-AWQ, 9B-AWQ, and 35B-A3B-GPTQ all produce clean Arabic output
BF16 models (4B, 9B) still degenerate — only tested at temperature=0.0, likely fixable with recommended params but not re-tested
Qwen3.5 is more verbose — outputs richer descriptions with visual element analysis, using more tokens
All Qwen3.5 models require vLLM nightly — qwen3_5 architecture not supported in stable v0.11.0
MoE 35B-A3B-GPTQ is fast (2.2s) but uses 42 GB VRAM
35B-A3B BF16 (67 GB) doesn't fit on our 46 GB L40S — OOM
Qwen3-VL-4B-AWQ (3.0s, stable vLLM) still beats Qwen3.5-4B-AWQ (3.5s, nightly) for our use case

Qwen3-VL vs Qwen3.5 — Best Variants Head-to-Head

	Qwen3-VL-4B-AWQ	Qwen3.5-4B-AWQ
Avg time (ok)	3.0s	3.5s
Degenerations	0/10	0/10
Arabic	ok	ok
Weights	4.2 GB	3.8 GB
vLLM version	v0.11.0 (stable)	nightly only
Output style	Concise JSON	Richer JSON with descriptions

Winner: Qwen3-VL-4B-AWQ — faster, works on stable vLLM, production-ready.

Nanonets-OCR2-3B & LightOnOCR-2-1B (Oct 2025 OCR Models)

Tested two dedicated OCR models released in October 2025 as potential DotsOCR replacements. Both served via vLLM v0.17.0 on L40S. Tested on 9 full pages + 5 Picture crops, using each model's recommended hyperparameters.

Models Tested

Model	Architecture	Params	VRAM (weights)	Hyperparams	DPI
Nanonets-OCR2-3B	Qwen2.5-VL fine-tune	3B	~8 GB (BF16)	`temperature=0.0` (model card)	150
LightOnOCR-2-1B	Custom (RLVR-trained)	1B	~2 GB (BF16)	`temperature=0.2, top_p=0.9` (model card)	200 (model card)

Full Pages — Speed Comparison

Page	DotsOCR v1.5	Nanonets-OCR2	LightOnOCR-2
oman p7	0.7s	3.8s	2.9s
oman p10	1.5s	3.0s	7.4s
oman p16	3.2s	3.6s	1.3s
oman p21 (table)	5.8s	7.2s	3.0s
oman p23 (table)	2.9s	4.2s	1.5s
oman p25 (table)	3.2s	5.1s	1.7s
kfd p1 (EN table)	5.5s	10.3s	3.2s
kfd p4 (AR table)	10.7s	18.4s	6.6s
adobe p3	—	6.4s	2.7s
Avg	2.9s	6.9s	3.4s

Full Pages — Table Quality

Both models produce well-structured HTML tables with <table>/<thead>/<tbody> tags.

	DotsOCR v1.5	Nanonets-OCR2	LightOnOCR-2
AED 3,500,000 (EN)	Correct	Correct	Correct
Oman p21 values	All correct	All correct	All correct + full indicator names
"Quacquarelli Symonds"	Correct	Correct	Correct

Full Pages — Arabic (kfd p4)

	DotsOCR v1.5	Nanonets-OCR2	LightOnOCR-2
Arabic chars	~1500	1451	1266
AED numerals	٣,٥٠٠,٠٠٠ (correct)	3,000,000 (westernized)	٣٠,٥٠,٠٠٠ (garbled)
Arabic text quality	Best	Good	Good

Picture Crops — vs Qwen3-VL-4B-AWQ

Neither OCR model is suitable for the secondary VLM role (Picture crop text extraction).

Crop	Qwen3-VL-4B-AWQ	Nanonets-OCR2	LightOnOCR-2
CCF diagram	2.6s, text	4.1s, text	1.2s, text
Compliance infographic	0.4s, text	1.8s, image description	2.5s, CSS styling
Donut chart	1.4s, values	5.2s, image description	0.9s, labels only
Vision partners	0.3s, text	0.6s, text	0.2s, text
Society diagram	0.9s, text	1.0s, image description	0.1s, garbled

Nanonets describes images instead of extracting text on 3/5 crops ("A blue background graphic with...")
LightOnOCR garbles small crops (society diagram: "dual / tion / pility") and outputs CSS/HTML styling on infographics
Qwen 4B-AWQ extracts actual text on all 5 crops — indicator names, values, labels

Key Findings

LightOnOCR-2 (1B) matches DotsOCR v1.5 speed on full pages (3.4s vs 2.9s avg) at 1/3 the params and 1/3 the VRAM — impressive efficiency, 280 tok/s
Nanonets-OCR2 (3B) is 2.4× slower than DotsOCR v1.5 despite same parameter count — no speed advantage
Neither matches DotsOCR on Arabic numerals — DotsOCR v1.5 is the only model that correctly extracts ٣,٥٠٠,٠٠٠
Neither works for Picture crops — these are document-page OCR models, not general VLMs. Qwen 4B-AWQ remains the clear winner for the secondary role
LightOnOCR outputs Markdown-only (embedded in weights, ignores text prompts) — cannot be steered to output JSON layout like DotsOCR

FireRed-OCR-2B (FireRedTeam) — Table Extraction Benchmark

Tested as a potential DotsOCR alternative for table extraction. Based on Qwen3-VL-2B-Instruct, fine-tuned with Format-Constrained GRPO for table structural integrity. 92.94% on OmniDocBench v1.5. Served via vLLM v0.17.0. Recommended params from vLLM inference script: temperature=0.0, max_tokens=8192. Recommended prompt: detailed Markdown conversion instructions with explicit "Convert tables to HTML format" directive.

Also tested with generation_config.json params (temperature=0.7, top_p=0.8, top_k=20) — CSS styling bloat caused p23 to hit 4096 token limit (27.1s). Recommended temperature=0.0 eliminates this issue.

English Table — kfd.pdf p1 (3-column bordered table)

Metric	DotsOCR v1.5	FireRed-OCR-2B
Speed	5.5s	9.4s
Format	HTML `<table>`	HTML `<table>` with CSS styling (`background-color`, `colspan`)
AED 3,500,000	Correct	Correct
AED 6,770	Correct	Correct
AED 5,000 / 7,500	Correct	Correct
AED 3,500 / 6,000	Correct	Correct
AED 20,000	Correct	Correct
Column headers	Correct	Correct, with header row styling
Section headers	Plain text	`colspan="4"` with bold styling — preserves "Main Covers" / "Enhanced Motor Protection" grouping

kfd p1 verdict: FireRed-OCR matches DotsOCR on accuracy and produces richer HTML — CSS color styling, colspan section headers, proper <thead>/<tbody>. Slightly slower (9.4s vs 5.5s).

English Table — Oman p21 (8-row styled table, no grid lines)

Metric	DotsOCR v1.5	FireRed-OCR-2B
Speed	5.8s	6.9s
Format	HTML `<table>` with `<thead>/<tbody>`	HTML `<table>` with `<thead>/<tbody>`
"Global Innovation Index" 32.8	Correct	Correct
"Education for All" 0.938	Correct	Correct
"Skills" 71.6, Rank 36/140	Correct	Correct
"Global Talent" 43.93	Correct	Correct
"Quacquarelli Symonds"	Correct	Correct
"Omani"	Correct	Correct
2030/2040 targets	All correct	All correct
Table structure	Good	Excellent — `colspan`, multiline `<br>` values

Oman p21 verdict: Both models produce correct HTML tables. FireRed-OCR is only 1.2× slower and produces clean, well-structured output with proper "Omani" spelling and all values correct.

English Tables — Oman p23 & p25

Page	DotsOCR v1.5	FireRed-OCR-2B
p23 format	HTML table	HTML table with CSS
p23 values	All correct	All correct
p23 speed	2.9s	3.9s
p25 format	HTML table	HTML table with `<thead>/<tbody>`
p25 values	All correct	All correct
p25 speed	3.2s	3.5s

p23/p25 verdict: FireRed-OCR produces HTML tables on ALL styled pages (unlike InternVL3.5-4B which fell back to bullet lists). Very close speed. All values correct.

Arabic Table — kfd.pdf p4

Metric	DotsOCR v1.5	FireRed-OCR-2B
Speed	10.7s	13.4s
Arabic numerals (٣,٥٠٠,٠٠٠ etc.)	All 6 correct	Zero correct — numbers garbled ("3,0,...,")
Arabic labels (التغطيات التأمينية)	All correct	Zero matched — text reversed/garbled
Arabic chars	~1500	1225
Table structure	Proper HTML	HTML table structure present but text scrambled

Arabic verdict: Arabic text is garbled — characters appear reversed and mixed with wrong-script characters. Not hallucinated (unlike InternVL) but not readable either. Numbers rendered as "3,0,...," placeholders. DotsOCR v1.5 remains the only model with correct Arabic numeral extraction.

Summary — FireRed-OCR-2B vs DotsOCR on Tables

Table Type	DotsOCR v1.5	FireRed-OCR-2B
English bordered (kfd p1)	Correct HTML	Correct HTML with CSS styling
English styled (oman p21)	Correct HTML	Correct HTML — all values match
English styled (oman p23)	Correct HTML	Correct HTML
English styled (oman p25)	Correct HTML	Correct HTML
Arabic (kfd p4)	All correct	Garbled — zero correct numerals or labels
Avg speed (table pages)	5.6s	7.4s (1.3× slower)
VRAM	5.7 GB (BF16)	~5 GB (BF16)

Verdict: FireRed-OCR-2B is the strongest English table extractor tested — it produces HTML tables on all 5 table pages (the only model besides DotsOCR to achieve this), with CSS styling, proper <thead>/<tbody>, colspan, and 100% value accuracy. Only 1.3× slower than DotsOCR and uses less VRAM. However, Arabic is garbled (like most non-DotsOCR models). For English-only table extraction, FireRed-OCR-2B is the closest competitor to DotsOCR v1.5.

InternVL3.5-4B (OpenGVLab) — Table Extraction Benchmark

Tested as a potential DotsOCR alternative for table extraction. Served via vLLM v0.17.0 with --trust-remote-code --max-model-len 16384. Tested with both default (temperature=0.1, top_p=0.9) and recommended (temperature=0.0, top_p=0.95) params — results identical. Prompt: "Read all the text in the image. Return tables in HTML format."

English Table — kfd.pdf p1 (3-column bordered table)

Metric	DotsOCR v1.5	InternVL3.5-4B
Speed	5.5s	6.4s
Format	HTML `<table>`	Markdown `\|` table
AED 3,500,000	Correct	Correct
AED 6,770	Correct	Correct
AED 5,000	Correct	Correct
AED 3,500	Correct	Correct
AED 20,000	Correct	Correct
AED 7,500	Correct	Correct
Column headers	Motor Value / Smart / Executive	Correct
Row labels	All correct	All correct

kfd p1 verdict: InternVL3.5-4B matches DotsOCR on English bordered table accuracy — all AED amounts correct, all row/column labels correct. Output is markdown instead of HTML, and slightly slower.

English Table — Oman p21 (8-row styled table, no grid lines)

Metric	DotsOCR v1.5	InternVL3.5-4B
Speed	5.8s	12.3s
Format	HTML `<table>` with `<thead>/<tbody>`	HTML `<table>` (with full `<!DOCTYPE>` boilerplate)
"Global Innovation Index" 32.8	Correct	Correct
"Education for All" 0.938	Correct	Correct
"Skills" 71.6	Correct	Correct
"Global Talent" 43.93	Correct	Correct
"Quacquarelli Symonds"	Correct	Correct (but "Omni" not "Omani")
2030/2040 targets	All correct	All correct

Oman p21 verdict: Values correct but 2.1× slower, outputs full HTML document boilerplate wasting ~200 tokens, minor OCR error ("Omni" instead of "Omani"), no <thead>/<tbody> structure.

English Tables — Oman p23 & p25 (smaller styled tables)

Page	DotsOCR v1.5	InternVL3.5-4B
p23 format	HTML table	Markdown bullet lists (no table)
p23 values	All correct	Values correct, no tabular structure
p23 speed	2.9s	3.7s
p25 format	HTML table	Markdown bullet lists (no table)
p25 values	All correct	Values correct, no tabular structure
p25 speed	3.2s	5.1s

p23/p25 verdict: InternVL3.5-4B extracted correct values but failed to produce table format on styled (non-bordered) tables — output was nested markdown lists. DotsOCR produced proper HTML tables.

Arabic Table — kfd.pdf p4

Metric	DotsOCR v1.5	InternVL3.5-4B
Speed	10.7s	54.9s (hit 4096 token limit)
Arabic numerals (٣,٥٠٠,٠٠٠ etc.)	All 6 correct	Zero correct
Arabic labels (التغطيات التأمينية)	All correct	Zero matched
Content fidelity	Actual document content	Hallucinated — "البيانات" repeated in every cell, "الجداول المUNITED" (mixed Arabic/English)

Arabic verdict: Complete failure. InternVL3.5-4B hallucinated a generic template table instead of reading the actual document. Zero real content extracted.

Summary — InternVL3.5-4B vs DotsOCR on Tables

Table Type	DotsOCR v1.5	InternVL3.5-4B
English bordered (kfd p1)	Correct HTML	Correct markdown
English styled (oman p21)	Correct HTML	Correct HTML (2.1× slower, boilerplate)
English styled (oman p23/25)	Correct HTML	Values ok, no table structure
Arabic (kfd p4)	All correct	Hallucinated — zero real content
Avg speed (table pages)	5.6s	16.5s (3× slower)
VRAM	5.7 GB (BF16)	9.5 GB (BF16)

Verdict: InternVL3.5-4B is competitive on simple English bordered tables (kfd p1) but falls behind DotsOCR on styled tables (loses table structure on p23/p25), is 3× slower overall, uses 1.7× more VRAM, and completely fails on Arabic tables. Not a viable DotsOCR replacement.

GLM-OCR (Zhipu/zai-org) — Table Extraction Benchmark

Tested as a potential DotsOCR alternative for table extraction. #1 on OmniDocBench v1.5 (94.62), 0.9B params, MIT license. Served via vLLM v0.17.0 with separate venv (requires transformers 5.x, incompatible with vLLM's transformers<5 constraint — force-installed transformers 5.3.0 with --no-deps). Recommended prompt from model card: "Table Recognition:" (model only supports 3 fixed prompts: Text/Table/Formula Recognition). Params: temperature=0.0, max_tokens=8192 (16384 context window).

English Table — kfd.pdf p1 (3-column bordered table)

Metric	DotsOCR v1.5	GLM-OCR
Speed	5.5s	1.8s
Tokens	~700	705
Format	HTML `<table>`	HTML `<table class="table table-bordered">` with `<thead>/<tbody>`
AED 3,500,000	Correct	Correct
AED 6,770	Correct	Correct
AED 5,000 / 7,500	Correct	Correct
AED 3,500 / 6,000	Correct	Correct
AED 20,000	Correct	Correct
Column headers	Correct	Correct (Motor Value, Motor Smart, Motor Executive)
Section headers	Plain text	`colspan="4"` — preserves "Main Covers" / "Enhanced Motor Protection" grouping

kfd p1 verdict: GLM-OCR produces perfect HTML — all 6 AED amounts correct, proper <thead>/<tbody>, colspan section headers, Bootstrap-style class names. 3× faster than DotsOCR v1.5.

English Table — Oman p21 (8-row styled table, no grid lines)

Metric	DotsOCR v1.5	GLM-OCR
Speed	5.8s	2.2s
Tokens	~700	696
Format	HTML `<table>` with `<thead>/<tbody>`	HTML `<table>` with `<thead>/<tbody>`
"Global Innovation Index" 32.8	Correct	Correct
"Education for All" 0.938	Correct	Correct
"Skills" 71.6, Rank 36/140	Correct	Correct
"Global Talent" 43.93	Correct	Correct
"Quacquarelli Symonds"	Correct	Correct
"Omani"	Correct	Correct
2030/2040 targets	All correct	All correct

Oman p21 verdict: Perfect match — all 8 indicator names, all values, all targets correct. 2.6× faster than DotsOCR.

English Tables — Oman p23 & p25

Page	DotsOCR v1.5	GLM-OCR
p23 format	HTML table	HTML table with `<thead>/<tbody>`
p23 values	All correct	All correct
p23 speed	2.9s	0.9s
p23 tokens	—	216
p25 format	HTML table	HTML table with `<thead>/<tbody>`
p25 values	All correct	All correct
p25 speed	3.2s	1.0s
p25 tokens	—	268

p23/p25 verdict: GLM-OCR produces correct HTML tables on all styled pages. 3× faster than DotsOCR on these pages.

Arabic Table — kfd.pdf p4

Metric	DotsOCR v1.5	GLM-OCR
Speed	10.7s	13.0s (hit 4096 token limit)
Tokens	~1500	4096 (max)
Arabic numerals (٣,٥٠٠,٠٠٠ etc.)	All 6 correct	Zero correct
Arabic labels (التغطيات التأمينية)	All correct	Zero matched
Content	Actual document	Hallucinated — HTML table structure with "النشاطات عالية" repeated in every cell, zero real content

Arabic verdict: Complete hallucination. GLM-OCR produced an HTML table structure but filled every cell with generic Arabic words ("النشاطات عالية" = "high activities") instead of actual document content. Zero real numerals or labels extracted. With the non-recommended generic prompt, it degenerated differently (single phrase repeated to 8192 tokens) — both modes produce zero useful Arabic output.

Summary — GLM-OCR vs DotsOCR on Tables

Table Type	DotsOCR v1.5	GLM-OCR
English bordered (kfd p1)	Correct HTML	Correct HTML — 3× faster
English styled (oman p21)	Correct HTML	Correct HTML — 2.6× faster
English styled (oman p23)	Correct HTML	Correct HTML — 3.2× faster
English styled (oman p25)	Correct HTML	Correct HTML — 3.2× faster
Arabic (kfd p4)	All correct	Degenerated — zero content
Avg speed (EN tables)	4.4s	1.5s (2.9× faster)
VRAM	5.7 GB (BF16)	~2 GB (BF16)
Params	3B	0.9B

Verdict: GLM-OCR is the fastest and most efficient English table extractor tested — 0.9B params, ~2 GB VRAM, sub-2s on most pages, perfect HTML output with Bootstrap-style class names and proper <thead>/<tbody>. However, it requires a separate venv (transformers 5.x) and completely degenerates on Arabic. For English-only table extraction, GLM-OCR is the strongest candidate. DotsOCR v1.5 remains irreplaceable for Arabic.

Stability Issues

Model	Issue	Severity
DotsOCR v1.0	None observed	Stable
DotsOCR v1.5	Minor: Arabic crop (top half) garbled one number; full page correct	Low
DotsOCR v1.5-SVG	With recommended params (`repetition_penalty=1.15`): zero degenerations (0/10). Without: 5/10 degenerate on SVG paths. Arabic garbled	Stable (with recommended params)
DeepSeek-OCR-2	None observed via vLLM	Stable
Granite-Docling	With `repetition_penalty=1.1`: zero degenerations (0/10). Without: critical degeneration loops (451s, 7668 tokens)	Stable (with repetition_penalty)
Granite-Docling	vLLM outputs `<loc_x>` format, not proper DocTags — Docling converter cannot parse	Limitation
PaddleOCR-VL-1.5	OCR mode degenerates on table/diagram pages ("Direct" or "نعم" repeated to 4096 tokens)	High
PaddleOCR-VL-1.5	Arabic full crop degenerated to "0.000..." repeated to 4096 tokens	High
PaddleOCR-VL-1.5	Table Recognition mode stable on English; mixed on Arabic	Moderate
Qwen3-VL-2B-FP8	Degenerates on Arabic (kfd_p4) and society_diagram crop even with recommended params. Also degenerates on compliance crop with prod params (`temp=0.1`) — 32.7s, 3500 tokens of repeated newlines. E2E pipeline: 9/10 degenerations on adobe-6-page.pdf (avg 153s vs 30s normal)	High
Qwen3-VL-4B-AWQ-4bit	E2E pipeline: 1/20 degenerations on adobe-6-page.pdf (avg 26s). Single degeneration was newline loop on one image crop, caught by token limit and handled by Paddle fallback	Stable
Qwen3-VL-4B-BF16	Zero degenerations with recommended params; Arabic ok (1178 tok)	Stable
Qwen3-VL-8B-AWQ-4bit	Zero degenerations; Arabic hit token limit but clean	Stable
Qwen3-VL-8B-BF16	Zero degenerations; best Arabic output (935 tok, clean)	Stable
Qwen3-VL-30B-A3B-FP8	Degenerates on Arabic (kfd_p4) even with recommended params	High
Qwen3.5-2B	Zero degenerations with recommended params; Arabic ok (990 tok)	Stable
Qwen3.5-4B	Degenerates on Arabic (kfd_p4); tested at temp=0.0 only	Moderate
Qwen3.5-4B-AWQ-4bit	Zero degenerations across 10 tests including Arabic	Stable
Qwen3.5-9B	Degenerates on Arabic (kfd_p4); tested at temp=0.0 only	Moderate
Qwen3.5-9B-AWQ-4bit	Zero degenerations with recommended params; Arabic ok (1061 tok)	Stable
Qwen3.5-35B-A3B-GPTQ-Int4	Zero degenerations with recommended params; Arabic ok (1118 tok); 42 GB VRAM	Stable
Qwen3.5-35B-A3B (BF16)	OOM — 67 GB doesn't fit on 46 GB L40S	N/A
Nanonets-OCR2-3B	Zero degenerations across 9 pages + 5 crops; stable with `temperature=0.0`	Stable
LightOnOCR-2-1B	Zero degenerations on full pages; garbled output on small crops (society diagram)	Stable (pages), Moderate (crops)
QARI-OCR v0.3	Critical degeneration on table pages: `".....,0,0,0 درهم إماراتي"` repeated (Arabic), `"<h1></h1>"` repeated (English). Both recommended and default params. Zero actual numbers extracted	Critical
FireRed-OCR-2B	Zero degenerations on English pages with recommended `temperature=0.0`. With `temperature=0.7` (generation_config.json), CSS bloat caused p23 to hit 4096 token limit (27.1s). Arabic text garbled but no degeneration loops	Stable (English, temp=0.0)
InternVL3.5-4B	5/9 pages flagged low_diversity. Arabic page completely hallucinated (generic template, no real content). 5/5 crops had quality issues. Generates full HTML boilerplate wasting tokens	High
GLM-OCR	Zero degenerations on English pages — excellent quality. Arabic page hallucinated with recommended `"Table Recognition:"` prompt: "النشاطات عالية" repeated in every cell to 4096 tokens. With generic prompt: different phrase repeated to 8192 tokens. Zero real Arabic content with either prompt. Requires separate venv (transformers 5.x)	Critical (Arabic), Stable (English)
HunyuanOCR	Zero degenerations on both English and Arabic. Arabic text quality excellent (proper RTL, all labels correct). With recommended Chinese doc parsing prompt: 4/6 Arabic numerals correct, HTML tables on English and Arabic. Some numbers still truncated (٦,٧٧ instead of ٦,٧٧٠). Prompt language matters — Chinese prompts produce significantly better output than English	Stable
Chandra-OCR	Zero degenerations. Best HTML output quality. 5/6 Arabic numerals correct (second-best after DotsOCR). But extremely slow (52s)	Stable
OlmOCR-2-7B-FP8	Zero degenerations. 3/6 Arabic numerals, some column values merged	Stable
Granite Vision 3.3 2B	Zero degenerations. Westernizes all Arabic numerals (0/6)	Stable
Arabic-Legal-OCR	Zero degenerations but no table structure or numbers extracted	Stable (but useless for tables)
DIMI-Arabic-OCR-V2	Hit 4096 token limit, extremely slow (95s). No numerals or table structure	Low quality
NuMarkdown-8B-Thinking	Zero degenerations. Arabic numbers garbled (0/6) but labels correct. Chain-of-thought reasoning accurate but doesn't fix output	Stable
DeepSeek-VL2-Tiny	Arabic table content completely hallucinated ("الاسم" repeated). English stable with correct values	Critical (Arabic)
ERNIE-4.5-VL-28B-AWQ-4bit	Arabic text reversed (LTR), numbers garbled. English stable. Thinking tokens consume context	Critical (Arabic)
Arabic-Nougat-Large	Hallucinates entirely on non-book content. Both Arabic and English outputs are fabricated	Critical (all)
OCRFlux-3B	Arabic numbers trigger degeneration: `٣٠٠٠٠٠٠٠٠٠` repeated to 4096 tokens. English pages stable. Same pattern as QARI-OCR (both Qwen2.5-VL-based)	Critical (Arabic)
Surya	N/A (standalone toolkit). Table detection failed on both pages. Arabic uses wrong numeral system (Persian/Urdu)	Limitation

vLLM Compatibility

Model	vLLM Version	Status
DotsOCR v1.0	v0.8.5	Production (current)
DotsOCR v1.5	v0.11.0 (native support)	Fully integrated, no custom code needed. Broken on v0.17.0 (outputs garbage/degeneration)
DotsOCR v1.5-SVG	v0.11.0+ (native, needs `--chat-template-content-format string`)	Works, same as base
Qwen3-VL (all sizes)	v0.11.0+ (native support)	Production-ready
DeepSeek-OCR-2	nightly only (PR #33165, Feb 2026)	Not in stable release yet
Granite-Docling	v0.11.0+	Works but output quality degrades vs raw transformers
PaddleOCR-VL-1.5	nightly (fails on v0.11.0 — `mlp_AR` module not found)	Works on nightly
Qwen3.5 (all sizes)	nightly only (`qwen3_5` arch not in v0.11.0)	Requires `chat_template_kwargs: {enable_thinking: false}`
Nanonets-OCR2-3B	v0.17.0 (Qwen2.5-VL architecture)	Works out of the box, `--limit-mm-per-prompt '{"image": 1}'`
LightOnOCR-2-1B	v0.17.0 (v0.11.1+ for v1)	Requires `--mm-processor-cache-gb 0 --no-enable-prefix-caching`
QARI-OCR v0.3	v0.17.0 (Qwen2-VL architecture)	Loads and serves, but degenerates on table documents. Needs `--max-model-len 16384` for full pages
FireRed-OCR-2B	v0.17.0 (Qwen3-VL architecture)	Works out of the box. Use `--max-model-len 32768` and `temperature=0.0` (recommended vLLM params)
InternVL3.5-4B	v0.17.0 (InternVLChat architecture)	Requires `--trust-remote-code`. Loads and serves, but quality is poor (hallucinations, degeneration)
GLM-OCR	v0.17.0 (`glm_ocr` architecture)	Requires transformers 5.x — separate venv with `transformers==5.3.0` force-installed via `--no-deps`. Excellent English, degenerates on Arabic
HunyuanOCR	nightly (`hunyuan_vl` architecture)	Works with `--no-enable-prefix-caching --mm-processor-cache-gb 0`. Stable, fast, but English output lacks table structure and Arabic numbers truncated
Chandra-OCR	v0.17.0 (Qwen2.5-VL architecture)	Works out of the box. Very slow (52s on Arabic)
OlmOCR-2-7B-FP8	v0.17.0 (Qwen2.5-VL architecture, FP8)	Works out of the box
Granite Vision 3.3 2B	v0.17.0	Works out of the box
Arabic-Legal-OCR	v0.17.0 (Gemma-3 architecture)	Works but poor quality on tables
DIMI-Arabic-OCR-V2	v0.17.0 (Qwen2.5-VL + LoRA)	Requires `--gpu-memory-utilization 0.6` for LoRA loading. Very slow
NuMarkdown-8B-Thinking	v0.17.0 (Qwen2.5-VL architecture)	Works with `--trust-remote-code --limit-mm-per-prompt '{"image": 1}'`
DeepSeek-VL2-Tiny	v0.17.0	Requires `--hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}' --limit-mm-per-prompt '{"image": 1}'`, `timm` package. Max 4096 context. Hallucinates Arabic
ERNIE-4.5-VL-28B-AWQ-4bit	v0.17.0	Community AWQ quant. Requires `--trust-remote-code`, `decord` package. ~15 GB VRAM. Arabic reversed
Arabic-Nougat-Large	N/A (standalone)	Not vLLM-compatible. Uses `VisionEncoderDecoderModel` + `NougatProcessor` from transformers. `pip install transformers torch pillow`
OCRFlux-3B	v0.17.0 (Qwen2.5-VL architecture)	Works with `--trust-remote-code`. Good English, degenerates on Arabic
Surya	N/A (standalone)	Not a vLLM model. Install via `pip install surya-ocr`. Requires transformers <5

VRAM Budget Analysis (L40S: 46 GB)

Configuration	VRAM Used	Free
Current (Triton + Qwen2.5-VL-7B + DotsOCR v1.0)	~34.7 GB	~11.3 GB
Current with Qwen3-VL-2B-FP8 (production)	~28.4 GB	~17.6 GB
Upgrade to DotsOCR v1.5 + Qwen3-VL-2B	~19.9 GB	~26.1 GB
Upgrade to DotsOCR v1.5 + Qwen3-VL-4B-AWQ	~21.0 GB	~25.0 GB
DotsOCR v1.5 + SVG model (replace Qwen)	~22.7 GB	~23.3 GB
Replace DotsOCR with PaddleOCR-VL-1.5	~22.3 GB	~23.7 GB
Replace DotsOCR with DeepSeek-OCR-2	~27.0 GB	~19.0 GB
Replace DotsOCR with Granite-Docling	~21.0 GB	~25.0 GB
Replace DotsOCR with Nanonets-OCR2	~28.5 GB	~17.5 GB
Replace DotsOCR with LightOnOCR-2	~22.5 GB	~23.5 GB

Other Models Researched (Not Tested)

Model	Params	Why Not Tested
Qwen2.5-VL-7B	7B (~17GB)	Already use Qwen as secondary; larger than DotsOCR
NVIDIA Nemotron Parse v1.1	885M (~1.8GB)	Now tested — 0/6 Arabic (garbled), excellent English (1.2s, 3/3). No Arabic support. See Arabic Table Extraction section
NVIDIA Nemotron Nano VL 8B	8B (~16GB)	Now tested — 0/6 Arabic numerals, English-only model, hallucinates Arabic content. Good English OCR. See Arabic Table Extraction section
InternVL3-8B	8B (~16-20GB)	General-purpose VLM, not document-specific. 4B variant tested — hallucinates on Arabic, degenerates on crops
MiniCPM-V 4.5	8B (~5GB AWQ)	Now tested (AWQ-4bit) — 0/6 Arabic numerals, hallucinates Chinese+Arabic mixed text. Good English OCR. Apache-2.0. See Arabic Table Extraction section
OCRFlux-3B (ChatDOC)	3B	Now tested — 0/6 Arabic numerals, degenerates on Arabic (same pattern as QARI-OCR). Good English OCR. See Arabic Table Extraction section
GOT-OCR 2.0	580M (~1.2GB)	Now tested (standalone transformers) — 0/6 Arabic, degenerated. English also broken (50 chars only). Not vLLM compatible. See Arabic Table Extraction section
GLM-OCR (Zhipu/zai-org)	0.9B	Now tested — see GLM-OCR benchmark section above. #1 on OmniDocBench v1.5 (94.62), MIT license. Fastest English table extractor (0.9-2.2s). Arabic degenerates. Requires transformers 5.x (separate venv)
HunyuanOCR (Tencent)	1B	Now tested — see Arabic Table Extraction section above. Best Arabic text quality (proper RTL, all labels matched) but numbers truncated (last digit dropped). English fast (0.6-1.5s) but no table structure. Custom license
OlmOCR-2-7B-FP8 (Allen AI)	7B	Now tested — 3/6 Arabic numerals, some column merging. See Arabic Table Extraction section
Chandra-OCR (ChandraAI)	9B	Now tested — 5/6 Arabic numerals (second-best), best HTML output, but 52s (very slow). See Arabic Table Extraction section
Granite Vision 3.3 2B (IBM)	2B	Now tested — westernizes all Arabic numerals (0/6). See Arabic Table Extraction section
Arabic-Legal-Documents-OCR	4B	Now tested — 0/6 Arabic numerals, no table structure. See Arabic Table Extraction section
DIMI-Arabic-OCR-V2	7B	Now tested — 0/6 Arabic numerals, 95s, no table structure. See Arabic Table Extraction section
NuMarkdown-8B-Thinking	8B	Now tested — 0/6 Arabic numerals (garbled), good English tables with chain-of-thought reasoning. See Arabic Table Extraction section
Surya (Datalab)	~500M	Now tested — standalone OCR toolkit (not VLM). 0/6 Arabic numerals (wrong numeral system), table detection failed. GPL license. See Arabic Table Extraction section
LightOnOCR-1B-1025 (v1)	1B	Strictly inferior to LightOnOCR-2-1B (v2) which was already tested. v2 scores higher on all benchmarks (83.2% vs 76.1% OlmOCR-Bench). Same Arabic limitations (garbled numerals). No reason to test
AtlasOCR	—	Requires Unsloth framework, not compatible with vLLM
Baseer (Misraj/Baseer__Nakba)	4B (Qwen2.5-VL-3B)	Now tested — non-functional via vLLM (outputs only 3-10 tokens). Needs custom transformers pipeline. See Arabic Table Extraction section
DeepSeek-VL2-Tiny	3.4B MoE	Now tested — 0/6 Arabic numerals, hallucinates Arabic table content entirely. English OK. See Arabic Table Extraction section
ERNIE-4.5-VL-28B-A3B (Baidu)	28B MoE / 3B active	Now tested (AWQ-4bit) — 0/6 Arabic numerals, Arabic text reversed. English OK. ~15 GB VRAM. See Arabic Table Extraction section
Arabic-Nougat (MohamedRashad)	~400M	Now tested — 0/6 Arabic numerals, hallucinates on non-book content. Not vLLM-compatible. See Arabic Table Extraction section
Mistral OCR	Unknown	Commercial/API model, not self-hostable via vLLM

Conclusions

1. DotsOCR v1.5 is the clear upgrade path — and the ONLY model with correct Arabic numerals

1.7× faster than v1.0 on average (2.9s vs 4.9s on Oman pages)
Same quality: Perfect HTML tables, all values correct for both English and Arabic
Arabic numbers perfect: ALL AED amounts extracted correctly (٣,٥٠٠,٠٠٠, ٦,٧٧٠, ٥,٠٠٠) — matches v1.0. 25+ models tested, none match DotsOCR v1.5 on Arabic numerals. Ranked results:
- Chandra-OCR (9B): 5/6 — second-best, ٢٠,٠٠٠→٢,٠٠٠. Best HTML output but 52s (very slow), 16 GB VRAM
- HunyuanOCR (1B): 4/6 — truncates last digit (٦,٧٧ instead of ٦,٧٧٠). Best Arabic text quality, fast (5.1s). Custom license
- Qwen3.5-4B-AWQ (generic prompt): 5/6 — ٢٠,٠٠٠→٢,٠٠٠. But 73s and reversed text. Recommended prompt only gets 3/6
- OlmOCR-2-7B-FP8: 3/6 — column values merge together
- All others 0/6: DeepSeek-VL2-Tiny hallucinates, ERNIE-4.5-VL reverses Arabic text, Arabic-Nougat hallucinates on non-book content, OCRFlux-3B degenerates (same as QARI-OCR — both Qwen2.5-VL-based), Granite Vision westernizes, Surya uses Persian numerals, NuMarkdown garbles digit order, Arabic-Legal-OCR/DIMI no numbers extracted, GLM-OCR degenerates, QARI-OCR outputs all zeros, InternVL hallucinates, FireRed garbles, AIN-7B outputs empty table cells
Native vLLM 0.11.0+ support: No custom code, out-of-tree registration, or nightly builds needed
Same architecture: Drop-in replacement for v1.0 (same 3B params, same serving config)
Stability: No degeneration observed, only minor crop artifacts

2. Qwen3-VL-4B-AWQ is the best secondary VLM upgrade

Tested 12 Qwen variants across Qwen3-VL (6 models) and Qwen3.5 (7 models, 1 OOM)
Qwen3-VL-4B-AWQ-4bit wins: fastest (3.0s avg), zero degenerations, handles Arabic (1300 tok), works on stable vLLM v0.11.0
Qwen3.5-4B-AWQ is comparable but slower (3.5s) and requires vLLM nightly
MoE models (30B/35B) don't justify their VRAM — 2B-FP8 and 30B still degenerate on Arabic
8B-BF16 has the cleanest Arabic output (935 tok) but is 3× slower and 4× larger

3. dots.ocr-1.5-svg does NOT replace Qwen

With recommended params (repetition_penalty=1.15), the SVG model is stable (0/10 degenerations) and extracts text from diagrams where Qwen 2B degenerates
However, Qwen3-VL-4B-AWQ does not degenerate on these same crops and produces more complete extractions on all 5 Picture crops (full indicator names + target values vs partial text snippets)
SVG is slower (4.7–38.5s vs 0.3–4.0s), heavier (5.72 GB vs ~4 GB), Arabic broken (garbled output), and requires SVG post-processing
Conclusion: Upgrade to Qwen3-VL-4B-AWQ-4bit for the secondary VLM role — no need for the SVG model

4. PaddleOCR-VL-1.5 remains strongest for English-only speed

7-17× faster than DotsOCR v1.0 on English tables (0.3-1.0s vs 5-7s)
1/8th the VRAM (1.8 GB vs 14.2 GB)
Excellent English table quality — matches DotsOCR accuracy
Weakness: Arabic numeral accuracy is poor (numbers missing or garbled)
Weakness: OCR mode degenerates on complex pages; must use task-specific prompts
Weakness: Requires vLLM nightly (not in stable release)

5. DeepSeek-OCR-2 is viable for bordered tables only

Works well on cropped images with visible grid lines (kfd.pdf tables)
Fails on styled/colored table layouts (Oman performance tables)
6.46 GB VRAM (saves 7.7 GB vs DotsOCR v1.0)
Requires vLLM nightly
Arabic numeral accuracy is poor

6. Granite-Docling is not ready for this use case

Incredible VRAM efficiency (0.52 GB) and speed (0.9s avg)
With repetition_penalty=1.1, degeneration is fully resolved (0/10 degenerations, all tests complete in 0.4–2.0s)
Raw transformers + Docling lib can produce excellent tables
But vLLM serving outputs <loc_x><loc_y> coordinate format instead of proper DocTags — the docling library cannot convert this format to markdown/HTML
Would require custom parser for vLLM's output format, or raw transformers inference (27.8s, too slow)

7. Nanonets-OCR2 and LightOnOCR-2 do not replace DotsOCR or Qwen

Tested two dedicated OCR models from Oct 2025 as potential DotsOCR replacements
LightOnOCR-2 (1B) is impressively fast (3.4s avg, 280 tok/s) and VRAM-efficient (~2 GB), matching DotsOCR v1.5 speed on English pages — but Arabic numerals are garbled and it cannot handle Picture crops
Nanonets-OCR2 (3B) is 2.4× slower than DotsOCR v1.5 with no quality advantage. Describes images instead of extracting text on crops
Neither model works for the secondary VLM role — they are document-page OCR models, not general VLMs. Qwen 4B-AWQ remains the best crop text extractor
LightOnOCR-2 could serve as an English-only fast tier alongside PaddleOCR-VL-1.5, but DotsOCR v1.5 remains the only model with correct Arabic numeral extraction

Recommendation

Immediate action: Upgrade DotsOCR v1.0 → v1.5. Same quality, 1.7× faster, native vLLM 0.11.0 support. Drop-in replacement.

Upgrade Qwen3-VL-2B-FP8 → Qwen3-VL-4B-AWQ-4bit as the secondary VLM for Picture regions. Tested 12 Qwen variants across Qwen3-VL and Qwen3.5 families — Qwen3-VL-4B-AWQ is the winner: zero degenerations (vs 2/10 for 2B-FP8), handles Arabic (1300 tok), works on stable vLLM v0.11.0, and only adds ~1 GB disk (4.2 GB vs 3.3 GB). Community AWQ model from cyankiwi/Qwen3-VL-4B-Instruct-AWQ-4bit on HuggingFace. Qwen3.5 models are not recommended — they require vLLM nightly (v0.17.0+ not yet released) and offer no advantage for our use case.

Pipeline comparison on adobe page 3 (DotsOCR v1.5 layout + 3 Picture crops):

DotsOCR v1.5 + Qwen 2B (prod params): 41.5s — compliance crop degenerated (32.7s wasted)
DotsOCR v1.5 + Qwen 2B (recommended params): 7.2s — stable, but requires presence_penalty=1.5
DotsOCR v1.5 + Qwen 4B-AWQ (prod params): 6.3s — stable without any param changes
DotsOCR v1.5 + Qwen 4B-AWQ (recommended params): 6.2s — stable

Degeneration stability test (adobe-6-page.pdf, 10 runs each via full Temporal workflow):

Stack	Degenerations	Avg Time	Avg Content
Old (v1.0 + Qwen 2B, prod params)	9/10 (90%)	153s	~17,400c
New (v1.5 + Qwen 4B-AWQ, recommended params)	1/20 (5%)	26s	~17,580c

Old stack degenerates on almost every run — Qwen 2B produces 4096 tokens of repeated newlines on the compliance crop, adding ~140s per document. New stack had 1 degeneration across 20 runs, caught by token limit and handled by Paddle fallback with minimal content loss.

Full PDF end-to-end comparison (Temporal workflow, all PDFs):

PDF	Pages	Main (v1.0+2B)	New (v1.5+4B)	Speedup	Content
adobe-6-page.pdf	6	64s / 9,950c	27s / 17,627c	2.4×	+77%
kfd.pdf (EN/AR tables)	6	42s / 8,448c	21s / 10,731c	2.0×	+27%
tickets.pdf (AR receipts)	6	45s / 777c	21s / 4,686c	2.1×	+503%
oman-2040-en-min.pdf	17	88s / 12,593c	67s / 19,373c	1.3×	+54%
ar-novel.pdf (Arabic)	28	191s / 42,245c	203s / 45,069c	0.9×	+7%
Total (63 pages)		430s / 74,013c	339s / 97,486c	1.3×	+32%

Accuracy analysis per PDF:

adobe-6-page.pdf (English compliance white paper with diagrams):

Main has 16 OCR errors, New fixes all 16: ٨ ةh0e (garbled header), Seryices→Services, Compianceoveniew→Compliance Overview, ayailability→availability, comptiance→compliance, reauirements→requirements, 27oo1:2o13→27001:2013, FedRAMp→FedRAMP, FERpA→FERPA, 5ocI٨ا→SOCIAL, SE5MENTs→SEGMENTS, obigations→obligations, priyacy→privacy, Asget yanagement→Asset Management, missing ® symbol, incomplete conclusion
New extracts complete CCF diagram labels (200 controls across 11 domains, all standard names correct), full cloud vendor comparison table (Azure/AWS/Google/IBM/Oracle/Alibaba with service details), and the full conclusion paragraph that main missed entirely
New issue: one secondary model output contains a JSON code block from the compliance icons crop (minor)

kfd.pdf (English/Arabic motor insurance tables):

English: Both extract all AED amounts correctly (3,500,000 / 6,770 / 5,000 / 7,500 / 20,000 / 200,000). Main uses GcC instead of GCC
Arabic: Main outputs reversed/unstructured Arabic text — not readable as natural Arabic. New outputs proper right-to-left Arabic with markdown tables: التغطيات التأمينية, درهم إماراتي, all AED amounts in Arabic numerals (٣,٥٠٠,٠٠٠ / ٦,٧٧٠ / ٥,٠٠٠)
New has proper markdown table structure for both English and Arabic benefit tables
New issue: Arabic table has minor number errors — ٦,٥٠٠ for emergency medical (should be ٦,٠٠٠) and ٦,٧٧٠ for personal injury cover (should be ٢٠,٠٠٠)

tickets.pdf (Arabic government receipts + English travel docs):

Main extracts only 777 chars — reversed Arabic (نامع ةنطلس instead of سلطنة عمان), garbled English (NVOIGE instead of INVOICE, MNVDICe Cwnoamip, PECH-HZ), and 3 completely blank pages
New extracts 4,686 chars — readable Arabic receipt (سلطنة عمان, سند صرف, financial table with amounts), full English embassy check (HSBC, $1,598.00), Air Canada flight itinerary (4 flight segments with times, airports, passenger names), and invoice details
New issue: secondary model hallucinates on blurry handwriting page ("I'm not sure if I'm going to be able to do this") and describes image quality instead of extracting text on low-quality scans

oman-2040-en-min.pdf (17-page English government vision document):

Main has 9+ systematic OCR errors: 2o4o→2040, oriorities→priorities, Ciyilization→Civilization, sOcial→social, SMuscol→Muscat, GR code→QR code, DocUumenf→Document, 0mm2040→oman2040, Iargest→largest
New fixes all of these — clean text with proper headings, table of contents as markdown table
Both extract the core content (vision priorities, national programs, indicators)

ar-novel.pdf (28-page Arabic academic paper):

Both produce comparable Arabic text (~30K Arabic chars each, +7% in new)
Main outputs reversed Arabic text throughout — functional but harder to process downstream
New outputs proper Arabic with markdown headings (# الْبُعد العجائبي في الرواية العربية)
Similar speed (191s vs 203s) — pure text pages with no Picture crops, so secondary model not involved
New is slightly slower likely due to higher max_tokens (4096 vs 512) allowing more complete extraction

Formatting: Main produces zero markdown formatting across all PDFs (no headings, no tables). New produces structured markdown throughout with ## headings, | | tables, and * bullet lists

Use recommended hyperparameters per model: Qwen models: temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5 (critical for preventing Arabic text degeneration). DotsOCR SVG: temperature=0.6, top_p=0.9, repetition_penalty=1.15 (critical for preventing SVG path degeneration). Granite-Docling: repetition_penalty=1.1 (fixes degeneration loops).

For English-heavy workloads: Consider PaddleOCR-VL-1.5 as a fast preprocessing tier (0.3-1.0s) with DotsOCR v1.5 fallback for Arabic content or complex layouts.

Adobe Page 3 Test — dots.ocr Behavior on Complex Pages

Tested dots.ocr v1.0 and v1.5 on adobe-6-page.pdf page 3 (CCF diagram + compliance infographic + body text):

Both v1.0 and v1.5 produce identical output — body text only, diagrams marked as Picture
Neither responds to prompt variation — tested 5 different prompts (OCR, extract all, describe, scene text), all produced the same output
This confirms dots.ocr is a document parser, not a general VLM — it extracts structured text/tables and marks visual elements as Picture for secondary VLM processing
dots.ocr v1.0 layout detection correctly identifies 3 Picture regions on this page, which are then routed to Qwen

HuggingFace Availability Note

The official rednote-hilab/dots.ocr-1.5 and dots.ocr-1.5-svg repos were removed from HuggingFace (tracked in GitHub issue #272). Available via:

dots.ocr-1.5: HuggingFace mirror kristaller486/dots.ocr-1.5 (MIT license, 40K+ downloads)
dots.ocr-1.5-svg: ModelScope only modelscope.cn/models/rednote-hilab/dots.ocr-1.5-svg (MIT license)