siraaj-dot-ocr-service / docs/ocr-model-comparison-report.md
OCR Model Comparison Report
OCR Model Comparison Report
Date: 2026-03-11 (updated with AIN-7B, Nemotron Parse v1.1, GOT-OCR 2.0, Nemotron Nano VL, Baseer, MiniCPM-V 4.5, DeepSeek-VL2-Tiny, ERNIE-4.5-VL-28B, Arabic-Nougat, OCRFlux-3B, Chandra-OCR, OlmOCR-2, Granite Vision 3.3, NuMarkdown-8B, Surya, DIMI-Arabic-OCR-V2, Arabic-Legal-OCR, HunyuanOCR, GLM-OCR, FireRed-OCR-2B, InternVL3.5-4B benchmarks, QARI-OCR Arabic table test, Qwen table comparison) GPU: NVIDIA L40S (46 GB VRAM) Test Documents: kfd.pdf (English/Arabic motor insurance), oman-2040-en.pdf (52-page government vision document), adobe-6-page.pdf (compliance white paper with diagrams/infographics)
See also: Hard English Benchmark — focused English-only evaluation of 17 models with ground truth scoring on 9 difficult pages.
Models Tested
| Model | Params | VRAM (weights) | Serving | License |
|---|---|---|---|---|
| DotsOCR v1.0 (baseline) | 3B | 14.2 GB (FP8) | vLLM v0.8.5 | MIT |
| DotsOCR v1.5 | 3B | 5.72 GB (BF16) | vLLM v0.11.0 | MIT |
| DotsOCR v1.5-SVG | 3B | 5.72 GB (BF16) | vLLM v0.11.0 | MIT |
| Qwen3-VL-2B-Instruct-FP8 (current secondary) | 2B | 2.93 GB (FP8) | vLLM v0.11.0 | Apache-2.0 |
| Qwen3-VL-4B-Instruct-AWQ-4bit | 4B | ~4 GB (INT4) | vLLM v0.11.0 | Apache-2.0 |
| Qwen3-VL-4B-Instruct | 4B | ~8 GB (BF16) | vLLM v0.11.0 | Apache-2.0 |
| Qwen3-VL-8B-Instruct-AWQ-4bit | 8B | ~7 GB (INT4) | vLLM v0.11.0 | Apache-2.0 |
| Qwen3-VL-8B-Instruct | 8B | ~17 GB (BF16) | vLLM v0.11.0 | Apache-2.0 |
| Qwen3-VL-30B-A3B-Instruct-FP8 | 30B MoE (3B active) | ~31 GB (FP8) | vLLM v0.11.0 | Apache-2.0 |
| Qwen3.5-2B | 2B | 4.3 GB (BF16) | vLLM nightly | Apache-2.0 |
| Qwen3.5-4B | 4B | 8.8 GB (BF16) | vLLM nightly | Apache-2.0 |
| Qwen3.5-4B-AWQ-4bit | 4B | 3.8 GB (INT4) | vLLM nightly | Apache-2.0 |
| Qwen3.5-9B | 9B | 19 GB (BF16) | vLLM nightly | Apache-2.0 |
| Qwen3.5-9B-AWQ-4bit | 9B | 8.5 GB (INT4) | vLLM nightly | Apache-2.0 |
| Qwen3.5-35B-A3B-GPTQ-Int4 | 35B MoE (3B active) | 23 GB (INT4) | vLLM nightly | Apache-2.0 |
| DeepSeek-OCR-2 | 3B MoE (~570M active) | 6.46 GB (BF16) | vLLM nightly | Apache-2.0 |
| Granite-Docling-258M (IBM) | 258M | 0.52 GB (BF16) | vLLM v0.11.0 | Apache-2.0 |
| PaddleOCR-VL-1.5 (Baidu) | 0.9B | ~1.8 GB (BF16) | vLLM nightly | Apache-2.0 |
| Nanonets-OCR2-3B | 3B (Qwen2.5-VL) | ~8 GB (BF16) | vLLM v0.17.0 | Apache-2.0 |
| LightOnOCR-2-1B | 1B | ~2 GB (BF16) | vLLM v0.17.0 | Apache-2.0 |
| QARI-OCR v0.3 (NAMAA-Space) | 2B (Qwen2-VL) | ~4 GB (BF16) | vLLM v0.17.0 | — |
| FireRed-OCR-2B (FireRedTeam) | 2B (Qwen3-VL-2B) | ~5 GB (BF16) | vLLM v0.17.0 | Apache-2.0 |
| InternVL3.5-4B (OpenGVLab) | 4.7B (0.3B vision + 4.4B LLM) | ~9.5 GB (BF16) | vLLM v0.17.0 | Apache-2.0 |
| GLM-OCR (Zhipu/zai-org) | 0.9B | ~2 GB (BF16) | vLLM v0.17.0 + transformers 5.x | MIT |
| HunyuanOCR (Tencent) | 1B (0.4B ViT + 0.5B LLM) | ~2 GB (BF16) | vLLM nightly | Other |
| Chandra-OCR (ChandraAI) | 9B (Qwen2.5-VL-7B) | ~16 GB (BF16) | vLLM v0.17.0 | Apache-2.0 |
| OlmOCR-2-7B-FP8 (Allen AI) | 7B (Qwen2.5-VL-7B FP8) | ~8 GB (FP8) | vLLM v0.17.0 | Apache-2.0 |
| Granite Vision 3.3 2B (IBM) | 2B | ~4 GB (BF16) | vLLM v0.17.0 | Apache-2.0 |
| Arabic-Legal-Documents-OCR (Moha) | 4B (Gemma-3-4B) | ~9 GB (BF16) | vLLM v0.17.0 | — |
| DIMI-Arabic-OCR-V2 (AhmedZaky1) | 7B (Qwen2.5-VL-7B + LoRA) | ~16 GB (BF16) | vLLM v0.17.0 | — |
| NuMarkdown-8B-Thinking (NuMind) | 8B (Qwen2.5-VL-7B) | ~16 GB (BF16) | vLLM v0.17.0 | MIT |
| DeepSeek-VL2-Tiny | 3.4B MoE (1B active) | ~7 GB (BF16) | vLLM v0.17.0 | MIT |
| ERNIE-4.5-VL-28B-A3B-AWQ-4bit (Baidu) | 28B MoE (3B active) | ~15 GB (INT4) | vLLM v0.17.0 | Apache-2.0 |
| Arabic-Nougat-Large (MohamedRashad) | ~400M | ~0.8 GB (BF16) | Standalone (transformers) | MIT |
| OCRFlux-3B (ChatDOC) | 3B (Qwen2.5-VL-3B) | ~6 GB (BF16) | vLLM v0.17.0 | — |
| Surya (Datalab) | ~500M (multiple models) | ~10 GB (detection+OCR+table) | Standalone (pip) | GPL-3.0 |
| Baseer (Misraj/Baseer__Nakba) | 4B (Qwen2.5-VL-3B) | ~7 GB (BF16) | vLLM v0.17.0 | — |
| MiniCPM-V 4.5 AWQ (OpenBMB) | 8.7B | ~5 GB (INT4) | vLLM v0.17.0 | Apache-2.0 |
| Nemotron Nano VL 8B (NVIDIA) | 8B | ~16 GB (BF16) | vLLM v0.17.0 | Llama 3.1 Community |
| GOT-OCR 2.0 (StepFun) | 580M | ~1.2 GB (BF16) | Standalone (transformers) | Apache-2.0 |
| Nemotron Parse v1.1 (NVIDIA) | 885M | ~1.8 GB (BF16) | vLLM v0.17.0 | — |
| AIN-7B (MBZUAI) | 8B (Qwen2-VL-7B) | ~16 GB (BF16) | vLLM v0.17.0 | — |
Speed Comparison — Oman 2040 (Full Pages)
OCR / Document Parsing Mode
| Page | Content Type | DotsOCR v1.0 | DotsOCR v1.5 | DeepSeek-OCR-2 | Granite-Docling (vLLM) | PaddleOCR-VL (OCR) | Nanonets-OCR2 | LightOnOCR-2 |
|---|---|---|---|---|---|---|---|---|
| 7 | Quote/directive | 7.2s | 0.7s | 2.4s | 0.5s | 0.2s | 3.8s | 2.9s |
| 10 | Vision chart | 3.6s | 1.5s | 11.0s | 0.6s | 0.6s | 3.0s | 7.4s |
| 16 | Mixed text/diagram | 5.5s | 3.2s | 7.9s | 1.4s | 0.7s | 3.6s | 1.3s |
| 21 | Performance table | 7.3s | 5.8s | 10.7s | 1.0s | 7.2s* | 7.2s | 3.0s |
| 23 | Performance table | 5.2s | 2.9s | 6.6s | 0.7s | 7.2s* | 4.2s | 1.5s |
| 25 | Performance table | 5.4s | 3.2s | 6.1s | 1.3s | 7.2s* | 5.1s | 1.7s |
| Avg (6 pages) | 4.9s | 2.9s | 6.5s | 0.9s | — | 4.5s | 3.0s |
*PaddleOCR-VL OCR mode degenerates on table pages (repeats one word to max tokens). Use Table Recognition mode instead.
Table Recognition Mode (PaddleOCR-VL only)
| Page | DotsOCR v1.0 | DotsOCR v1.5 | PaddleOCR-VL (Table) | Speedup vs v1.0 |
|---|---|---|---|---|
| 21 (full page) | 7.3s | 5.8s | 1.0s | 7.3× |
| 23 (full page) | 5.2s | 2.9s | 0.3s | 17× |
| 25 (full page) | 5.4s | 3.2s | 0.5s | 11× |
| 21 (crop) | — | 5.6s | 1.0s | — |
| 25 (crop) | — | 1.7s | 0.5s | — |
Table Extraction Quality — Oman Page 21 (Performance Indicators, 8 rows × 4 columns)
DotsOCR v1.0 — BEST OVERALL (BASELINE)
- Output: Structured HTML
<table>with<thead>,<tbody>,<tr>,<td>,<th> - Accuracy: All 8 indicator names correct, all baseline values correct, all targets correct
- OCR: "Average", "Omani", "Quacquarelli Symonds" — all correct
- Structure: Proper column alignment, multiline cell values preserved
DotsOCR v1.5 — MATCHES V1.0 QUALITY, 1.3× FASTER
- Output: Same structured HTML
<table>format as v1.0 - Accuracy: All 8 indicator names correct, all values correct, all targets correct
- OCR: "Quacquarelli Symonds", "Omani" — all correct
- Structure: Proper 4-column alignment with
<thead>/<tbody> - Speed: 5.8s (1.3× faster than v1.0's 7.3s)
- Crop test: Also works on cropped tables (5.6s for crop P21)
PaddleOCR-VL-1.5 (Table Recognition) — EXCELLENT
- Output: Structured cell format using
<fcel>/<lcel>/<ucel>/<ecel>/<nl>tags - Accuracy: All 8 indicator names correct, all baseline values correct, all targets correct
- OCR: "Symonds", "Omani" — all correct, matches DotsOCR quality
- Structure: Proper 4-column alignment, works on both full pages and crops
- Speed: 1.0s (7.3× faster than DotsOCR v1.0)
- Note: Cell format is easily convertible to HTML/markdown in post-processing
DeepSeek-OCR-2 (Full Page) — FRAGMENTED
- Output: ~30 separate
<|ref|>text<|/ref|>regions with bounding boxes - Accuracy: Most text correct, some errors ("Rverage", "Omni")
- Structure: No table structure — each cell is an independent text block
- Note: Cannot reconstruct table from individual regions without spatial reasoning
DeepSeek-OCR-2 (Cropped Table) — STILL FRAGMENTED
- Output: Same fragmented text regions even on cropped table images
- Conclusion: DeepSeek only produces HTML tables for visually bordered/gridded tables (worked on kfd.pdf which has clear grid lines), NOT for styled/colored table layouts
Granite-Docling (vLLM) — JUMBLED
- Output: Single text blob with
<loc_x><loc_y>textcoordinate format - Accuracy: Wrong numbers, missing values ("Value: (2018) Rank: 69/127")
- Structure: No table structure — vLLM serving outputs raw location tokens, not proper DocTags (
<doctag>,<text>,<table>) - Stability: With
repetition_penalty=1.1, zero degenerations across 10 tests (0.4–2.0s each). Without it, critical degeneration loops (451s, 7668 tokens) - Note: The
doclinglibrary'sDocTagsDocumentconverter cannot parse vLLM's<loc_x>format — it only works with proper DocTags from raw transformers inference
Granite-Docling (Raw Transformers + Docling) — GOOD BUT SLOW
- Output: Proper markdown table via DocTags → Docling library conversion
- Accuracy: All values correct, proper column alignment
- Speed: 27.8s (5.7× slower than DotsOCR)
- Note: Only way to get proper DocTags output; vLLM serving mode loses the format entirely
Table Extraction Quality — kfd.pdf (English Bordered Tables)
DotsOCR v1.0
- Speed: ~3.5s per page
- Quality: Correct HTML tables, accurate numbers, proper structure
DotsOCR v1.5
- Speed: 5.5s per page
- Quality: Correct HTML tables, all AED amounts correct (AED 3,500,000, AED 6,770, AED 5,000, etc.)
- Structure: Same
<table>/<thead>/<tbody>format as v1.0, all 3 product columns detected
PaddleOCR-VL-1.5 (Table Recognition)
- Speed: 0.9s per page (3.9× faster than v1.0)
- Quality: Correct cell structure with
<fcel>/<lcel>/<nl>tags, all AED amounts correct (AED 3,500,000, AED 6,770, AED 5,000, etc.) - Structure: All 3 product columns (Motor Value, Motor Smart, Motor Executive) correctly detected
DeepSeek-OCR-2 (Cropped Tables)
- Speed: ~5.5s per crop via vLLM
- Quality: Proper HTML tables with
<tr>,<td>,colspan— works well on bordered tables - Note: Only works on visually clear gridded tables
Arabic Table Extraction — kfd.pdf (Page 4)
DotsOCR v1.0 — BEST FOR ARABIC
- Numbers: Correct (e.g., ٣,٥٠٠,٠٠٠ AED)
- Structure: Rows sometimes merged
- Text: Accurate Arabic text
DotsOCR v1.5 — MATCHES V1.0 ARABIC QUALITY
- Speed: 10.7s (full page), 8.7s (full crop), 4.3s (top crop)
- Numbers: ALL correct — ٣,٥٠٠,٠٠٠, ٦,٧٧٠, ٥,٠٠٠, ٣,٥٠٠, ٢٠,٠٠٠, ٧,٥٠٠, ١,٠٠٠, ٢٠٠,٠٠٠, ٣٠٠, ٥٠٠ درهم إماراتي
- Structure: Proper HTML table, all rows and columns correct
- Text: Accurate Arabic text — التغطيات التأمينية, مسؤولية الغير, etc.
- Extra rows: Extracted additional rows v1.0 also captured (emergency repairs, new vehicle, taxi, personal accident ٢٠٠,٠٠٠)
- Minor crop issue: Top-half crop garbled ambulance cost (٧٧,٧٧٧,٠٠٠ instead of ٦,٧٧٠), but full page was correct
PaddleOCR-VL-1.5 — Good Structure, Bad Numbers
Tested in 4 configurations:
| Test | Mode | Time | Table Structure | AED Numbers | Stability |
|---|---|---|---|---|---|
| Full page | Table Recognition | 1.4s | Good (3 columns) | Missing — "مقاراتي نعم" instead of values | Stable |
| Full table crop | Table Recognition | 7.2s | Good (4 columns) | Partially — "7,000.00 إماراتي" then degenerated | Degenerated |
| Top half crop | Table Recognition | 0.6s | Good | Missing — "درهم إماراتي" without values | Stable |
| Full table crop | OCR | 0.9s | No (flat text) | Missing entirely | Stable |
- Arabic text labels: Mostly correct — "التغطيات التأمينية الرئيسية", "مسؤولية الغير", etc.
- Critical weakness: AED numbers consistently missing or replaced with garbage ("مقاراتي" = nonsense)
- Full table crop degenerated: Output "0.000000..." repeated to 4096 tokens
DeepSeek-OCR-2
- Numbers: Garbled — missing digits (٣,,٠٠٠ instead of ٣,٥٠٠,٠٠٠)
- Structure: Better row separation than DotsOCR
- Text: Decent Arabic text but with errors
Granite-Docling
- Arabic: Not tested (documented as "experimental" Arabic support)
QARI-OCR v0.3 (Arabic-Specialized) — DEGENERATED
Tested with recommended params (temperature=0.7, top_p=0.9) and recommended prompt ("Below is the image of one page of a document...Just return the plain text representation...Do not hallucinate."). Also tested with temperature=0.0 and a generic OCR prompt. Both configurations degenerated.
- Numbers: ALL zero —
".....,0,0,0 درهم إماراتي"repeated to token limit. No actual AED values extracted - Structure: No table structure — flat HTML
<h4>tags with repeated placeholder text - Arabic text labels: Partially correct (التغطيات التأمينية found) but most labels missing
- English page (kfd p1): Also degenerated —
"<h1></h1><br>"repeated to 2000 tokens, zero content - Verdict: Despite being purpose-built for Arabic OCR, QARI-OCR cannot handle tabular documents with numbers via vLLM. Significantly worse than DotsOCR v1.5
Qwen3.5-4B-AWQ-4bit — BEST NON-DOTSOCR ARABIC
Tested with recommended Qwen params (temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5) and recommended prompt "qwenvl markdown" (production secondary/image prompt). Served via vLLM v0.17.0 (glmocr-venv with transformers 5.3.0).
- Speed: 73.3s (very slow — 6.9× slower than DotsOCR v1.5)
- Numbers: 3/6 Arabic numerals correct — ٣,٥٠٠,٠٠٠, ٥,٠٠٠, ٣,٥٠٠. Missing: ٦,٧٧٠ (rendered as ٦,٧٧,٠٠٠), ٢٠,٠٠٠ (rendered as ٢,٠٠٠,٠٠٠), ٧,٥٠٠ (rendered as ٧,٥٠,٠٠٠). Numbers have wrong comma grouping — extra zeros added
- Structure: Markdown table with proper columns and section headers
- Arabic text: Reversed (LTR instead of RTL) — individual words correct but reading direction wrong. Labels like ةينيمأتلا تايطغتلا (reversed التغطيات التأمينية) present
- English page (kfd p1): 3.2s, all values correct, clean markdown tables
- Oman p21: 4.5s, all 6 key values correct (32.8, 0.938, 71.6, 43.93, Quacquarelli, Omani) but rendered as bullet lists instead of table
- Also tested with generic prompt (
"Convert the content of the image to Markdown format."): Got 5/6 numerals correct — ٣,٥٠٠,٠٠٠, ٦,٧٧٠, ٥,٠٠٠, ٣,٥٠٠, ٧,٥٠٠ (missing ٢٠,٠٠٠ → ٢,٠٠٠). Better than recommended prompt but not the official configuration - Verdict: Good Arabic numeral extraction (3-5/6 numerals depending on prompt). But very slow on Arabic (73s) and text is reversed. Not a DotsOCR replacement
HunyuanOCR (Tencent) — BEST ARABIC TEXT, TRUNCATED NUMBERS
Tested with recommended params (temperature=0.0) and recommended Chinese document parsing prompt "提取文档图片中正文的所有信息用markdown格式表示,其中页眉、页脚部分忽略,表格用html格式表达,文档中公式用latex格式表示,按照阅读顺序组织进行解析。" (from model card — all recommended prompts are in Chinese). Served via vLLM nightly (glmocr-venv). 1B params (0.4B ViT + 0.5B LLM), ~2 GB VRAM. Supports 100+ languages including Arabic. --no-enable-prefix-caching --mm-processor-cache-gb 0 required.
- Speed: 5.1s (2× faster than DotsOCR v1.5's 10.7s on Arabic)
- Numbers: 4/6 Arabic numerals correct — ٣,٥٠٠,٠٠٠, ٥,٠٠٠, ٣,٥٠٠, ٧,٥٠٠. Missing: ٦,٧٧٠ (rendered as ٦,٧٧), ٢٠,٠٠٠ (rendered as ٢,٠٠٠). Last digit still dropped on some numbers
- Arabic text: BEST of any non-DotsOCR model — proper RTL direction, all 3 reference labels matched (التغطيات التأمينية, مسؤولية الغير, درهم إماراتي). Fully readable Arabic with correct word order
- Structure: HTML
<table>with<td>,colspan— proper table structure with recommended prompt - Arabic chars: 1,397 (comparable to DotsOCR's ~1,500)
- English page (kfd p1): 2.1s, all values correct (3,500,000, 6,770, 5,000), HTML table structure
- Oman p21: 2.1s, all 6 key values correct (32.8, 0.938, 71.6, 43.93, Quacquarelli, Omani), HTML table
- Oman p23/p25: 0.8-1.0s, correct values, HTML tables
- Also tested with generic English prompt (
"Extract the text in the image"): Only 2/6 Arabic numerals correct, flat text output with no table structure. Chinese prompt is significantly better - Also tested with Chinese table prompt (
"把图中的表格解析为HTML。"): 2/6 Arabic numerals — same as generic English prompt - Verdict: HunyuanOCR has the best Arabic text quality of any non-DotsOCR model — proper RTL, correct labels, readable output, HTML tables. 4/6 Arabic numerals correct with recommended doc parsing prompt (best non-DotsOCR result). But still truncates some numbers (٦,٧٧ instead of ٦,٧٧٠). Fast (0.8-5.1s). Not a full DotsOCR replacement but the closest competitor for Arabic
Chandra-OCR (9B) — SECOND-BEST ARABIC NUMERALS (5/6)
Tested with recommended prompt "Convert the following image to markdown format." and temperature=0.0. Served via vLLM v0.17.0 (glmocr-venv). 9B model based on Qwen2.5-VL-7B, ~16 GB VRAM.
- Speed: 52.3s (very slow — 4.9× slower than DotsOCR v1.5)
- Numbers: 5/6 Arabic numerals correct — ٣,٥٠٠,٠٠٠, ٦,٧٧٠, ٥,٠٠٠, ٣,٥٠٠, ٧,٥٠٠. Missing: ٢٠,٠٠٠ (rendered as ٢,٠٠٠)
- Structure: HTML
<table>with<thead>,<tbody>,<th>,<td>,colspan— best HTML output of all tested models, proper RTL withstyle="text-align: center/right" - Arabic text: Excellent — proper RTL, all 3 reference labels matched (التغطيات التأمينية, مسؤولية الغير, درهم إماراتي). Clean readable Arabic
- English page (kfd p1): All values correct, clean HTML tables
- Verdict: Second-best Arabic numeral accuracy after DotsOCR v1.5 (5/6 vs 6/6). Best HTML output quality. But extremely slow (52s) and requires ~16 GB VRAM — too large to run alongside other models
OlmOCR-2-7B-FP8 (Allen AI) — 3/6 ARABIC NUMERALS
Tested with recommended prompt (OlmOCR's built-in document parsing prompt) and temperature=0.0. Served via vLLM v0.17.0. 7B model, ~8 GB VRAM (FP8).
- Speed: 25.3s
- Numbers: 3/6 Arabic numerals correct — ٥,٠٠٠, ٣,٥٠٠, ٧,٥٠٠. Missing: ٣,٥٠٠,٠٠٠ (values merged across columns), ٦,٧٧٠ (rendered as ٦,٧٧), ٢٠,٠٠٠ (rendered as ٢٠,٠٠)
- Structure: Markdown tables with proper columns
- Arabic text: All 3 reference labels found. Some column values merged together
- English page (kfd p1): All values correct
- Verdict: Decent Arabic support (3/6 numerals) but column merging and truncated numbers. Slower than DotsOCR v1.5
Granite Vision 3.3 2B (IBM) — 0/6 ARABIC NUMERALS
Tested with temperature=0.0 and generic OCR prompt. Served via vLLM v0.17.0. 2B model, ~4 GB VRAM.
- Speed: 18.8s
- Numbers: 0/6 Arabic numerals — all numbers westernized (3,500,000 instead of ٣,٥٠٠,٠٠٠)
- Structure: Markdown tables
- Arabic text: Westernized — Arabic labels present but numerals converted to Western
- English page (kfd p1): All values correct
- Verdict: Not viable for Arabic — converts all Arabic numerals to Western digits
Arabic-Legal-Documents-OCR (Moha) — 0/6 ARABIC NUMERALS
Tested with recommended prompt from model card. Served via vLLM v0.17.0. 4B model based on Gemma-3-4B, ~9 GB VRAM.
- Speed: 29.5s
- Numbers: 0/6 Arabic numerals — no numbers extracted at all
- Structure: No table structure — flat text output
- Arabic text: Labels partially found but garbled. Outputs mostly unstructured text
- English page (kfd p1): Partial values only
- Verdict: Despite being trained on Arabic legal documents, cannot handle tables with numbers
DIMI-Arabic-OCR-V2 (AhmedZaky1) — 0/6 ARABIC NUMERALS
Tested with recommended prompt. Served via vLLM v0.17.0 with Qwen2.5-VL-7B base + LoRA adapter, --gpu-memory-utilization 0.6. ~16 GB VRAM.
- Speed: 95.4s (extremely slow — hit 4096 token limit)
- Numbers: 0/6 Arabic numerals — no Arabic numerals found
- Structure: No table structure
- Arabic text: All 3 reference labels found (التغطيات التأمينية, مسؤولية الغير, درهم إماراتي) but no table structure or numerals
- English page (kfd p1): Partial extraction only
- Verdict: Very slow, no numerals, no table structure. LoRA adapter on Qwen2.5-VL-7B did not help with Arabic table extraction
NuMarkdown-8B-Thinking (NuMind) — 0/6 ARABIC NUMERALS
Tested with recommended params (temperature=0.7, image-only prompt — model ignores text instructions). Served via vLLM v0.17.0. 8B model based on Qwen2.5-VL-7B, ~16 GB VRAM. MIT license.
- Speed: 41.0s
- Numbers: 0/6 Arabic numerals — all numbers garbled (e.g., "5,5,5,3 درهم إماراتي" instead of ٣,٥٠٠,٠٠٠). Numbers appear to be jumbled individual digits
- Structure: Markdown table with proper columns (best markdown structure of non-HTML models)
- Arabic text: All 3 reference labels found (التغطيات التأمينية, مسؤولية الغير, درهم إماراتي). Good Arabic text quality
- Chain-of-thought: Model generates
<think>...</think>reasoning tokens analyzing document layout before producing<answer>. Reasoning is accurate (correctly identifies RTL table, merged cells) but final output still garbles numbers - English page (kfd p1): 20.9s, all values correct (3,500,000, 6,770, 5,000)
- Verdict: Strong English table extraction with reasoning-based layout analysis. Arabic labels good but numbers completely garbled. No documented Arabic support
DeepSeek-VL2-Tiny — 0/6 ARABIC NUMERALS, HALLUCINATED
Tested with generic OCR prompt and temperature=0.0. Served via vLLM v0.17.0 with --hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'. 3.4B MoE (1B active), ~7 GB VRAM. Max context 4096 tokens.
- Speed: 8.4s (hit 2048 token limit)
- Numbers: 0/6 Arabic numerals — no actual numbers extracted. Table cells filled with "الاسم" (="name") repeated. Completely hallucinated structure
- Structure: HTML
<table>with CSS styling but entirely fabricated content — correct number of columns but wrong headers and empty data cells - Arabic text: 0/3 reference labels found. Hallucinated generic column headers instead of real content
- English page (kfd p1): 3.2s, all values correct (3,500,000, 6,770, 5,000) but generated unnecessary HTML boilerplate (DOCTYPE, stylesheet links)
- Verdict: General VLM, not OCR-specialized. Hallucinates Arabic table content entirely. Arabic not officially supported — DeepSeek-OCR-2 (already tested) is the OCR-specific derivative
ERNIE-4.5-VL-28B-A3B-AWQ-4bit (Baidu) — 0/6 ARABIC NUMERALS, REVERSED TEXT
Tested with generic OCR prompt and temperature=0.0. Community AWQ-4bit quantization from cyankiwi/ERNIE-4.5-VL-28B-A3B-Thinking-AWQ-4bit. 28B MoE (3B active), ~15 GB VRAM (INT4). Requires --trust-remote-code and decord package.
- Speed: 35.3s (hit 4096 token limit, includes thinking tokens)
- Numbers: 0/6 Arabic numerals — all numbers garbled ("٤,٠,...,٥ يتارامإ" instead of ٣,٥٠٠,٠٠٠ درهم إماراتي). Truncated and dots instead of digits
- Structure: HTML
<table>with extensive CSS styling,rowspan/colspan. Good structural attempt but wrong content - Arabic text: 0/3 reference labels found. All Arabic text is reversed (LTR rendering of RTL) — "ةينيمأتلا تايطغتلا" instead of "التغطيات التأمينية", "يتارامإ مهرد" instead of "درهم إماراتي"
- Chain-of-thought: Thinking variant outputs reasoning before answer. Reasoning correctly identifies document as Arabic insurance but final output reverses all text
- English page (kfd p1): 10.7s, all values correct (3,500,000, 6,770, 5,000), clean HTML table
- Verdict: Strong English OCR but Arabic is reversed and numbers garbled. No official Arabic support. Even at 28B params (3B active), cannot handle Arabic documents. Too large for our VRAM budget anyway
Arabic-Nougat-Large (MohamedRashad) — 0/6 ARABIC NUMERALS, HALLUCINATED
Tested as standalone Nougat-based encoder-decoder model (NOT vLLM — uses transformers directly). ~400M params, ~0.8 GB VRAM. Purpose-built for Arabic book pages to Markdown. Max 8192 tokens. Params: repetition_penalty=1.5, max_new_tokens=8192.
- Speed: 3.2s (very fast)
- Numbers: 0/6 Arabic numerals — no real numbers extracted. Numbers in output are hallucinated ("٣٦٤", "٢٥٠٠٠")
- Structure: Markdown table separator
---|---|---present but content is entirely fabricated - Arabic text: 0/3 reference labels found. Title hallucinated as "تأكيد رحلة نموذج التأمين" (completely wrong — should be "وثيقة السمات الرئيسية للتأمين على المركبات")
- Content: Generates plausible-looking Arabic text that is entirely fabricated — hallucinates academic/historical content ("الولايات", "القرن الرابع الميلادي", "Museum-Homewforum") instead of insurance table data
- English page (kfd p1): 2.5s, 0/3 values found. Output is garbled pseudo-English/French — "Égypte", "Múarne Krb", "Cairo Economy Series". Completely unusable
- Verdict: Trained on Arabic academic books, not documents. Hallucinates wildly on unfamiliar content (insurance tables). Not viable for any OCR use case outside its training domain
OCRFlux-3B (ChatDOC) — 0/6 ARABIC NUMERALS, DEGENERATED
Tested with recommended prompt from OCRFlux toolkit source (build_page_to_markdown_prompt: "Below is the image of one page of a document. Just return the plain text representation...ALL tables should be presented in HTML format...Do not hallucinate.") and temperature=0.0, max_tokens=8192. Served via vLLM v0.17.0. 3B model based on Qwen2.5-VL-3B, ~6 GB VRAM.
- Speed: 44.1s (hit 4096 token limit — degenerated)
- Numbers: 0/6 Arabic numerals — degenerated on Arabic numbers. Output:
٣٠٠٠٠٠٠٠٠٠٣٠٠٠٠٠٠٠٠٠repeated to token limit (same "٣٠" pattern as QARI-OCR). No actual AED values extracted - Structure: HTML
<table>tags present but table content degenerated after first few rows - Arabic text: Labels partially found (2/3 — التغطيات التأمينية, مسؤولية الغير). Arabic text before the table was correct, but numbers triggered degeneration loop
- Output format: JSON with
natural_textfield containing markdown + HTML + custom<table>tags (needs post-processing viatable_matrix2html) - English page (kfd p1): 7.9s, all values correct (3,500,000, 6,770, 5,000). Good HTML table output
- Verdict: Good English OCR but critically degenerates on Arabic numbers. Same degeneration pattern as QARI-OCR (both Qwen2.5-VL-based). Not viable for Arabic documents
Surya (Datalab) — 0/6 ARABIC NUMERALS (Standalone OCR Toolkit)
Tested as a standalone OCR toolkit (not vLLM). Surya is a suite of specialized models for text detection, recognition, layout analysis, and table recognition. ~500M params total across models, ~10 GB VRAM for table recognition at default batch size.
- OCR Speed: ~2s for detection + recognition (very fast)
- Numbers: 0/6 Arabic numerals — outputs Persian/Urdu numerals instead of Arabic (e.g., ۵۰۰ instead of ٣,٥٠٠,٠٠٠, ،۱٫۷۷ instead of ٦,٧٧٠)
- Table Recognition: No tables detected on either kfd page (p1 English or p4 Arabic)
- Arabic text: Labels found (3/3 — التغطيات التأمينية, مسؤولية الغير, درهم إماراتي) but text has errors (وثبقة instead of وثيقة, حدول instead of جدول)
- Output format: Flat text (OCR) or JSON cell structures (table rec) — no HTML/Markdown. Needs Marker library for formatted output
- English page (kfd p1): Good text extraction but as flat text, not table-structured
- Verdict: Lightweight and fast OCR toolkit but cannot handle Arabic numerals (wrong numeral system), table detection failed, and output is unstructured. Not a viable candidate for our use case
Baseer (Misraj/Baseer__Nakba) — NON-FUNCTIONAL VIA VLLM
Tested with recommended prompt from model card ("Extract the text from the above document.") and temperature=0.0, max_tokens=4096. Served via vLLM v0.17.0 with --max-model-len 8192. 4B model based on Qwen2.5-VL-3B, ~7 GB VRAM.
- Speed: 3.4s (but only generated 3 tokens)
- Numbers: 0/6 Arabic numerals — model outputs only 3-10 tokens total (e.g., "ليرة" or "ليرا Motor Insurance KFD")
- Structure: No table structure — output too short
- Arabic text: Single word output ("ليرة" = "lira" — wrong currency, should be "درهم إماراتي")
- English page (kfd p1): 10 tokens only — "ليرا Motor Insurance KFD"
- Verdict: Baseer appears to require its specific transformers pipeline (custom
processorandgenerate()params likerepetition_penalty=1.1). Via vLLM chat completions API, it generates only a few tokens and stops. Not viable through vLLM
MiniCPM-V 4.5 AWQ (OpenBMB) — 0/6 ARABIC NUMERALS, HALLUCINATED + CHINESE MIXING
Tested with generic OCR prompt and temperature=0.0. Served via vLLM v0.17.0. AWQ-4bit quantization (openbmb/MiniCPM-V-4_5-AWQ), ~5 GB VRAM. Apache-2.0 license. --max-model-len 16384 --max-num-batched-tokens 16384.
- Speed: 38.3s (hit 4096 token limit — degenerated)
- Numbers: 0/6 Arabic numerals — no real numbers extracted. Table cells filled with repeated "العلاقة的基本组成部分" (Arabic + Chinese mixed hallucination)
- Structure: HTML
<table>tags present but content is entirely fabricated — repeating the same Chinese+Arabic phrase in every cell - Arabic text: 0/3 reference labels found. Title hallucinated as "العلاقة بين أجزاء جملة مركبة" (completely wrong — "relationship between parts of a compound sentence")
- English page (kfd p1): 8.4s, all values correct (3,500,000, 6,770, 5,000). Good HTML table structure
- Verdict: Strong English OCR but completely hallucinates on Arabic — generates Chinese characters mixed with Arabic in a degenerate loop. Despite 8.7B params, Arabic is not supported. Only useful for English documents
Nemotron Nano VL 8B (NVIDIA) — 0/6 ARABIC NUMERALS, HALLUCINATED ARABIC TEXT
Tested with generic OCR prompt and temperature=0.0. Served via vLLM v0.17.0 with --trust-remote-code. 8B model (C-RADIOv2-H vision + Llama-3.1-8B LLM), ~16 GB VRAM (BF16). English-only model (per model card). Requires timm, open-clip-torch, einops.
- Speed: 21.7s
- Numbers: 0/6 Arabic numerals, 0/3 Western numerals — no real numbers extracted at all. Table cells filled with hallucinated Arabic names ("نجم", "إيمان إبراهيم") instead of actual values
- Structure: LaTeX
\begin{tabular}format with\multicolumn— unusual output format, not HTML or Markdown - Arabic text: 0/3 reference labels found. All Arabic text is hallucinated — "الصفات الشخصية" (personal attributes), "موتور سمارت" (transliterated "Motor Smart") instead of actual Arabic insurance terms. Recognizes document is Arabic but fabricates content
- English page (kfd p1): 8.3s, all values correct (3,500,000, 6,770, 5,000). Good LaTeX table with correct structure and values
- Verdict: English-only model as documented. Recognizes Arabic script but hallucinates content entirely. LaTeX output format is unusual. Not viable for Arabic documents. Strong English OCR but 8B/16GB is too large for a secondary model
GOT-OCR 2.0 (StepFun) — 0/6 ARABIC NUMERALS, DEGENERATED
Tested as standalone model via HF-native stepfun-ai/GOT-OCR-2.0-hf with AutoModelForImageTextToText. 580M params, ~1.2 GB VRAM. Used format=True (formatted OCR mode) and do_sample=False, max_new_tokens=4096. Not vLLM compatible (custom GOTQwenForCausalLM architecture).
- Speed: 4.0s Arabic, 1.2s English (very fast — smallest model tested)
- Numbers: 0/6 Arabic numerals, 0/3 Western numerals — no numbers extracted from either page
- Structure: LaTeX
\title{}wrapper only, no table structure at all - Arabic text: 0/3 reference labels found. Degenerated — repeats "انتشار" (="spread") 27 times then outputs Chinese "汉语". Single-word repetition loop
- English page (kfd p1): Only extracted title "Motor Insurance KFD" and "Table of Benefits:" — stopped after 50 chars. No table content, no values
- Verdict: Fast and lightweight but essentially non-functional on full-page document images. Designed for scene text and cropped regions, not full-page document parsing. Cannot handle complex layouts with tables. Not viable for any document OCR use case
Qwen3.5-9B-AWQ-4bit — 0/6 ARABIC NUMERALS
Tested with recommended Qwen params (temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5) and recommended prompt "qwenvl markdown". Served via vLLM v0.17.0. 9B model, ~8.5 GB VRAM (INT4).
- Speed: 103.2s (extremely slow — hit 4096 token limit with chain-of-thought reasoning)
- Numbers: 0/6 Arabic numerals — all numbers westernized (6,770 and 5,000 found as Western, but no Arabic-script numerals). Model generates extensive reasoning tokens analyzing table structure before outputting content
- Structure: Markdown table attempted but truncated by token limit
- Arabic text: 1/3 reference labels found (مسؤولية الغير). Text partially correct but mostly reasoning in English about the Arabic content
- English page (kfd p1): 26.8s, all values correct (3,500,000, 6,770, 5,000). Chain-of-thought reasoning adds significant overhead
- Verdict: 9B model wastes most tokens on reasoning instead of extraction. Westernizes Arabic numerals. Much slower than 4B variant with worse Arabic results
Qwen3.5-35B-A3B-GPTQ-Int4 — 2/6 ARABIC NUMERALS
Tested with recommended Qwen params (temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5) and recommended prompt "qwenvl markdown". Served via vLLM v0.17.0 with --gpu-memory-utilization 0.85. 35B MoE (3B active), ~23 GB VRAM (INT4).
- Speed: 51.5s (hit 4096 token limit with chain-of-thought reasoning)
- Numbers: 2/6 Arabic numerals — ٣,٥٠٠,٠٠٠ and ٣,٥٠٠ found. Missing: ٦,٧٧٠, ٥,٠٠٠, ٢٠,٠٠٠, ٧,٥٠٠. Also found 3,500,000 as Western numeral
- Structure: Markdown table with proper columns
- Arabic text: 1/3 reference labels found (التغطيات التأمينية). Extensive reasoning about table structure in English
- English page (kfd p1): 21.4s, all values correct (3,500,000, 6,770, 5,000)
- Verdict: 35B MoE produces some Arabic numerals (2/6) but wastes most tokens on reasoning. Too large (23 GB) and slow (51s) for our use case. Worse than Qwen3.5-4B-AWQ (3/6) despite being 8× larger
Nemotron Parse v1.1 (NVIDIA) — 0/6 ARABIC NUMERALS, GARBLED ARABIC
Tested with recommended prompt </s><s><predict_bbox><predict_classes><output_markdown> and recommended params (temperature=0, top_k=1, repetition_penalty=1.1). Served via vLLM v0.17.0. 885M params, ~1.8 GB VRAM. Supports English, German, French, Spanish, Chinese, Japanese — but NOT Arabic. Max context 9000 tokens.
- Speed: 3.9s Arabic, 1.2s English (very fast)
- Numbers: 0/6 Arabic numerals, 0/3 Western numerals — no real numbers extracted. Outputs Persian/Urdu characters instead (چارقة, ۈۋۉۋۋ)
- Structure: LaTeX
\begin{tabular}with bounding box coordinates (<x_0.5352><y_0.1797>) — correct table shape but wrong content - Arabic text: 0/3 reference labels found. All Arabic text garbled — "تارقة الملاد عبادي" (nonsense) repeated throughout. Mixes Arabic with Persian/Urdu script characters
- English page (kfd p1): 1.2s, all values correct (3,500,000, 6,770, 5,000). Excellent LaTeX table with bounding boxes. Very fast and accurate for English
- Verdict: Outstanding English document parser — fast (1.2s), accurate, with spatial coordinates. But Arabic is not supported and output is garbled nonsense. Only viable for English documents
AIN-7B (MBZUAI) — 0/6 ARABIC NUMERALS, EMPTY TABLE CELLS
Tested with three prompts: English extraction, Arabic extraction (استخرج جميع النصوص من هذه الوثيقة.), and explicit markdown table prompt. Served via vLLM v0.17.0. 8B params (Qwen2-VL-7B base), ~16 GB VRAM BF16. Arabic-first bilingual model (MSA/English) trained on 3.6M Arabic-English multimodal samples.
- Speed: 3.6s–84.2s Arabic (varies wildly by prompt), 5.0–8.5s English
- Numbers: 0/6 Arabic numerals across all prompts. English extraction prompt: only 13 tokens output (document title only). Arabic prompt: 18 tokens. Table prompt: 4096 tokens but all empty table cells
| | | | | - Structure: Detected table grid correctly with markdown table prompt but filled every cell with whitespace — no content extracted at all
- Arabic text: 0/3 reference labels. First prompt returned only the page title. Arabic prompt returned even less. Table prompt returned empty cells
- English page (kfd p1): 3/3 western numerals found with English and table prompts. Good markdown table output. 5.0s, 656 chars
- Verdict: Despite being Arabic-specialized (claims to outperform GPT-4o on Arabic OCR benchmarks), AIN completely fails on Arabic table extraction. Detects table structure but cannot read any cell content. Likely trained on prose OCR, not tabular documents
Table Extraction Quality — Qwen3-VL-4B-AWQ (Secondary Model on Table Pages)
Tested Qwen3-VL-4B-AWQ on the same table-heavy pages used for DotsOCR benchmarks, using production parameters and prompt from settings.py and prompts.py:
- Params:
temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5 - Prompt:
"qwenvl markdown"(same as production secondary/image prompt) - Serving: vLLM v0.17.0,
cyankiwi/Qwen3-VL-4B-Instruct-AWQ-4bit
Speed Comparison
| Page | DotsOCR v1.5 | Qwen3-VL-4B-AWQ |
|---|---|---|
| oman p21 (performance table) | 5.8s | 6.6s |
| oman p23 (health table) | 2.9s | 2.1s |
| oman p25 (security table) | 3.2s | 2.4s |
| kfd p1 (EN benefits table) | 5.5s | 2.8s |
| kfd p4 (AR benefits table) | 10.7s | 8.3s |
Output Format
- DotsOCR: Structured HTML
<table>/<thead>/<tbody>/<tr>/<td>/<th>— machine-parseable, unified single table per page - Qwen: Markdown
| | |tables — human-readable, but splits multi-indicator pages into separate small tables
English Table Quality
Oman p21 (Performance Indicators, 8 rows × 4 columns):
- Qwen produces a proper 4-column markdown table with all 8 indicators
- All indicator names correct ("Quacquarelli Symonds", "Global Innovation Index", etc.)
- Values correct (32.8, 69/127, 0.938, 43.93, etc.)
- Table is unified (all rows in one table), matching DotsOCR structure
kfd p1 (English Motor Insurance Benefits, 3 product columns):
- All 3 product columns correct (Motor Value, Motor Smart, Motor Executive)
- All AED amounts exact: 3,500,000, 6,770, 5,000, 7,500, 3,500, 20,000, 5,000, 1,000, 300, 500
- Quality matches DotsOCR v1.5 for English numbers
Oman p23/p25 (2-row indicator tables):
- Correct indicator names and values (79.03, 65.6, 94.6, 51.2)
- Tables split per indicator (separate small tables vs DotsOCR's unified table)
Arabic Table Quality (kfd p4)
- 1,327 Arabic chars extracted (comparable to DotsOCR's ~1,500)
- Markdown table structure present with proper columns
- Arabic text is reversed (LTR instead of RTL) — labels garbled but recognizable
- Numbers corrupted:
٣,٥٠٠٠٠instead of٣,٥٠٠,٠٠٠,٦,٧٧.instead of٦,٧٧٠ - DotsOCR v1.5 has ALL Arabic numbers correct — clear winner for Arabic
Verdict: Qwen is NOT a DotsOCR Replacement for Tables
| Capability | DotsOCR v1.5 | Qwen3-VL-4B-AWQ |
|---|---|---|
| English table values | All correct | All correct |
| Arabic numbers | ٣,٥٠٠,٠٠٠ (exact) | ٣,٥٠٠٠٠ (corrupted) |
| Output format | HTML <table> (structured) | Markdown | | (flat) |
| Table structure | Unified, machine-parseable | Split per indicator |
| Avg speed (tables) | 5.6s | 4.4s |
Qwen matches DotsOCR on English table content but fails on Arabic numerals and produces less structured output (markdown vs HTML). DotsOCR remains the correct choice for the PRIMARY role (full pages with tables). Qwen's strength is Picture crop text extraction, not table parsing.
Visual Region Extraction — dots.ocr-1.5-SVG vs Qwen3-VL-2B
In the production pipeline, dots.ocr handles layout detection + text/table extraction in one pass. Regions classified as "Picture" (diagrams, charts, infographics) are cropped and sent to the secondary VLM (Qwen) for text extraction. We tested whether dots.ocr-1.5-svg could replace Qwen for this role.
dots.ocr-1.5-SVG Model
A separate 3B model fine-tuned for converting images to SVG code. Available on ModelScope (rednote-hilab/dots.ocr-1.5-svg); removed from HuggingFace. Same VRAM as base model (5.72 GB BF16).
Official prompt format: Please generate the SVG code based on the image.viewBox="0 0 {width} {height}"
Hyperparameters: temperature=0.6, top_p=0.9, repetition_penalty=1.15 (recommended by DotsOCR model card). Critical for preventing SVG path degeneration — without these params, the model wastes all tokens on SVG paths and produces 0 text elements on 5/10 tests.
Test Setup: Actual Pipeline Picture Crops
Used dots.ocr v1.0 with prompt_dots to detect layout regions on test pages, then extracted the actual "Picture" crops that Qwen would receive:
- Adobe page 3: 3 Picture regions — logo (113×174), CCF diagram (1207×710), compliance infographic (722×557)
- Oman page 10: 1 Picture — donut chart (943×643)
- Oman page 16: 2 Pictures — vision partners diagram (471×432), society diagram (379×357)
Head-to-Head: Qwen vs SVG Model on Actual Picture Crops
| Crop | Qwen3-VL-2B | dots.ocr-1.5-svg | Winner |
|---|---|---|---|
| CCF diagram (1207×710) | 4.0s, 519 tok — all text (SOC 2, ISO, PCI, etc.) | 17.2s, 2701 tok — 37 texts in SVG | Qwen (4× faster, clean text) |
| Compliance infographic (722×557) | 31.9s — got keywords then degenerated | 38.4s, 6000 tok — 9 texts extracted | SVG (Qwen degenerated) |
| Donut chart (943×643) | 32.0s — got values then repeated | 38.5s, 6000 tok — 64 text snippets | SVG (Qwen degenerated) |
| Vision partners (471×432) | 0.3s, 29 tok — clean text | 16.8s, 2649 tok — 7 texts | Qwen (56× faster) |
| Society diagram (379×357) | 31.6s — degenerated | 4.7s, 747 tok — 5 texts | SVG (Qwen degenerated) |
SVG Model on Full Pages (for reference)
When given full pages instead of crops, the SVG model extracts text labels embedded in SVG <text> elements:
| Page | Time | Tokens | Finish | Texts | Result |
|---|---|---|---|---|---|
| Adobe page 3 (1653×2339) | 21.1s | 3204 | length | 1 | Minimal text — most tokens on SVG structure |
| Oman page 10 (1191×1684) | 36.3s | 5580 | length | 10 | Chart labels and values extracted |
| Oman page 16 (1191×1684) | 24.7s | 3825 | stop | 14 | Diagram text extracted, completed naturally |
| kfd page 1 English (1224×1584) | 33.4s | 5148 | stop | 55 | Table content extracted as SVG text |
| kfd page 4 Arabic (1224×1584) | 36.7s | 5652 | length | 100 | Arabic text extracted but garbled ("مottor") |
Key Findings
-
SVG model works well with recommended params: With
repetition_penalty=1.15, the model stops wasting tokens on SVG paths and produces<text>elements — 0/10 degenerations vs 5/10 at defaulttemperature=0.0. -
SVG model beats Qwen 2B on hard cases: For compliance infographics, donut charts, and society diagrams where Qwen 2B degenerates, the SVG model extracts text successfully (9, 64, and 5 texts respectively). However, Qwen3-VL-4B-AWQ does not degenerate on these crops and produces more complete extractions than SVG on all 5 crops (full indicator names + values vs partial text snippets).
-
Qwen 4B-AWQ is faster and higher quality: 0.3–4.0s vs 4.7–38.5s, with more complete text on every crop. SVG model consistently misses labels and truncates content.
-
Arabic is broken in SVG model: Produces garbled mixed-language output ("مottor", "مسHAOYI").
-
SVG format is token-expensive: Even with recommended params, SVG paths consume significant tokens. The 8192-token context limits how much text can be extracted from complex pages.
-
SVG model requires post-processing: Text must be extracted from SVG
<text>elements, adding pipeline complexity vs Qwen's clean JSON output.
Qwen3-VL Model Size Comparison
Tested all available Qwen3-VL variants as potential upgrades to the current Qwen3-VL-2B-FP8 secondary model. Each variant was tested on 5 full pages + 5 actual Picture crops from dots.ocr v1.0 layout detection (10 tests total).
Hyperparameters: temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5 (recommended by Qwen model cards for VL tasks).
Models Tested
| Model | Quantization | Weights (disk) | VRAM (weights) |
|---|---|---|---|
| Qwen3-VL-2B-Instruct-FP8 (current) | FP8 | 3.3 GB | ~3 GB |
| Qwen3-VL-4B-Instruct-AWQ-4bit | INT4 (compressed-tensors) | 4.2 GB | ~4 GB |
| Qwen3-VL-4B-Instruct | BF16 | 8.3 GB | ~8 GB |
| Qwen3-VL-8B-Instruct-AWQ-4bit | INT4 (compressed-tensors) | 7.1 GB | ~7 GB |
| Qwen3-VL-8B-Instruct | BF16 | 17 GB | ~17 GB |
| Qwen3-VL-30B-A3B-Instruct-FP8 | FP8 (MoE, 3B active) | 31 GB | ~31 GB |
AWQ-4bit models from HuggingFace community (cyankiwi), using compressed-tensors quantization format (auto-detected by vLLM v0.11.0, no --quantization flag needed).
Full Pages Results
| Page | 2B-FP8 | 4B-AWQ | 4B-BF16 | 8B-AWQ | 8B-BF16 | 30B-A3B-FP8 |
|---|---|---|---|---|---|---|
| adobe_p3 | 7.2s, 725t | 5.0s, 556t | 12.2s, 772t | 7.9s, 759t | 22.8s, 895t | 7.4s, 561t |
| oman_p10 | 2.2s, 194t | 2.1s, 207t | 3.6s, 209t | 2.9s, 241t | 6.3s, 233t | 3.1s, 213t |
| oman_p16 | 3.4s, 340t | 3.1s, 334t | 5.4s, 334t | 3.7s, 341t | 11.1s, 434t | 4.4s, 325t |
| kfd_p1 (EN) | 5.4s, 546t | 3.1s, 334t | 5.4s, 335t | 3.9s, 358t | 9.0s, 348t | 4.7s, 345t |
| kfd_p4 (AR) | DEGEN | 11.4s, 1300t | 18.3s, 1178t | 42.1s, 4096t† | 23.6s, 935t | DEGEN |
Picture Crops Results
| Crop | 2B-FP8 | 4B-AWQ | 4B-BF16 | 8B-AWQ | 8B-BF16 | 30B-A3B-FP8 |
|---|---|---|---|---|---|---|
| CCF diagram | 3.1s, 322t | 2.6s, 307t | 4.8s, 310t | 3.2s, 319t | 13.1s, 528t | 4.2s, 326t |
| Compliance | 1.3s, 132t | 0.4s, 44t | 0.7s, 44t | 0.5s, 48t | 1.1s, 44t | 0.7s, 48t |
| Donut chart | 1.7s, 175t | 1.4s, 167t | 2.6s, 167t | 2.0s, 202t | 4.2s, 169t | 2.6s, 200t |
| Vision partners | 0.6s, 58t | 0.3s, 25t | 0.4s, 25t | 0.3s, 29t | 0.7s, 25t | 0.5s, 29t |
| Society diagram | DEGEN | 0.9s, 103t | 0.4s, 25t | 0.4s, 34t | 0.5s, 19t | 0.5s, 36t |
†8B-AWQ tested at temperature=0.0 (no recommended-params re-run; already stable).
Summary
| Model | Avg Time (ok) | Degenerations | Arabic (kfd_p4) | Weights |
|---|---|---|---|---|
| 2B-FP8 (current) | 3.1s | 2/10 | DEGEN | 3.3 GB |
| 4B-AWQ-4bit | 3.0s | 0/10 | ok (1300 tok) | 4.2 GB |
| 4B-BF16 | 5.4s | 0/10 | ok (1178 tok) | 8.3 GB |
| 8B-AWQ-4bit | 6.7s | 0/10 | ok (4096 tok) | 7.1 GB |
| 8B-BF16 | 9.2s | 0/10 | ok (935 tok, best) | 17 GB |
| 30B-A3B-FP8 | 3.1s | 1/10 | DEGEN | 31 GB |
Key Findings
- 4B-AWQ is the best upgrade — fastest average (3.0s), zero degenerations, handles Arabic cleanly (1300 tok), only ~1 GB more than current 2B-FP8
- 4B-BF16 also handles Arabic — zero degenerations with recommended params, but 1.8× slower and 2× heavier than AWQ
- 8B-BF16 has the cleanest Arabic output (935 tok) but 3× slower and 4× larger than 4B-AWQ
- Current 2B-FP8 is the weakest — degenerates on Arabic (kfd_p4) and society_diagram with recommended params, and also on compliance crop with prod params (
temp=0.1, 32.7s wasted). 4B-AWQ is stable with both prod and recommended params - 30B-A3B MoE doesn't justify its 31 GB — still degenerates on Arabic, same speed as 4B-AWQ
- AWQ (INT4) models are faster than BF16 counterparts with similar output quality
Qwen3.5 Model Comparison (Native Multimodal)
Qwen3.5 (released Mar 2, 2026) is Alibaba's latest model family with native vision capabilities built-in via early fusion training. Unlike Qwen3-VL (dedicated vision-language models), Qwen3.5 models are natively multimodal — no separate "-VL" variant needed.
Important: Qwen3.5 requires vLLM nightly (not in stable v0.11.0) and has a "thinking mode" enabled by default. All tests below were run with enable_thinking: false via chat_template_kwargs.
Hyperparameters: temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5 (recommended by Qwen model cards). BF16 models (4B, 9B) were tested at temperature=0.0 only.
Models Tested
| Model | Quantization | Weights (disk) | VRAM (total) |
|---|---|---|---|
| Qwen3.5-2B | BF16 | 4.3 GB | ~7 GB |
| Qwen3.5-4B | BF16 | 8.8 GB | ~41 GB |
| Qwen3.5-4B-AWQ-4bit | INT4 (compressed-tensors) | 3.8 GB | ~41 GB |
| Qwen3.5-9B | BF16 | 19 GB | ~41 GB |
| Qwen3.5-9B-AWQ-4bit | INT4 (compressed-tensors) | 8.5 GB | ~41 GB |
| Qwen3.5-35B-A3B-GPTQ-Int4 | GPTQ INT4 (MoE, 3B active) | 23 GB | ~42 GB |
| Qwen3.5-35B-A3B | BF16 (MoE, 3B active) | 67 GB | OOM (>46 GB) |
AWQ-4bit models from HuggingFace community (cyankiwi). GPTQ-Int4 from official Qwen repo.
Full Pages Results
| Page | 3.5-2B | 3.5-4B | 3.5-4B-AWQ | 3.5-9B | 3.5-9B-AWQ | 3.5-35B-A3B-GPTQ |
|---|---|---|---|---|---|---|
| adobe_p3 | 7.6s, 1207t | 10.2s, 771t | 5.5s, 869t | 12.7s, 535t | 5.7s, 544t | 4.8s, 540t |
| oman_p10 | 3.3s, 453t | 4.9s, 335t | 2.9s, 384t | 6.1s, 238t | 2.9s, 232t | 2.5s, 229t |
| oman_p16 | 25.6s, 4096t | 6.7s, 499t | 3.2s, 474t | 8.0s, 332t | 3.7s, 341t | 3.2s, 343t |
| kfd_p1 (EN) | 4.3s, 665t | 7.3s, 545t | 6.9s, 1086t | 9.7s, 405t | 4.2s, 394t | 3.3s, 352t |
| kfd_p4 (AR) | 6.2s, 990t | DEGEN† | 8.9s, 1422t | DEGEN† | 6.6s, 1061t | 9.0s, 1118t |
Picture Crops Results
| Crop | 3.5-2B | 3.5-4B | 3.5-4B-AWQ | 3.5-9B | 3.5-9B-AWQ | 3.5-35B-A3B-GPTQ |
|---|---|---|---|---|---|---|
| CCF diagram | 3.2s, 500t | 8.0s, 622t | 3.6s, 582t | 9.9s, 427t | 3.1s, 306t | 2.9s, 323t |
| Compliance | 1.5s, 232t | 2.2s, 167t | 1.5s, 237t | 1.9s, 81t | 0.6s, 51t | 0.5s, 52t |
| Donut chart | 1.5s, 222t | 4.0s, 310t | 2.6s, 412t | 8.2s, 355t | 2.1s, 205t | 1.7s, 186t |
| Vision partners | 1.0s, 159t | 1.9s, 146t | 1.2s, 184t | 1.7s, 73t | 0.4s, 35t | 0.3s, 33t |
| Society diagram | 0.7s, 101t | 1.9s, 144t | 1.0s, 152t | 1.8s, 77t | 0.5s, 46t | 0.4s, 35t |
†BF16 models (3.5-4B, 3.5-9B) tested at temperature=0.0 only — not re-tested with recommended params.
Summary
| Model | Avg Time (ok) | Degenerations | Arabic (kfd_p4) | Weights | vLLM |
|---|---|---|---|---|---|
| 3.5-2B | 5.0s | 0/10 | ok (990 tok) | 4.3 GB | nightly |
| 3.5-4B† | 5.3s | 1/10 | DEGEN | 8.8 GB | nightly |
| 3.5-4B-AWQ | 3.5s | 0/10 | ok (1422 tok) | 3.8 GB | nightly |
| 3.5-9B† | 6.7s | 1/10 | DEGEN | 19 GB | nightly |
| 3.5-9B-AWQ | 2.7s | 0/10 | ok (1061 tok) | 8.5 GB | nightly |
| 3.5-35B-A3B-GPTQ | 2.2s | 0/10 | ok (1118 tok) | 23 GB | nightly |
| 3.5-35B-A3B (BF16) | — | — | — | 67 GB | OOM |
Key Findings
- Most Qwen3.5 variants handle Arabic with recommended hyperparameters — 2B, 4B-AWQ, 9B-AWQ, and 35B-A3B-GPTQ all produce clean Arabic output
- BF16 models (4B, 9B) still degenerate — only tested at
temperature=0.0, likely fixable with recommended params but not re-tested - Qwen3.5 is more verbose — outputs richer descriptions with visual element analysis, using more tokens
- All Qwen3.5 models require vLLM nightly —
qwen3_5architecture not supported in stable v0.11.0 - MoE 35B-A3B-GPTQ is fast (2.2s) but uses 42 GB VRAM
- 35B-A3B BF16 (67 GB) doesn't fit on our 46 GB L40S — OOM
- Qwen3-VL-4B-AWQ (3.0s, stable vLLM) still beats Qwen3.5-4B-AWQ (3.5s, nightly) for our use case
Qwen3-VL vs Qwen3.5 — Best Variants Head-to-Head
| Qwen3-VL-4B-AWQ | Qwen3.5-4B-AWQ | |
|---|---|---|
| Avg time (ok) | 3.0s | 3.5s |
| Degenerations | 0/10 | 0/10 |
| Arabic | ok | ok |
| Weights | 4.2 GB | 3.8 GB |
| vLLM version | v0.11.0 (stable) | nightly only |
| Output style | Concise JSON | Richer JSON with descriptions |
Winner: Qwen3-VL-4B-AWQ — faster, works on stable vLLM, production-ready.
Nanonets-OCR2-3B & LightOnOCR-2-1B (Oct 2025 OCR Models)
Tested two dedicated OCR models released in October 2025 as potential DotsOCR replacements. Both served via vLLM v0.17.0 on L40S. Tested on 9 full pages + 5 Picture crops, using each model's recommended hyperparameters.
Models Tested
| Model | Architecture | Params | VRAM (weights) | Hyperparams | DPI |
|---|---|---|---|---|---|
| Nanonets-OCR2-3B | Qwen2.5-VL fine-tune | 3B | ~8 GB (BF16) | temperature=0.0 (model card) | 150 |
| LightOnOCR-2-1B | Custom (RLVR-trained) | 1B | ~2 GB (BF16) | temperature=0.2, top_p=0.9 (model card) | 200 (model card) |
Full Pages — Speed Comparison
| Page | DotsOCR v1.5 | Nanonets-OCR2 | LightOnOCR-2 |
|---|---|---|---|
| oman p7 | 0.7s | 3.8s | 2.9s |
| oman p10 | 1.5s | 3.0s | 7.4s |
| oman p16 | 3.2s | 3.6s | 1.3s |
| oman p21 (table) | 5.8s | 7.2s | 3.0s |
| oman p23 (table) | 2.9s | 4.2s | 1.5s |
| oman p25 (table) | 3.2s | 5.1s | 1.7s |
| kfd p1 (EN table) | 5.5s | 10.3s | 3.2s |
| kfd p4 (AR table) | 10.7s | 18.4s | 6.6s |
| adobe p3 | — | 6.4s | 2.7s |
| Avg | 2.9s | 6.9s | 3.4s |
Full Pages — Table Quality
Both models produce well-structured HTML tables with <table>/<thead>/<tbody> tags.
| DotsOCR v1.5 | Nanonets-OCR2 | LightOnOCR-2 | |
|---|---|---|---|
| AED 3,500,000 (EN) | Correct | Correct | Correct |
| Oman p21 values | All correct | All correct | All correct + full indicator names |
| "Quacquarelli Symonds" | Correct | Correct | Correct |
Full Pages — Arabic (kfd p4)
| DotsOCR v1.5 | Nanonets-OCR2 | LightOnOCR-2 | |
|---|---|---|---|
| Arabic chars | ~1500 | 1451 | 1266 |
| AED numerals | ٣,٥٠٠,٠٠٠ (correct) | 3,000,000 (westernized) | ٣٠,٥٠,٠٠٠ (garbled) |
| Arabic text quality | Best | Good | Good |
Picture Crops — vs Qwen3-VL-4B-AWQ
Neither OCR model is suitable for the secondary VLM role (Picture crop text extraction).
| Crop | Qwen3-VL-4B-AWQ | Nanonets-OCR2 | LightOnOCR-2 |
|---|---|---|---|
| CCF diagram | 2.6s, text | 4.1s, text | 1.2s, text |
| Compliance infographic | 0.4s, text | 1.8s, image description | 2.5s, CSS styling |
| Donut chart | 1.4s, values | 5.2s, image description | 0.9s, labels only |
| Vision partners | 0.3s, text | 0.6s, text | 0.2s, text |
| Society diagram | 0.9s, text | 1.0s, image description | 0.1s, garbled |
- Nanonets describes images instead of extracting text on 3/5 crops ("A blue background graphic with...")
- LightOnOCR garbles small crops (society diagram: "dual / tion / pility") and outputs CSS/HTML styling on infographics
- Qwen 4B-AWQ extracts actual text on all 5 crops — indicator names, values, labels
Key Findings
- LightOnOCR-2 (1B) matches DotsOCR v1.5 speed on full pages (3.4s vs 2.9s avg) at 1/3 the params and 1/3 the VRAM — impressive efficiency, 280 tok/s
- Nanonets-OCR2 (3B) is 2.4× slower than DotsOCR v1.5 despite same parameter count — no speed advantage
- Neither matches DotsOCR on Arabic numerals — DotsOCR v1.5 is the only model that correctly extracts ٣,٥٠٠,٠٠٠
- Neither works for Picture crops — these are document-page OCR models, not general VLMs. Qwen 4B-AWQ remains the clear winner for the secondary role
- LightOnOCR outputs Markdown-only (embedded in weights, ignores text prompts) — cannot be steered to output JSON layout like DotsOCR
FireRed-OCR-2B (FireRedTeam) — Table Extraction Benchmark
Tested as a potential DotsOCR alternative for table extraction. Based on Qwen3-VL-2B-Instruct, fine-tuned with Format-Constrained GRPO for table structural integrity. 92.94% on OmniDocBench v1.5. Served via vLLM v0.17.0. Recommended params from vLLM inference script: temperature=0.0, max_tokens=8192. Recommended prompt: detailed Markdown conversion instructions with explicit "Convert tables to HTML format" directive.
Also tested with generation_config.json params (temperature=0.7, top_p=0.8, top_k=20) — CSS styling bloat caused p23 to hit 4096 token limit (27.1s). Recommended temperature=0.0 eliminates this issue.
English Table — kfd.pdf p1 (3-column bordered table)
| Metric | DotsOCR v1.5 | FireRed-OCR-2B |
|---|---|---|
| Speed | 5.5s | 9.4s |
| Format | HTML <table> | HTML <table> with CSS styling (background-color, colspan) |
| AED 3,500,000 | Correct | Correct |
| AED 6,770 | Correct | Correct |
| AED 5,000 / 7,500 | Correct | Correct |
| AED 3,500 / 6,000 | Correct | Correct |
| AED 20,000 | Correct | Correct |
| Column headers | Correct | Correct, with header row styling |
| Section headers | Plain text | colspan="4" with bold styling — preserves "Main Covers" / "Enhanced Motor Protection" grouping |
kfd p1 verdict: FireRed-OCR matches DotsOCR on accuracy and produces richer HTML — CSS color styling, colspan section headers, proper <thead>/<tbody>. Slightly slower (9.4s vs 5.5s).
English Table — Oman p21 (8-row styled table, no grid lines)
| Metric | DotsOCR v1.5 | FireRed-OCR-2B |
|---|---|---|
| Speed | 5.8s | 6.9s |
| Format | HTML <table> with <thead>/<tbody> | HTML <table> with <thead>/<tbody> |
| "Global Innovation Index" 32.8 | Correct | Correct |
| "Education for All" 0.938 | Correct | Correct |
| "Skills" 71.6, Rank 36/140 | Correct | Correct |
| "Global Talent" 43.93 | Correct | Correct |
| "Quacquarelli Symonds" | Correct | Correct |
| "Omani" | Correct | Correct |
| 2030/2040 targets | All correct | All correct |
| Table structure | Good | Excellent — colspan, multiline <br> values |
Oman p21 verdict: Both models produce correct HTML tables. FireRed-OCR is only 1.2× slower and produces clean, well-structured output with proper "Omani" spelling and all values correct.
English Tables — Oman p23 & p25
| Page | DotsOCR v1.5 | FireRed-OCR-2B |
|---|---|---|
| p23 format | HTML table | HTML table with CSS |
| p23 values | All correct | All correct |
| p23 speed | 2.9s | 3.9s |
| p25 format | HTML table | HTML table with <thead>/<tbody> |
| p25 values | All correct | All correct |
| p25 speed | 3.2s | 3.5s |
p23/p25 verdict: FireRed-OCR produces HTML tables on ALL styled pages (unlike InternVL3.5-4B which fell back to bullet lists). Very close speed. All values correct.
Arabic Table — kfd.pdf p4
| Metric | DotsOCR v1.5 | FireRed-OCR-2B |
|---|---|---|
| Speed | 10.7s | 13.4s |
| Arabic numerals (٣,٥٠٠,٠٠٠ etc.) | All 6 correct | Zero correct — numbers garbled ("3,0,...,") |
| Arabic labels (التغطيات التأمينية) | All correct | Zero matched — text reversed/garbled |
| Arabic chars | ~1500 | 1225 |
| Table structure | Proper HTML | HTML table structure present but text scrambled |
Arabic verdict: Arabic text is garbled — characters appear reversed and mixed with wrong-script characters. Not hallucinated (unlike InternVL) but not readable either. Numbers rendered as "3,0,...," placeholders. DotsOCR v1.5 remains the only model with correct Arabic numeral extraction.
Summary — FireRed-OCR-2B vs DotsOCR on Tables
| Table Type | DotsOCR v1.5 | FireRed-OCR-2B |
|---|---|---|
| English bordered (kfd p1) | Correct HTML | Correct HTML with CSS styling |
| English styled (oman p21) | Correct HTML | Correct HTML — all values match |
| English styled (oman p23) | Correct HTML | Correct HTML |
| English styled (oman p25) | Correct HTML | Correct HTML |
| Arabic (kfd p4) | All correct | Garbled — zero correct numerals or labels |
| Avg speed (table pages) | 5.6s | 7.4s (1.3× slower) |
| VRAM | 5.7 GB (BF16) | ~5 GB (BF16) |
Verdict: FireRed-OCR-2B is the strongest English table extractor tested — it produces HTML tables on all 5 table pages (the only model besides DotsOCR to achieve this), with CSS styling, proper <thead>/<tbody>, colspan, and 100% value accuracy. Only 1.3× slower than DotsOCR and uses less VRAM. However, Arabic is garbled (like most non-DotsOCR models). For English-only table extraction, FireRed-OCR-2B is the closest competitor to DotsOCR v1.5.
InternVL3.5-4B (OpenGVLab) — Table Extraction Benchmark
Tested as a potential DotsOCR alternative for table extraction. Served via vLLM v0.17.0 with --trust-remote-code --max-model-len 16384. Tested with both default (temperature=0.1, top_p=0.9) and recommended (temperature=0.0, top_p=0.95) params — results identical. Prompt: "Read all the text in the image. Return tables in HTML format."
English Table — kfd.pdf p1 (3-column bordered table)
| Metric | DotsOCR v1.5 | InternVL3.5-4B |
|---|---|---|
| Speed | 5.5s | 6.4s |
| Format | HTML <table> | Markdown | table |
| AED 3,500,000 | Correct | Correct |
| AED 6,770 | Correct | Correct |
| AED 5,000 | Correct | Correct |
| AED 3,500 | Correct | Correct |
| AED 20,000 | Correct | Correct |
| AED 7,500 | Correct | Correct |
| Column headers | Motor Value / Smart / Executive | Correct |
| Row labels | All correct | All correct |
kfd p1 verdict: InternVL3.5-4B matches DotsOCR on English bordered table accuracy — all AED amounts correct, all row/column labels correct. Output is markdown instead of HTML, and slightly slower.
English Table — Oman p21 (8-row styled table, no grid lines)
| Metric | DotsOCR v1.5 | InternVL3.5-4B |
|---|---|---|
| Speed | 5.8s | 12.3s |
| Format | HTML <table> with <thead>/<tbody> | HTML <table> (with full <!DOCTYPE> boilerplate) |
| "Global Innovation Index" 32.8 | Correct | Correct |
| "Education for All" 0.938 | Correct | Correct |
| "Skills" 71.6 | Correct | Correct |
| "Global Talent" 43.93 | Correct | Correct |
| "Quacquarelli Symonds" | Correct | Correct (but "Omni" not "Omani") |
| 2030/2040 targets | All correct | All correct |
Oman p21 verdict: Values correct but 2.1× slower, outputs full HTML document boilerplate wasting ~200 tokens, minor OCR error ("Omni" instead of "Omani"), no <thead>/<tbody> structure.
English Tables — Oman p23 & p25 (smaller styled tables)
| Page | DotsOCR v1.5 | InternVL3.5-4B |
|---|---|---|
| p23 format | HTML table | Markdown bullet lists (no table) |
| p23 values | All correct | Values correct, no tabular structure |
| p23 speed | 2.9s | 3.7s |
| p25 format | HTML table | Markdown bullet lists (no table) |
| p25 values | All correct | Values correct, no tabular structure |
| p25 speed | 3.2s | 5.1s |
p23/p25 verdict: InternVL3.5-4B extracted correct values but failed to produce table format on styled (non-bordered) tables — output was nested markdown lists. DotsOCR produced proper HTML tables.
Arabic Table — kfd.pdf p4
| Metric | DotsOCR v1.5 | InternVL3.5-4B |
|---|---|---|
| Speed | 10.7s | 54.9s (hit 4096 token limit) |
| Arabic numerals (٣,٥٠٠,٠٠٠ etc.) | All 6 correct | Zero correct |
| Arabic labels (التغطيات التأمينية) | All correct | Zero matched |
| Content fidelity | Actual document content | Hallucinated — "البيانات" repeated in every cell, "الجداول المUNITED" (mixed Arabic/English) |
Arabic verdict: Complete failure. InternVL3.5-4B hallucinated a generic template table instead of reading the actual document. Zero real content extracted.
Summary — InternVL3.5-4B vs DotsOCR on Tables
| Table Type | DotsOCR v1.5 | InternVL3.5-4B |
|---|---|---|
| English bordered (kfd p1) | Correct HTML | Correct markdown |
| English styled (oman p21) | Correct HTML | Correct HTML (2.1× slower, boilerplate) |
| English styled (oman p23/25) | Correct HTML | Values ok, no table structure |
| Arabic (kfd p4) | All correct | Hallucinated — zero real content |
| Avg speed (table pages) | 5.6s | 16.5s (3× slower) |
| VRAM | 5.7 GB (BF16) | 9.5 GB (BF16) |
Verdict: InternVL3.5-4B is competitive on simple English bordered tables (kfd p1) but falls behind DotsOCR on styled tables (loses table structure on p23/p25), is 3× slower overall, uses 1.7× more VRAM, and completely fails on Arabic tables. Not a viable DotsOCR replacement.
GLM-OCR (Zhipu/zai-org) — Table Extraction Benchmark
Tested as a potential DotsOCR alternative for table extraction. #1 on OmniDocBench v1.5 (94.62), 0.9B params, MIT license. Served via vLLM v0.17.0 with separate venv (requires transformers 5.x, incompatible with vLLM's transformers<5 constraint — force-installed transformers 5.3.0 with --no-deps). Recommended prompt from model card: "Table Recognition:" (model only supports 3 fixed prompts: Text/Table/Formula Recognition). Params: temperature=0.0, max_tokens=8192 (16384 context window).
English Table — kfd.pdf p1 (3-column bordered table)
| Metric | DotsOCR v1.5 | GLM-OCR |
|---|---|---|
| Speed | 5.5s | 1.8s |
| Tokens | ~700 | 705 |
| Format | HTML <table> | HTML <table class="table table-bordered"> with <thead>/<tbody> |
| AED 3,500,000 | Correct | Correct |
| AED 6,770 | Correct | Correct |
| AED 5,000 / 7,500 | Correct | Correct |
| AED 3,500 / 6,000 | Correct | Correct |
| AED 20,000 | Correct | Correct |
| Column headers | Correct | Correct (Motor Value, Motor Smart, Motor Executive) |
| Section headers | Plain text | colspan="4" — preserves "Main Covers" / "Enhanced Motor Protection" grouping |
kfd p1 verdict: GLM-OCR produces perfect HTML — all 6 AED amounts correct, proper <thead>/<tbody>, colspan section headers, Bootstrap-style class names. 3× faster than DotsOCR v1.5.
English Table — Oman p21 (8-row styled table, no grid lines)
| Metric | DotsOCR v1.5 | GLM-OCR |
|---|---|---|
| Speed | 5.8s | 2.2s |
| Tokens | ~700 | 696 |
| Format | HTML <table> with <thead>/<tbody> | HTML <table> with <thead>/<tbody> |
| "Global Innovation Index" 32.8 | Correct | Correct |
| "Education for All" 0.938 | Correct | Correct |
| "Skills" 71.6, Rank 36/140 | Correct | Correct |
| "Global Talent" 43.93 | Correct | Correct |
| "Quacquarelli Symonds" | Correct | Correct |
| "Omani" | Correct | Correct |
| 2030/2040 targets | All correct | All correct |
Oman p21 verdict: Perfect match — all 8 indicator names, all values, all targets correct. 2.6× faster than DotsOCR.
English Tables — Oman p23 & p25
| Page | DotsOCR v1.5 | GLM-OCR |
|---|---|---|
| p23 format | HTML table | HTML table with <thead>/<tbody> |
| p23 values | All correct | All correct |
| p23 speed | 2.9s | 0.9s |
| p23 tokens | — | 216 |
| p25 format | HTML table | HTML table with <thead>/<tbody> |
| p25 values | All correct | All correct |
| p25 speed | 3.2s | 1.0s |
| p25 tokens | — | 268 |
p23/p25 verdict: GLM-OCR produces correct HTML tables on all styled pages. 3× faster than DotsOCR on these pages.
Arabic Table — kfd.pdf p4
| Metric | DotsOCR v1.5 | GLM-OCR |
|---|---|---|
| Speed | 10.7s | 13.0s (hit 4096 token limit) |
| Tokens | ~1500 | 4096 (max) |
| Arabic numerals (٣,٥٠٠,٠٠٠ etc.) | All 6 correct | Zero correct |
| Arabic labels (التغطيات التأمينية) | All correct | Zero matched |
| Content | Actual document | Hallucinated — HTML table structure with "النشاطات عالية" repeated in every cell, zero real content |
Arabic verdict: Complete hallucination. GLM-OCR produced an HTML table structure but filled every cell with generic Arabic words ("النشاطات عالية" = "high activities") instead of actual document content. Zero real numerals or labels extracted. With the non-recommended generic prompt, it degenerated differently (single phrase repeated to 8192 tokens) — both modes produce zero useful Arabic output.
Summary — GLM-OCR vs DotsOCR on Tables
| Table Type | DotsOCR v1.5 | GLM-OCR |
|---|---|---|
| English bordered (kfd p1) | Correct HTML | Correct HTML — 3× faster |
| English styled (oman p21) | Correct HTML | Correct HTML — 2.6× faster |
| English styled (oman p23) | Correct HTML | Correct HTML — 3.2× faster |
| English styled (oman p25) | Correct HTML | Correct HTML — 3.2× faster |
| Arabic (kfd p4) | All correct | Degenerated — zero content |
| Avg speed (EN tables) | 4.4s | 1.5s (2.9× faster) |
| VRAM | 5.7 GB (BF16) | ~2 GB (BF16) |
| Params | 3B | 0.9B |
Verdict: GLM-OCR is the fastest and most efficient English table extractor tested — 0.9B params, ~2 GB VRAM, sub-2s on most pages, perfect HTML output with Bootstrap-style class names and proper <thead>/<tbody>. However, it requires a separate venv (transformers 5.x) and completely degenerates on Arabic. For English-only table extraction, GLM-OCR is the strongest candidate. DotsOCR v1.5 remains irreplaceable for Arabic.
Stability Issues
| Model | Issue | Severity |
|---|---|---|
| DotsOCR v1.0 | None observed | Stable |
| DotsOCR v1.5 | Minor: Arabic crop (top half) garbled one number; full page correct | Low |
| DotsOCR v1.5-SVG | With recommended params (repetition_penalty=1.15): zero degenerations (0/10). Without: 5/10 degenerate on SVG paths. Arabic garbled | Stable (with recommended params) |
| DeepSeek-OCR-2 | None observed via vLLM | Stable |
| Granite-Docling | With repetition_penalty=1.1: zero degenerations (0/10). Without: critical degeneration loops (451s, 7668 tokens) | Stable (with repetition_penalty) |
| Granite-Docling | vLLM outputs <loc_x> format, not proper DocTags — Docling converter cannot parse | Limitation |
| PaddleOCR-VL-1.5 | OCR mode degenerates on table/diagram pages ("Direct" or "نعم" repeated to 4096 tokens) | High |
| PaddleOCR-VL-1.5 | Arabic full crop degenerated to "0.000..." repeated to 4096 tokens | High |
| PaddleOCR-VL-1.5 | Table Recognition mode stable on English; mixed on Arabic | Moderate |
| Qwen3-VL-2B-FP8 | Degenerates on Arabic (kfd_p4) and society_diagram crop even with recommended params. Also degenerates on compliance crop with prod params (temp=0.1) — 32.7s, 3500 tokens of repeated newlines. E2E pipeline: 9/10 degenerations on adobe-6-page.pdf (avg 153s vs 30s normal) | High |
| Qwen3-VL-4B-AWQ-4bit | E2E pipeline: 1/20 degenerations on adobe-6-page.pdf (avg 26s). Single degeneration was newline loop on one image crop, caught by token limit and handled by Paddle fallback | Stable |
| Qwen3-VL-4B-BF16 | Zero degenerations with recommended params; Arabic ok (1178 tok) | Stable |
| Qwen3-VL-8B-AWQ-4bit | Zero degenerations; Arabic hit token limit but clean | Stable |
| Qwen3-VL-8B-BF16 | Zero degenerations; best Arabic output (935 tok, clean) | Stable |
| Qwen3-VL-30B-A3B-FP8 | Degenerates on Arabic (kfd_p4) even with recommended params | High |
| Qwen3.5-2B | Zero degenerations with recommended params; Arabic ok (990 tok) | Stable |
| Qwen3.5-4B | Degenerates on Arabic (kfd_p4); tested at temp=0.0 only | Moderate |
| Qwen3.5-4B-AWQ-4bit | Zero degenerations across 10 tests including Arabic | Stable |
| Qwen3.5-9B | Degenerates on Arabic (kfd_p4); tested at temp=0.0 only | Moderate |
| Qwen3.5-9B-AWQ-4bit | Zero degenerations with recommended params; Arabic ok (1061 tok) | Stable |
| Qwen3.5-35B-A3B-GPTQ-Int4 | Zero degenerations with recommended params; Arabic ok (1118 tok); 42 GB VRAM | Stable |
| Qwen3.5-35B-A3B (BF16) | OOM — 67 GB doesn't fit on 46 GB L40S | N/A |
| Nanonets-OCR2-3B | Zero degenerations across 9 pages + 5 crops; stable with temperature=0.0 | Stable |
| LightOnOCR-2-1B | Zero degenerations on full pages; garbled output on small crops (society diagram) | Stable (pages), Moderate (crops) |
| QARI-OCR v0.3 | Critical degeneration on table pages: ".....,0,0,0 درهم إماراتي" repeated (Arabic), "<h1></h1>" repeated (English). Both recommended and default params. Zero actual numbers extracted | Critical |
| FireRed-OCR-2B | Zero degenerations on English pages with recommended temperature=0.0. With temperature=0.7 (generation_config.json), CSS bloat caused p23 to hit 4096 token limit (27.1s). Arabic text garbled but no degeneration loops | Stable (English, temp=0.0) |
| InternVL3.5-4B | 5/9 pages flagged low_diversity. Arabic page completely hallucinated (generic template, no real content). 5/5 crops had quality issues. Generates full HTML boilerplate wasting tokens | High |
| GLM-OCR | Zero degenerations on English pages — excellent quality. Arabic page hallucinated with recommended "Table Recognition:" prompt: "النشاطات عالية" repeated in every cell to 4096 tokens. With generic prompt: different phrase repeated to 8192 tokens. Zero real Arabic content with either prompt. Requires separate venv (transformers 5.x) | Critical (Arabic), Stable (English) |
| HunyuanOCR | Zero degenerations on both English and Arabic. Arabic text quality excellent (proper RTL, all labels correct). With recommended Chinese doc parsing prompt: 4/6 Arabic numerals correct, HTML tables on English and Arabic. Some numbers still truncated (٦,٧٧ instead of ٦,٧٧٠). Prompt language matters — Chinese prompts produce significantly better output than English | Stable |
| Chandra-OCR | Zero degenerations. Best HTML output quality. 5/6 Arabic numerals correct (second-best after DotsOCR). But extremely slow (52s) | Stable |
| OlmOCR-2-7B-FP8 | Zero degenerations. 3/6 Arabic numerals, some column values merged | Stable |
| Granite Vision 3.3 2B | Zero degenerations. Westernizes all Arabic numerals (0/6) | Stable |
| Arabic-Legal-OCR | Zero degenerations but no table structure or numbers extracted | Stable (but useless for tables) |
| DIMI-Arabic-OCR-V2 | Hit 4096 token limit, extremely slow (95s). No numerals or table structure | Low quality |
| NuMarkdown-8B-Thinking | Zero degenerations. Arabic numbers garbled (0/6) but labels correct. Chain-of-thought reasoning accurate but doesn't fix output | Stable |
| DeepSeek-VL2-Tiny | Arabic table content completely hallucinated ("الاسم" repeated). English stable with correct values | Critical (Arabic) |
| ERNIE-4.5-VL-28B-AWQ-4bit | Arabic text reversed (LTR), numbers garbled. English stable. Thinking tokens consume context | Critical (Arabic) |
| Arabic-Nougat-Large | Hallucinates entirely on non-book content. Both Arabic and English outputs are fabricated | Critical (all) |
| OCRFlux-3B | Arabic numbers trigger degeneration: ٣٠٠٠٠٠٠٠٠٠ repeated to 4096 tokens. English pages stable. Same pattern as QARI-OCR (both Qwen2.5-VL-based) | Critical (Arabic) |
| Surya | N/A (standalone toolkit). Table detection failed on both pages. Arabic uses wrong numeral system (Persian/Urdu) | Limitation |
vLLM Compatibility
| Model | vLLM Version | Status |
|---|---|---|
| DotsOCR v1.0 | v0.8.5 | Production (current) |
| DotsOCR v1.5 | v0.11.0 (native support) | Fully integrated, no custom code needed. Broken on v0.17.0 (outputs garbage/degeneration) |
| DotsOCR v1.5-SVG | v0.11.0+ (native, needs --chat-template-content-format string) | Works, same as base |
| Qwen3-VL (all sizes) | v0.11.0+ (native support) | Production-ready |
| DeepSeek-OCR-2 | nightly only (PR #33165, Feb 2026) | Not in stable release yet |
| Granite-Docling | v0.11.0+ | Works but output quality degrades vs raw transformers |
| PaddleOCR-VL-1.5 | nightly (fails on v0.11.0 — mlp_AR module not found) | Works on nightly |
| Qwen3.5 (all sizes) | nightly only (qwen3_5 arch not in v0.11.0) | Requires chat_template_kwargs: {enable_thinking: false} |
| Nanonets-OCR2-3B | v0.17.0 (Qwen2.5-VL architecture) | Works out of the box, --limit-mm-per-prompt '{"image": 1}' |
| LightOnOCR-2-1B | v0.17.0 (v0.11.1+ for v1) | Requires --mm-processor-cache-gb 0 --no-enable-prefix-caching |
| QARI-OCR v0.3 | v0.17.0 (Qwen2-VL architecture) | Loads and serves, but degenerates on table documents. Needs --max-model-len 16384 for full pages |
| FireRed-OCR-2B | v0.17.0 (Qwen3-VL architecture) | Works out of the box. Use --max-model-len 32768 and temperature=0.0 (recommended vLLM params) |
| InternVL3.5-4B | v0.17.0 (InternVLChat architecture) | Requires --trust-remote-code. Loads and serves, but quality is poor (hallucinations, degeneration) |
| GLM-OCR | v0.17.0 (glm_ocr architecture) | Requires transformers 5.x — separate venv with transformers==5.3.0 force-installed via --no-deps. Excellent English, degenerates on Arabic |
| HunyuanOCR | nightly (hunyuan_vl architecture) | Works with --no-enable-prefix-caching --mm-processor-cache-gb 0. Stable, fast, but English output lacks table structure and Arabic numbers truncated |
| Chandra-OCR | v0.17.0 (Qwen2.5-VL architecture) | Works out of the box. Very slow (52s on Arabic) |
| OlmOCR-2-7B-FP8 | v0.17.0 (Qwen2.5-VL architecture, FP8) | Works out of the box |
| Granite Vision 3.3 2B | v0.17.0 | Works out of the box |
| Arabic-Legal-OCR | v0.17.0 (Gemma-3 architecture) | Works but poor quality on tables |
| DIMI-Arabic-OCR-V2 | v0.17.0 (Qwen2.5-VL + LoRA) | Requires --gpu-memory-utilization 0.6 for LoRA loading. Very slow |
| NuMarkdown-8B-Thinking | v0.17.0 (Qwen2.5-VL architecture) | Works with --trust-remote-code --limit-mm-per-prompt '{"image": 1}' |
| DeepSeek-VL2-Tiny | v0.17.0 | Requires --hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}' --limit-mm-per-prompt '{"image": 1}', timm package. Max 4096 context. Hallucinates Arabic |
| ERNIE-4.5-VL-28B-AWQ-4bit | v0.17.0 | Community AWQ quant. Requires --trust-remote-code, decord package. ~15 GB VRAM. Arabic reversed |
| Arabic-Nougat-Large | N/A (standalone) | Not vLLM-compatible. Uses VisionEncoderDecoderModel + NougatProcessor from transformers. pip install transformers torch pillow |
| OCRFlux-3B | v0.17.0 (Qwen2.5-VL architecture) | Works with --trust-remote-code. Good English, degenerates on Arabic |
| Surya | N/A (standalone) | Not a vLLM model. Install via pip install surya-ocr. Requires transformers <5 |
VRAM Budget Analysis (L40S: 46 GB)
| Configuration | VRAM Used | Free |
|---|---|---|
| Current (Triton + Qwen2.5-VL-7B + DotsOCR v1.0) | ~34.7 GB | ~11.3 GB |
| Current with Qwen3-VL-2B-FP8 (production) | ~28.4 GB | ~17.6 GB |
| Upgrade to DotsOCR v1.5 + Qwen3-VL-2B | ~19.9 GB | ~26.1 GB |
| Upgrade to DotsOCR v1.5 + Qwen3-VL-4B-AWQ | ~21.0 GB | ~25.0 GB |
| DotsOCR v1.5 + SVG model (replace Qwen) | ~22.7 GB | ~23.3 GB |
| Replace DotsOCR with PaddleOCR-VL-1.5 | ~22.3 GB | ~23.7 GB |
| Replace DotsOCR with DeepSeek-OCR-2 | ~27.0 GB | ~19.0 GB |
| Replace DotsOCR with Granite-Docling | ~21.0 GB | ~25.0 GB |
| Replace DotsOCR with Nanonets-OCR2 | ~28.5 GB | ~17.5 GB |
| Replace DotsOCR with LightOnOCR-2 | ~22.5 GB | ~23.5 GB |
Other Models Researched (Not Tested)
| Model | Params | Why Not Tested |
|---|---|---|
| Qwen2.5-VL-7B | 7B (~17GB) | Already use Qwen as secondary; larger than DotsOCR |
| NVIDIA Nemotron Parse v1.1 | 885M (~1.8GB) | Now tested — 0/6 Arabic (garbled), excellent English (1.2s, 3/3). No Arabic support. See Arabic Table Extraction section |
| NVIDIA Nemotron Nano VL 8B | 8B (~16GB) | Now tested — 0/6 Arabic numerals, English-only model, hallucinates Arabic content. Good English OCR. See Arabic Table Extraction section |
| InternVL3-8B | 8B (~16-20GB) | General-purpose VLM, not document-specific. 4B variant tested — hallucinates on Arabic, degenerates on crops |
| MiniCPM-V 4.5 | 8B (~5GB AWQ) | Now tested (AWQ-4bit) — 0/6 Arabic numerals, hallucinates Chinese+Arabic mixed text. Good English OCR. Apache-2.0. See Arabic Table Extraction section |
| OCRFlux-3B (ChatDOC) | 3B | Now tested — 0/6 Arabic numerals, degenerates on Arabic (same pattern as QARI-OCR). Good English OCR. See Arabic Table Extraction section |
| GOT-OCR 2.0 | 580M (~1.2GB) | Now tested (standalone transformers) — 0/6 Arabic, degenerated. English also broken (50 chars only). Not vLLM compatible. See Arabic Table Extraction section |
| GLM-OCR (Zhipu/zai-org) | 0.9B | Now tested — see GLM-OCR benchmark section above. #1 on OmniDocBench v1.5 (94.62), MIT license. Fastest English table extractor (0.9-2.2s). Arabic degenerates. Requires transformers 5.x (separate venv) |
| HunyuanOCR (Tencent) | 1B | Now tested — see Arabic Table Extraction section above. Best Arabic text quality (proper RTL, all labels matched) but numbers truncated (last digit dropped). English fast (0.6-1.5s) but no table structure. Custom license |
| OlmOCR-2-7B-FP8 (Allen AI) | 7B | Now tested — 3/6 Arabic numerals, some column merging. See Arabic Table Extraction section |
| Chandra-OCR (ChandraAI) | 9B | Now tested — 5/6 Arabic numerals (second-best), best HTML output, but 52s (very slow). See Arabic Table Extraction section |
| Granite Vision 3.3 2B (IBM) | 2B | Now tested — westernizes all Arabic numerals (0/6). See Arabic Table Extraction section |
| Arabic-Legal-Documents-OCR | 4B | Now tested — 0/6 Arabic numerals, no table structure. See Arabic Table Extraction section |
| DIMI-Arabic-OCR-V2 | 7B | Now tested — 0/6 Arabic numerals, 95s, no table structure. See Arabic Table Extraction section |
| NuMarkdown-8B-Thinking | 8B | Now tested — 0/6 Arabic numerals (garbled), good English tables with chain-of-thought reasoning. See Arabic Table Extraction section |
| Surya (Datalab) | ~500M | Now tested — standalone OCR toolkit (not VLM). 0/6 Arabic numerals (wrong numeral system), table detection failed. GPL license. See Arabic Table Extraction section |
| LightOnOCR-1B-1025 (v1) | 1B | Strictly inferior to LightOnOCR-2-1B (v2) which was already tested. v2 scores higher on all benchmarks (83.2% vs 76.1% OlmOCR-Bench). Same Arabic limitations (garbled numerals). No reason to test |
| AtlasOCR | — | Requires Unsloth framework, not compatible with vLLM |
| Baseer (Misraj/Baseer__Nakba) | 4B (Qwen2.5-VL-3B) | Now tested — non-functional via vLLM (outputs only 3-10 tokens). Needs custom transformers pipeline. See Arabic Table Extraction section |
| DeepSeek-VL2-Tiny | 3.4B MoE | Now tested — 0/6 Arabic numerals, hallucinates Arabic table content entirely. English OK. See Arabic Table Extraction section |
| ERNIE-4.5-VL-28B-A3B (Baidu) | 28B MoE / 3B active | Now tested (AWQ-4bit) — 0/6 Arabic numerals, Arabic text reversed. English OK. ~15 GB VRAM. See Arabic Table Extraction section |
| Arabic-Nougat (MohamedRashad) | ~400M | Now tested — 0/6 Arabic numerals, hallucinates on non-book content. Not vLLM-compatible. See Arabic Table Extraction section |
| Mistral OCR | Unknown | Commercial/API model, not self-hostable via vLLM |
Conclusions
1. DotsOCR v1.5 is the clear upgrade path — and the ONLY model with correct Arabic numerals
- 1.7× faster than v1.0 on average (2.9s vs 4.9s on Oman pages)
- Same quality: Perfect HTML tables, all values correct for both English and Arabic
- Arabic numbers perfect: ALL AED amounts extracted correctly (٣,٥٠٠,٠٠٠, ٦,٧٧٠, ٥,٠٠٠) — matches v1.0. 25+ models tested, none match DotsOCR v1.5 on Arabic numerals. Ranked results:
- Chandra-OCR (9B): 5/6 — second-best, ٢٠,٠٠٠→٢,٠٠٠. Best HTML output but 52s (very slow), 16 GB VRAM
- HunyuanOCR (1B): 4/6 — truncates last digit (٦,٧٧ instead of ٦,٧٧٠). Best Arabic text quality, fast (5.1s). Custom license
- Qwen3.5-4B-AWQ (generic prompt): 5/6 — ٢٠,٠٠٠→٢,٠٠٠. But 73s and reversed text. Recommended prompt only gets 3/6
- OlmOCR-2-7B-FP8: 3/6 — column values merge together
- All others 0/6: DeepSeek-VL2-Tiny hallucinates, ERNIE-4.5-VL reverses Arabic text, Arabic-Nougat hallucinates on non-book content, OCRFlux-3B degenerates (same as QARI-OCR — both Qwen2.5-VL-based), Granite Vision westernizes, Surya uses Persian numerals, NuMarkdown garbles digit order, Arabic-Legal-OCR/DIMI no numbers extracted, GLM-OCR degenerates, QARI-OCR outputs all zeros, InternVL hallucinates, FireRed garbles, AIN-7B outputs empty table cells
- Native vLLM 0.11.0+ support: No custom code, out-of-tree registration, or nightly builds needed
- Same architecture: Drop-in replacement for v1.0 (same 3B params, same serving config)
- Stability: No degeneration observed, only minor crop artifacts
2. Qwen3-VL-4B-AWQ is the best secondary VLM upgrade
- Tested 12 Qwen variants across Qwen3-VL (6 models) and Qwen3.5 (7 models, 1 OOM)
- Qwen3-VL-4B-AWQ-4bit wins: fastest (3.0s avg), zero degenerations, handles Arabic (1300 tok), works on stable vLLM v0.11.0
- Qwen3.5-4B-AWQ is comparable but slower (3.5s) and requires vLLM nightly
- MoE models (30B/35B) don't justify their VRAM — 2B-FP8 and 30B still degenerate on Arabic
- 8B-BF16 has the cleanest Arabic output (935 tok) but is 3× slower and 4× larger
3. dots.ocr-1.5-svg does NOT replace Qwen
- With recommended params (
repetition_penalty=1.15), the SVG model is stable (0/10 degenerations) and extracts text from diagrams where Qwen 2B degenerates - However, Qwen3-VL-4B-AWQ does not degenerate on these same crops and produces more complete extractions on all 5 Picture crops (full indicator names + target values vs partial text snippets)
- SVG is slower (4.7–38.5s vs 0.3–4.0s), heavier (5.72 GB vs ~4 GB), Arabic broken (garbled output), and requires SVG post-processing
- Conclusion: Upgrade to Qwen3-VL-4B-AWQ-4bit for the secondary VLM role — no need for the SVG model
4. PaddleOCR-VL-1.5 remains strongest for English-only speed
- 7-17× faster than DotsOCR v1.0 on English tables (0.3-1.0s vs 5-7s)
- 1/8th the VRAM (1.8 GB vs 14.2 GB)
- Excellent English table quality — matches DotsOCR accuracy
- Weakness: Arabic numeral accuracy is poor (numbers missing or garbled)
- Weakness: OCR mode degenerates on complex pages; must use task-specific prompts
- Weakness: Requires vLLM nightly (not in stable release)
5. DeepSeek-OCR-2 is viable for bordered tables only
- Works well on cropped images with visible grid lines (kfd.pdf tables)
- Fails on styled/colored table layouts (Oman performance tables)
- 6.46 GB VRAM (saves 7.7 GB vs DotsOCR v1.0)
- Requires vLLM nightly
- Arabic numeral accuracy is poor
6. Granite-Docling is not ready for this use case
- Incredible VRAM efficiency (0.52 GB) and speed (0.9s avg)
- With
repetition_penalty=1.1, degeneration is fully resolved (0/10 degenerations, all tests complete in 0.4–2.0s) - Raw transformers + Docling lib can produce excellent tables
- But vLLM serving outputs
<loc_x><loc_y>coordinate format instead of proper DocTags — thedoclinglibrary cannot convert this format to markdown/HTML - Would require custom parser for vLLM's output format, or raw transformers inference (27.8s, too slow)
7. Nanonets-OCR2 and LightOnOCR-2 do not replace DotsOCR or Qwen
- Tested two dedicated OCR models from Oct 2025 as potential DotsOCR replacements
- LightOnOCR-2 (1B) is impressively fast (3.4s avg, 280 tok/s) and VRAM-efficient (~2 GB), matching DotsOCR v1.5 speed on English pages — but Arabic numerals are garbled and it cannot handle Picture crops
- Nanonets-OCR2 (3B) is 2.4× slower than DotsOCR v1.5 with no quality advantage. Describes images instead of extracting text on crops
- Neither model works for the secondary VLM role — they are document-page OCR models, not general VLMs. Qwen 4B-AWQ remains the best crop text extractor
- LightOnOCR-2 could serve as an English-only fast tier alongside PaddleOCR-VL-1.5, but DotsOCR v1.5 remains the only model with correct Arabic numeral extraction
Recommendation
Immediate action: Upgrade DotsOCR v1.0 → v1.5. Same quality, 1.7× faster, native vLLM 0.11.0 support. Drop-in replacement.
Upgrade Qwen3-VL-2B-FP8 → Qwen3-VL-4B-AWQ-4bit as the secondary VLM for Picture regions. Tested 12 Qwen variants across Qwen3-VL and Qwen3.5 families — Qwen3-VL-4B-AWQ is the winner: zero degenerations (vs 2/10 for 2B-FP8), handles Arabic (1300 tok), works on stable vLLM v0.11.0, and only adds ~1 GB disk (4.2 GB vs 3.3 GB). Community AWQ model from cyankiwi/Qwen3-VL-4B-Instruct-AWQ-4bit on HuggingFace. Qwen3.5 models are not recommended — they require vLLM nightly (v0.17.0+ not yet released) and offer no advantage for our use case.
Pipeline comparison on adobe page 3 (DotsOCR v1.5 layout + 3 Picture crops):
- DotsOCR v1.5 + Qwen 2B (prod params): 41.5s — compliance crop degenerated (32.7s wasted)
- DotsOCR v1.5 + Qwen 2B (recommended params): 7.2s — stable, but requires
presence_penalty=1.5 - DotsOCR v1.5 + Qwen 4B-AWQ (prod params): 6.3s — stable without any param changes
- DotsOCR v1.5 + Qwen 4B-AWQ (recommended params): 6.2s — stable
Degeneration stability test (adobe-6-page.pdf, 10 runs each via full Temporal workflow):
| Stack | Degenerations | Avg Time | Avg Content |
|---|---|---|---|
| Old (v1.0 + Qwen 2B, prod params) | 9/10 (90%) | 153s | ~17,400c |
| New (v1.5 + Qwen 4B-AWQ, recommended params) | 1/20 (5%) | 26s | ~17,580c |
Old stack degenerates on almost every run — Qwen 2B produces 4096 tokens of repeated newlines on the compliance crop, adding ~140s per document. New stack had 1 degeneration across 20 runs, caught by token limit and handled by Paddle fallback with minimal content loss.
Full PDF end-to-end comparison (Temporal workflow, all PDFs):
| Pages | Main (v1.0+2B) | New (v1.5+4B) | Speedup | Content | |
|---|---|---|---|---|---|
| adobe-6-page.pdf | 6 | 64s / 9,950c | 27s / 17,627c | 2.4× | +77% |
| kfd.pdf (EN/AR tables) | 6 | 42s / 8,448c | 21s / 10,731c | 2.0× | +27% |
| tickets.pdf (AR receipts) | 6 | 45s / 777c | 21s / 4,686c | 2.1× | +503% |
| oman-2040-en-min.pdf | 17 | 88s / 12,593c | 67s / 19,373c | 1.3× | +54% |
| ar-novel.pdf (Arabic) | 28 | 191s / 42,245c | 203s / 45,069c | 0.9× | +7% |
| Total (63 pages) | 430s / 74,013c | 339s / 97,486c | 1.3× | +32% |
Accuracy analysis per PDF:
adobe-6-page.pdf (English compliance white paper with diagrams):
- Main has 16 OCR errors, New fixes all 16:
٨ ةh0e(garbled header),Seryices→Services,Compianceoveniew→Compliance Overview,ayailability→availability,comptiance→compliance,reauirements→requirements,27oo1:2o13→27001:2013,FedRAMp→FedRAMP,FERpA→FERPA,5ocI٨ا→SOCIAL,SE5MENTs→SEGMENTS,obigations→obligations,priyacy→privacy,Asget yanagement→Asset Management, missing ® symbol, incomplete conclusion - New extracts complete CCF diagram labels (200 controls across 11 domains, all standard names correct), full cloud vendor comparison table (Azure/AWS/Google/IBM/Oracle/Alibaba with service details), and the full conclusion paragraph that main missed entirely
- New issue: one secondary model output contains a JSON code block from the compliance icons crop (minor)
kfd.pdf (English/Arabic motor insurance tables):
- English: Both extract all AED amounts correctly (3,500,000 / 6,770 / 5,000 / 7,500 / 20,000 / 200,000). Main uses
GcCinstead ofGCC - Arabic: Main outputs reversed/unstructured Arabic text — not readable as natural Arabic. New outputs proper right-to-left Arabic with markdown tables:
التغطيات التأمينية,درهم إماراتي, all AED amounts in Arabic numerals (٣,٥٠٠,٠٠٠ / ٦,٧٧٠ / ٥,٠٠٠) - New has proper markdown table structure for both English and Arabic benefit tables
- New issue: Arabic table has minor number errors —
٦,٥٠٠for emergency medical (should be٦,٠٠٠) and٦,٧٧٠for personal injury cover (should be٢٠,٠٠٠)
tickets.pdf (Arabic government receipts + English travel docs):
- Main extracts only 777 chars — reversed Arabic (
نامع ةنطلسinstead ofسلطنة عمان), garbled English (NVOIGEinstead ofINVOICE,MNVDICe Cwnoamip,PECH-HZ), and 3 completely blank pages - New extracts 4,686 chars — readable Arabic receipt (
سلطنة عمان,سند صرف, financial table with amounts), full English embassy check (HSBC, $1,598.00), Air Canada flight itinerary (4 flight segments with times, airports, passenger names), and invoice details - New issue: secondary model hallucinates on blurry handwriting page ("I'm not sure if I'm going to be able to do this") and describes image quality instead of extracting text on low-quality scans
oman-2040-en-min.pdf (17-page English government vision document):
- Main has 9+ systematic OCR errors:
2o4o→2040,oriorities→priorities,Ciyilization→Civilization,sOcial→social,SMuscol→Muscat,GR code→QR code,DocUumenf→Document,0mm2040→oman2040,Iargest→largest - New fixes all of these — clean text with proper headings, table of contents as markdown table
- Both extract the core content (vision priorities, national programs, indicators)
ar-novel.pdf (28-page Arabic academic paper):
- Both produce comparable Arabic text (~30K Arabic chars each, +7% in new)
- Main outputs reversed Arabic text throughout — functional but harder to process downstream
- New outputs proper Arabic with markdown headings (
# الْبُعد العجائبي في الرواية العربية) - Similar speed (191s vs 203s) — pure text pages with no Picture crops, so secondary model not involved
- New is slightly slower likely due to higher max_tokens (4096 vs 512) allowing more complete extraction
Formatting: Main produces zero markdown formatting across all PDFs (no headings, no tables). New produces structured markdown throughout with ## headings, | | tables, and * bullet lists
Use recommended hyperparameters per model: Qwen models: temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5 (critical for preventing Arabic text degeneration). DotsOCR SVG: temperature=0.6, top_p=0.9, repetition_penalty=1.15 (critical for preventing SVG path degeneration). Granite-Docling: repetition_penalty=1.1 (fixes degeneration loops).
For English-heavy workloads: Consider PaddleOCR-VL-1.5 as a fast preprocessing tier (0.3-1.0s) with DotsOCR v1.5 fallback for Arabic content or complex layouts.
Adobe Page 3 Test — dots.ocr Behavior on Complex Pages
Tested dots.ocr v1.0 and v1.5 on adobe-6-page.pdf page 3 (CCF diagram + compliance infographic + body text):
- Both v1.0 and v1.5 produce identical output — body text only, diagrams marked as Picture
- Neither responds to prompt variation — tested 5 different prompts (OCR, extract all, describe, scene text), all produced the same output
- This confirms dots.ocr is a document parser, not a general VLM — it extracts structured text/tables and marks visual elements as Picture for secondary VLM processing
- dots.ocr v1.0 layout detection correctly identifies 3 Picture regions on this page, which are then routed to Qwen
HuggingFace Availability Note
The official rednote-hilab/dots.ocr-1.5 and dots.ocr-1.5-svg repos were removed from HuggingFace (tracked in GitHub issue #272). Available via:
- dots.ocr-1.5: HuggingFace mirror
kristaller486/dots.ocr-1.5(MIT license, 40K+ downloads) - dots.ocr-1.5-svg: ModelScope only
modelscope.cn/models/rednote-hilab/dots.ocr-1.5-svg(MIT license)