siraaj-dot-ocr-service / docs/research/triton-concurrency-tuning.md

Triton Concurrency Tuning Report

Last updated: 4/16/2026GitHub

Triton Concurrency Tuning Report

Date: 2026-03-25 GPU: NVIDIA L40S (46 GB VRAM) Triton: 26.02 with ONNX Runtime backend + Python BLS backend Protocol: Direct gRPC via tritonclient.grpc (no HTTP proxy) Goal: Determine optimal Triton instance counts and preprocessing strategy for high-throughput PDF batch processing

Models Under Test

Model	Backend	Role	Instances	GPU/CPU
`router_onnx`	ONNX Runtime	SigLIP2 vision encoder	1	GPU
`router_pipeline`	Python BLS	Image decode + preprocess + cosine similarity	2	CPU
`det_onnx`	ONNX Runtime	PaddleOCR text detection	2	GPU
`cls_onnx`	ONNX Runtime	PaddleOCR text classification	1	GPU
`rec_onnx`	ONNX Runtime	PaddleOCR text recognition	2	GPU
`ocr_pipeline`	Python BLS	Detection → classification → recognition orchestrator	4	CPU

Router Profiling

Profiled per-request time breakdown inside the router pipeline:

Single-request breakdown (BLS preprocessing)

Step	Time (ms)	Notes
PNG decode (PIL)	15	Full-size page image (~460KB)
BICUBIC resize to 256x256	14	Most expensive preprocessing step
Normalize + transpose	2	Negligible
Total preprocessing	32

Triton model stats (`router_onnx`)

Metric	Per-request avg	Notes
Queue wait	182ms	Requests waiting for single GPU instance
GPU inference	18ms	ONNX Runtime on L40S (not 5ms as initially estimated)
Compute I/O	0.4ms	Negligible

Key finding: GPU inference is 18ms, not 5ms. With 1 GPU instance, the theoretical max is 1000/18 ≈ 56 RPS. After BLS + gRPC overhead, practical ceiling is ~32 RPS.

Optimization: Client-Side Pre-Resize

Moved image resize from BLS (server) to TritonClient.classify() (client): resize to 256x256 JPEG (quality=95) before sending to Triton.

Image format	Size	BLS decode time
Full PNG (original)	460KB	32ms (decode + resize)
256x256 JPEG (pre-resized)	20KB	0.4ms (decode only, no resize)

Results: Pre-Resized JPEG vs Full PNG

Workers	RPS (PNG)	RPS (JPEG)	p50 PNG (ms)	p50 JPEG (ms)	p50 improvement
4	28.0	30.8	123	41	3x faster
16	27.8	31.7	331	85	4x faster
64	30.2	32.7	1067	418	2.5x faster

Throughput: modest ~10% improvement — GPU inference (18ms) is the hard ceiling, not preprocessing.

Latency: 3-4x improvement at typical concurrency. This is the real win for the Temporal worker use case.

Accuracy Validation

Tested 27 pages across 5 PDFs (English + Arabic). Zero classification flips between PNG and pre-resized JPEG paths. Max margin difference: 0.0067 (threshold is 0.02).

Experiment: Increased Instance Counts

Router Pipeline (CPU instances: 2 → 4/8/16)

No throughput improvement at any count. The BLS per-request time (~10ms with pre-resize) is not the bottleneck — the single GPU instance is.

CPU Instances	RPS @4w	RPS @64w
2	30.8	32.7
4	31.3	31.1
8	28.1	30.1
16	31.3	32.9

More instances increase cold start penalty without improving throughput.

Router ONNX (GPU instances: 1 → 2)

Workers	RPS (1 GPU)	RPS (2 GPU)
4	30.8	23.7
64	32.7	33.2
128	29.1	33.8

Marginal gain at high concurrency, worse at low concurrency due to GPU contention.

All Models Scaled (4x GPU, 8-16x CPU)

Config Change	Before	After
`router_onnx` GPU instances	1	4
`router_pipeline` CPU instances	2	8
`det_onnx` GPU instances	2	4
`cls_onnx` GPU instances	1	4
`rec_onnx` GPU instances	2	4
`ocr_pipeline` CPU instances	4	16
Total GPU VRAM used	~13 GB	~31 GB

Result: performance degraded — up to 96% RPS drop at low concurrency due to CUDA context switching and VRAM pressure.

Preprocessing Alternatives

Approach	RPS @64w	p50 @4w (ms)	Notes
PIL BICUBIC (original)	30.2	123	Baseline
cv2 INTER_CUBIC	25.4	119	Slower — cv2 PNG decode is heavier
PIL BILINEAR	24.3	—	No improvement, cold start worse
Client pre-resize JPEG	32.7	41	Winner — eliminates BLS preprocessing

OCR Pipeline

Workers	RPS	p50 (ms)	p95 (ms)	Max (ms)
1	0.2	5859	5859	5859
4	0.3	10040	12136	12136
16	0.4	29032	44079	44079
64	1.0	35156	59347	65112

Single OCR request takes ~6 seconds (det → cls → rec chain). Instance tuning has no effect — the sequential GPU chain cannot be parallelized within a single request.

Conclusions

Optimal config for single L40S

Model	Instances	Why
`router_onnx`	1 GPU	GPU inference is 18ms. Single instance handles 32 RPS. Adding more causes contention.
`router_pipeline`	2 CPU	With client pre-resize, BLS work is ~0.4ms. 2 instances is more than enough.
`det_onnx`	2 GPU	Detection runs once per page. 2 instances allow overlap.
`cls_onnx`	1 GPU	Classification runs per text region, but is fast per call.
`rec_onnx`	2 GPU	Recognition runs per text region. 2 instances allow overlap.
`ocr_pipeline`	4 CPU	Orchestrator with CPU preprocessing between GPU calls.

Router: 32 RPS is GPU-bound

The throughput ceiling is set by router_onnx at 18ms/inference. Client pre-resize eliminates CPU preprocessing as a factor, cutting p50 latency 3x. At 4 Temporal workers, each page classifies in 41ms — a 100-page document routes in ~1 second.

OCR: 6s/page is model-bound

To improve OCR throughput, the path is model-level optimization (FP16, TensorRT, or replacing PaddleOCR entirely), not instance tuning.

Cold start

First request to any model takes ~1.3 seconds due to ONNX CUDA kernel compilation and Python BLS stub initialization. Not an instance count issue.

How to Reproduce

# Benchmark (direct gRPC, pre-resized JPEG for router)
python scripts/benchmark_triton_concurrency.py --endpoint both --workers 1 4 16 64 --url localhost:8101

# Accuracy test (compares PNG vs pre-resized JPEG classification)
python scripts/test_router_accuracy.py

Triton Concurrency Tuning Report

Models Under Test

Router Profiling

Single-request breakdown (BLS preprocessing)

Triton model stats (router_onnx)

Optimization: Client-Side Pre-Resize

Results: Pre-Resized JPEG vs Full PNG

Accuracy Validation

Experiment: Increased Instance Counts

Router Pipeline (CPU instances: 2 → 4/8/16)

Router ONNX (GPU instances: 1 → 2)

All Models Scaled (4x GPU, 8-16x CPU)

Preprocessing Alternatives

OCR Pipeline

Conclusions

Optimal config for single L40S

Router: 32 RPS is GPU-bound

OCR: 6s/page is model-bound

Cold start

How to Reproduce

Triton model stats (`router_onnx`)