siraaj-dot-ocr-service / docs/research/triton-concurrency-tuning.md

Triton Concurrency Tuning Report

Last updated: 4/16/2026GitHub

Triton Concurrency Tuning Report

Date: 2026-03-25 GPU: NVIDIA L40S (46 GB VRAM) Triton: 26.02 with ONNX Runtime backend + Python BLS backend Protocol: Direct gRPC via tritonclient.grpc (no HTTP proxy) Goal: Determine optimal Triton instance counts and preprocessing strategy for high-throughput PDF batch processing

Models Under Test

ModelBackendRoleInstancesGPU/CPU
router_onnxONNX RuntimeSigLIP2 vision encoder1GPU
router_pipelinePython BLSImage decode + preprocess + cosine similarity2CPU
det_onnxONNX RuntimePaddleOCR text detection2GPU
cls_onnxONNX RuntimePaddleOCR text classification1GPU
rec_onnxONNX RuntimePaddleOCR text recognition2GPU
ocr_pipelinePython BLSDetection → classification → recognition orchestrator4CPU

Router Profiling

Profiled per-request time breakdown inside the router pipeline:

Single-request breakdown (BLS preprocessing)

StepTime (ms)Notes
PNG decode (PIL)15Full-size page image (~460KB)
BICUBIC resize to 256x25614Most expensive preprocessing step
Normalize + transpose2Negligible
Total preprocessing32

Triton model stats (router_onnx)

MetricPer-request avgNotes
Queue wait182msRequests waiting for single GPU instance
GPU inference18msONNX Runtime on L40S (not 5ms as initially estimated)
Compute I/O0.4msNegligible

Key finding: GPU inference is 18ms, not 5ms. With 1 GPU instance, the theoretical max is 1000/18 ≈ 56 RPS. After BLS + gRPC overhead, practical ceiling is ~32 RPS.

Optimization: Client-Side Pre-Resize

Moved image resize from BLS (server) to TritonClient.classify() (client): resize to 256x256 JPEG (quality=95) before sending to Triton.

Image formatSizeBLS decode time
Full PNG (original)460KB32ms (decode + resize)
256x256 JPEG (pre-resized)20KB0.4ms (decode only, no resize)

Results: Pre-Resized JPEG vs Full PNG

WorkersRPS (PNG)RPS (JPEG)p50 PNG (ms)p50 JPEG (ms)p50 improvement
428.030.8123413x faster
1627.831.7331854x faster
6430.232.710674182.5x faster

Throughput: modest ~10% improvement — GPU inference (18ms) is the hard ceiling, not preprocessing.

Latency: 3-4x improvement at typical concurrency. This is the real win for the Temporal worker use case.

Accuracy Validation

Tested 27 pages across 5 PDFs (English + Arabic). Zero classification flips between PNG and pre-resized JPEG paths. Max margin difference: 0.0067 (threshold is 0.02).

Experiment: Increased Instance Counts

Router Pipeline (CPU instances: 2 → 4/8/16)

No throughput improvement at any count. The BLS per-request time (~10ms with pre-resize) is not the bottleneck — the single GPU instance is.

CPU InstancesRPS @4wRPS @64w
230.832.7
431.331.1
828.130.1
1631.332.9

More instances increase cold start penalty without improving throughput.

Router ONNX (GPU instances: 1 → 2)

WorkersRPS (1 GPU)RPS (2 GPU)
430.823.7
6432.733.2
12829.133.8

Marginal gain at high concurrency, worse at low concurrency due to GPU contention.

All Models Scaled (4x GPU, 8-16x CPU)

Config ChangeBeforeAfter
router_onnx GPU instances14
router_pipeline CPU instances28
det_onnx GPU instances24
cls_onnx GPU instances14
rec_onnx GPU instances24
ocr_pipeline CPU instances416
Total GPU VRAM used~13 GB~31 GB

Result: performance degraded — up to 96% RPS drop at low concurrency due to CUDA context switching and VRAM pressure.

Preprocessing Alternatives

ApproachRPS @64wp50 @4w (ms)Notes
PIL BICUBIC (original)30.2123Baseline
cv2 INTER_CUBIC25.4119Slower — cv2 PNG decode is heavier
PIL BILINEAR24.3No improvement, cold start worse
Client pre-resize JPEG32.741Winner — eliminates BLS preprocessing

OCR Pipeline

WorkersRPSp50 (ms)p95 (ms)Max (ms)Errors
10.25859585958590
40.31004012136121360
160.42903244079440790
641.03515659347651120

Single OCR request takes ~6 seconds (det → cls → rec chain). Instance tuning has no effect — the sequential GPU chain cannot be parallelized within a single request.

Conclusions

Optimal config for single L40S

ModelInstancesWhy
router_onnx1 GPUGPU inference is 18ms. Single instance handles 32 RPS. Adding more causes contention.
router_pipeline2 CPUWith client pre-resize, BLS work is ~0.4ms. 2 instances is more than enough.
det_onnx2 GPUDetection runs once per page. 2 instances allow overlap.
cls_onnx1 GPUClassification runs per text region, but is fast per call.
rec_onnx2 GPURecognition runs per text region. 2 instances allow overlap.
ocr_pipeline4 CPUOrchestrator with CPU preprocessing between GPU calls.

Router: 32 RPS is GPU-bound

The throughput ceiling is set by router_onnx at 18ms/inference. Client pre-resize eliminates CPU preprocessing as a factor, cutting p50 latency 3x. At 4 Temporal workers, each page classifies in 41ms — a 100-page document routes in ~1 second.

OCR: 6s/page is model-bound

To improve OCR throughput, the path is model-level optimization (FP16, TensorRT, or replacing PaddleOCR entirely), not instance tuning.

Cold start

First request to any model takes ~1.3 seconds due to ONNX CUDA kernel compilation and Python BLS stub initialization. Not an instance count issue.

How to Reproduce

# Benchmark (direct gRPC, pre-resized JPEG for router)
python scripts/benchmark_triton_concurrency.py --endpoint both --workers 1 4 16 64 --url localhost:8101

# Accuracy test (compares PNG vs pre-resized JPEG classification)
python scripts/test_router_accuracy.py