siraaj-dot-ocr-service / docs/research/triton-concurrency-tuning.md
Triton Concurrency Tuning Report
Triton Concurrency Tuning Report
Date: 2026-03-25
GPU: NVIDIA L40S (46 GB VRAM)
Triton: 26.02 with ONNX Runtime backend + Python BLS backend
Protocol: Direct gRPC via tritonclient.grpc (no HTTP proxy)
Goal: Determine optimal Triton instance counts and preprocessing strategy for high-throughput PDF batch processing
Models Under Test
| Model | Backend | Role | Instances | GPU/CPU |
|---|---|---|---|---|
router_onnx | ONNX Runtime | SigLIP2 vision encoder | 1 | GPU |
router_pipeline | Python BLS | Image decode + preprocess + cosine similarity | 2 | CPU |
det_onnx | ONNX Runtime | PaddleOCR text detection | 2 | GPU |
cls_onnx | ONNX Runtime | PaddleOCR text classification | 1 | GPU |
rec_onnx | ONNX Runtime | PaddleOCR text recognition | 2 | GPU |
ocr_pipeline | Python BLS | Detection → classification → recognition orchestrator | 4 | CPU |
Router Profiling
Profiled per-request time breakdown inside the router pipeline:
Single-request breakdown (BLS preprocessing)
| Step | Time (ms) | Notes |
|---|---|---|
| PNG decode (PIL) | 15 | Full-size page image (~460KB) |
| BICUBIC resize to 256x256 | 14 | Most expensive preprocessing step |
| Normalize + transpose | 2 | Negligible |
| Total preprocessing | 32 |
Triton model stats (router_onnx)
| Metric | Per-request avg | Notes |
|---|---|---|
| Queue wait | 182ms | Requests waiting for single GPU instance |
| GPU inference | 18ms | ONNX Runtime on L40S (not 5ms as initially estimated) |
| Compute I/O | 0.4ms | Negligible |
Key finding: GPU inference is 18ms, not 5ms. With 1 GPU instance, the theoretical max is 1000/18 ≈ 56 RPS. After BLS + gRPC overhead, practical ceiling is ~32 RPS.
Optimization: Client-Side Pre-Resize
Moved image resize from BLS (server) to TritonClient.classify() (client): resize to 256x256 JPEG (quality=95) before sending to Triton.
| Image format | Size | BLS decode time |
|---|---|---|
| Full PNG (original) | 460KB | 32ms (decode + resize) |
| 256x256 JPEG (pre-resized) | 20KB | 0.4ms (decode only, no resize) |
Results: Pre-Resized JPEG vs Full PNG
| Workers | RPS (PNG) | RPS (JPEG) | p50 PNG (ms) | p50 JPEG (ms) | p50 improvement |
|---|---|---|---|---|---|
| 4 | 28.0 | 30.8 | 123 | 41 | 3x faster |
| 16 | 27.8 | 31.7 | 331 | 85 | 4x faster |
| 64 | 30.2 | 32.7 | 1067 | 418 | 2.5x faster |
Throughput: modest ~10% improvement — GPU inference (18ms) is the hard ceiling, not preprocessing.
Latency: 3-4x improvement at typical concurrency. This is the real win for the Temporal worker use case.
Accuracy Validation
Tested 27 pages across 5 PDFs (English + Arabic). Zero classification flips between PNG and pre-resized JPEG paths. Max margin difference: 0.0067 (threshold is 0.02).
Experiment: Increased Instance Counts
Router Pipeline (CPU instances: 2 → 4/8/16)
No throughput improvement at any count. The BLS per-request time (~10ms with pre-resize) is not the bottleneck — the single GPU instance is.
| CPU Instances | RPS @4w | RPS @64w |
|---|---|---|
| 2 | 30.8 | 32.7 |
| 4 | 31.3 | 31.1 |
| 8 | 28.1 | 30.1 |
| 16 | 31.3 | 32.9 |
More instances increase cold start penalty without improving throughput.
Router ONNX (GPU instances: 1 → 2)
| Workers | RPS (1 GPU) | RPS (2 GPU) |
|---|---|---|
| 4 | 30.8 | 23.7 |
| 64 | 32.7 | 33.2 |
| 128 | 29.1 | 33.8 |
Marginal gain at high concurrency, worse at low concurrency due to GPU contention.
All Models Scaled (4x GPU, 8-16x CPU)
| Config Change | Before | After |
|---|---|---|
router_onnx GPU instances | 1 | 4 |
router_pipeline CPU instances | 2 | 8 |
det_onnx GPU instances | 2 | 4 |
cls_onnx GPU instances | 1 | 4 |
rec_onnx GPU instances | 2 | 4 |
ocr_pipeline CPU instances | 4 | 16 |
| Total GPU VRAM used | ~13 GB | ~31 GB |
Result: performance degraded — up to 96% RPS drop at low concurrency due to CUDA context switching and VRAM pressure.
Preprocessing Alternatives
| Approach | RPS @64w | p50 @4w (ms) | Notes |
|---|---|---|---|
| PIL BICUBIC (original) | 30.2 | 123 | Baseline |
| cv2 INTER_CUBIC | 25.4 | 119 | Slower — cv2 PNG decode is heavier |
| PIL BILINEAR | 24.3 | — | No improvement, cold start worse |
| Client pre-resize JPEG | 32.7 | 41 | Winner — eliminates BLS preprocessing |
OCR Pipeline
| Workers | RPS | p50 (ms) | p95 (ms) | Max (ms) | Errors |
|---|---|---|---|---|---|
| 1 | 0.2 | 5859 | 5859 | 5859 | 0 |
| 4 | 0.3 | 10040 | 12136 | 12136 | 0 |
| 16 | 0.4 | 29032 | 44079 | 44079 | 0 |
| 64 | 1.0 | 35156 | 59347 | 65112 | 0 |
Single OCR request takes ~6 seconds (det → cls → rec chain). Instance tuning has no effect — the sequential GPU chain cannot be parallelized within a single request.
Conclusions
Optimal config for single L40S
| Model | Instances | Why |
|---|---|---|
router_onnx | 1 GPU | GPU inference is 18ms. Single instance handles 32 RPS. Adding more causes contention. |
router_pipeline | 2 CPU | With client pre-resize, BLS work is ~0.4ms. 2 instances is more than enough. |
det_onnx | 2 GPU | Detection runs once per page. 2 instances allow overlap. |
cls_onnx | 1 GPU | Classification runs per text region, but is fast per call. |
rec_onnx | 2 GPU | Recognition runs per text region. 2 instances allow overlap. |
ocr_pipeline | 4 CPU | Orchestrator with CPU preprocessing between GPU calls. |
Router: 32 RPS is GPU-bound
The throughput ceiling is set by router_onnx at 18ms/inference. Client pre-resize eliminates CPU preprocessing as a factor, cutting p50 latency 3x. At 4 Temporal workers, each page classifies in 41ms — a 100-page document routes in ~1 second.
OCR: 6s/page is model-bound
To improve OCR throughput, the path is model-level optimization (FP16, TensorRT, or replacing PaddleOCR entirely), not instance tuning.
Cold start
First request to any model takes ~1.3 seconds due to ONNX CUDA kernel compilation and Python BLS stub initialization. Not an instance count issue.
How to Reproduce
# Benchmark (direct gRPC, pre-resized JPEG for router)
python scripts/benchmark_triton_concurrency.py --endpoint both --workers 1 4 16 64 --url localhost:8101
# Accuracy test (compares PNG vs pre-resized JPEG classification)
python scripts/test_router_accuracy.py