GPU Efficient - Scalable, PCIe Friendly Multistream Licence Plate Recognition Pipeline

How we rebuilt an industrial licence plate recognition pipeline from a per-camera DeepStream architecture into a GPU-efficient, PCIe-friendly multistream design with batched OCR.

GPU Efficient - Scalable, PCIe Friendly Multistream Licence Plate Recognition Pipeline

Overview

This article documents a production journey: taking a licence plate recognition (LPR) system that worked on a handful of cameras and rebuilding it for industrial scale — without pinning a single datacenter GPU at 94% utilization.

The original stack was not broken. On two cameras it was stable, accurate, and easy to operate. The problems surfaced only at scale, and they were architectural — not model-quality issues. The fix was not swapping OCR models; it was redesigning how video, detection, cropping, and recognition share GPU memory and PCIe bandwidth.

Part 1 covers the initial pipeline. Part 2 documents the rebuild and what each optimization contributed.


1. The Initial Pipeline

The production system was a DeepStream-based Docker service started with a single command:

python3 plate_supervisor.py -f plate_cams.txt

Behind that line hid a simple rule: one independent processing stack per camera.

Architecture

ComponentRole
plate_supervisor.pyReads plate_cams.txt and spawns one worker process per camera
plate_ws_service.pyPer-camera pipeline: RTSP → YOLO TRT → crop → OCR → WebSocket
plate_cams.txtCamera registry — up to 20 streams in production
manual_ocr_gateway.pyGateway for on-demand manual OCR across cameras
ocr-gpu-serviceSeparate container — PaddleOCR GPU over HTTP (POST /ocr)

Each RTSP source got its own plate_ws_service.py OS process with a dedicated HTTP/WebSocket endpoint. DeepStream ran on a daemon thread inside that process; FastAPI/Uvicorn handled the API layer on the main asyncio loop.

plate_cams.txt (20 cameras)
plate_supervisor.py
        ├── plate_ws_service.py  (cam 1)  ──► DeepStream + YOLO  ──► HTTP OCR
        ├── plate_ws_service.py  (cam 2)  ──► DeepStream + YOLO  ──► HTTP OCR
        ├── ...
        └── plate_ws_service.py  (cam 20) ──► DeepStream + YOLO  ──► HTTP OCR
                                          ocr-gpu-service
                                          (shared PaddleOCR API)

The system was multistream only in name. There was no shared nvstreammux across cameras — each stream owned its own OS process, its own DeepStream pipeline, its own YOLO TensorRT engine, and its own GPU context.

How a Frame Moved Through the System

NVIDIA DeepStream chains GStreamer elements to keep video on the GPU: NVDEC decode → nvstreammuxnvinfer (TensorRT) → Python probe. In our setup, nvstreammux was meant to batch multiple streams into one tensor, and nvinfer ran YOLO plate detection and attached bbox metadata (NvDsObjectMeta) to each frame. OCR lived outside DeepStream in a separate PaddleOCR service.

The per-camera path looked like this:

RTSP → NVDEC decode (GPU)
     → nvstreammux (batch-size = 1)
     → YOLO TensorRT — full-frame detection (GPU)
     → probe: full frame copied to CPU → crop sliced on CPU
     → JPEG encode → HTTP POST → PaddleOCR (det → cls → rec)
     → WebSocket result

YOLO always ran on the full decoded frame. The probe pulled the entire image out of GPU memory and only then cut the plate region on the CPU:

n_frame = pyds.get_nvds_buf_surface(hash(gst_buffer), frame_meta.batch_id)
frame_rgba = np.asarray(n_frame).copy()          # GPU → CPU (full 1920×1080)
frame_bgr = cv2.cvtColor(frame_rgba, cv2.COLOR_RGBA2BGR)
crop = frame_bgr[py:py2, px:px2]                # bbox slice on CPU, not GPU
Old pipeline PCIe path: GPU NVDEC and YOLO on full frame, entire 1920x1080 frame copied to CPU, crop sliced on CPU, then OCR over HTTP

GPU–CPU–OCR HTTP request, high PCIe traffic — detection stays on the GPU; the bus carries the full frame even though only a small crop reaches OCR.

Gst-nvinfer plugin architecture

Diagram source: NVIDIA DeepStream — Gst-nvinfer plugin

Plate Detection: DeepStream Integration and the Batch-Size = 1 Ceiling

Standard DeepStream nvinfer cannot run Ultralytics YOLO out of the box. We used marcoslucianops/DeepStream-Yolo — a custom libnvdsinfer_custom_impl_Yolo.so that provides NvDsInferYoloCudaEngineGet (ONNX → TensorRT) and NvDsInferParseYolo (raw output → bounding boxes).

Thanks to Marcos Luciano for DeepStream-Yolo.

The YOLO engine was exported and configured at batch-size=1:

[property]
onnx-file=tr_plateV1.onnx
model-engine-file=plate_yolo.engine
batch-size=1
parse-bbox-func-name=NvDsInferParseYolo
custom-lib-path=.../libnvdsinfer_custom_impl_Yolo.so
engine-create-func-name=NvDsInferYoloCudaEngineGet
streammux.set_property("batch-size", 1)
pgie.set_property("config-file-path", PGIE_CONFIG_PATH)

With 20 cameras that meant 20 separate TensorRT engines in VRAM, each processing one frame per kernel launch. The GPU stayed busy (~94% utilization) firing small single-frame inferences across parallel pipelines — hard work, not smart work.

When we later tried true multistream, DeepStream logged Backend has maxBatchSize 1 whereas 2 has been requested. The ONNX had been exported at fixed batch=1 ([1, 3, 640, 640]); only a fixed-batch re-export (yolo export batch=N dynamic=False) could raise the ceiling.

Old per-stream YOLO pipeline: N camera streams each with dedicated TensorRT thread batch-size 1

Each stream owns its own TensorRT thread (B1), its own engine in VRAM, and its own full-frame path. Frames from different cameras never enter the model as one batch.

Note — detector choice is not locked to YOLO. This deployment used YOLO for plate object detection, but DeepStream's nvinfer path is model-agnostic at the integration layer: any PyTorch or TensorFlow detector exported to ONNX with a fixed or dynamic batch dimension compatible with TensorRT can replace it, provided a matching custom bbox parser (or standard detector output format) is wired into config_plate_pgie.txt. The multistream architecture — shared nvstreammux, batched TRT engine, lazy ROI probe — does not depend on YOLO specifically.

RF-DETR is a notable alternative. Core models (Nano through Large) ship under Apache 2.0, which simplifies commercial and redistribution use compared with some YOLO licensing paths. RF-DETR exports to ONNX (model.export(format="onnx")) and integrates with DeepStream via community parsers such as deepstream-rfdetr — the same nvinfer + custom parse-bbox-func-name pattern as DeepStream-Yolo. For greenfield LPR projects where licence clarity and transformer-based accuracy matter, RF-DETR is worth evaluating alongside YOLO before committing to a TRT export.

OCR: Paddle GPU, Fast per Image, Slow at Fleet Scale

OCR ran in ocr-gpu-service (paddlecloud/paddleocr:2.6-gpu), a separate container exposing POST /ocr:

PaddleOCR(
    use_angle_cls=True,
    use_gpu=True,
    det_model_dir="/models/en_PP-OCRv3_det_infer",
    rec_model_dir="/models/en_PP-OCRv4_rec_infer",
    cls_model_dir="/models/ch_ppocr_mobile_v2.0_cls_infer",
)

For a single cropped plate, performance was excellent — 27.ZE 822 at ~0.88 confidence, inferenceMs in the 40–55 ms range:

{ "mergedText": "27.ZE 822", "inferenceMs": 53.9 }

Two or three cameras sending occasional crops was no problem. The pain started when crop volume exceeded what a single-image service could drain: one HTTP request per crop, no batch endpoint, no dynamic batching — each call a standalone PaddleOCR.ocr(image, cls=True) processed largely one after another.

Fleet sizeWhat happened
2–3 camerasSporadic crops, sub-second end-to-end
20 camerasConcurrent HTTP bursts, OCR queue forms
30 cameras × 25 FPSFar beyond single-image throughput

inferenceMs stayed in the millisecond band, but queued requests waited seconds. And YOLO had already found the plate — yet PaddleOCR still ran full det → cls → rec on every crop.

What Worked, What Didn't

On a small fleet, the design had genuine strengths: per-camera fault isolation, simple onboarding (one line in plate_cams.txt), accurate OCR on single crops, and the right idea of cropping before OCR:

OCR_URL = "http://ocr-service.internal/ocr"
session.post(OCR_URL, files={"image": ("plate.jpg", jpeg_bytes, "image/jpeg")}, data={"lang": "en"})

At 20 cameras on one datacenter GPU, four structural limits collided:

LimitWhat it meant
Isolated pipelines20 processes × batch-1 YOLO engines in VRAM, no cross-camera batching
PCIe round-tripFull frame GPU→CPU every probe tick, crop back to GPU for OCR
Monolithic OCRPaddleOCR.ocr() det+cls+rec, no GPU batching
HTTP hot pathOne crop per request into one PaddleOCR instance sharing GPU with 20 YOLO pipelines

The service kept running. GPU utilization hit ~94%. But there was no headroom for burst traffic, new models, or fleet expansion.

Where We Headed Next

The rebuild would keep what worked — YOLO plate detection, crop-then-OCR, Docker deployment — and fix what blocked scale:

  1. Replace per-camera processes with true multistream batching (nvstreammux).
  2. Keep tensors on the GPU through crop and into OCR.
  3. Upgrade OCR to Paddle 3.x with real GPU batching.
  4. Replace monolithic PaddleOCR.ocr() with a modular det→rec pipeline that still localizes text inside each YOLO crop.

2. The Rebuild

Four changes addressed the four limits from Part 1. Each maps to a concrete code path.

Part 1 limitFixPrimary code
Isolated pipelinesSingle multistream DeepStream serviceplate_multistream_service.py
PCIe round-tripLazy ROI crop on unified GPU buffersplate_ws_service.py (shared helpers)
Monolithic OCR, no batchPaddle 3.x modular det→rec + GPU batchocr-gpu-v3 (ocr_service.py)
HTTP hot pathClient-side batch collector + POST /ocr/batchplate_multistream_service.py + OCR service

The entrypoint changed from a process supervisor to a single multistream daemon:

# Before
python3 plate_supervisor.py -f plate_cams.txt    # spawns N × plate_ws_service.py

# After
python3 plate_multistream_service.py -f plate_cams.txt

plate_supervisor.py still exists for per-camera deployments, but production scale runs one OS process with one shared DeepStream pipeline.

One Pipeline, N Cameras

plate_multistream_service.py replaces the fleet of isolated workers. All enabled cameras connect to one nvstreammux whose batch-size equals the active camera count. A single nvinfer instance loads one TensorRT engine and processes every stream in one batched kernel call.

plate_cams.txt (N cameras)
plate_multistream_service.py  (single OS process)
        ├── nvurisrcbin × N  ──►  nvstreammux (batch-size = N)
        │                              │
        │                              ▼
        │                    YOLO TensorRT (one engine, batch = N)
        │                              │
        │                              ▼
        │                    shared probe → OCR queue
        ├── WebSocket + /manual-ocr  (cam 1, port from registry)
        ├── WebSocket + /manual-ocr  (cam 2)
        └── ...
              ocr-gpu-v3  (batched Paddle GPU)

At startup, _ensure_pgie_batch_size() rewrites the PGIE config to match the live camera count and points nvinfer at the correct engine file:

batch_size = len(self.pipeline_cameras)
streammux.set_property("batch-size", batch_size)
self.pgie_config_path = _ensure_pgie_batch_size(PGIE_CONFIG_PATH, batch_size)
# → model_b{batch_size}_gpu0_fp16.engine
pgie.set_property("config-file-path", self.pgie_config_path)

For multistream batching to work, the YOLO ONNX must be exported at the target batch dimension (yolo export batch=N dynamic=False). The old batch-size=1 engine cannot be reused — TensorRT compiles maxBatchSize at build time.

What this buys: instead of 20 OS processes, 20 CUDA contexts, and 20 engine copies in VRAM, the fleet shares one inference graph. Kernel launch overhead is paid once per batch interval, not once per camera per frame. GPU utilization drops because the hardware processes meaningful batch work instead of twenty parallel batch=1 threads.

Per-camera WebSocket and /manual-ocr endpoints remain on their registry ports — the manual OCR gateway needs no changes. Each camera gets a CameraRuntime object with isolated state, but they all feed the same GStreamer pipeline.

RTSP sources use nvurisrcbin with automatic reconnect. On first deploy, DEFER_RTSP_UNTIL_PGIE can delay RTSP attachment until the TensorRT engine finishes building — avoiding a long stall where nvstreammux waits for streams that never arrive during engine compilation.

Lazy ROI: PCIe-Friendly Cropping

The old probe copied the full 1920×1080 frame to CPU on every tick. The rebuild moved cropping logic into shared helpers in plate_ws_service.py and made three design choices:

1. CUDA unified memory. nvstreammux, nvvideoconvert, and the decoder use NVBUF_MEM_CUDA_UNIFIED so Python can map GPU buffers without a full PCIe copy:

def map_frame_rgba(gst_buffer, batch_id: int) -> Optional[np.ndarray]:
    n_frame = pyds.get_nvds_buf_surface(hash(gst_buffer), batch_id)
    return np.asarray(n_frame)          # view, not .copy()

2. Metadata-first detection scan. scan_plate_objects() walks NvDsObjectMeta from the YOLO output — no GPU buffer access needed to know where plates are.

3. Lazy frame access. The probe only maps the GPU buffer when a frame is actually needed:

def frame_access_needed(*, first_frame_pending, manual_requested, ocr_candidates) -> bool:
    return first_frame_pending or manual_requested or bool(ocr_candidates)

When a plate is detected, only the ROI is converted — not the full frame:

roi = frame_rgba[py:py2, px:px2]
crop = cv2.cvtColor(np.ascontiguousarray(roi), cv2.COLOR_RGBA2BGR)

What this buys: on frames with no plate detection, the probe never touches GPU memory. On detection frames, PCIe traffic scales with crop size (~200×60 pixels) instead of full-frame resolution (~2 megapixels). That is roughly a 30–50× reduction in bytes moved per detection event.

Optional full-frame snapshots for audit (SAVE_FULL_FRAME_ON_AUTO) still convert the whole image, but only when explicitly enabled and only at OCR enqueue time — not on every frame.

OCR Upgrade: Paddle 2.6 → 3.3 (CUDA 13)

OCR was the second half of the GPU contention problem — and it required more than a config change.

The old ocr-gpu-service ran on paddlecloud/paddleocr:2.6-gpu with monolithic PaddleOCR:

PaddleOCR(use_gpu=True, use_angle_cls=True, ...)
ocr.ocr(tmp.name, cls=True)   # det → cls → rec, one image at a time

That API had two hard limits:

LimitImpact
No real GPU batchingocr.ocr() processes one image per call — no /ocr/batch, no dynamic batch queue
Monolithic pipelinedet, angle classification, and recognition bundled together — cannot batch stages independently

There is no ready-made “PaddleOCR 3.x + batch + GPU” Docker image on Hub. The official paddlepaddle/paddle images ship the framework only; the OCR service layer is custom. We moved to:

paddlepaddle/paddle:3.3.1-gpu-cuda13.0-cudnn9.13

Paddle 3.x exposes separate GPU modules with native batch_size support:

TextDetection(model_name="PP-OCRv4_mobile_det", device="gpu:0")
TextRecognition(model_name="en_PP-OCRv4_mobile_rec", device="gpu:0")

Model inference runs on GPU (device="gpu:0"). JPEG decode and CTC string decode stay on CPU — that is normal and lightweight compared to the neural network forward pass.

Why Not Rec-Only?

The first ocr_service.py prototype sent YOLO crops directly to TextRecognition — skipping detection inside OCR entirely. The logic seemed sound: YOLO already found the plate, so why run det again?

Production testing showed otherwise. YOLO bbox and OCR text box are not the same thing:

DeepStream / YOLO  →  plate bounding box on full frame
TextDetection      →  text line box inside the crop
Test imageOld ocr-gpu-service (det+cls+rec)v3 rec-only
plate_full_frame.jpg (full plate frame)27.ZE 822 (conf 0.88)2Z 2 (conf 0.34)
plate_tight_crop.jpg (tight crop)goodL27 ZE 822J — artifacts at edges

Rec-only was fast (~3 ms/image in batch tests) but fragile. YOLO crops often still contain TR strip, frame text, bolt holes, or padding — the recognition model reads the entire bitmap as one line and hallucinates characters.

Sending crops straight to recognition also made us dependent on perfect manual-tight crops. That is risky in a live RTSP pipeline where bbox padding, motion blur, and angle vary per camera.

The production decision: keep TextDetection inside OCR to locate the text region within each YOLO crop, then run TextRecognition on the refined sub-crop. Angle classification (cls) — redundant once the text box is isolated — was dropped. The pipeline became:

YOLO crop → TextDetection batch → largest text poly → sub-crop → TextRecognition batch → plate text

Engine tag in responses: "engine": "modular-det-rec-batch".

Production accuracy matched the old service. Timings below are warmed single requests — the first call after container restart can spike to ~80–90 ms while models load into GPU memory:

ImageNew v3 (modular-det-rec-batch)Old ocr-gpu-service
plate_sample_2.jpg80 ACV73025 ms80 ACV730 — 50 ms
plate_full_frame.jpg27ZE822~24 ms27.ZE 822 — 46 ms
plate_tight_crop.jpgL27 ZE 822J~23 ms

On warmed requests v3 was roughly 2× faster than the old service on single crops (~23–25 ms vs ~50 ms). The larger win is fleet throughput when batching kicks in.

Batched OCR: Client and Server

The modular det→rec pipeline runs as two separate GPU batch kernels — not one fused tensor, but both stages batch across N images:

def _run_modular_det_rec_batch(images, langs):
    det_out = _get_det().predict(input=images, batch_size=n)
    # per image: pick largest detection poly → sub-crop
    rec_out = _get_rec().predict(input=crops, batch_size=n)

Two endpoints serve the pipeline:

EndpointRole
POST /ocrSingle crop — auto-batched by server-side OcrBatcher
POST /ocr/batchExplicit multi-crop batch from the DeepStream client

The server-side OcrBatcher collects concurrent requests within OCR_BATCH_WAIT_MS (default 30 ms) and flushes them as one GPU batch. On the DeepStream side, _ocr_batch_collector gathers crops from the shared OCR queue within OCR_BATCH_COLLECT_MS (default 15 ms) and posts via call_ocr_service_batch():

def _flush_ocr_batch(self, items):
    crops = [pws._preprocess_for_ocr(it["image"], it.get("source")) for it in items]
    ocrs, err = pws.call_ocr_service_batch(crops)

Manual OCR requests bypass the collector and go straight to POST /ocr — interactive latency stays low.

Batch benchmarks on plate_tight_crop.jpg (N=30):

ModeTotal timePer imageSpeedup
30× batch_size=1222 ms7.4 ms
batch_size=3094 ms3.1 ms2.4×

When twenty cameras fire crops in the same frame interval, the old path queued twenty sequential ~50 ms inferences (~1 s wall time). The new path groups them into one or two GPU batches — amortizing kernel launch across the fleet.

What changed end-to-end:

MetricOld (ocr-gpu-service)New (ocr-gpu-v3)
Base imagepaddlecloud/paddleocr:2.6-gpupaddlepaddle/paddle:3.3.1-gpu-cuda13.0-cudnn9.13
Inference APIMonolithic PaddleOCR.ocr()Modular TextDetection + TextRecognition
Per-request pipelinedet + cls + rec (sequential)det + rec (two batched GPU stages)
GPU batchingNonepredict(batch_size=N) + OcrBatcher + /ocr/batch
Upload pathTempfile on diskIn-memory decode
Response fieldsinferenceMsinferenceMs, batchSize, batchInferenceMs, engine

Operational Compatibility

The rebuild preserved the interfaces operators already relied on:

  • Camera registry — same plate_cams.txt format (id|name|rtsp|port), with optional per-camera disable flag.
  • Per-camera ports — WebSocket overlay and /manual-ocr stay on registry ports; gateway routing unchanged.
  • OCR response shapemergedText, inferenceMs, lines[] — so downstream consumers parse results the same way.
  • Shared moduleplate_multistream_service.py imports plate_ws_service as pws for plate normalization, OCR calls, persistence, and crop helpers. Bug fixes in the shared module benefit both deployment modes.

Summary: Old vs New

                        Old (20 cameras)              New (20 cameras)
                        ────────────────              ───────────────
Processes               20 × plate_ws_service         1 × plate_multistream
DeepStream pipelines    20                            1
YOLO TRT engines        20 × batch=1                  1 × batch=20
nvstreammux             batch=1 per process           batch=N shared
GPU frame copy          Full 1920×1080 every frame    ROI only, lazy access
OCR service             ocr-gpu-service (Paddle 2.6)  ocr-gpu-v3 (Paddle 3.3, det→rec batch)
OCR requests            1 crop → 1 HTTP               N crops → 1 batch HTTP
GPU utilization         ~94%, no headroom             Headroom for burst + growth

The detection architecture from Part 1's target diagram is now what runs in production:

Batched YOLO pipeline: N camera streams feed a single dynamic batch-sized TensorRT engine

All N streams share one TensorRT engine. Crops leave the GPU path as small ROIs. OCR drains them in GPU batches instead of one HTTP call per plate.


3. Conclusion

The rebuild delivered what the initial architecture could not: easy operations, real scalability, and efficient GPU use across a multistream LPR fleet.

Before this work, the site also relied on traditional CPU-based OCR elsewhere in the stack — slower inference, weaker accuracy on plate crops, and no path to batch twenty concurrent camera events. That approach was adequate for ad-hoc single-image reads; it was not a foundation for industrial video analytics. The new system replaces that model with a GPU-native pipeline: batched YOLO detection in DeepStream, PCIe-friendly ROI cropping, and Paddle 3.x OCR with modular det→rec batching.

The operational gains are concrete:

  • One process, one pipeline — camera onboarding stays a single line in plate_cams.txt; no per-camera process sprawl.
  • Headroom on the GPU — utilization drops from ~94% pinned to a level that tolerates burst traffic and fleet growth.
  • Throughput at scale — OCR crops batch across cameras instead of queueing as isolated HTTP posts.
  • Accuracy retained — TextDetection inside OCR compensates for imperfect YOLO crops; results match or exceed the legacy GPU service on warmed requests.

Management is simpler, the architecture scales with camera count, and GPU memory and PCIe bandwidth are spent on work that matters — not on twenty redundant TensorRT engines and full-frame CPU copies.

Licence plate recognition pipeline dashboard

4. Future Improvements

The current design fits the LPR hot path: JPEG crops over HTTP into a dedicated Paddle GPU service with client- and server-side batching. If the same OCR stack must also serve other workloads — document scans, ad-hoc photo uploads, batch archive reprocessing — a further step is worth considering: NVIDIA Triton Inference Server with dynamic batching and gRPC tensor transport.

Today, even with ROI cropping on the DeepStream side, crops still leave the video pipeline as encoded images and re-enter OCR through HTTP multipart upload. That means JPEG encode on the client, decode on the server, and CPU memory in between. Triton opens a different path:

DeepStream probe → GPU tensor (crop batch)
gRPC → Triton (ppocr_det + ppocr_rec, dynamic batch)
Text results — no full photo round-trip through CPU RAM

Potential benefits:

AreaHTTP OCR (current)Triton gRPC (future)
TransportJPEG bytes per cropRaw or NVMM tensor batch
BatchingApplication-level collectorsServer-side dynamic batching (preferred batch sizes, queue delay)
Multi-tenant OCRShared Paddle containerModel repository per stage (det / rec), independent scaling
PCIe / CPUEncode → decode per cropTensors stay closer to GPU memory

This is not required for the LPR fleet that motivated this article — the multistream DeepStream + Paddle 3.x batch service already solved the production bottleneck. Triton becomes relevant when OCR must become a shared inference platform across LPR and other image pipelines, with maximum throughput and minimum CPU involvement in the data plane.

A practical migration path would keep the current HTTP service for backward compatibility while introducing a Triton backend for high-volume batch clients — det and rec as separate ONNX or TensorRT models, dynamic batching on both, and gRPC clients that send preprocessed tensor batches instead of photographs.