How to Scale and Distribute Multiple High-Resolution Camera Streams Without CPU Bottlenecks
A practical, reproducible guide to building a 1-to-many GPU-accelerated video pipeline on Windows using GStreamer, CUDA-backed scaling, and MediaMTX.
High-resolution video pipelines are now a core part of modern computer-vision systems. The challenge is not only inference accuracy, but also efficient video handling across multiple consumers.
A single camera stream often needs to power several workloads at once:
- full-resolution archival recording and precision image-processing workflows (for example, camera-based dimensional measurement),
- lower-resolution real-time dashboards,
- model-friendly inputs for analytics pipelines.
If each resized stream is generated in software on the CPU, performance degrades quickly. Frame copies increase memory pressure, PCIe transfers become expensive, and latency rises. The result is unstable throughput at exactly the point where systems should scale.
This article explains a practical architecture for 1-to-many video distribution using GPU-accelerated scaling and GStreamer on Windows, including protocol choices, build prerequisites, and deployment patterns.
1) The Core Problem: One Input, Many Outputs
Consider a high-resolution H.265 RTSP source stream. Typical downstream requirements include:
- Original resolution for storage, audit trails, and precision measurement tasks,
1080pfor web viewers,- Square model input (such as
1024x1024) for AI processing.
A naive implementation launches separate decode/scale pipelines on the CPU for each output. That multiplies cost and usually introduces one or more bottlenecks:
- repeated frame transfer through system memory,
- CPU-bound scaling operations,
- inconsistent frame pacing under load.
A better design decodes and scales on the GPU while keeping frames in GPU memory end-to-end whenever possible.
2) Distribution Strategy: Protocols by Role
Before implementation details, define protocol responsibilities clearly.
SRT (Secure Reliable Transport)
Use SRT for long-distance, lossy-network contribution links. It combines low-latency UDP transport with packet recovery and encryption, making it suitable for facility-to-datacenter ingest of raw, high-resolution feeds.
HLS (HTTP Live Streaming)
Developed by Apple, use HLS for broad playback compatibility and firewall-friendly delivery over standard HTTP infrastructure. It is reliable and widely supported across mobile and desktop, with higher latency due to chunked segmentation.
MPEG-DASH
An open-standard alternative to HLS. Use DASH where open-source adaptive streaming is preferred and client compatibility (like Android or Smart TVs) is appropriate. It is flexible, but browser/device support strategy should be validated per deployment.
WebRTC for ultra-low latency
For operator-facing live dashboards, WebRTC is often the best choice when sub-500ms latency is required. HLS can remain a fallback for broader device support.
3) Why GPU Memory Residency Matters
In GStreamer, the major optimization is to avoid unnecessary CPU-side frame movement. CUDA-enabled elements can pass frames as GPU-backed buffers (using the memory:CUDAMemory flag) through decode, scale, and encode stages.
That reduces:
- CPU utilization to near-zero,
- PCIe transfer overhead,
- frame jitter under concurrent loads.
When this path is fully enabled, scaling becomes a GPU-native operation rather than a CPU bottleneck, leaving maximum computational headroom for background AI tasks.
cudascale at a glance (official behavior)
According to the official GStreamer docs, cudascale is:
- a CUDA-based video resize element,
- part of the
nvcodecplugin, - shipped under GStreamer Bad Plug-ins.
Its sink and source pads both operate on video/x-raw(memory:CUDAMemory), which is exactly why it is central to zero-copy GPU pipelines.
Official minimal example:
gst-launch-1.0 videotestsrc ! video/x-raw,width=640,height=480 ! cudaupload ! cudascale ! cudadownload ! video/x-raw,width=1280,height=720 ! fakesink
This demonstrates the memory flow clearly: upload frame to CUDA memory, scale in CUDA, optionally download back to system memory only when needed.
4) Windows Build Prerequisites for GPU Scaling
Depending on your environment, required CUDA-backed plugins (like cudascale) may not be available by default in the official binaries and must be built from source.
This point is critical: in many Windows installations, cudascale is not available out of the box even if GStreamer itself is installed correctly. In practice, teams often discover this only after running:
gst-inspect-1.0 cudascale
and seeing that the element is missing.
Why? Because cudascale is tied to the nvcodec plugin stack, which lives in GStreamer Bad Plug-ins (gst-plugins-bad), not in the always-shipped core set.
What is GStreamer Bad Plug-ins?
gst-plugins-bad is an official GStreamer module that contains plugins that are not yet considered at the same maturity level as the "good/ugly/base" sets. A plugin can be in this module for several reasons: limited review, incomplete tests, missing documentation, narrower production usage, or limited maintainer bandwidth.
Important clarification: "bad" does not mean unusable. Many production systems rely on elements from this module. It simply indicates maturity/support status, which is exactly why some binaries do not expose every feature by default and why custom builds are sometimes required.
To build the custom gstnvcodec.dll, you need the following environment on your Windows Server:
- Visual Studio Build Tools: Ensure the "Desktop development with C++" workload is installed.
- Python 3.x: Installed and added to your system
PATH. - Meson and Ninja: Build systems. Install via command prompt:
pip install meson ninja - WinFlexBison: Download the Windows port, extract to
C:\winflexbison, and add to yourPATH. - GStreamer Runtime and Development: Install both the runtime and devel
.msipackages for MSVC x64 (for example, version1.28.1). Default path isC:\Program Files\gstreamer. - GStreamer Bad Plugins Source: Download the matching source code tarball (for example,
gst-plugins-bad-1.28.1) and extract it to a working directory likeC:\build\gst-plugins-bad.
Version alignment is critical: keep runtime, devel, and source on the exact same release train to avoid ABI mismatches.
5) Compiling gstnvcodec.dll on Windows
The goal is to build GStreamer "bad" plugins with nvcodec enabled so GPU scaling and codec elements are available to your pipelines.
Step A: Avoid path-space toolchain issues
Linux-based build tools like pkg-config behave poorly with spaces in paths (like Program Files). Create a directory junction to bypass this:
mklink /J C:\gstreamer "C:\Program Files\gstreamer"
set PATH=C:\gstreamer\1.0\msvc_x86_64\bin;C:\winflexbison;%PATH%
set PKG_CONFIG=C:\gstreamer\1.0\msvc_x86_64\bin\pkg-config.exe
Step B: Fix script interpreter paths
In standard binary distributions, helper scripts often reference invalid build-time Python locations (like a developer's cerbero.git folder). Open glib-mkenums and glib-genmarshal (located in the \bin directory) and ensure the top shebang line points to your local Python installation:
#!C:\Users\Administrator\AppData\Local\Programs\Python\Python311\python.exe
Step C: Configure build with focused options
Open the x64 Native Tools Command Prompt for VS. Navigate to your source folder and run Meson. Enable nvcodec and disable optional modules (docs/tests/webrtcdsp) to reduce dependency friction and compile strictly what is needed:
cd C:\build\gst-plugins-bad
meson setup builddir -Dnvcodec=enabled -Dwebrtcdsp=disabled -Dtests=disabled -Dexamples=disabled -Dintrospection=disabled -Ddoc=disabled
Step D: Build and deploy
Compile with Ninja, copy the resulting DLL into the local GStreamer plugin directory, clear the registry cache, and verify plugin visibility:
ninja -C builddir
copy builddir\sys\nvcodec\gstnvcodec.dll "C:\gstreamer\1.0\msvc_x86_64\lib\gstreamer-1.0\" /Y
del "%LOCALAPPDATA%\gstreamer-1.0\registry.x86_64.bin" /Q
"C:\gstreamer\1.0\msvc_x86_64\bin\gst-inspect-1.0.exe" cudascale
If the scaler element appears in the inspection output, the GPU scaling path is officially active.
6) Runtime Architecture: 1-to-Many Pipeline
A practical production layout is dual-track, utilizing a media server for human viewing and a Python engine for machine viewing.
Track 1: Distribution Hub (MediaMTX)
A media server process ingests the RTSP feed, selectively demuxes the video to avoid audio/metadata crashes, decodes, scales to 1080p in VRAM, re-encodes, and republishes for WebRTC browser delivery.
Example mediamtx.yml path configuration:
paths:
camera_web_stream:
runOnInit: >
gst-launch-1.0.exe rtspsrc location="rtsp://USER:PASS@10.x.x.x:554/live" protocols=tcp latency=2000 name=demux
demux. ! application/x-rtp, media=video, encoding-name=H265 !
rtph265depay ! h265parse ! nvh265dec !
cudascale ! video/x-raw(memory:CUDAMemory), width=1920, height=1080 !
nvh264enc bitrate=5000 rc-mode=cbr bframes=0 gop-size=50 !
h264parse config-interval=-1 ! flvmux ! rtmp2sink location=rtmp://localhost:1935/camera_web_stream
runOnInitRestart: yes
Track 2: Analytics Path (Python and OpenCV)
A parallel pipeline consumes the same source stream, scales it strictly to model input dimensions (for example, 1024x1024) on the GPU, and feeds AI services. This preserves CPU resources for business logic and post-processing.
In addition, you can consume the HLS output in your Python service, compute analytics results (detections, measurements, quality flags), and publish those results over WebSocket in real time.
Example OpenCV VideoCapture pipeline:
gst_pipeline = (
'rtspsrc location="rtsp://USER:PASS@10.x.x.x:554/live" protocols=tcp latency=200 ! '
'rtph265depay ! h265parse ! nvh265dec ! '
'cudascale ! video/x-raw(memory:CUDAMemory), width=1024, height=1024 ! '
'cudaconvert ! video/x-raw, format=BGR ! appsink'
)
cap = cv2.VideoCapture(gst_pipeline, cv2.CAP_GSTREAMER)
This pattern gives you a strong operational advantage: video remains easy to distribute (HLS for broad compatibility), while the decision layer stays live and interactive (WebSocket for low-latency events). In practice, this enables remote supervision, centralized tuning of thresholds/rules, and faster rollout of pipeline updates without interrupting stream delivery.
7) GStreamer Pipeline Design Notes
When building resilient pipelines, pay attention to these details:
- Select video stream explicitly from RTSP sources (using
name=demux demux. ! ...) to avoid hidden non-video tracks causingnot-linkedpipeline crashes. - Tune latency and buffering based on network conditions and use case (operator UI vs archival).
- Choose encoder settings for your transport target (bitrate control, GOP sizing, B-frames).
- Keep formats explicit between stages (caps negotiation) to prevent implicit software conversions.
- Restart policies are essential for long-running services in industrial and edge environments.
8) Operational Checklist
Before calling the deployment production-ready, validate:
- CPU usage remains stable near 0-5% during peak stream fan-out,
- dropped-frame rate under sustained load,
- end-to-end latency per output protocol,
- plugin availability after reboot/updates,
- watchdog/restart behavior after source interruptions.
Also document exact tool versions and environment variables used at build time. This avoids "works-on-one-machine" failures later.
Conclusion
Scaling high-resolution video for multiple consumers does not have to overload CPU resources.
A GPU-first GStreamer architecture, combined with role-based protocol design and careful Windows build configuration, enables reliable 1-to-many distribution with lower latency and better throughput.
The key principle is simple: decode, scale, and encode where the data already is. When frames stay in GPU memory, the entire pipeline becomes more efficient, more stable, and easier to scale.