Optimizing YOLOv8 Inference on Edge Devices: 60 FPS under 15W | Technical Blog | Farouk Hajjej

Deploying state-of-the-art computer vision models like YOLOv8 on edge devices presents a unique set of challenges. Unlike cloud deployments where compute is virtually unbounded, edge AI requires strict adherence to SWaP (Size, Weight, and Power) constraints. In a recent project, my objective was to achieve >60 Frames Per Second (FPS) inference with multi-object tracking, all while remaining within the strict 15-watt power budget of an Nvidia Jetson Orin Nano.

This case study outlines the architectural decisions and compilation strategies required to unlock this level of performance.

The Bottleneck: FP16/FP32 Operations

Out-of-the-box, PyTorch models generally operate using 32-bit floating-point (FP32) arithmetic. While mathematically precise, this is disastrous for edge environments. The memory bandwidth requirements alone will throttle the GPU, leading to high latency and severe thermal throttling.

Our first step was moving away from standard ONNX runtimes. We directly exported the Ultralytics YOLO architecture to an ONNX graph, then aggressively compiled it using NVIDIA's TensorRT.

INT8 Post-Training Quantization (PTQ)

To achieve the 60 FPS target, we had to rely on INT8 quantization. We supplied TensorRT with a representative dataset of 5,000 images from our target domain.

# Pseudo-code for TensorRT calibration
import tensorrt as trt

builder = trt.Builder(TRT_LOGGER)
config = builder.create_builder_config()

# Enable INT8 precision
config.set_flag(trt.BuilderFlag.INT8)

# Provide standard calibration algorithm
calibrator = StandardEntropyCalibrator('calibration_cache.bin', representative_dataset)
config.int8_calibrator = calibrator

# Build the highly optimized engine
engine = builder.build_engine(network, config)

By mapping the continuous FP32 weight distributions to discrete 8-bit integers, we saw a ~4x reduction in memory footprint and a massive theoretical spike in throughput.

Note: Quantization inherently introduces rounding errors. The standard entropy calibration ensured that we only dropped our Mean Average Precision (mAP) by a negligible 0.6%, maintaining 99.8% tracking accuracy.

TensorRT Graph Optimization

Beyond data types, TensorRT automatically performs layer fusion. It collapses sequences like Convolution -> Batch Normalization -> ReLU into single, highly optimized kernel executions on the GPU. This eliminates the overhead of constantly reading from and writing to VRAM between operations.

Multi-threading and Asynchronous Execution

Even with a blazing-fast inference engine, the surrounding application logic (video decoding, preprocessing, NMS post-processing, tracking algorithms) can limit the system.

We engineered a specialized C++ multi-threaded pipeline:

Thread 1 (Decoding/Capture): Uses hardware-accelerated NVDEC to grab frames from the hardware MIPI-CSI camera.
Thread 2 (Preprocessing): Resizes images and normalizes tensors directly in CUDA memory, completely bypassing CPU involvement.
Thread 3 (Inference): Streams tensors asynchronously into the TensorRT engine via non-blocking CUDA streams.
Thread 4 (Post/Tracking): Employs an optimized ByteTrack algorithm operating strictly on the CPU to handle object persistence across frames.

Results

The final compiled binary achieved exactly what we set out to do. On the Jetson Orin Nano, the model stabilized at 62 FPS with a latency under 14ms, drawing barely 12 watts sustained power. This unlocks enterprise-scale computer vision at the very edge of the network without requiring massive cloud infrastructure.