TensorRT
ONNX
segmentation
sam
sam3
quantization
edge
embedl

Embedl SAM3 (Quantized)

SAM3 Benchmark: Baseline vs Embedl Deploy

Optimized version of facebook/sam3 for edge deployment.

Mixed-precision INT8/FP16 quantization with hardware-aware optimizations, ready for NVIDIA Jetson AGX Orin and other TensorRT-capable platforms.

Highlights

  • Format: ONNX with external weights (embedl_sam3_quant.onnx + .onnx.data)
  • Precision: INT8 with sensitive layers kept in FP16
  • Runtime: TensorRT (FP16 + INT8 mode)
  • Target hardware: NVIDIA Jetson AGX Orin, desktop/server GPUs with TensorRT

Quick Start

1. Download the model

hf download embedl/sam3 embedl_sam3_quant.onnx embedl_sam3_quant.onnx.data infer_trt.py --local-dir .

2. Build the TensorRT engine

WARNING: Validated with TensorRT 10.1 and 10.3 only. Latest versions of TensorRT produce incorrect segmentation masks for this model.

/usr/src/tensorrt/bin/trtexec --onnx=embedl_sam3_quant.onnx \
        --fp16 --int8 \
        --builderOptimizationLevel=5 \
        --memPoolSize=workspace:4294967296 \
        --timingCacheFile=embedl_sam3_timing_cache.bin \
        --saveEngine=embedl_sam3_quant.engine

3. Run inference

See infer_trt.py for a complete example that runs text-prompted video segmentation, measures latency, and saves an output video with mask overlays.

python3 -m venv venv --system-site-packages # Use system TensorRT
source venv/bin/activate
pip install opencv-python transformers av
python infer_trt.py

Files

File Description
embedl_sam3_quant.onnx ONNX model graph
embedl_sam3_quant.onnx.data External weights (~3.1 GB)
infer_trt.py TensorRT inference example

Performance

The input resolution is reduced from the default to 924 to enable TensorRT layer fusions that are not possible at the original size. All benchmarks use this resolution.

NVIDIA L4 GPU

Environment: NVIDIA L4, Driver 570.211.01, CUDA 12.8, TensorRT 10.3

Text-prompted video segmentation on NVIDIA L4 GPU

Configuration Latency Speedup
torch.compile (FP16) 137 ms 1.0x
Embedl Deploy (this model) 104 ms 1.32x

NVIDIA Jetson AGX Orin

Configuration Latency Throughput Speedup
Baseline (FP16, resized to 924) 763 ms 1.31 qps 1.0x
Embedl Deploy (this model) 462 ms 2.17 qps 1.65x

Accuracy (SA-Co/Gold)

Evaluated on the SA-Co/Gold instance segmentation benchmark (Table 30 in the SAM3 paper). The quantized model retains nearly all of the FP32 accuracy with a tolerance.

Average across all subsets:

Model cgF1 IL_MCC pos_µF1
SAM3 (paper, Table 30) 54.1 0.82 66.1
SAM3 ONNX FP32 (ours) 55.56 0.823 67.45
Embedl SAM3 INT8 (this model) 53.77 0.809 66.36

Per-subset breakdown:

Subset cgF1 (FP32) cgF1 (INT8) pos_µF1 (FP32) pos_µF1 (INT8)
Metaclip 47.92 47.07 59.24 58.54
SA-1B 53.44 52.33 61.70 61.31
Crowded 60.28 59.09 67.54 67.25
FG Food 58.76 56.28 72.01 70.02
Sports Equipment 67.85 65.61 75.15 73.91
Attributes 55.11 54.12 73.08 72.57
WikiCommon 45.57 41.85 63.46 60.88
Average 55.56 53.77 67.45 66.36

Creating Your Own Optimized Models

Deployment-ready models can be created from any supported base model using embedl-deploy, available on PyPI. Detailed tutorials will follow.

License

This model is a derivative of facebook/sam3.

Component License
Upstream (Meta SAM3) SAM License
Optimized components Embedl Models Community Licence v1.0 (no redistribution as a hosted service)

Contact

We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/sam3

Base model

facebook/sam3
Quantized
(10)
this model

Paper for embedl/sam3