Argus

Argus is a multi-task perception system built on a single frozen vision backbone. One forward pass through the encoder produces classification labels, semantic segmentation masks, metric depth maps, object detections, and dense keypoint correspondences. Roughly 103M parameters total, with the 86M backbone frozen and about 17.3M learnable across five task heads. Named after Argus Panoptes, the many-eyed giant of Greek mythology tasked with watching over everything at once.

The backbone is EUPE-ViT-B, introduced in Efficient Universal Perception Encoder (Zhu et al., Meta FAIR, arXiv:2603.22387, March 2026). EUPE distills a small vision encoder from a collection of larger specialist teachers, producing features that transfer well to image understanding, dense prediction, and vision-language tasks simultaneously. Argus leaves those weights frozen and attaches five lightweight heads.

Architecture

Image β†’ EUPE-ViT-B (frozen, 86M) β†’ shared features

  β”œβ”€β”€ Classification   trained linear softmax, 1000 ImageNet classes
  β”œβ”€β”€ Segmentation     BN + 1Γ—1 Conv, 150 ADE20K classes
  β”œβ”€β”€ Depth            DPT multi-scale decoder, metric depth (meters), NYU Depth V2
  β”œβ”€β”€ Detection        split-tower with cofiber decomposition, 80 COCO classes
  └── Correspondence   training-free dense feature matching
Head Params Description
Classification 769K Linear(768, 1000) softmax on the L2-normalized CLS token
Segmentation 117K BatchNorm2d(768) β†’ Conv2d(768, 150, 1Γ—1) at stride 16, bilinear-upsampled to input resolution
Depth 13.45M DPT fusing backbone blocks [2, 5, 8, 11], 256 depth bins over 0.001 to 10 m
Detection 2.98M 5 prediction levels at strides [8, 16, 32, 64, 128], cosine similarity against CLIP ViT-L/14 text embeddings
Correspondence 0 cosine-max on backbone spatial features

Benchmarks

EUPE paper reproduction

All four reported benchmarks were reproduced as part of building Argus.

Task Dataset Metric Paper Argus Delta
Classification ImageNet-1k kNN k=10 top-1 84.1 84.07 βˆ’0.03
Segmentation ADE20K mean IoU 52.4 52.72 +0.32
Depth NYU Depth V2 RMSE (lower is better) 0.391 0.3914 +0.0004
Correspondence SPair-71k PCK@0.1 51.3 54.35 +3.05

Shipped task metrics

Task Dataset Metric Value
Classification ImageNet-1k val top-1 / top-5 85.53 / 97.69
Segmentation ADE20K val mIoU 52.72
Depth NYU Depth V2 test RMSE / abs_rel / a1 0.480 / 0.219 / 0.872
Detection COCO val2017 mAP @[.5:.95] 42.64 (42.71 soft NMS)
Correspondence SPair-71k PCK@0.1 54.35

The shipped classifier is a trained linear softmax layer (85.53% top-1) that superseded the kNN protocol used during paper reproduction. The shipped depth head is a DPT decoder that improves RMSE by 8% and abs_rel by 28% over a linear probe on the same backbone (0.480 vs 0.520 RMSE). The same DPT architecture applied to segmentation does not beat the linear probe (52.28 vs 52.72 mIoU on ADE20K), so the shipping segmentation head stays a 1Γ—1 convolution.

Detection detail (COCO val2017)

Metric Value
mAP@[0.5:0.95] 42.64
mAP@0.50 65.70
mAP@0.75 45.10
mAP (small / medium / large) 22.31 / 48.33 / 62.90

At 2.98M learnable parameters the detection head passes the 16.14M FCOS simple-feature-pyramid baseline (41.0 mAP) by +1.64, using 18.4% of its head parameter budget. Small-object mAP is 22.3 against FCOS's 19.4 (+2.9). The backbone was never exposed to detection data; these are the same frozen features used for every other task.

Evaluation protocol: per-class hard NMS (IoU 0.5), score threshold 0.05, top-100 detections per image, pycocotools on COCO val2017.

The full scaling curve of head designs, from 105-parameter minimal circuits through the shipped 2.98M head, lives in the sibling phanerozoic/cofiber-detection repository. The architecture-search context and FCOS reference baseline are in phanerozoic/detection-heads.

Cross-Dataset Detection Transfer

To test whether the detection head's features generalize beyond COCO, the shipping 2.98M detection head (trained on COCO 2017 at 768px) and the 16.14M FCOS baseline (trained on COCO 2017 at 640px) were each evaluated zero-shot against the 20 RF100-VL validation domains. Both heads saw only COCO during training; RF100-VL was never exposed to either. Evaluation is class-agnostic AR@100 (all detections relabeled to a single "object" class, all ground-truth boxes relabeled likewise) so that localization transfer can be measured even on domains whose label space does not overlap COCO-80.

domain FCOS (16.1M) Ours (3.0M) Ξ”
actions 37.5 39.6 +2.1
aerial-airport 16.1 17.3 +1.1
all-elements 2.3 7.9 +5.6
aquarium-combined 47.5 58.2 +10.6
defect-detection 0.1 0.3 +0.2
dentalai 0.2 0.9 +0.7
flir-camera-objects 53.1 54.3 +1.2
gwhd2021 1.7 1.5 -0.3
lacrosse-object-detection 57.9 66.6 +8.7
new-defects-in-wood 5.6 14.6 +9.0
orionproducts 17.1 25.5 +8.5
paper-parts 19.3 22.2 +2.8
recode-waste 11.4 11.8 +0.4
soda-bottles 29.6 35.8 +6.3
the-dreidel-project 57.7 65.2 +7.4
trail-camera 60.1 69.6 +9.5
water-meter 0.7 0.0 -0.6
wb-prova 83.6 86.2 +2.6
wildfire-smoke 0.3 0.5 +0.2
x-ray-id 0.0 0.0 0.0
RF100-VL AR@100 mean 25.1 28.9 +3.8
Domain wins 3 17

The detection head wins 17 of 20 domains, loses 3, with mean AR@100 +3.8 over the 5Γ— larger FCOS baseline. The largest gaps are on domains far from COCO's distribution: aquarium-combined (+10.6), trail-camera (+9.5), new-defects-in-wood (+9.0), lacrosse-object-detection (+8.7), orionproducts (+8.5), the-dreidel-project (+7.4), soda-bottles (+6.3), all-elements (+5.6). The three losses are small (≀0.6 AR) on domains with very low absolute AR for both heads (gwhd2021 wheat-head crops, water-meter digit reads, x-ray-id anatomical landmarks). The interpretation is that the backbone's multi-teacher distilled features produce representations general enough that a frozen head one-fifth the FCOS size transfers across wildly different visual domains at the same level or better.

Cross-Dataset Segmentation Transfer

A separate BN+1Γ—1 linear probe with the same training recipe as the ADE20K head, on the frozen backbone. The backbone was never exposed to driving scenes during EUPE distillation or Argus head training.

Dataset Classes Train images mIoU
ADE20K (shipped head) 150 20,210 52.72
Cityscapes (transfer probe) 19 2,975 63.76

The Cityscapes probe scores road 96.4, car 87.9, sky 88.8, building 86.7, vegetation 85.6. The weaker categories are thin vertical structures (pole 17.8, traffic light 36.4, traffic sign 48.3), which is an inherent resolution limitation of the stride-16 patch grid rather than a deficiency in the learned representation.

Comparison with standard baselines

As a sanity check, Argus was compared against several well-known models on the same 200-image COCO subset. The classification comparison uses a keyword cross-reference between each model's top-k ImageNet predictions and the COCO ground-truth detection labels on those images, which provides a consistent yardstick across differently-trained models despite the label-space mismatch. These hit rates measure agreement with COCO detection labels via keyword matching on the 200-image subset; they are not raw ImageNet accuracy. For reference, all three classifiers exceed 80% top-1 on the full ImageNet validation set.

Classification (hit rate against COCO detection labels, 200 images):

Model Parameters Top-1 hit Top-5 hit Latency Peak VRAM
Argus (EUPE-ViT-B) 86 M 42.2% 66.8% 13.1 ms 0.34 GB
ConvNeXt-Base 89 M 40.2% 71.4% 10.4 ms 0.35 GB
ResNet50 26 M 36.2% 61.8% 8.4 ms 0.12 GB

Segmentation:

Model Parameters Classes Latency Peak VRAM
Argus (EUPE + linear head) 86 M 150 11.8 ms 0.41 GB
DeepLabV3-ResNet50 42 M 21 15.9 ms 0.33 GB

Depth:

Model Parameters Latency Peak VRAM
Argus (EUPE + linear head) 86 M 13.3 ms 0.35 GB
Depth-Anything-V2-Base 98 M 18.8 ms 0.68 GB

Argus produces the top-1 classification accuracy of the three image classifiers, with ConvNeXt-Base edging it slightly on top-5. The Argus classification row above was measured with the kNN method during the original head-to-head comparison; the current shipped classifier (trained linear softmax) would widen the top-5 margin. Argus is faster than DeepLabV3 while predicting a much richer label space, and it is faster than Depth-Anything-V2 while using roughly half the VRAM. Although these baselines and Argus were trained for different objectives on different datasets, the comparison is useful for understanding what the model delivers in practice.

Multi-Task Throughput

The per-task comparisons above measure each head against its single-task counterpart in isolation. A separate question is what happens when a user needs all of the tasks at once, which is the typical situation in dataset annotation, model evaluation, and any pipeline where images pass through multiple analysis stages in sequence. The alternative to Argus in that situation is to load and run four separate single-task models of comparable quality, each carrying its own backbone, its own preprocessing, and its own forward pass. The total cost is the sum of the four individual inference times, plus the memory overhead of holding four independent models on the device simultaneously.

The models chosen for this comparison were selected to match the quality tier of the EUPE-ViT-B backbone rather than to minimize size or maximize speed. ConvNeXt-Base (88.6M parameters) is a widely-used ImageNet-1k classifier at the same parameter scale as EUPE-ViT-B. SegFormer-B3 (47.3M) is a transformer-based ADE20K semantic segmenter that is the standard mid-range alternative to a linear probe on a frozen backbone. Depth-Anything-V2-Base (97.5M) is the current standard for single-image monocular depth estimation at base scale. YOLO26l (26.3M) is the large variant of the January 2026 YOLO release from Ultralytics, representing the state of the art in efficient real-time detection. All measurements were taken on an NVIDIA RTX 6000 Ada across the same nine example images, with five timed runs after a three-image warmup pass to eliminate cold-start effects.

Pipeline Parameters Latency per image Tasks
Argus unified 103 M 56 ms 5 (classify, segment, depth, detect, correspond)
Four separate models 260 M 68 ms 4 (classify, segment, depth, detect)

The per-model breakdown for the separate pipeline is ConvNeXt-Base at 6 ms, SegFormer-B3 at 19 ms, Depth-Anything-V2-Base at 31 ms, and YOLO26l at 12 ms, summing to 68 ms when the tasks are run sequentially on the same image. Argus completes five tasks (the same four plus keypoint correspondence, which the separate pipeline does not attempt) in 56 ms from a single model load. The total parameter count for the separate pipeline is 260M across four independent weight sets, while Argus carries 103M in a single file.

The throughput advantage comes from the shared backbone. Each of the four separate models pays the cost of encoding the image through its own network before producing task-specific output. Argus encodes the image once through EUPE-ViT-B and then routes the resulting features to five lightweight heads, each of which adds only a few milliseconds on top of the shared representation. The backbone forward pass is the dominant cost in both pipelines, and running it once rather than four times is where the 1.2x throughput improvement and 2.5x parameter reduction originate. The practical consequence for deployment is that Argus requires a single model download, a single checkpoint load, and a single Python import, where the equivalent separate-model pipeline requires four downloads totaling over a gigabyte, four independent weight sets held concurrently, and four separate dependency trees to manage.

Usage

from PIL import Image
from transformers import AutoModel

model = AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True)
image = Image.open("your_image.jpg").convert("RGB")

top5  = model.classify(image, top_k=5)
seg   = model.segment(image)              # [H, W] class indices
depth = model.depth(image)                # [H, W] metric depth in meters
dets  = model.detect(image, score_thresh=0.3)
# dets: list of {"box": [x1, y1, x2, y2], "score", "label", "class_name"}

# Three tasks at once (shared backbone forward inside perceive)
result = model.perceive(image)
# result["classification"], ["segmentation"], ["depth"], ["timings_ms"]

# Keypoint correspondence between two images
target = Image.open("other_image.jpg").convert("RGB")
predicted = model.correspond(image, target, [[100, 100], [200, 200]])

Every single-image method also accepts a list of PIL images and returns a list of per-image results in the same shape a single call would produce.

Confidence outputs

seg_map, seg_conf   = model.segment(image, return_confidence=True)
# seg_conf is per-pixel max softmax probability in [0, 1]

depth_map, depth_std = model.depth(image, return_confidence=True)
# depth_std is per-pixel standard deviation of the 256-bin distribution

result = model.perceive(image, return_confidence=True)
# result["segmentation_confidence"], result["depth_uncertainty"]

Classification always carries a margin field (top-1 minus top-2 score) on the first entry.

ONNX export

paths = model.export_onnx("/path/to/out_dir", backbone_resolution=640, verify=True)
# backbone, classifier, seg_head, depth_head, detection_head (five graphs)

The segmentation graph folds bilinear upsample to input resolution inside the graph, so consumers argmax directly. The classifier graph is self-contained (softmax weights captured as buffers). The depth head accepts four intermediate ViT-block activations as separate positional tensor inputs. The detection head returns pre-NMS per-location boxes and scores by default, or with include_nms=True bakes ONNX NonMaxSuppression (opset β‰₯ 10) into the detection graph for single-shot TensorRT or mobile inference. Correspondence has no learned parameters and needs no graph.

Tolerance for verify=True can be a float or a dict keyed by verification output name. When a float is passed, detection box coordinates get a resolution-scaled tolerance because exp() in the regression path amplifies FP kernel-dispatch differences to pixel scale.

INT8 quantization

model = AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True)
model = model.cuda().eval().quantize_int8()  # requires: pip install torchao

Weight-only INT8 quantization via torchao. Linear weights go to INT8; activations stay in BF16. Classification agreement with FP32 is 100%, depth drift averages 0.013 m. Reduces weight VRAM substantially. Latency behaviour depends on whether the target GPU has an INT8 tensor-core path torchao can dispatch to.

Precision variants

Two safetensors with identical inference behaviour but different on-disk precision.

File Load
model.safetensors AutoModel.from_pretrained("phanerozoic/argus", trust_remote_code=True)
model.bf16_backbone.safetensors add variant="bf16_backbone"

Both load into the same FP32 model in memory; PyTorch upcasts the stored bfloat16 weights at construction. The smaller variant saves download bandwidth only.

Training

The backbone is frozen for every task. Only the task heads are trained; the kNN class prototypes used during paper reproduction were extracted (not trained at all).

Component Source Method
Segmentation ADE20K (20,210 train) Linear probe, CE loss, AdamW lr 1e-3, 512Γ—512, 40,000 iterations
Depth NYU Depth V2 (24,231 train) DPT decoder, SILog loss, AdamW lr 1e-4, 416Γ—416, 38,400 iterations
Linear softmax classifier ImageNet-1k (1.28M train) Cached CLS features, SGD momentum 0.9, cosine LR, 100 epochs
Detection COCO 2017 (117,266 train) Split-tower with cofiber decomposition, ATSS, focal + GIoU + BCE, AdamW lr 5e-4, 768Γ—768, 16 ep + 3 ep partial calibration
Correspondence none training-free cosine similarity

Backbone simplification

The upstream EUPE-ViT-B release ships a LinearKMaskedBias wrapper around each block's QKV projection. In the released weights both the bias_mask and the bias are filled with zeros across all twelve blocks, so the masked bias is identically zero at every forward pass. The Argus backbone drops the 24 redundant tensors entirely (12 Γ— qkv.bias + 12 Γ— qkv.bias_mask, 55,296 values total), and the attention blocks are constructed with qkv_bias=False, mask_k_bias=False. FP32 forward is bitwise-equivalent for classification, segmentation, detection, and correspondence. The DPT depth decoder shows sub-centimeter drift under BF16 autocast; the drift is an order of magnitude smaller than the head's own 39-centimeter NYU Depth V2 RMSE and causes no visible change in depth maps. To load the upstream EUPE-ViT-B release directly into this backbone class, pass strict=False to load_state_dict so the extra keys in the upstream checkpoint are silently ignored.

Head details

Segmentation. BatchNorm2d(768) β†’ Conv2d(768, 150, 1Γ—1), 116,886 parameters. Trained at 512Γ—512 with cross-entropy loss, AdamW (lr 1e-3, weight decay 1e-3), WarmupOneCycleLR with 1500-step warmup, batch 16.

Depth (DPT). Hooks into backbone blocks [2, 5, 8, 11] via PyTorch forward hooks, capturing intermediate representations without modifying the backbone. A reassemble stage projects each block's output from 768 to 256 channels via LayerNorm + Linear, reshapes to spatial grids, and rescales to strides [4, 8, 16, 32]. A bottom-up fusion path combines the four scales through residual conv blocks with skip connections. A final conv head produces 256 depth-bin logits; metric depth is the bin-weighted sum. 13,450,000 parameters. Trained at 416Γ—416 with SILog loss, AdamW (lr 1e-4, weight decay 1e-3), cosine schedule with 3% warmup, batch 16, 38,400 iterations.

Linear softmax classifier. A single Linear(768, 1000) layer with bias applied to the L2-normalized CLS token, 769,000 parameters. Trained as a two-pass job: first the frozen backbone runs over ImageNet-1k train to cache a per-image CLS feature tensor (1,281,167 Γ— 768), then the linear layer trains on the cached features alone with SGD (momentum 0.9, weight decay 0), batch 4096, cosine schedule, 100 epochs, no augmentation. A small LR sweep over {0.5, 1.0, 3.0, 10.0, 30.0} selected lr=30.0: L2-normalized features plus zero-initialized weights require an unusually large learning rate to grow the weight scale to the point where the softmax distribution sharpens. The best run reached 85.53% top-1 and 97.69% top-5 on ImageNet-1k val.

Detection (split-tower with cofiber decomposition). Anchor-free, operating on a cofiber decomposition of the frozen backbone spatial features rather than a learned FPN. Cofiber decomposition is a zero-parameter multi-scale operation that iteratively subtracts the downsampled-then-upsampled component of a feature map, producing frequency-separated scale bands. Five prediction levels: one stride-8 level synthesized by a single transposed convolution from the stride-16 band, plus four cofiber bands at strides 16, 32, 64, 128. Separate classification and regression towers of depth nine (five 3Γ—3 ConvGN blocks followed by four depthwise residual blocks at 160 hidden channels) process each level with weights shared across levels. Top-down lateral connections pass information from coarser to finer bands before the towers run. Classification is cosine similarity between a Linear(160, 768) projection and CLIP ViT-L/14 multi-prompt text embeddings of the 80 COCO class names, with a learned scalar temperature and per-class bias. Regression uses exponentiated LTRB distances with a learned per-level scale. Centerness is a single 1Γ—1 convolution. 2,975,067 parameters.

Trained at 768Γ—768 with letterbox padding, ATSS target assignment (Zhang et al. 2020), horizontal-flip augmentation, focal loss (Ξ±=0.25, Ξ³=2.0) for classification, GIoU for boxes, BCE for centerness, AdamW (lr 5e-4, weight decay 1e-4), cosine schedule with 3% warmup, batch 16, 16 epochs. Step 104,000 of 117,264 was selected by late-training checkpoint sweep as the base. A 3-epoch partial fine-tune at lr 1e-4 then updates only cls_project, cls_bias, and logit_scale (the classification calibration layers), leaving the towers and cofiber path frozen. The partial fine-tune adds +0.15 aggregate mAP and +1.1 small-object mAP. The shipped weights are the final state of that fine-tune.

Correspondence. No learned parameters. At inference, dense patch features are extracted from both images, upsampled to 512Γ—512 pixel resolution, and matched by cosine similarity per source keypoint.

Compute

Task Iterations Notes
Segmentation (ADE20K) 40,000 linear probe, batch 16, 512px, CE loss, frozen backbone
Linear classifier (ImageNet-1k) 100 epochs Γ— 313 steps SGD momentum 0.9, batch 4096, cosine schedule on cached CLS features; extraction is a single full-train pass through the frozen backbone
DPT depth decoder (NYU Depth V2) 38,400 iterations batch 16, 416px, SILog loss, frozen backbone
Detection (COCO 2017) 16 epochs Γ— 7,329 batches at 768px + 3-epoch partial fine-tune of classification calibration layers bf16 mixed-precision forward + fp32 master params + fp32 AdamW moments, CUDA graph capture, frozen backbone
Correspondence (SPair-71k) training-free

Why minimal heads

The segmentation and classification heads follow the EUPE paper's evaluation principle: a minimal decoder isolates the backbone's contribution from the head's capacity. A Mask2Former-style segmentation head would produce higher mIoU, but those numbers would reflect the decoder as much as the features. The depth and detection heads are heavier because their tasks require multi-scale reasoning. The cofiber construction costs no trained parameters, so the detection head budget stays small (2.98M) while covering five pyramid levels from stride 8 to stride 128.

Notes and limitations

  • The segmentation head was trained on ADE20K's 150-class indoor-and-urban label space.
  • The depth head was trained on NYU Depth V2 (indoor). Outdoor metric depth should be treated as approximate.
  • The detection head was trained on COCO 2017's 80-class label space at 768-pixel input. Small-object mAP (22.3) is the weakest axis because the stride-8 P3 level can only resolve objects roughly 10 pixels and larger at that resolution.
  • Correspondence has no confidence signal; it returns a target pixel for every source keypoint regardless of match ambiguity.

License

The EUPE-ViT-B backbone weights inside this checkpoint were released by Meta FAIR under the FAIR Research License, which restricts use to non-commercial research and education. The task heads and class prototypes in this checkpoint were trained independently by the author of this repository and would on their own be releasable under a permissive license. However, because they are inseparably bundled with the backbone weights in a single file, the unified checkpoint inherits the more restrictive license of its most restricted component. In practical terms, both model.safetensors and model.bf16_backbone.safetensors should be treated as released under the FAIR Research License. See LICENSE for the full text.

Citation

@misc{zhu2026eupe,
  title={Efficient Universal Perception Encoder},
  author={Zhu, Chenchen and Suri, Saksham and Jose, Cijo and Oquab, Maxime and Szafraniec, Marc and Wen, Wei and Xiong, Yunyang and Labatut, Patrick and Bojanowski, Piotr and Krishnamoorthi, Raghuraman and Chandra, Vikas},
  year={2026},
  eprint={2603.22387},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Acknowledgements

The EUPE backbone was trained and released by Meta FAIR. The dataset loading utilities are from the DINOv3 repository. The Argus task heads, benchmarks, and packaging were done by phanerozoic.

Downloads last month
878
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for phanerozoic/argus

Finetuned
(5)
this model

Datasets used to train phanerozoic/argus

Space using phanerozoic/argus 1

Paper for phanerozoic/argus