Upload sam2-hiera-tiny ONNX models

2b467a1 verified 4 months ago

5.96 kB

	---
	license: apache-2.0
	tags:
	- sam2
	- segment-anything
	- onnx
	- webgpu
	- computer-vision
	- image-segmentation
	library_name: onnxruntime
	---

	# SAM2-HIERA-TINY - ONNX Format for WebGPU

	Powered by [Segment Anything 2 (SAM2)](https://github.com/facebookresearch/segment-anything-2) from Meta Research

	This repository contains ONNX-converted models from [facebook/sam2-hiera-tiny](https://huggingface.co/facebook/sam2-hiera-tiny), optimized for WebGPU deployment in browsers.

	## Model Information

	- Original Model: [facebook/sam2-hiera-tiny](https://huggingface.co/facebook/sam2-hiera-tiny)
	- Version: SAM 2.0
	- Size: 38.9M parameters
	- Description: Tiny variant - fastest inference, good for real-time applications
	- Format: ONNX (encoder + decoder)
	- Optimization: Encoder optimized to .ort format for WebGPU

	## Files

	- `encoder.onnx` - Image encoder (ONNX format)
	- `encoder.with_runtime_opt.ort` - Image encoder (optimized for WebGPU)
	- `decoder.onnx` - Mask decoder (ONNX format)
	- `config.json` - Model configuration

	## Usage

	### In Browser with ONNX Runtime Web

	```javascript
	import * as ort from 'onnxruntime-web/webgpu';

	// Load encoder (use optimized .ort version for WebGPU)
	const encoderURL = 'https://huggingface.co/SharpAI/sam2-hiera-tiny-onnx/resolve/main/encoder.with_runtime_opt.ort';
	const encoderSession = await ort.InferenceSession.create(encoderURL, {
	executionProviders: ['webgpu'],
	graphOptimizationLevel: 'disabled'
	});

	// Load decoder
	const decoderURL = 'https://huggingface.co/SharpAI/sam2-hiera-tiny-onnx/resolve/main/decoder.onnx';
	const decoderSession = await ort.InferenceSession.create(decoderURL, {
	executionProviders: ['webgpu']
	});

	// Run encoder
	const imageData = preprocessImage(image); // Your preprocessing
	const encoderOutputs = await encoderSession.run({ image: imageData });

	// Run decoder with point
	const point_coords = new ort.Tensor('float32', [x, y, 0, 0], [1, 2, 2]);
	const point_labels = new ort.Tensor('float32', [1, -1], [1, 2]);
	const mask_input = new ort.Tensor('float32', new Float32Array(256 * 256).fill(0), [1, 1, 256, 256]);
	const has_mask_input = new ort.Tensor('float32', [0], [1]);

	const decoderOutputs = await decoderSession.run({
	image_embed: encoderOutputs.image_embed,
	high_res_feats_0: encoderOutputs.high_res_feats_0,
	high_res_feats_1: encoderOutputs.high_res_feats_1,
	point_coords: point_coords,
	point_labels: point_labels,
	mask_input: mask_input,
	has_mask_input: has_mask_input
	});

	// Get masks
	const masks = decoderOutputs.masks; // Shape: [1, num_masks, 256, 256]
	```

	### In Python with ONNX Runtime

	```python
	import onnxruntime as ort
	import numpy as np

	# Load models
	encoder_session = ort.InferenceSession("encoder.onnx")
	decoder_session = ort.InferenceSession("decoder.onnx")

	# Run encoder
	encoder_outputs = encoder_session.run(None, {"image": image_tensor})

	# Run decoder
	decoder_outputs = decoder_session.run(None, {
	"image_embed": encoder_outputs[0],
	"high_res_feats_0": encoder_outputs[1],
	"high_res_feats_1": encoder_outputs[2],
	"point_coords": point_coords,
	"point_labels": point_labels,
	"mask_input": mask_input,
	"has_mask_input": has_mask_input
	})

	masks = decoder_outputs[0]
	```

	## Input/Output Specifications

	### Encoder

	Input:
	- `image`: Float32[1, 3, 1024, 1024] - Normalized RGB image

	Outputs:
	- `image_embed`: Float32[1, 256, 64, 64] - Image embeddings
	- `high_res_feats_0`: Float32[1, 32, 256, 256] - High-res features (level 0)
	- `high_res_feats_1`: Float32[1, 64, 128, 128] - High-res features (level 1)

	### Decoder

	Inputs:
	- `image_embed`: Float32[1, 256, 64, 64] - From encoder
	- `high_res_feats_0`: Float32[1, 32, 256, 256] - From encoder
	- `high_res_feats_1`: Float32[1, 64, 128, 128] - From encoder
	- `point_coords`: Float32[1, 2, 2] - Point coordinates [[x, y], [0, 0]]
	- `point_labels`: Float32[1, 2] - Point labels [1, -1] (1=foreground, -1=padding)
	- `mask_input`: Float32[1, 1, 256, 256] - Previous mask (zeros if none)
	- `has_mask_input`: Float32[1] - Flag [0] or [1]

	Outputs:
	- `masks`: Float32[1, 3, 256, 256] - Generated masks (3 candidates)
	- `iou_predictions`: Float32[1, 3] - IoU scores for each mask
	- `low_res_masks`: Float32[1, 3, 256, 256] - Low-resolution masks

	## Browser Requirements

	- Chrome 113+ with WebGPU enabled (`chrome://flags/#enable-unsafe-webgpu`)
	- Firefox Nightly with WebGPU enabled
	- Safari Technology Preview with WebGPU enabled

	## Performance

	Typical inference times on Chrome with WebGPU:
	- Encoder: {'2-3s' if 'tiny' in model_name else '3-5s' if 'small' in model_name else '4-6s' if 'base' in model_name else '8-10s'}
	- Decoder: 0.1-0.5s per point

	## License

	This model is released under the Apache 2.0 license, following the original SAM2 model.

	## Citation

	```bibtex
	@article{ravi2024sam2,
	title={SAM 2: Segment Anything in Images and Videos},
	author={Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and Ma, Tengyu and Khedr, Haitham and R{\"a}dle, Roman and Rolland, Chloe and Gustafson, Laura and Mintun, Eric and Pan, Junting and Alwala, Kalyan Vasudev and Carion, Nicolas and Wu, Chao-Yuan and Girshick, Ross and Doll{\'a}r, Piotr and Feichtenhofer, Christoph},
	journal={arXiv preprint arXiv:2408.00714},
	year={2024}
	}
	```

	## Related Resources

	- Original SAM2: [facebookresearch/segment-anything-2](https://github.com/facebookresearch/segment-anything-2)
	- WebGPU Demo: [Aegis AI SAM2 WebGPU Demo](https://github.com/yourusername/Aegis-AI/tree/main/tools/sam2-webgpu)
	- Conversion Tool: [SAM2 ONNX Converter](https://github.com/yourusername/Aegis-AI/tree/main/tools/sam2-converter)

	## Acknowledgments

	- Meta Research for the original SAM2 model
	- Microsoft for ONNX Runtime
	- SamExporter for conversion tools

	---

	Converted and optimized by [Aegis AI](https://github.com/yourusername/Aegis-AI)