Telesurgery Neural Tokenizer v1.0 Overview

Description:

Telesurgery Neural Tokenizer processes surgical scenario inputs by tokenizing frames using a distilled frame autoencoder, optimized for low-latency applications like telesurgery video streaming.

This model is available for commercial use.

License/Terms of Use:

Governing Terms: Use of this model system is governed by the NVIDIA Open Model License.

Deployment Geography:

Global

Use Case:

Primarily intended for surgical robotics researchers, healthcare AI developers, academic institutions, or companies exploring neural codecs for telesurgery applications, particularly where low latency video streaming is critical.

Model Architecture:

Architecture Type: Convolutional Neural Network with Residual and Attention Blocks (based on Wan2.1 with 2D Convolutions)
Network Architecture: Telesurgery Neural Tokenizer (Custom Architecture, 1GB VRAM Requirement, Optimized for NVIDIA GPUs)

This model was distilled from Wan2.1. Number of model parameters: 12.6M

Input:

Input Type(s): Image
Input Format(s): Red, Green, Blue (RGB)
Input Parameters: Two-Dimensional (2D)
Other Properties Related to Input: Image Resolution: 536x960, 720x1280 or 1080x1920; Image Range: [-1, 1]

Output:

Output Type(s): Embeddings
Output Format: Pytorch Tensor
Output Parameters: Three-Dimensional (3D)
Other Properties Related to Output: Embeddings format: 2x(H/8)x(W/8) (With H and W being Height and Width of the original image).

Output Type(s): Image
Output Format: Red, Green, Blue (RGB)
Output Parameters: Two-Dimensional (2D)
Other Properties Related to Output: Minimum Resolution: 480x848, Maximum Resolution: 1080x1920, Image Range: [-1, 1]

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems NVIDIA GPUs or equivalent GPU-accelerated hardware. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

TensorRT

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Hopper

[Preferred/Supported] Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

v0.1

The Telesurgery Neural Tokenizer can be integrated into an AI system via ONNX or TensorRT runtime engines, supporting NVIDIA Ampere, Blackwell, and Hopper microarchitectures, and Linux-based operating systems. It accepts 2D RGB image frames (numeric vectors) at specific resolutions (536x960, 720x1280 or 1080x1920) for low-latency video streaming in telesurgery scenarios.

Training, Testing, and Evaluation Datasets:

Training Dataset:

Link: In-house surgical data (laparoscopic surgeries)

Data Modality:

Image

Image Training Data Size:

Less than a Million Images

Text Training Data Size:

Less than a Billion Tokens

Video Training Data Size:

10,000 to 1 Million Hours

Non-Audio, Image, Text Training Data Size:

Approximately 536x960 to 1080x1920 pixels (RGB images)

Data Collection Method by dataset:

Human

Labeling Method by dataset:

Human

Properties (Quantity, Dataset Descriptions, Sensor(s)): Training set consists of 5765 (5-minute) video items for laparoscopic surgeries. Modality: Video (Image sequences). Content Nature: In-house surgical data.

Testing Dataset:

Link: In-house surgical data (laparoscopic surgeries)

Data Collection Method by dataset:

Human

Labeling Method by dataset:

Human

Properties (Quantity, Dataset Descriptions, Sensor(s)): Testing set consists of 1224 (5-minute) video items for laparoscopic surgeries. Modality: Video (Image sequences). Content Nature: In-house surgical data.

Evaluation Dataset:

Link: In-house surgical dataset (laparoscopic surgeries)

Data Collection Method by dataset:

Human

Labeling Method by dataset:

Human

Properties (Quantity, Dataset Descriptions, Sensor(s)): Evaluation set consists of 1220 (5-minute) video items for laparoscopic surgeries. Modality: Video (Image sequences). Content Nature: In-house surgical data.

Inference:

Acceleration Engine: TensorRT

Test Hardware:

A100
A6000
RTX 6000 ADA
RTX 6000 Pro Blackwell

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

For more detailed information on ethical considerations for this model, please see the Bias, Explainability, Safety & Security, and Privacy Subcards.

Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month: 43

Collection including nvidia/Cosmos-Tokenizer-Surg

MedTech Open Models

Collection

Open models for physical AI and medical imaging — robot control, surgical simulation, segmentation, reconstruction, generation, and reasoning. • 13 items • Updated 3 days ago • 9