merve PRO
AI & ML interests
Recent Activity
Organizations
-
Running25
YOLO26
π25Process images with advanced object detection and segmentation
-
RunningFeatured61
YOLO26 WebGPU
π61Real-time object detection & pose estimation in your browser
-
onnx-community/yolo26x-ONNX
Updated β’ 685 β’ 5 -
openvision/yoloe26-n-seg
Zero-Shot Object Detection β’ Updated β’ 167 β’ 2
-
Wuli-art/Qwen-Image-2512-Turbo-LoRA
Text-to-Image β’ Updated β’ 8.73k β’ 202 -
miromind-ai/MiroThinker-v1.5-235B
Text Generation β’ Updated β’ 560 β’ 249 -
prithivMLmods/Qwen-Image-Edit-2511-Object-Remover
Image-to-Image β’ Updated β’ 12.2k β’ β’ 50 -
tencent/Youtu-LLM-2B-Base
Text Generation β’ Updated β’ 1.41k β’ 41
-
facebook/sam3
Mask Generation β’ Updated β’ 1.74M β’ 1.56k -
Running on ZeroFeatured106
SAM3 Video Segmentation
π106Track and label objects in videos using text prompts or clicks
-
onnx-community/sam3-tracker-ONNX
Mask Generation β’ Updated β’ 2.56k β’ 28 -
Running23
SAM3 Tracker WebGPU
π―23Segment and extract parts from images by clicking
-
Kwaipilot/KAT-Dev-72B-Exp
Text Generation β’ 73B β’ Updated β’ 21 β’ 158 -
LiquidAI/LFM2-8B-A1B
Text Generation β’ 8B β’ Updated β’ 16.5k β’ 301 -
yanolja/YanoljaNEXT-Rosetta-12B-2510
Translation β’ 12B β’ Updated β’ 302 β’ 30 -
NeuML/colbert-muvera-femto
Sentence Similarity β’ 243k β’ Updated β’ 3 β’ 20
-
bytedance-research/HuMo
Image-to-Video β’ Updated β’ 147 β’ 212 -
facebook/MobileLLM-R1-950M
Text Generation β’ 0.9B β’ Updated β’ 343 β’ 280 -
tencent/POINTS-Reader
Image-Text-to-Text β’ 4B β’ Updated β’ 407k β’ 100 -
baidu/ERNIE-4.5-21B-A3B-Thinking
Text Generation β’ 22B β’ Updated β’ 650 β’ β’ 772
-
microsoft/VibeVoice-1.5B
Text-to-Speech β’ 3B β’ Updated β’ 183k β’ 2.21k -
OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview
Image-Text-to-Text β’ 0.4B β’ Updated β’ 36.5k β’ 82 -
apple/FastVLM-1.5B
Text Generation β’ 2B β’ Updated β’ 1.32k β’ 76 -
stepfun-ai/Step-Audio-2-mini
Any-to-Any β’ 8B β’ Updated β’ 2.13k β’ 248
-
openai/gpt-oss-120b
Text Generation β’ 120B β’ Updated β’ 3.33M β’ β’ 4.49k -
openai/gpt-oss-20b
Text Generation β’ 22B β’ Updated β’ 5.54M β’ β’ 4.36k -
openai/BrowseCompLongContext
Viewer β’ Updated β’ 295 β’ 666 β’ 46 -
baichuan-inc/Baichuan-M2-32B
Text Generation β’ 33B β’ Updated β’ 116k β’ β’ 118
-
Wan-AI/Wan2.2-I2V-A14B
Image-to-Video β’ Updated β’ 12.3k β’ β’ 605 -
allenai/olmOCR-7B-0725
Image-Text-to-Text β’ 8B β’ Updated β’ 436 β’ 64 -
Wan-AI/Wan2.2-T2V-A14B
Text-to-Video β’ Updated β’ 3.49k β’ β’ 420 -
Qwen/Qwen3-235B-A22B-Thinking-2507
Text Generation β’ Updated β’ 42.4k β’ β’ 398
-
nari-labs/Dia-1.6B-0626
Text-to-Speech β’ 2B β’ Updated β’ 25k β’ 124 -
google/gemma-3n-E4B-it
Image-Text-to-Text β’ Updated β’ 128k β’ β’ 865 -
ByteDance/XVerse
Text-to-Image β’ Updated β’ 55 β’ 89 -
nvidia/llama-nemoretriever-colembed-3b-v1
Visual Document Retrieval β’ Updated β’ 677 β’ 74
-
opendatalab/OmniDocBench
Viewer β’ Updated β’ 1.36k β’ 10.7k β’ 68 -
nanonets/Nanonets-OCR-s
Image-Text-to-Text β’ 4B β’ Updated β’ 28.8k β’ 1.58k -
echo840/MonkeyOCR
Image-Text-to-Text β’ Updated β’ 258 β’ 514 -
Running on ZeroMCPFeatured140
Multimodal OCR2
π»140nanonets ocr / smoldocling / monkey ocr / typhoon ocr
-
ByteDance-Seed/BAGEL-7B-MoT
Any-to-Any β’ 15B β’ Updated β’ 597 β’ 1.18k -
mistralai/Devstral-Small-2505
24B β’ Updated β’ 61.9k β’ 861 -
ByteDance/Dolphin
Image-Text-to-Text β’ Updated β’ 3.12k β’ 513 -
moondream/moondream-2b-2025-04-14-4bit
Image-Text-to-Text β’ 1B β’ Updated β’ 4.62k β’ 62
-
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text β’ 16B β’ Updated β’ 79.3k β’ 445 -
agentica-org/DeepCoder-14B-Preview
Text Generation β’ 15B β’ Updated β’ 396 β’ β’ 680 -
HiDream-ai/HiDream-I1-Full
Text-to-Image β’ Updated β’ 24.6k β’ β’ 986 -
OpenGVLab/InternVL3-78B
Image-Text-to-Text β’ Updated β’ 148k β’ 231
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text β’ 8B β’ Updated β’ 2.03k β’ 88 -
AIDC-AI/Ovis2-34B
Image-Text-to-Text β’ 35B β’ Updated β’ 491 β’ 152 -
open-r1/OpenR1-Qwen-7B
Text Generation β’ 8B β’ Updated β’ 20 β’ β’ 54 -
nomic-ai/nomic-embed-text-v2-moe
Sentence Similarity β’ 0.5B β’ Updated β’ 1.26M β’ 452
-
ostris/Flex.1-alpha
Text-to-Image β’ Updated β’ 524 β’ 481 -
Qwen/Qwen2.5-Math-PRM-72B
Text Classification β’ 73B β’ Updated β’ 248 β’ 72 -
HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text β’ 0.5B β’ Updated β’ 15.8k β’ 186 -
deepseek-ai/DeepSeek-R1
Text Generation β’ 685B β’ Updated β’ 538k β’ β’ 13k
-
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text β’ Updated β’ 38.3k β’ 577 -
Qwen/QwQ-32B-Preview
Text Generation β’ 33B β’ Updated β’ 6.69k β’ β’ 1.74k -
nvidia/Hymba-1.5B-Base
Text Generation β’ 2B β’ Updated β’ 462 β’ 157 -
vidore/colsmolvlm-v0.1
Visual Document Retrieval β’ Updated β’ 4 β’ 55
-
microsoft/LLM2CLIP-EVA02-L-14-336
Zero-Shot Image Classification β’ Updated β’ 46 β’ 60 -
microsoft/LLM2CLIP-EVA02-B-16
Updated β’ 48 β’ 10 -
PleIAs/common_corpus
Viewer β’ Updated β’ 517M β’ 43.7k β’ 343 -
Qwen/Qwen2.5-Coder-32B-Instruct
Text Generation β’ 33B β’ Updated β’ 700k β’ β’ 1.99k
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper β’ 2409.11402 β’ Published β’ 74 -
BRAVE: Broadening the visual encoding of vision-language models
Paper β’ 2404.07204 β’ Published β’ 19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper β’ 2403.18814 β’ Published β’ 48 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper β’ 2409.17146 β’ Published β’ 121
-
Runtime errorFeatured100
LOTUS Normal
π100Generate high-quality predictions from images
-
Runtime error78
LOTUS Depth
π78Generate depth maps from images and videos
-
jingheya/lotus-depth-g-v1-0
Depth Estimation β’ Updated β’ 8.07k β’ 27 -
jingheya/lotus-depth-d-v1-0
Depth Estimation β’ Updated β’ 259 β’ 5
-
facebook/dinov2-large
Image Feature Extraction β’ 0.3B β’ Updated β’ 560k β’ 102 -
google/flan-t5-xl
Updated β’ 169k β’ 527 -
google/siglip-large-patch16-384
Zero-Shot Image Classification β’ 0.7B β’ Updated β’ 14.9k β’ 11 -
google/vit-huge-patch14-224-in21k
Image Feature Extraction β’ 0.6B β’ Updated β’ 76.2k β’ 22
-
facebook/deit-base-distilled-patch16-384
Image Classification β’ 87.6M β’ Updated β’ 58.9k β’ 7 -
facebook/convnextv2-base-1k-224
Image Classification β’ 88.7M β’ Updated β’ 1.44k β’ 4 -
facebook/deit-base-distilled-patch16-224
Image Classification β’ Updated β’ 5.88k β’ 32 -
google/vit-base-patch32-384
Image Classification β’ 88.3M β’ Updated β’ 6.81k β’ 23
-
facebook/maskformer-swin-large-coco
Image Segmentation β’ 0.2B β’ Updated β’ 622 β’ β’ 27 -
nvidia/segformer-b0-finetuned-ade-512-512
Image Segmentation β’ 3.75M β’ Updated β’ 561k β’ β’ 179 -
facebook/detr-resnet-50-dc5-panoptic
Image Segmentation β’ 43M β’ Updated β’ 17 β’ 3 -
nvidia/segformer-b5-finetuned-cityscapes-1024-1024
Image Segmentation β’ Updated β’ 114k β’ β’ 38
-
timbrooks/instruct-pix2pix
Image-to-Image β’ Updated β’ 115k β’ 1.17k -
TencentARC/t2i-adapter-canny-sdxl-1.0
Image-to-Image β’ Updated β’ 3.43k β’ 52 -
TencentARC/t2i-adapter-sketch-sdxl-1.0
Image-to-Image β’ Updated β’ 4.62k β’ 75 -
CrucibleAI/ControlNetMediaPipeFace
Image-to-Image β’ Updated β’ 1.05k β’ 575
-
Salesforce/blip-image-captioning-large
Image-to-Text β’ 0.5B β’ Updated β’ 680k β’ 1.45k -
Salesforce/blip-image-captioning-base
Image-to-Text β’ Updated β’ 2.29M β’ 842 -
microsoft/trocr-base-handwritten
Image-to-Text β’ 0.3B β’ Updated β’ 238k β’ 476 -
microsoft/git-large-coco
Image-to-Text β’ 0.4B β’ Updated β’ 3.21k β’ 104
-
Running114
Grounding DINO Demo
π»114Cutting edge open-vocabulary object detection app
-
RunningFeatured96
Owlv2
π96State-of-the-art Zero-shot Object Detection
-
Runtime errorFeatured41
BLIP2 with transformers
π41BLIP2 (cutting edge image captioning) in π€transformers
-
Build errorFeatured378
IDEFICS Playground
π¨378
-
RunningFeatured96
Owlv2
π96State-of-the-art Zero-shot Object Detection
-
Runtime errorFeatured64
Owl Tracking
β‘64Powerful foundation model for zero-shot object tracking
-
Running26
Search and Detect (CLIP/OWL-ViT)
π¦26Search and detect objects in images using text queries
-
Running on ZeroFeatured109
OWLSAM
π»109State-of-the-art open-vocabulary image segmentation β‘οΈ
-
Runtime errorFeatured84
UDOP
π84Generate text from document images
-
Runtime error40
Pix2struct
π40Play with all the pix2struct variants in this d
-
Running26
Compare Docvqa Models
π¦26Compare different visual question answering
-
Runtime errorFeatured289
DocQuery βΒ Document Query Engine
π¦289
-
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 39 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper β’ 2403.05525 β’ Published β’ 49 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper β’ 2308.12966 β’ Published β’ 11 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper β’ 2404.01331 β’ Published β’ 27
-
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 100k β’ 145 -
google/owlvit-base-patch16
Zero-Shot Object Detection β’ Updated β’ 13.1k β’ 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection β’ Updated β’ 17.5k β’ 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 143k β’ 29
-
depth-anything/Depth-Anything-V2-Small
Depth Estimation β’ Updated β’ 9.22k β’ 76 -
depth-anything/Depth-Anything-V2-Large
Depth Estimation β’ Updated β’ 88.5k β’ 150 -
Running on Zero627
Depth Anything V2
π627Generate depth maps and 3D perception from any image
-
depth-anything/DA-2K
Viewer β’ Updated β’ 1.04k β’ 321 β’ 16
-
Running196
Vidore Leaderboard
π₯196Compare and rank visual document retrieval models across different benchmarks
-
Running on CPU Upgrade990
Open VLM Leaderboard
π990VLMEvalKit Evaluation Results Collection
-
RunningFeatured560
Vision Arena (Testing VLMs side-by-side)
πΌ560Compare vision models on your images instantly
-
RunningFeatured85
SEED-Bench Leaderboard
π85Submit model evaluation results to leaderboard
-
vidore/colpali-v1.2
Visual Document Retrieval β’ Updated β’ 23.9k β’ 113 -
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text β’ Updated β’ 1.68M β’ β’ 1.26k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text β’ Updated β’ 2.12M β’ 487 -
Qwen/Qwen2-72B-Instruct
Text Generation β’ 73B β’ Updated β’ 32k β’ β’ 719
-
google/translategemma-27b-it
Image-Text-to-Text β’ Updated β’ 40.9k β’ 304 -
kakaocorp/kanana-2-30b-a3b-mid-2601
Text Generation β’ 31B β’ Updated β’ 104 β’ 30 -
black-forest-labs/FLUX.2-klein-base-4B
Image-to-Image β’ Updated β’ 61.6k β’ β’ 84 -
google/translategemma-12b-it
Image-Text-to-Text β’ Updated β’ 307k β’ 250
-
PekingU/rtdetr_v2_r50vd
Object Detection β’ 43M β’ Updated β’ 16.2k β’ 26 -
ustc-community/dfine-xlarge-obj365
Object Detection β’ 63.4M β’ Updated β’ 1.04k β’ 4 -
PekingU/rtdetr_v2_r101vd
Object Detection β’ 76.8M β’ Updated β’ 5.3k β’ 13 -
Running on T4120
RF-DETR
π₯120SOTA real-time object detection model
-
facebook/metaclip-2-worldwide-s16
Zero-Shot Image Classification β’ 0.4B β’ Updated β’ 52 β’ 8 -
facebook/metaclip-2-worldwide-m16
Zero-Shot Image Classification β’ 0.5B β’ Updated β’ 58 β’ 3 -
facebook/metaclip-2-worldwide-l14
Zero-Shot Image Classification β’ 1B β’ Updated β’ 149k β’ 12 -
facebook/metaclip-2-worldwide-b32
Zero-Shot Image Classification β’ 0.6B β’ Updated β’ 138 β’ 6
-
openbmb/MiniCPM4.1-8B
Text Generation β’ Updated β’ 18.5k β’ 382 -
tencent/Hunyuan-MT-7B
Translation β’ 8B β’ Updated β’ 11.9k β’ 549 -
google/embeddinggemma-300m
Sentence Similarity β’ Updated β’ 1.33M β’ β’ 1.47k -
moonshotai/Kimi-K2-Instruct-0905
Text Generation β’ Updated β’ 13.9k β’ β’ 671
-
stepfun-ai/step3
Image-Text-to-Text β’ 321B β’ Updated β’ 33.6k β’ 166 -
nunchaku-ai/nunchaku-flux.1-krea-dev
Text-to-Image β’ Updated β’ 9.31k β’ 119 -
fdtn-ai/Foundation-Sec-8B-Instruct
Text Generation β’ 8B β’ Updated β’ 5.49k β’ β’ 67 -
Wan-AI/Wan2.2-TI2V-5B-Diffusers
Text-to-Video β’ Updated β’ 35.1k β’ 108
-
nvidia/OpenReasoning-Nemotron-32B
Text Generation β’ 33B β’ Updated β’ 720 β’ β’ 122 -
ByteDance-Seed/Seed-X-RM-7B
Translation β’ Updated β’ 115 β’ 30 -
LGAI-EXAONE/EXAONE-4.0-32B
Text Generation β’ 32B β’ Updated β’ 12.6k β’ 277 -
vidore/colqwen-omni-v0.1
Visual Document Retrieval β’ Updated β’ 3.53k β’ 93
-
Qwen/WorldPM-72B
Text Classification β’ 73B β’ Updated β’ 68 β’ 81 -
Running on ZeroMCPFeatured1.48k
LTX Video Fast
π₯1.48kultra-fast video model, LTX 0.9.8 13B distilled
-
BLIP3o/BLIP3o-Pretrain-Long-Caption
Viewer β’ Updated β’ 27.2M β’ 32.6k β’ 57 -
BLIP3o/BLIP3o-Model-8B
14B β’ Updated β’ 578 β’ 101
-
OpenGVLab/InternVL3-1B-hf
Image-Text-to-Text β’ 0.9B β’ Updated β’ 109k β’ 10 -
OpenGVLab/InternVL3-2B-hf
Image-Text-to-Text β’ 2B β’ Updated β’ 8.07k β’ 3 -
OpenGVLab/InternVL3-8B-hf
Image-Text-to-Text β’ 8B β’ Updated β’ 13.7k β’ 9 -
OpenGVLab/InternVL3-14B-hf
Image-Text-to-Text β’ 15B β’ Updated β’ 3.77k
-
deepseek-ai/DeepSeek-V3-0324
Text Generation β’ 685B β’ Updated β’ 236k β’ β’ 3.09k -
Qwen/Qwen2.5-Omni-7B
Any-to-Any β’ Updated β’ 302k β’ 1.86k -
google/txgemma-27b-chat
Text Generation β’ 27B β’ Updated β’ 967 β’ 58 -
RunningFeatured366
Qwen2.5 Omni 7B Demo
π366Chat with an AI using text, audio, image, or video and hear responses
-
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text β’ Updated β’ 1.68M β’ β’ 1.26k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text β’ Updated β’ 2.12M β’ 487 -
CohereLabs/aya-vision-8b
Image-Text-to-Text β’ 9B β’ Updated β’ 45.3k β’ 316 -
CohereLabs/aya-vision-32b
Image-Text-to-Text β’ Updated β’ 176 β’ β’ 221
-
Running on Zero266
Qwen2-VL-7B
π₯266Answer questions about any uploaded image
-
Running67
UI-TARS
π67Find click coordinates on images based on instructions
-
Running98
Qwen2.5-1M Demo
π»98Ask questions about your uploaded documents instantly
-
Qwen/Qwen2.5-14B-Instruct-1M
Text Generation β’ 15B β’ Updated β’ 10.3k β’ β’ 333
-
meta-llama/Llama-3.3-70B-Instruct
Text Generation β’ Updated β’ 867k β’ β’ 2.66k -
Qwen/Qwen2-VL-72B
Image-Text-to-Text β’ 73B β’ Updated β’ 109 β’ 80 -
google/paligemma2-3b-pt-224
Image-Text-to-Text β’ Updated β’ 28.7k β’ 162 -
tencent/HunyuanVideo
Text-to-Video β’ Updated β’ 1.12k β’ β’ 2.12k
-
ibm-granite/granite-3.0-8b-instruct
Text Generation β’ Updated β’ 17.1k β’ 205 -
ibm-granite/granite-3.0-2b-instruct
Text Generation β’ 3B β’ Updated β’ 4.21k β’ 47 -
CohereLabs/aya-expanse-8b
Text Generation β’ 8B β’ Updated β’ 84.6k β’ 421 -
CohereLabs/aya-expanse-32b
Text Generation β’ 32B β’ Updated β’ 5.08k β’ β’ 288
-
Running on ZeroFeatured198
DepthCrafter
π¦198a super consistent video depth model
-
PausedFeatured223
Depth Pro
π223Generate an inverse depth map from an image
-
Runtime error78
LOTUS Depth
π78Generate depth maps from images and videos
-
apple/DepthPro
Depth Estimation β’ Updated β’ 3.59k β’ 499
-
microsoft/resnet-50
Image Classification β’ Updated β’ 179k β’ β’ 478 -
google/vit-base-patch16-224-in21k
Image Feature Extraction β’ 86.4M β’ Updated β’ 1.1M β’ 393 -
google/vit-base-patch32-224-in21k
Image Feature Extraction β’ 88M β’ Updated β’ 6.36k β’ 19 -
facebook/dinov2-large
Image Feature Extraction β’ 0.3B β’ Updated β’ 560k β’ 102
-
facebook/detr-resnet-50
Object Detection β’ 41.6M β’ Updated β’ 497k β’ β’ 933 -
facebook/detr-resnet-101-dc5
Object Detection β’ 60.7M β’ Updated β’ 1.76k β’ 19 -
facebook/detr-resnet-50-dc5
Object Detection β’ 41.6M β’ Updated β’ 1.82k β’ 6 -
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 100k β’ 145
-
openai/clip-vit-large-patch14
Zero-Shot Image Classification β’ 0.4B β’ Updated β’ 7.71M β’ 1.96k -
openai/clip-vit-base-patch32
Zero-Shot Image Classification β’ Updated β’ 17.2M β’ 857 -
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Zero-Shot Image Classification β’ Updated β’ 52.6k β’ 306 -
kakaobrain/align-base
Zero-Shot Image Classification β’ Updated β’ 10.8k β’ 30
-
microsoft/xclip-base-patch32
Video Classification β’ 0.2B β’ Updated β’ 158k β’ 108 -
facebook/timesformer-base-finetuned-k400
Video Classification β’ Updated β’ 35.1k β’ 42 -
facebook/timesformer-base-finetuned-k600
Video Classification β’ Updated β’ 831 β’ 12 -
google/vivit-b-16x2
Video Classification β’ Updated β’ 2.16k β’ 11
-
stabilityai/stable-diffusion-xl-base-1.0
Text-to-Image β’ Updated β’ 2.05M β’ β’ 7.43k -
warp-ai/wuerstchen
Text-to-Image β’ Updated β’ 150 β’ 176 -
Deci/DeciDiffusion-v1-0
Text-to-Image β’ Updated β’ 27 β’ 140 -
stabilityai/stable-diffusion-xl-refiner-1.0
Image-to-Image β’ Updated β’ 332k β’ 2.02k
-
Running on ZeroFeatured72
Draw To Search Art
π72Draw/upload image and search among WikiART using SigLIP
-
Running on CPU Upgrade23
Compare Clip Siglip
π23Compare strong zero-shot image classification models
-
Running on Zero13
Multilingual Zero Shot Image Clf
π’13Comparing powerful multilingual zero-shot image clf models
-
BAAI/bunny-phi-2-siglip-lora
Text Generation β’ Updated β’ 251 β’ 48
-
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 100k β’ 145 -
google/owlvit-base-patch16
Zero-Shot Object Detection β’ Updated β’ 13.1k β’ 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection β’ Updated β’ 17.5k β’ 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 143k β’ 29
-
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 100k β’ 145 -
google/owlvit-base-patch16
Zero-Shot Object Detection β’ Updated β’ 13.1k β’ 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection β’ Updated β’ 17.5k β’ 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 143k β’ 29
-
Paused21
Video Llava
π¨21Generate descriptions by uploading images or videos
-
llava-hf/LLaVA-NeXT-Video-7B-hf
Video-Text-to-Text β’ 7B β’ Updated β’ 87.4k β’ 122 -
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf
Video-Text-to-Text β’ 7B β’ Updated β’ 1.38k β’ 11 -
llava-hf/LLaVA-NeXT-Video-7B-32K-hf
Image-Text-to-Text β’ 8B β’ Updated β’ 449 β’ 8
-
NVEagle/Eagle-X5-13B
Image-Text-to-Text β’ 15B β’ Updated β’ 4 β’ 15 -
NVEagle/Eagle-X5-13B-Chat
Image-Text-to-Text β’ 15B β’ Updated β’ 4 β’ 28 -
NVEagle/Eagle-X5-7B
Image-Text-to-Text β’ 9B β’ Updated β’ 26 β’ 26 -
Runtime error64
Eagle X5 13B Chat
π64Combine text and images to generate responses
-
google/translategemma-27b-it
Image-Text-to-Text β’ Updated β’ 40.9k β’ 304 -
kakaocorp/kanana-2-30b-a3b-mid-2601
Text Generation β’ 31B β’ Updated β’ 104 β’ 30 -
black-forest-labs/FLUX.2-klein-base-4B
Image-to-Image β’ Updated β’ 61.6k β’ β’ 84 -
google/translategemma-12b-it
Image-Text-to-Text β’ Updated β’ 307k β’ 250
-
Running25
YOLO26
π25Process images with advanced object detection and segmentation
-
RunningFeatured61
YOLO26 WebGPU
π61Real-time object detection & pose estimation in your browser
-
onnx-community/yolo26x-ONNX
Updated β’ 685 β’ 5 -
openvision/yoloe26-n-seg
Zero-Shot Object Detection β’ Updated β’ 167 β’ 2
-
Wuli-art/Qwen-Image-2512-Turbo-LoRA
Text-to-Image β’ Updated β’ 8.73k β’ 202 -
miromind-ai/MiroThinker-v1.5-235B
Text Generation β’ Updated β’ 560 β’ 249 -
prithivMLmods/Qwen-Image-Edit-2511-Object-Remover
Image-to-Image β’ Updated β’ 12.2k β’ β’ 50 -
tencent/Youtu-LLM-2B-Base
Text Generation β’ Updated β’ 1.41k β’ 41
-
PekingU/rtdetr_v2_r50vd
Object Detection β’ 43M β’ Updated β’ 16.2k β’ 26 -
ustc-community/dfine-xlarge-obj365
Object Detection β’ 63.4M β’ Updated β’ 1.04k β’ 4 -
PekingU/rtdetr_v2_r101vd
Object Detection β’ 76.8M β’ Updated β’ 5.3k β’ 13 -
Running on T4120
RF-DETR
π₯120SOTA real-time object detection model
-
facebook/sam3
Mask Generation β’ Updated β’ 1.74M β’ 1.56k -
Running on ZeroFeatured106
SAM3 Video Segmentation
π106Track and label objects in videos using text prompts or clicks
-
onnx-community/sam3-tracker-ONNX
Mask Generation β’ Updated β’ 2.56k β’ 28 -
Running23
SAM3 Tracker WebGPU
π―23Segment and extract parts from images by clicking
-
facebook/metaclip-2-worldwide-s16
Zero-Shot Image Classification β’ 0.4B β’ Updated β’ 52 β’ 8 -
facebook/metaclip-2-worldwide-m16
Zero-Shot Image Classification β’ 0.5B β’ Updated β’ 58 β’ 3 -
facebook/metaclip-2-worldwide-l14
Zero-Shot Image Classification β’ 1B β’ Updated β’ 149k β’ 12 -
facebook/metaclip-2-worldwide-b32
Zero-Shot Image Classification β’ 0.6B β’ Updated β’ 138 β’ 6
-
Kwaipilot/KAT-Dev-72B-Exp
Text Generation β’ 73B β’ Updated β’ 21 β’ 158 -
LiquidAI/LFM2-8B-A1B
Text Generation β’ 8B β’ Updated β’ 16.5k β’ 301 -
yanolja/YanoljaNEXT-Rosetta-12B-2510
Translation β’ 12B β’ Updated β’ 302 β’ 30 -
NeuML/colbert-muvera-femto
Sentence Similarity β’ 243k β’ Updated β’ 3 β’ 20
-
bytedance-research/HuMo
Image-to-Video β’ Updated β’ 147 β’ 212 -
facebook/MobileLLM-R1-950M
Text Generation β’ 0.9B β’ Updated β’ 343 β’ 280 -
tencent/POINTS-Reader
Image-Text-to-Text β’ 4B β’ Updated β’ 407k β’ 100 -
baidu/ERNIE-4.5-21B-A3B-Thinking
Text Generation β’ 22B β’ Updated β’ 650 β’ β’ 772
-
openbmb/MiniCPM4.1-8B
Text Generation β’ Updated β’ 18.5k β’ 382 -
tencent/Hunyuan-MT-7B
Translation β’ 8B β’ Updated β’ 11.9k β’ 549 -
google/embeddinggemma-300m
Sentence Similarity β’ Updated β’ 1.33M β’ β’ 1.47k -
moonshotai/Kimi-K2-Instruct-0905
Text Generation β’ Updated β’ 13.9k β’ β’ 671
-
microsoft/VibeVoice-1.5B
Text-to-Speech β’ 3B β’ Updated β’ 183k β’ 2.21k -
OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview
Image-Text-to-Text β’ 0.4B β’ Updated β’ 36.5k β’ 82 -
apple/FastVLM-1.5B
Text Generation β’ 2B β’ Updated β’ 1.32k β’ 76 -
stepfun-ai/Step-Audio-2-mini
Any-to-Any β’ 8B β’ Updated β’ 2.13k β’ 248
-
openai/gpt-oss-120b
Text Generation β’ 120B β’ Updated β’ 3.33M β’ β’ 4.49k -
openai/gpt-oss-20b
Text Generation β’ 22B β’ Updated β’ 5.54M β’ β’ 4.36k -
openai/BrowseCompLongContext
Viewer β’ Updated β’ 295 β’ 666 β’ 46 -
baichuan-inc/Baichuan-M2-32B
Text Generation β’ 33B β’ Updated β’ 116k β’ β’ 118
-
stepfun-ai/step3
Image-Text-to-Text β’ 321B β’ Updated β’ 33.6k β’ 166 -
nunchaku-ai/nunchaku-flux.1-krea-dev
Text-to-Image β’ Updated β’ 9.31k β’ 119 -
fdtn-ai/Foundation-Sec-8B-Instruct
Text Generation β’ 8B β’ Updated β’ 5.49k β’ β’ 67 -
Wan-AI/Wan2.2-TI2V-5B-Diffusers
Text-to-Video β’ Updated β’ 35.1k β’ 108
-
Wan-AI/Wan2.2-I2V-A14B
Image-to-Video β’ Updated β’ 12.3k β’ β’ 605 -
allenai/olmOCR-7B-0725
Image-Text-to-Text β’ 8B β’ Updated β’ 436 β’ 64 -
Wan-AI/Wan2.2-T2V-A14B
Text-to-Video β’ Updated β’ 3.49k β’ β’ 420 -
Qwen/Qwen3-235B-A22B-Thinking-2507
Text Generation β’ Updated β’ 42.4k β’ β’ 398
-
nvidia/OpenReasoning-Nemotron-32B
Text Generation β’ 33B β’ Updated β’ 720 β’ β’ 122 -
ByteDance-Seed/Seed-X-RM-7B
Translation β’ Updated β’ 115 β’ 30 -
LGAI-EXAONE/EXAONE-4.0-32B
Text Generation β’ 32B β’ Updated β’ 12.6k β’ 277 -
vidore/colqwen-omni-v0.1
Visual Document Retrieval β’ Updated β’ 3.53k β’ 93
-
nari-labs/Dia-1.6B-0626
Text-to-Speech β’ 2B β’ Updated β’ 25k β’ 124 -
google/gemma-3n-E4B-it
Image-Text-to-Text β’ Updated β’ 128k β’ β’ 865 -
ByteDance/XVerse
Text-to-Image β’ Updated β’ 55 β’ 89 -
nvidia/llama-nemoretriever-colembed-3b-v1
Visual Document Retrieval β’ Updated β’ 677 β’ 74
-
opendatalab/OmniDocBench
Viewer β’ Updated β’ 1.36k β’ 10.7k β’ 68 -
nanonets/Nanonets-OCR-s
Image-Text-to-Text β’ 4B β’ Updated β’ 28.8k β’ 1.58k -
echo840/MonkeyOCR
Image-Text-to-Text β’ Updated β’ 258 β’ 514 -
Running on ZeroMCPFeatured140
Multimodal OCR2
π»140nanonets ocr / smoldocling / monkey ocr / typhoon ocr
-
ByteDance-Seed/BAGEL-7B-MoT
Any-to-Any β’ 15B β’ Updated β’ 597 β’ 1.18k -
mistralai/Devstral-Small-2505
24B β’ Updated β’ 61.9k β’ 861 -
ByteDance/Dolphin
Image-Text-to-Text β’ Updated β’ 3.12k β’ 513 -
moondream/moondream-2b-2025-04-14-4bit
Image-Text-to-Text β’ 1B β’ Updated β’ 4.62k β’ 62
-
Qwen/WorldPM-72B
Text Classification β’ 73B β’ Updated β’ 68 β’ 81 -
Running on ZeroMCPFeatured1.48k
LTX Video Fast
π₯1.48kultra-fast video model, LTX 0.9.8 13B distilled
-
BLIP3o/BLIP3o-Pretrain-Long-Caption
Viewer β’ Updated β’ 27.2M β’ 32.6k β’ 57 -
BLIP3o/BLIP3o-Model-8B
14B β’ Updated β’ 578 β’ 101
-
OpenGVLab/InternVL3-1B-hf
Image-Text-to-Text β’ 0.9B β’ Updated β’ 109k β’ 10 -
OpenGVLab/InternVL3-2B-hf
Image-Text-to-Text β’ 2B β’ Updated β’ 8.07k β’ 3 -
OpenGVLab/InternVL3-8B-hf
Image-Text-to-Text β’ 8B β’ Updated β’ 13.7k β’ 9 -
OpenGVLab/InternVL3-14B-hf
Image-Text-to-Text β’ 15B β’ Updated β’ 3.77k
-
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text β’ 16B β’ Updated β’ 79.3k β’ 445 -
agentica-org/DeepCoder-14B-Preview
Text Generation β’ 15B β’ Updated β’ 396 β’ β’ 680 -
HiDream-ai/HiDream-I1-Full
Text-to-Image β’ Updated β’ 24.6k β’ β’ 986 -
OpenGVLab/InternVL3-78B
Image-Text-to-Text β’ Updated β’ 148k β’ 231
-
deepseek-ai/DeepSeek-V3-0324
Text Generation β’ 685B β’ Updated β’ 236k β’ β’ 3.09k -
Qwen/Qwen2.5-Omni-7B
Any-to-Any β’ Updated β’ 302k β’ 1.86k -
google/txgemma-27b-chat
Text Generation β’ 27B β’ Updated β’ 967 β’ 58 -
RunningFeatured366
Qwen2.5 Omni 7B Demo
π366Chat with an AI using text, audio, image, or video and hear responses
-
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text β’ Updated β’ 1.68M β’ β’ 1.26k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text β’ Updated β’ 2.12M β’ 487 -
CohereLabs/aya-vision-8b
Image-Text-to-Text β’ 9B β’ Updated β’ 45.3k β’ 316 -
CohereLabs/aya-vision-32b
Image-Text-to-Text β’ Updated β’ 176 β’ β’ 221
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text β’ 8B β’ Updated β’ 2.03k β’ 88 -
AIDC-AI/Ovis2-34B
Image-Text-to-Text β’ 35B β’ Updated β’ 491 β’ 152 -
open-r1/OpenR1-Qwen-7B
Text Generation β’ 8B β’ Updated β’ 20 β’ β’ 54 -
nomic-ai/nomic-embed-text-v2-moe
Sentence Similarity β’ 0.5B β’ Updated β’ 1.26M β’ 452
-
Running on Zero266
Qwen2-VL-7B
π₯266Answer questions about any uploaded image
-
Running67
UI-TARS
π67Find click coordinates on images based on instructions
-
Running98
Qwen2.5-1M Demo
π»98Ask questions about your uploaded documents instantly
-
Qwen/Qwen2.5-14B-Instruct-1M
Text Generation β’ 15B β’ Updated β’ 10.3k β’ β’ 333
-
ostris/Flex.1-alpha
Text-to-Image β’ Updated β’ 524 β’ 481 -
Qwen/Qwen2.5-Math-PRM-72B
Text Classification β’ 73B β’ Updated β’ 248 β’ 72 -
HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text β’ 0.5B β’ Updated β’ 15.8k β’ 186 -
deepseek-ai/DeepSeek-R1
Text Generation β’ 685B β’ Updated β’ 538k β’ β’ 13k
-
meta-llama/Llama-3.3-70B-Instruct
Text Generation β’ Updated β’ 867k β’ β’ 2.66k -
Qwen/Qwen2-VL-72B
Image-Text-to-Text β’ 73B β’ Updated β’ 109 β’ 80 -
google/paligemma2-3b-pt-224
Image-Text-to-Text β’ Updated β’ 28.7k β’ 162 -
tencent/HunyuanVideo
Text-to-Video β’ Updated β’ 1.12k β’ β’ 2.12k
-
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text β’ Updated β’ 38.3k β’ 577 -
Qwen/QwQ-32B-Preview
Text Generation β’ 33B β’ Updated β’ 6.69k β’ β’ 1.74k -
nvidia/Hymba-1.5B-Base
Text Generation β’ 2B β’ Updated β’ 462 β’ 157 -
vidore/colsmolvlm-v0.1
Visual Document Retrieval β’ Updated β’ 4 β’ 55
-
microsoft/LLM2CLIP-EVA02-L-14-336
Zero-Shot Image Classification β’ Updated β’ 46 β’ 60 -
microsoft/LLM2CLIP-EVA02-B-16
Updated β’ 48 β’ 10 -
PleIAs/common_corpus
Viewer β’ Updated β’ 517M β’ 43.7k β’ 343 -
Qwen/Qwen2.5-Coder-32B-Instruct
Text Generation β’ 33B β’ Updated β’ 700k β’ β’ 1.99k
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper β’ 2409.11402 β’ Published β’ 74 -
BRAVE: Broadening the visual encoding of vision-language models
Paper β’ 2404.07204 β’ Published β’ 19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper β’ 2403.18814 β’ Published β’ 48 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper β’ 2409.17146 β’ Published β’ 121
-
ibm-granite/granite-3.0-8b-instruct
Text Generation β’ Updated β’ 17.1k β’ 205 -
ibm-granite/granite-3.0-2b-instruct
Text Generation β’ 3B β’ Updated β’ 4.21k β’ 47 -
CohereLabs/aya-expanse-8b
Text Generation β’ 8B β’ Updated β’ 84.6k β’ 421 -
CohereLabs/aya-expanse-32b
Text Generation β’ 32B β’ Updated β’ 5.08k β’ β’ 288
-
Runtime errorFeatured100
LOTUS Normal
π100Generate high-quality predictions from images
-
Runtime error78
LOTUS Depth
π78Generate depth maps from images and videos
-
jingheya/lotus-depth-g-v1-0
Depth Estimation β’ Updated β’ 8.07k β’ 27 -
jingheya/lotus-depth-d-v1-0
Depth Estimation β’ Updated β’ 259 β’ 5
-
Running on ZeroFeatured198
DepthCrafter
π¦198a super consistent video depth model
-
PausedFeatured223
Depth Pro
π223Generate an inverse depth map from an image
-
Runtime error78
LOTUS Depth
π78Generate depth maps from images and videos
-
apple/DepthPro
Depth Estimation β’ Updated β’ 3.59k β’ 499
-
facebook/dinov2-large
Image Feature Extraction β’ 0.3B β’ Updated β’ 560k β’ 102 -
google/flan-t5-xl
Updated β’ 169k β’ 527 -
google/siglip-large-patch16-384
Zero-Shot Image Classification β’ 0.7B β’ Updated β’ 14.9k β’ 11 -
google/vit-huge-patch14-224-in21k
Image Feature Extraction β’ 0.6B β’ Updated β’ 76.2k β’ 22
-
microsoft/resnet-50
Image Classification β’ Updated β’ 179k β’ β’ 478 -
google/vit-base-patch16-224-in21k
Image Feature Extraction β’ 86.4M β’ Updated β’ 1.1M β’ 393 -
google/vit-base-patch32-224-in21k
Image Feature Extraction β’ 88M β’ Updated β’ 6.36k β’ 19 -
facebook/dinov2-large
Image Feature Extraction β’ 0.3B β’ Updated β’ 560k β’ 102
-
facebook/deit-base-distilled-patch16-384
Image Classification β’ 87.6M β’ Updated β’ 58.9k β’ 7 -
facebook/convnextv2-base-1k-224
Image Classification β’ 88.7M β’ Updated β’ 1.44k β’ 4 -
facebook/deit-base-distilled-patch16-224
Image Classification β’ Updated β’ 5.88k β’ 32 -
google/vit-base-patch32-384
Image Classification β’ 88.3M β’ Updated β’ 6.81k β’ 23
-
facebook/detr-resnet-50
Object Detection β’ 41.6M β’ Updated β’ 497k β’ β’ 933 -
facebook/detr-resnet-101-dc5
Object Detection β’ 60.7M β’ Updated β’ 1.76k β’ 19 -
facebook/detr-resnet-50-dc5
Object Detection β’ 41.6M β’ Updated β’ 1.82k β’ 6 -
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 100k β’ 145
-
facebook/maskformer-swin-large-coco
Image Segmentation β’ 0.2B β’ Updated β’ 622 β’ β’ 27 -
nvidia/segformer-b0-finetuned-ade-512-512
Image Segmentation β’ 3.75M β’ Updated β’ 561k β’ β’ 179 -
facebook/detr-resnet-50-dc5-panoptic
Image Segmentation β’ 43M β’ Updated β’ 17 β’ 3 -
nvidia/segformer-b5-finetuned-cityscapes-1024-1024
Image Segmentation β’ Updated β’ 114k β’ β’ 38
-
openai/clip-vit-large-patch14
Zero-Shot Image Classification β’ 0.4B β’ Updated β’ 7.71M β’ 1.96k -
openai/clip-vit-base-patch32
Zero-Shot Image Classification β’ Updated β’ 17.2M β’ 857 -
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Zero-Shot Image Classification β’ Updated β’ 52.6k β’ 306 -
kakaobrain/align-base
Zero-Shot Image Classification β’ Updated β’ 10.8k β’ 30
-
timbrooks/instruct-pix2pix
Image-to-Image β’ Updated β’ 115k β’ 1.17k -
TencentARC/t2i-adapter-canny-sdxl-1.0
Image-to-Image β’ Updated β’ 3.43k β’ 52 -
TencentARC/t2i-adapter-sketch-sdxl-1.0
Image-to-Image β’ Updated β’ 4.62k β’ 75 -
CrucibleAI/ControlNetMediaPipeFace
Image-to-Image β’ Updated β’ 1.05k β’ 575
-
microsoft/xclip-base-patch32
Video Classification β’ 0.2B β’ Updated β’ 158k β’ 108 -
facebook/timesformer-base-finetuned-k400
Video Classification β’ Updated β’ 35.1k β’ 42 -
facebook/timesformer-base-finetuned-k600
Video Classification β’ Updated β’ 831 β’ 12 -
google/vivit-b-16x2
Video Classification β’ Updated β’ 2.16k β’ 11
-
Salesforce/blip-image-captioning-large
Image-to-Text β’ 0.5B β’ Updated β’ 680k β’ 1.45k -
Salesforce/blip-image-captioning-base
Image-to-Text β’ Updated β’ 2.29M β’ 842 -
microsoft/trocr-base-handwritten
Image-to-Text β’ 0.3B β’ Updated β’ 238k β’ 476 -
microsoft/git-large-coco
Image-to-Text β’ 0.4B β’ Updated β’ 3.21k β’ 104
-
stabilityai/stable-diffusion-xl-base-1.0
Text-to-Image β’ Updated β’ 2.05M β’ β’ 7.43k -
warp-ai/wuerstchen
Text-to-Image β’ Updated β’ 150 β’ 176 -
Deci/DeciDiffusion-v1-0
Text-to-Image β’ Updated β’ 27 β’ 140 -
stabilityai/stable-diffusion-xl-refiner-1.0
Image-to-Image β’ Updated β’ 332k β’ 2.02k
-
Running114
Grounding DINO Demo
π»114Cutting edge open-vocabulary object detection app
-
RunningFeatured96
Owlv2
π96State-of-the-art Zero-shot Object Detection
-
Runtime errorFeatured41
BLIP2 with transformers
π41BLIP2 (cutting edge image captioning) in π€transformers
-
Build errorFeatured378
IDEFICS Playground
π¨378
-
RunningFeatured96
Owlv2
π96State-of-the-art Zero-shot Object Detection
-
Runtime errorFeatured64
Owl Tracking
β‘64Powerful foundation model for zero-shot object tracking
-
Running26
Search and Detect (CLIP/OWL-ViT)
π¦26Search and detect objects in images using text queries
-
Running on ZeroFeatured109
OWLSAM
π»109State-of-the-art open-vocabulary image segmentation β‘οΈ
-
Running on ZeroFeatured72
Draw To Search Art
π72Draw/upload image and search among WikiART using SigLIP
-
Running on CPU Upgrade23
Compare Clip Siglip
π23Compare strong zero-shot image classification models
-
Running on Zero13
Multilingual Zero Shot Image Clf
π’13Comparing powerful multilingual zero-shot image clf models
-
BAAI/bunny-phi-2-siglip-lora
Text Generation β’ Updated β’ 251 β’ 48
-
Runtime errorFeatured84
UDOP
π84Generate text from document images
-
Runtime error40
Pix2struct
π40Play with all the pix2struct variants in this d
-
Running26
Compare Docvqa Models
π¦26Compare different visual question answering
-
Runtime errorFeatured289
DocQuery βΒ Document Query Engine
π¦289
-
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 39 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper β’ 2403.05525 β’ Published β’ 49 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper β’ 2308.12966 β’ Published β’ 11 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper β’ 2404.01331 β’ Published β’ 27
-
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 100k β’ 145 -
google/owlvit-base-patch16
Zero-Shot Object Detection β’ Updated β’ 13.1k β’ 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection β’ Updated β’ 17.5k β’ 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 143k β’ 29
-
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 100k β’ 145 -
google/owlvit-base-patch16
Zero-Shot Object Detection β’ Updated β’ 13.1k β’ 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection β’ Updated β’ 17.5k β’ 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 143k β’ 29
-
google/owlvit-base-patch32
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 100k β’ 145 -
google/owlvit-base-patch16
Zero-Shot Object Detection β’ Updated β’ 13.1k β’ 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection β’ Updated β’ 17.5k β’ 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection β’ 0.2B β’ Updated β’ 143k β’ 29
-
depth-anything/Depth-Anything-V2-Small
Depth Estimation β’ Updated β’ 9.22k β’ 76 -
depth-anything/Depth-Anything-V2-Large
Depth Estimation β’ Updated β’ 88.5k β’ 150 -
Running on Zero627
Depth Anything V2
π627Generate depth maps and 3D perception from any image
-
depth-anything/DA-2K
Viewer β’ Updated β’ 1.04k β’ 321 β’ 16
-
Running196
Vidore Leaderboard
π₯196Compare and rank visual document retrieval models across different benchmarks
-
Running on CPU Upgrade990
Open VLM Leaderboard
π990VLMEvalKit Evaluation Results Collection
-
RunningFeatured560
Vision Arena (Testing VLMs side-by-side)
πΌ560Compare vision models on your images instantly
-
RunningFeatured85
SEED-Bench Leaderboard
π85Submit model evaluation results to leaderboard
-
Paused21
Video Llava
π¨21Generate descriptions by uploading images or videos
-
llava-hf/LLaVA-NeXT-Video-7B-hf
Video-Text-to-Text β’ 7B β’ Updated β’ 87.4k β’ 122 -
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf
Video-Text-to-Text β’ 7B β’ Updated β’ 1.38k β’ 11 -
llava-hf/LLaVA-NeXT-Video-7B-32K-hf
Image-Text-to-Text β’ 8B β’ Updated β’ 449 β’ 8
-
NVEagle/Eagle-X5-13B
Image-Text-to-Text β’ 15B β’ Updated β’ 4 β’ 15 -
NVEagle/Eagle-X5-13B-Chat
Image-Text-to-Text β’ 15B β’ Updated β’ 4 β’ 28 -
NVEagle/Eagle-X5-7B
Image-Text-to-Text β’ 9B β’ Updated β’ 26 β’ 26 -
Runtime error64
Eagle X5 13B Chat
π64Combine text and images to generate responses
-
vidore/colpali-v1.2
Visual Document Retrieval β’ Updated β’ 23.9k β’ 113 -
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text β’ Updated β’ 1.68M β’ β’ 1.26k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text β’ Updated β’ 2.12M β’ 487 -
Qwen/Qwen2-72B-Instruct
Text Generation β’ 73B β’ Updated β’ 32k β’ β’ 719