🏗️ Building on HF

merve PRO

merve

1982 454 956

Rostenbach's profile picture

chansung's profile picture

dhoa's profile picture

https://github.com/merveenoyan/smol-vision

mervenoyann
merveenoyan
merve.bsky.social

AI & ML interests

I love this website VLMs, vision & co

Recent Activity

upvoted an article 2 days ago

Grabette: an open system to record robot-manipulation data

upvoted an article 2 days ago

The State of Simulation for Physical AI: An Overview

updated a dataset 2 days ago

merve/vl-test-suite

View all activity

Organizations

merve 's collections 108

Weekly Releases (Jul 17, 2026)

thinkingmachines/Inkling

Image-Text-to-Text • 952B • Updated 3 days ago • 34.5k • • 1.57k
Lightricks/LTX-2.3-22b-LoRA-Foley-V2A

Text-to-Audio • Updated about 3 hours ago • 30
genzeonplatform/healthcare-brain-diagnosis-icd-ner

Token Classification • 0.1B • Updated 6 days ago • 38 • 20
Hippotes/Ideogram4-Fal-ComfyUI

Updated 5 days ago • 8.1k • 23

Weekly Releases (Jul 03, 2026)

ai-sage/GFusion-10B-A1.8B

Text Generation • 11B • Updated 21 days ago • 2.01k • 21
malcolmrey/krea2

Updated 16 minutes ago • 55
tencent/Hy3-FP8

Text Generation • 299B • Updated 6 days ago • 19.5k • 66
RudySen/Krea2-realism-V2

Text-to-Image • Updated 24 days ago • 21.6k • • 94

Weekly Releases (Jun 19, 2026)

Boogu/Boogu-Image-0.1-Turbo

Text-to-Image • 10B • Updated 2 days ago • 954 • 68
datalab-to/lift

Image-Text-to-Text • 10B • Updated Jun 19 • 38.2k • 189
Comfy-Org/Boogu-Image

Updated 6 days ago • 126
Boogu/Boogu-Image-0.1-Turbo-fp8

Text-to-Image • Updated 2 days ago • 470 • 49

Vision Intern

Roboflow/rf-detr-medium

Object Detection • 33.7M • Updated May 20 • 1.02k • 6
Qwen/Qwen3.5-9B

Image-Text-to-Text • 10B • Updated Mar 2 • 10.9M • • 1.75k
LiquidAI/LFM2.5-VL-1.6B

Image-Text-to-Text • 2B • Updated Mar 30 • 55.2k • 309
google/gemma-4-E4B-it

Any-to-Any • 8B • Updated 6 days ago • 5.73M • 1.41k

Weekly Releases (Jun 05, 2026)

Comfy-Org/Ideogram-4

Updated 26 days ago • 221
jdopensource/JoyAI-Echo

Text-to-Video • Updated Jun 18 • 4.37k • 166
litert-community/gemma-4-12B-it-litert-lm

Updated 12 days ago • 12.8k • 46
google/gemma-4-12B-it-qat-q4_0-unquantized

Any-to-Any • 12B • Updated 6 days ago • 536k • 69

Weekly Releases (May 22, 2026)

Efficient-Large-Model/SANA-WM_bidirectional

Image-to-Video • Updated May 19 • 126
CohereLabs/command-a-plus-05-2026-w4a4

Image-Text-to-Text • 126B • Updated Jun 16 • 6.65k • • 235
FINAL-Bench/Darwin-28B-Coder

Text Generation • 27B • Updated 3 days ago • 129 • 28
LatitudeGames/Equinox-31B

31B • Updated May 22 • 281 • 59

RFDetr

RF-DETR checkpoints converted to be used with 🤗 Transformers

Running on Zero

Agents

Featured

56

RF-DETR Realtime Webcam Demo

🎯

56

Segment objects in live webcam and uploaded media
Roboflow/rf-detr-base

Object Detection • 32.2M • Updated May 20 • 12.4k • 5
Roboflow/rf-detr-base-2

Object Detection • 32.2M • Updated May 20 • 12
Roboflow/rf-detr-nano

Object Detection • 30.5M • Updated May 20 • 688

Apr 27 Releases

nvidia/Gemma-4-26B-A4B-NVFP4

Text Generation • 14B • Updated May 11 • 1.18M • 122
XiaomiMiMo/MiMo-V2.5-Pro

Text Generation • 1T • Updated 17 days ago • 61.3k • • 718
AngelSlim/Hy-MT1.5-1.8B-1.25bit

Translation • 2B • Updated May 26 • 58 • 194
mistralai/Mistral-Medium-3.5-128B-EAGLE

Updated Apr 30 • 202 • 57

apr-17-releases

Curated collection of notable models and spaces from the week of April 17.

OpenMOSS-Team/MOSS-Audio-4B-Instruct

Audio-Text-to-Text • 5B • Updated Apr 14 • 169k • 78
OpenMOSS-Team/MOSS-Audio-8B-Thinking

Audio-Text-to-Text • 9B • Updated Jun 11 • 39.9k • 79
bytedance-research/Timer-S1

Time Series Forecasting • 8B • Updated Apr 21 • 4.32k • 33
BugTraceAI/BugTraceAI-Apex-G4-26B-Q4

25B • Updated May 11 • 2.63k • 87

Apr 3 Releases

netflix/void-model

Video-to-Video • Updated Apr 6 • 960
arcee-ai/Trinity-Large-Thinking

Text Generation • 399B • Updated May 28 • 8.65k • • 185
KRAFTON/Raon-VisionEncoder

Feature Extraction • 1B • Updated Apr 1 • 24 • 21
KRAFTON/Raon-SpeechChat-9B

Audio-to-Audio • 10B • Updated 19 days ago • 868 • 38

Multimodal tool calling datasets

AgoraX/OpenImage-FNCall-50k

Viewer • Updated Feb 14, 2024 • 53.3k • 60 • 3
ScaleAI/VisualToolBench

Viewer • Updated Dec 16, 2025 • 1.2k • 1.37k • 5
internlm/ARM-Thinker-Data

Preview • Updated Feb 13 • 34 • 7
zzliang/GRIT

Viewer • Updated Jul 4, 2023 • 20.5M • 604 • 160

Jan 19 Releases

Nemotron ColEmbed V2

Collection

State-of-the-Art Late Interaction Vision-Language Embedding Models • 3 items • Updated 10 days ago • 15
Qwen/Qwen3-TTS-12Hz-1.7B-Base

2B • Updated Jan 23 • 2.39M • 455
fal/flux-2-klein-4B-outpaint-lora

Image-to-Image • Updated Jan 21 • • 88
Qwen/Qwen3-TTS-Tokenizer-12Hz

Audio-to-Audio • 0.2B • Updated Jan 29 • 42.8k • 74

YOLO26 Models

YOLO26 models: detection, segmentation, classification, pose, and OBB variants with demos and ONNX variants.

Runtime error

Agents

26

YOLO26

💙

26

Process images with advanced object detection and segmentation
Running

Featured

65

YOLO26 WebGPU

🏆

65

Real-time object detection & pose estimation in your browser
onnx-community/yolo26x-ONNX

Updated Jan 18 • 27 • 5
openvision/yoloe26-n-seg

Zero-Shot Object Detection • Updated Jan 15 • 34 • 2

Dec 30 Releases

Wuli-art/Qwen-Image-2512-Turbo-LoRA

Text-to-Image • Updated Jan 8 • 7.56k • 218
miromind-ai/MiroThinker-v1.5-235B

Text Generation • 235B • Updated Mar 20 • 50 • 254
prithivMLmods/Qwen-Image-Edit-2511-Object-Remover

Image-to-Image • Updated Jan 4 • 6.06k • • 68
tencent/Youtu-LLM-2B-Base

Text Generation • 2B • Updated Feb 24 • 1.3k • 42

Dec 12 Releases

openai/circuit-sparsity

Text Generation • 0.4B • Updated Dec 12, 2025 • 203 • 208
FunAudioLLM/Fun-CosyVoice3-0.5B-2512

Text-to-Speech • Updated Feb 3 • 24.2k • 594
DiffSynth-Studio/Qwen-Image-i2L

Updated Dec 16, 2025 • 257
Aratako/T5Gemma-TTS-2b-2b

Text-to-Speech • 5B • Updated Apr 3 • 559 • 119

SAM3

facebook/sam3

Mask Generation • 0.9B • Updated Nov 20, 2025 • 2.12M • 2.53k
Running on Zero

Agents

Featured

115

SAM3 Video Segmentation

🐠

115

Track and label objects in videos using text prompts or clicks
onnx-community/sam3-tracker-ONNX

Mask Generation • Updated Nov 19, 2025 • 633 • 38
Running

30

SAM3 Tracker WebGPU

🎯

30

Segment images with click points and download cutouts

Oct 6 Releases

Kwaipilot/KAT-Dev-72B-Exp

Text Generation • 73B • Updated Oct 13, 2025 • 669 • • 156
LiquidAI/LFM2-8B-A1B

Text Generation • 8B • Updated May 29 • 30.6k • 370
yanolja/YanoljaNEXT-Rosetta-12B-2510

Translation • 12B • Updated Nov 2, 2025 • 675 • 29
NeuML/colbert-muvera-femto

Sentence Similarity • 243k • Updated Jun 15 • 50 • 20

Sep 23 Releases

ByteDance/lynx

Image-to-Video • Updated Sep 27, 2025 • • 140
tencent/HunyuanImage-3.0

Text-to-Image • 83B • Updated Jan 28 • 12.3k • • 1.1k
meituan-longcat/LongCat-Flash-Thinking

Text Generation • 562B • Updated Sep 24, 2025 • 152 • 148
Qwen/Qwen3Guard-Gen-4B

Text Generation • 4B • Updated Nov 7, 2025 • 187k • • 52

Sep 11 Releases

bytedance-research/HuMo

Image-to-Video • Updated Sep 18, 2025 • 59 • 267
facebook/MobileLLM-R1-950M

Text Generation • 0.9B • Updated Sep 30, 2025 • 433 • 359
tencent/POINTS-Reader

Image-Text-to-Text • 4B • Updated Sep 12, 2025 • 62 • 102
baidu/ERNIE-4.5-21B-A3B-Thinking

Text Generation • 22B • Updated Nov 26, 2025 • 14.5k • 787

August 29 Releases

microsoft/VibeVoice-1.5B

Text-to-Speech • 3B • Updated Jan 22 • 52.8k • 2.43k
OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview

Image-Text-to-Text • 0.4B • Updated Aug 29, 2025 • 106k • 83
apple/FastVLM-1.5B

Text Generation • 2B • Updated Sep 3, 2025 • 5.93k • 80
stepfun-ai/Step-Audio-2-mini

Any-to-Any • 8B • Updated Feb 14 • 21.9k • 263

Releases August 9

openai/gpt-oss-120b

Text Generation • 120B • Updated Aug 26, 2025 • 4.38M • • 5.03k
openai/gpt-oss-20b

Text Generation • 22B • Updated Aug 26, 2025 • 7.93M • • 4.84k
openai/BrowseCompLongContext

Viewer • Updated Aug 9, 2025 • 295 • 1.84k • 54
baichuan-inc/Baichuan-M2-32B

Text Generation • 33B • Updated Dec 24, 2025 • 2.08k • • 124

Releases July 25

Wan-AI/Wan2.2-I2V-A14B

Image-to-Video • Updated Aug 7, 2025 • 15.6k • • 758
allenai/olmOCR-7B-0725

Image-Text-to-Text • 8B • Updated Aug 26, 2025 • 542 • 64
Wan-AI/Wan2.2-T2V-A14B

Text-to-Video • Updated Aug 7, 2025 • 3.8k • • 526
Qwen/Qwen3-235B-A22B-Thinking-2507

Text Generation • 235B • Updated Aug 17, 2025 • 20.4k • • 408

Releases July 11

HuggingFaceTB/SmolLM3-3B

Text Generation • 3B • Updated Sep 10, 2025 • 767k • 985
moonshotai/Kimi-K2-Instruct

Text Generation • 1T • Updated Apr 23 • 192k • • 2.37k
fal/Realism-Detailer-Kontext-Dev-LoRA

Image-to-Image • Updated Jul 7, 2025 • 143 • • 55
Alibaba-NLP/WebSailor-3B

3B • Updated Jul 10, 2025 • 29 • 74

Releases June 27

nari-labs/Dia-1.6B-0626

Text-to-Speech • 2B • Updated Jul 3, 2025 • 8.97k • 132
google/gemma-3n-E4B-it

Image-Text-to-Text • 8B • Updated Jul 14, 2025 • 29.7k • • 919
ByteDance/XVerse

Text-to-Image • Updated Jul 1, 2025 • 64 • 91
nvidia/llama-nemoretriever-colembed-3b-v1

Visual Document Retrieval • 4B • Updated Feb 4 • 246 • 74

OCR Models & Datasets

opendatalab/OmniDocBench

Viewer • Updated about 1 month ago • 1.66k • 29.8k • 98
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20, 2025 • 7.22k • 1.59k
echo840/MonkeyOCR

Image-Text-to-Text • Updated Mar 3 • 352 • 516
Running on Zero

MCP

Featured

143

Multimodal OCR2

💻

143

FireRed / Nanonets / Monkey / Thyme / Typhoon / SmolDocling

Releases June 6

Qwen/Qwen3-Reranker-4B

Text Ranking • 4B • Updated Apr 16 • 2.39M • 149
echo840/MonkeyOCR

Image-Text-to-Text • Updated Mar 3 • 352 • 516
openbmb/MiniCPM4-8B

Text Generation • 8B • Updated Oct 24, 2025 • 75.1k • 284
arcee-ai/Homunculus

Text Generation • 12B • Updated Jun 3, 2025 • 18 • 99

Releases 23 May

ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated Jan 9 • 810 • 1.21k
mistralai/Devstral-Small-2505

24B • Updated Aug 18, 2025 • 4.84k • 867
ByteDance/Dolphin

Image-Text-to-Text • 0.4B • Updated Jul 16, 2025 • 526 • 517
moondream/moondream-2b-2025-04-14-4bit

Image-Text-to-Text • 1B • Updated May 22, 2025 • 8.59k • 69

May 9 Releases

tencent/HunyuanCustom

Image-to-Video • Updated Jun 6, 2025 • 191
stepfun-ai/Step1X-3D

Updated May 13, 2025 • 106
cognition-ai/Kevin-32B

33B • Updated May 6, 2025 • 10 • 163
ServiceNow-AI/Apriel-Nemotron-15b-Thinker

Text Generation • 15B • Updated Nov 10, 2025 • 659 • 127

Releases Apr 21 & May 2

facebook/EdgeTAM

Updated Apr 30, 2025 • 4 • 31
nvidia/parakeet-tdt-0.6b-v2

Automatic Speech Recognition • Updated 27 days ago • 690k • 1.53k
deepseek-ai/DeepSeek-Prover-V2-671B

Text Generation • 685B • Updated Apr 30, 2025 • 629 • 831
Qwen/Qwen2.5-Omni-3B

Any-to-Any • 6B • Updated Apr 30, 2025 • 1.69M • 344

April 16 Releases

giskardai/realharm

Viewer • Updated Apr 16, 2025 • 136 • 31 • 12
Junfeng5/Liquid_V1_7B

Any-to-Any • 9B • Updated Mar 20, 2025 • 2.04k • 97

April 11 Releases

moonshotai/Kimi-VL-A3B-Thinking

Image-Text-to-Text • 16B • Updated Jan 30 • 162k • 449
agentica-org/DeepCoder-14B-Preview

Text Generation • 15B • Updated May 11, 2025 • 620 • • 679
HiDream-ai/HiDream-I1-Full

Text-to-Image • 17B • Updated Jul 17, 2025 • 18.3k • • 997
OpenGVLab/InternVL3-78B

Image-Text-to-Text • 78B • Updated Sep 11, 2025 • 14.6k • 238

March 21 Releases

docling-project/SmolDocling-256M-preview

Image-Text-to-Text • 0.3B • Updated Sep 17, 2025 • 31.2k • 1.62k
sesame/csm-1b

Text-to-Speech • 2B • Updated Dec 1, 2025 • 245k • 2.42k
mistralai/Mistral-Small-3.1-24B-Instruct-2503

24B • Updated Dec 22, 2025 • 251k • 1.38k
tencent/Hunyuan3D-2mini

Image-to-3D • Updated Oct 17, 2025 • 22.3k • 140

Feb 14 Releases 💌

OpenGVLab/InternVideo2_5_Chat_8B

Video-Text-to-Text • 8B • Updated Aug 4, 2025 • 2.91k • 91
ATH-MaaS/Ovis2-34B

Image-Text-to-Text • 35B • Updated Aug 15, 2025 • 109 • 151
open-r1/OpenR1-Qwen-7B

Text Generation • 8B • Updated May 28, 2025 • 160 • • 54
nomic-ai/nomic-embed-text-v2-moe

Sentence Similarity • 0.5B • Updated Apr 1, 2025 • 1.64M • 493

January 31 Releases 🧤

allenai/Llama-3.1-Tulu-3-405B

Text Generation • 406B • Updated Feb 10, 2025 • 254 • 112
Qwen/Qwen2.5-VL-72B-Instruct

Image-Text-to-Text • 73B • Updated Jun 6, 2025 • 565k • • 643
mistralai/Mistral-Small-24B-Instruct-2501

24B • Updated Jul 28, 2025 • 85k • 962
deepseek-ai/Janus-Pro-7B

Any-to-Any • Updated Feb 1, 2025 • 11.3k • 3.64k

Jan 24 Releases

ostris/Flex.1-alpha

Text-to-Image • 8B • Updated Jan 19, 2025 • 990 • 483
Qwen/Qwen2.5-Math-PRM-72B

Text Classification • 73B • Updated Jan 17, 2025 • 141 • 77
HuggingFaceTB/SmolVLM-500M-Instruct

Image-Text-to-Text • 0.5B • Updated Apr 8, 2025 • 325k • 196
deepseek-ai/DeepSeek-R1

Text Generation • 685B • Updated Mar 27, 2025 • 8.93M • • 13.5k

Jan 10 Releases 🌨️

vikhyatk/moondream2

Image-Text-to-Text • 2B • Updated Sep 23, 2025 • 2.31M • 1.43k
DAMO-NLP-SG/multimodal_textbook

Updated Mar 17, 2025 • 1.69k • 164
ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Sep 8, 2025 • 601 • 30
nvidia/Cosmos-1.0-Autoregressive-4B

Updated Feb 11, 2025 • 43 • 57

Nov 29 Releases 🌲🌲

HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • 2B • Updated Apr 8, 2025 • 24k • 591
Qwen/QwQ-32B-Preview

Text Generation • 33B • Updated Jan 12, 2025 • 42k • • 1.74k
nvidia/Hymba-1.5B-Base

Text Generation • 2B • Updated Nov 26, 2025 • 2.42k • 158
vidore/colsmolvlm-v0.1

Visual Document Retrieval • Updated Mar 14, 2025 • 191 • 55

Nov 15 Releases 🍂

microsoft/LLM2CLIP-EVA02-L-14-336

Zero-Shot Image Classification • Updated Nov 22, 2024 • 96 • 62
microsoft/LLM2CLIP-EVA02-B-16

Updated Feb 8, 2025 • 36 • 11
PleIAs/common_corpus

Viewer • Updated May 6 • 69.9k • 65k • 411
Qwen/Qwen2.5-Coder-32B-Instruct

Text Generation • 33B • Updated Jan 12, 2025 • 1.32M • • 2.09k

MIT Talk 31/10 Papers

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17, 2024 • 75
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10, 2024 • 20
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27, 2024 • 49
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 123

LOTUS 🪷

Running on Zero

Agents

Featured

101

Lotus Normal

🌍

101

Official Demo of Lotus (https://lotus3d.github.io/)
Running on Zero

Agents

79

Lotus Depth

🚀

79

Official Demo of Lotus (https://lotus3d.github.io/)
jingheya/lotus-depth-g-v1-0

Depth Estimation • 0.9B • Updated Oct 5, 2024 • 7.07k • 27
jingheya/lotus-depth-d-v1-0

Depth Estimation • 0.9B • Updated Oct 5, 2024 • 81 • 5

BRAVE Models 🦁

Models mentioned in https://huggingface.co/papers/2404.07204

facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 1.06M • 115
google/flan-t5-xl

3B • Updated Nov 28, 2023 • 153k • 535
google/siglip-large-patch16-384

Zero-Shot Image Classification • 0.7B • Updated Sep 26, 2024 • 46.8k • 12
google/vit-huge-patch14-224-in21k

Image Feature Extraction • 0.6B • Updated Feb 14, 2024 • 3.75k • 22

Image Classification Models 🐶 🐱

facebook/deit-base-distilled-patch16-384

Image Classification • 87.6M • Updated Sep 12, 2023 • 20.9k • • 8
facebook/convnextv2-base-1k-224

Image Classification • 88.7M • Updated Feb 17, 2025 • 5.82k • • 4
facebook/deit-base-distilled-patch16-224

Image Classification • Updated Jul 13, 2022 • 9.88k • • 34
google/vit-base-patch32-384

Image Classification • 88.3M • Updated Sep 11, 2023 • 6.52k • • 23

Image Segmentation Models 💜

A collection of instance/semantic/panoptic segmentation models.

facebook/maskformer-swin-large-coco

Image Segmentation • 0.2B • Updated Sep 11, 2023 • 364 • • 28
nvidia/segformer-b0-finetuned-ade-512-512

Image Segmentation • 3.75M • Updated Jan 14, 2024 • 383k • • 193
facebook/detr-resnet-50-dc5-panoptic

Image Segmentation • 43M • Updated Sep 11, 2023 • 47 • 3
nvidia/segformer-b5-finetuned-cityscapes-1024-1024

Image Segmentation • Updated Aug 9, 2022 • 20.1k • • 44

Image-to-Image Models 🎨

Collection of image to image editing, image enhancement (SR, deblur, brighten) and text-to-image adapter models.

timbrooks/instruct-pix2pix

Image-to-Image • 0.9B • Updated Jul 5, 2023 • 31.1k • 1.18k
TencentARC/t2i-adapter-canny-sdxl-1.0

Image-to-Image • 79M • Updated Sep 7, 2023 • 2.02k • 54
TencentARC/t2i-adapter-sketch-sdxl-1.0

Image-to-Image • 79M • Updated Sep 8, 2023 • 2.37k • 77
CrucibleAI/ControlNetMediaPipeFace

Image-to-Image • 0.4B • Updated May 19, 2023 • 828 • 575

Image-to-Text Models 📝

This collection contains image captioning and OCR models.

Salesforce/blip-image-captioning-large

Image-to-Text • 0.5B • Updated Feb 3, 2025 • 684k • 1.48k
Salesforce/blip-image-captioning-base

Image-to-Text • Updated Feb 3, 2025 • 1.79M • 868
microsoft/trocr-base-handwritten

Image-to-Text • 0.3B • Updated Feb 11, 2025 • 183k • 502
microsoft/git-large-coco

Image-to-Text • 0.4B • Updated Jun 26, 2023 • 2.98k • 106

Foundation Models for Vision 🧩

Foundation models for computer vision.

Running

Agents

124

Grounding DINO Demo

💻

124

Cutting edge open-vocabulary object detection app
Running

Agents

Featured

105

Owlv2

👀

105

State-of-the-art Zero-shot Object Detection
Configuration error

Agents

Featured

41

BLIP2 with transformers

🌖

41

BLIP2 (cutting edge image captioning) in 🤗transformers
Build error

Agents

Featured

377

IDEFICS Playground

🐨

377

OWL-series 🦉

Models and applications of OWL-ViT and OWLv2.

Running

Agents

Featured

105

Owlv2

👀

105

State-of-the-art Zero-shot Object Detection
Running on Zero

Agents

Featured

64

Owl Tracking

⚡

64

Powerful foundation model for zero-shot object tracking
Sleeping

26

Search and Detect (CLIP/OWL-ViT)

🦉

26

Search and detect objects in images using text queries
Running on Zero

Agents

Featured

110

OWLSAM

😻

110

State-of-the-art open-vocabulary image segmentation ⚡️

Awesome Document AI

A collection of open-source document AI 📄 📝 📈

Running on Zero

Agents

Featured

83

UDOP

🏃

83

Generate answers or summaries from document images with prompts
Configuration error

Agents

40

Pix2struct

📚

40

Play with all the pix2struct variants in this d
Running

Agents

26

Compare Docvqa Models

🦀

26

Compare different visual question answering
Runtime error

Agents

Featured

289

DocQuery — Document Query Engine

🦉

289

Vision Language Models Papers 🖼️💬📝

Papers about vision-language models, most important ones are on top of the list.

Improved Baselines with Visual Instruction Tuning

Paper • 2310.03744 • Published Oct 5, 2023 • 39
DeepSeek-VL: Towards Real-World Vision-Language Understanding

Paper • 2403.05525 • Published Mar 8, 2024 • 50
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Paper • 2308.12966 • Published Aug 24, 2023 • 12
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Paper • 2404.01331 • Published Mar 29, 2024 • 28

gv-hf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 296k • 150
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 9.6k • 14
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 6.58k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 54k • 30

Depth Anything v2 Release

A comprehensive collection on DAv2

depth-anything/Depth-Anything-V2-Small

Depth Estimation • Updated Jul 8, 2024 • 14.3k • 79
depth-anything/Depth-Anything-V2-Large

Depth Estimation • Updated Jul 8, 2024 • 48.1k • 156
Running on Zero

Agents

696

Depth Anything V2

🌖

696

Generate depth map from any photo
depth-anything/DA-2K

Viewer • Updated Jun 14, 2024 • 1.04k • 363 • 17

Vision Language Leaderboards

This collection has all the vision language leaderboards.

Running

Agents

209

Vidore Leaderboard

🥇

209

Browse and compare visual document retrieval model scores
Running on CPU Upgrade

Agents

1.02k

Open VLM Leaderboard

🌎

1.02k

VLMEvalKit Evaluation Results Collection
Running

Featured

561

Vision Arena (Testing VLMs side-by-side)

🖼

561

Explore Vision Arena visual AI demo online
Build error

Agents

Featured

85

SEED-Bench Leaderboard

🏆

85

Submit model evaluation results to leaderboard

SAM2

All the models and demos for SAM2

merve/sam2-hiera-tiny

Mask Generation • Updated Aug 2, 2024 • 31
merve/sam2-hiera-small

Mask Generation • Updated Aug 2, 2024 • 17 • 2
merve/sam2-hiera-large

Mask Generation • Updated Aug 2, 2024 • 105 • 2
merve/sam2-hiera-base-plus

Mask Generation • Updated Aug 2, 2024 • 71

Multimodal RAG

vidore/colpali-v1.2

Visual Document Retrieval • Updated Mar 14, 2025 • 134k • 112
Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6, 2025 • 1.47M • 1.28k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • 2B • Updated Jan 12, 2025 • 3.13M • 516
Qwen/Qwen2-72B-Instruct

Text Generation • 73B • Updated Oct 8, 2024 • 60.8k • • 718

Weekly Releases (Jul 10, 2026)

OrionLLM/GRM-2.6-Plus-0628

Image-Text-to-Text • 28B • Updated 18 days ago • 1.22k • • 31
SOLRICKS/ltx-2.3-product-ad-style

Text-to-Video • Updated 17 days ago • 37
robbyant/lingbot-vision-vit-large

Image Feature Extraction • Updated 20 days ago • 1 • 24
robbyant/lingbot-video-moe-30b-a3b

30B • Updated 18 days ago • 1.09k • 121

Weekly Releases (Jun 26, 2026)

deepreinforce-ai/Ornith-1.0-397B

Text Generation • 397B • Updated about 1 month ago • 403k • 245
Winnougan/Krea-2-Base-Turbo-NVFP4-FP8-INT8

Updated 23 days ago • 96
wikeeyang/Flux2-Klein-9B-True-V3

Text-to-Image • 9B • Updated 13 days ago • 42.9k • 111
Cseti/LTX2.3-22B_IC-LoRA-Cameraman_v2

Updated Jun 24 • 52

my recommended vision/mm models

detection, segmentation, OCR, depth, pose, grounding, VLM detection

Roboflow/rf-detr-base

Object Detection • 32.2M • Updated May 20 • 12.4k • 5
Roboflow/rf-detr-seg-large

Image Segmentation • 36.2M • Updated May 20 • 48 • 3
Roboflow/rf-detr-seg-medium

Image Segmentation • 35.7M • Updated May 20 • 388 • 3
facebook/sam3

Mask Generation • 0.9B • Updated Nov 20, 2025 • 2.12M • 2.53k

Weekly Releases (Jun 12, 2026)

PaddlePaddle/PP-OCRv6_medium_det

Image-to-Text • Updated Jun 12 • 84.6k • 26
PaddlePaddle/PP-OCRv6_tiny_det_safetensors

Image-to-Text • 438k • Updated Jun 12 • 263 • 27
moonshotai/Kimi-K2.7-Code

Image-Text-to-Text • 1.1T • Updated Jun 15 • 730k • • 1.29k
PaddlePaddle/PP-OCRv6_small_rec_onnx

Image-to-Text • Updated Jun 18 • 6.48k • 19

Weekly Releases (May 29, 2026)

Comfy-Org/PixelDiT

Updated 9 days ago • 76.5k • 135
spiritbuun/buun-Qwen3.6-chat_template

Updated May 28 • 57
avaturn-live/avtr-1

Image-to-Video • Updated May 31 • 911 • 38
Kwai-Keye/Keye-VL-2.0-30B-A3B

Image-Text-to-Text • 31B • Updated Jun 10 • 8.97k • 121

Weekly Releases (May 15, 2026)

internlm/Intern-S2-Preview

Image-Text-to-Text • 36B • Updated May 29 • 15.5k • 115
nvidia/nemotron-3.5-asr-streaming-0.6b

Automatic Speech Recognition • 0.6B • Updated 20 days ago • 884k • • 942
internlm/Intern-S2-Preview-FP8

Image-Text-to-Text • 36B • Updated May 22 • 401 • 24
Aratako/Irodori-TTS-500M-v3

Text-to-Speech • 0.5B • Updated May 12 • 115

may-4-releases

Curated collection of notable models, datasets, and spaces from the week.

hadadxyz/OpenSonnet-Lite-MAX

Text Generation • 4B • Updated May 10 • 115 • • 30
hadadxyz/OpenSonnet-Lite

Text Generation • 4B • Updated May 6 • 30 • • 24
Zyphra/ZAYA1-74B-preview

75B • Updated about 1 month ago • 219 • 48
oumoumad/ltx-2.3-dearchive-lora

Video-to-Video • Updated May 9 • 44

Privacy Filter

openai/privacy-filter

Token Classification • 1B • Updated Apr 22 • 473k • • 1.71k
Running on Zero

Agents

18

OPF Image Anonymizer

🦀

18

Use OAI's Privacy Filter to redact PII info from any image
Running

Featured

65

Privacy Filter WebGPU

🕵

65

PII detection and text masking in your browser

Apr 7 Releases

FINAL-Bench/Darwin-4B-Opus

Text Generation • 8B • Updated May 15 • 70 • 28
nvidia/nemocurator-speech-bandwidth-filter

Updated Apr 2 • 21
ACE-Step/acestep-v15-xl-turbo

Text-to-Audio • 5B • Updated Apr 7 • 3.87k • 190
EasonXiao-888/SpatialEdit-16B

Image-Text-to-Image • Updated Apr 8 • 25 • 17

super cool vision language datasets

ServiceNow/ui-vision

Viewer • Updated May 7, 2025 • 1.46k • 14k • 22
xxxllz/Chart2Code-160k

Updated Jul 7, 2025 • 147 • 11
ReCAP-Agent/ReCAP-187k-SFT

Viewer • Updated Mar 26 • 188k • 59 • 8
allenai/MolmoPoint-GUISyn

Viewer • Updated Apr 3 • 37k • 434 • 12

Jan 26 Releases

robbyant/lingbot-world-base-cam

Image-to-Video • Updated Feb 2 • 340
nvidia/C-RADIOv4-H

Feature Extraction • 0.7B • Updated Jan 30 • 12.2k • 80
deepseek-ai/DeepSeek-OCR-2

Image-Text-to-Text • 3B • Updated Feb 3 • 2.69M • 1.06k
arcee-ai/Trinity-Large-Base

Text Generation • 399B • Updated May 28 • 282 • 58

Jan 12 Releases

google/translategemma-27b-it

Image-Text-to-Text • 29B • Updated Jan 28 • 35k • 386
kakaocorp/kanana-2-30b-a3b-mid-2601

Text Generation • 31B • Updated Jan 15 • 25 • 31
black-forest-labs/FLUX.2-klein-base-4B

Image-to-Image • 4B • Updated Feb 24 • 154k • • 151
google/translategemma-12b-it

Image-Text-to-Text • 13B • Updated Jan 28 • 22.2k • 319

Jan 5 Releases

LiquidAI/LFM2.5-VL-1.6B

Image-Text-to-Text • 2B • Updated Mar 30 • 55.2k • 309
openbmb/AgentCPM-Explore

Text Generation • 4B • Updated Jan 18 • 287 • • 418
Phr00t/LTX2-Rapid-Merges

Image-Text-to-Video • Updated Feb 12 • 368
LiquidAI/LFM2.5-1.2B-Base

Text Generation • 1B • Updated Mar 30 • 23.6k • 136

Dec 19 Releases

nvidia/NitroGen

Reinforcement Learning • Updated Feb 5 • 559
google/gemma-scope-2

Updated Dec 19, 2025 • 89
FunAudioLLM/Fun-ASR-MLT-Nano-2512

Automatic Speech Recognition • Updated 4 days ago • 329 • 55
facebook/map-anything-v1

Image-to-3D • 0.6B • Updated Dec 19, 2025 • 443 • 26

Real-time Vision Models

A collection of real-time detectors.

RFDetr

Collection

RF-DETR checkpoints converted to be used with 🤗 Transformers • 15 items • Updated May 27 • 17
PekingU/rtdetr_v2_r50vd

Object Detection • 43M • Updated Feb 6, 2025 • 27.6k • 29
ustc-community/dfine-xlarge-obj365

Object Detection • 63.4M • Updated May 5, 2025 • 2.6k • 5
PekingU/rtdetr_v2_r101vd

Object Detection • 76.8M • Updated Feb 6, 2025 • 8.49k • 14

MetaCLIP2 Multilingual

facebook/metaclip-2-worldwide-s16

Zero-Shot Image Classification • 0.4B • Updated Nov 12, 2025 • 2.7k • 9
facebook/metaclip-2-worldwide-m16

Zero-Shot Image Classification • 0.5B • Updated Nov 12, 2025 • 28 • 4
facebook/metaclip-2-worldwide-l14

Zero-Shot Image Classification • 1B • Updated Nov 12, 2025 • 2.33k • 13
facebook/metaclip-2-worldwide-b32

Zero-Shot Image Classification • 0.6B • Updated Nov 12, 2025 • 552 • 7

Sep 30 Releases

deepseek-ai/DeepSeek-V3.2-Exp

Text Generation • 685B • Updated Nov 18, 2025 • 294k • • 992
Qwen3-VL

Collection

37 items • Updated Dec 31, 2025 • 761
SDLM

Collection

Sequential Diffusion Language Models • 4 items • Updated Mar 2 • 8
Ming 2.0

Collection

Ming is the multi-modal series of any-to-any models developed by Ant Ling team. • 14 items • Updated Jun 15 • 35

Sep 16 Releases

ibm-granite/granite-docling-258M

Image-Text-to-Text • 0.3B • Updated Sep 23, 2025 • 299k • 1.23k
XiaomiMiMo/MiMo-Audio-7B-Base

Any-to-Any • 8B • Updated Jun 17 • 193 • 57
decart-ai/Lucy-Edit-Dev

Video-to-Video • 5B • Updated Nov 20, 2025 • 358 • 359
OpenGVLab/ScaleCUA-3B

Image-Text-to-Text • 4B • Updated Sep 17, 2025 • 186 • 13

Sep 1 Releases

openbmb/MiniCPM4.1-8B

Text Generation • 8B • Updated Oct 24, 2025 • 140k • 391
tencent/Hunyuan-MT-7B

Translation • 8B • Updated Dec 30, 2025 • 1.71k • 735
google/embeddinggemma-300m

Sentence Similarity • 0.3B • Updated Sep 25, 2025 • 1.71M • • 1.81k
moonshotai/Kimi-K2-Instruct-0905

Text Generation • 1T • Updated Jan 30 • 160k • • 785

Aug 22 Releases

Qwen/Qwen-Image-Edit

Image-to-Image • 20B • Updated Aug 25, 2025 • 98.9k • • 2.46k
internlm/Intern-S1-mini

Image-Text-to-Text • 9B • Updated Mar 29 • 21k • 115
xai-org/grok-2

Updated Nov 5, 2025 • 13.5k • 1.13k
ByteDance-Seed/Seed-OSS-36B-Instruct

Text Generation • 36B • Updated Aug 26, 2025 • 38.9k • 503

Releases August 2

stepfun-ai/step3

Image-Text-to-Text • 321B • Updated Jan 29 • 166k • 166
nunchaku-ai/nunchaku-flux.1-krea-dev

Text-to-Image • Updated Nov 16, 2025 • 3.75k • 121
fdtn-ai/Foundation-Sec-8B-Instruct

Text Generation • 8B • Updated Aug 26, 2025 • 17.6k • • 71
Wan-AI/Wan2.2-TI2V-5B-Diffusers

Text-to-Video • 5B • Updated Aug 9, 2025 • 160k • 149

Releases July 18

nvidia/OpenReasoning-Nemotron-32B

Text Generation • 33B • Updated Sep 16, 2025 • 9.4k • • 125
ByteDance-Seed/Seed-X-RM-7B

Translation • 7B • Updated Jul 31, 2025 • 65 • 32
LGAI-EXAONE/EXAONE-4.0-32B

Text Generation • 32B • Updated Aug 4, 2025 • 30.7k • 281
vidore/colqwen-omni-v0.1

Visual Document Retrieval • Updated Jul 17, 2025 • 7.99k • 94

Releases July 4

apple/DiffuCoder-7B-cpGRPO

8B • Updated Dec 8, 2025 • 761 • 321
BAAI/MTVCraft

Text-to-Video • Updated Jul 7, 2025 • 17 • 36
kyutai/tts-1.6b-en_fr

Text-to-Speech • Updated Sep 11, 2025 • 47.1k • 378
apple/DiffuCoder-7B-Base

8B • Updated Dec 8, 2025 • 511 • 30

June 20 Releases

moonshotai/Kimi-VL-A3B-Thinking-2506

Image-Text-to-Text • 16B • Updated Jan 30 • 7.25k • 371
mistralai/Mistral-Small-3.2-24B-Instruct-2506

24B • Updated Dec 22, 2025 • 256k • 598
kyutai/stt-1b-en_fr

Automatic Speech Recognition • 1.0B • Updated Nov 18, 2025 • 133
google/magenta-realtime

Updated Aug 29, 2025 • 32 • 554

Releases June 13

ByteDance/LatentSync-1.6

Updated Jun 12, 2025 • 81.9k • 76
V-JEPA 2

Collection

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann • 8 items • Updated Jun 13, 2025 • 227
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20, 2025 • 7.22k • 1.59k
tencent/Hunyuan3D-2.1

Image-to-3D • Updated Oct 17, 2025 • 39.1k • 1.09k

Releases 30 May

All the releases of the week of 30th May.

deepseek-ai/DeepSeek-R1-0528

Text Generation • 685B • Updated May 29, 2025 • 256k • • 2.46k
Running on Zero

Agents

Featured

222

BAGEL

🚀

222

Demo for BAGEL
tencent/HunyuanPortrait

Image-to-Video • Updated May 27, 2025 • 75
XiaomiMiMo/MiMo-7B-RL-0530

Text Generation • 8B • Updated Jun 5, 2025 • 277 • 45

May 16 Releases

Qwen/WorldPM-72B

Text Classification • 73B • Updated May 17, 2025 • 93 • 82
Running on Zero

MCP

Featured

1.54k

LTX Video Fast

🎥

1.54k

ultra-fast video model, LTX 0.9.8 13B distilled
BLIP3o/BLIP3o-Pretrain-Long-Caption

Viewer • Updated Jun 26, 2025 • 27.2M • 9.21k • 74
BLIP3o/BLIP3o-Model-8B

14B • Updated Jun 4, 2025 • 962 • 103

Any-to-Any Models, Datasets, Spaces

Runtime error

Agents

Featured

83

MMaDA

🌍

83

Demo for MMaDA: Multimodal Large Diffusion Language Models
Running on Zero

Agents

Featured

222

BAGEL

🚀

222

Demo for BAGEL
Gen-Verse/MMaDA-8B-Base

Any-to-Any • 8B • Updated May 24, 2025 • 933 • 91
ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated Jan 9 • 810 • 1.21k

InternVL3 HF

OpenGVLab/InternVL3-1B-hf

Image-Text-to-Text • 0.9B • Updated Apr 23, 2025 • 248k • 10
OpenGVLab/InternVL3-2B-hf

Image-Text-to-Text • 2B • Updated Apr 23, 2025 • 17.6k • 3
OpenGVLab/InternVL3-8B-hf

Image-Text-to-Text • 8B • Updated Apr 23, 2025 • 69.3k • 10
OpenGVLab/InternVL3-14B-hf

Image-Text-to-Text • 15B • Updated Apr 23, 2025 • 7.3k

Multimodal DSE Retrievers

A collection of DSE models for multimodal retrieval

racineai/Flantier-SmolVLM-2B-dse

2B • Updated Jun 18, 2025 • 1 • 11
MrLight/dse-qwen2-2b-mrl-v1

Visual Document Retrieval • Updated Feb 26, 2025 • 29.8k • 68
marco/mcdse-2b-v1

2B • Updated Oct 29, 2024 • 67 • 56
llamaindex/vdr-2b-multi-v1

Image-Text-to-Text • 2B • Updated Apr 8 • 12k • 128

March 28 Releases

deepseek-ai/DeepSeek-V3-0324

Text Generation • 685B • Updated Mar 27, 2025 • 838k • • 3.15k
Qwen/Qwen2.5-Omni-7B

Any-to-Any • 11B • Updated Apr 30, 2025 • 414k • 1.92k
google/txgemma-27b-chat

Text Generation • 27B • Updated Apr 10, 2025 • 441 • • 61
Running

Agents

Featured

374

Qwen2.5 Omni 7B Demo

🏆

374

Chat with text, audio, images, and video, get spoken replies

Türkçe VLMler

Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6, 2025 • 1.47M • 1.28k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • 2B • Updated Jan 12, 2025 • 3.13M • 516
CohereLabs/aya-vision-8b

Image-Text-to-Text • 9B • Updated Jan 9 • 40.4k • 325
CohereLabs/aya-vision-32b

Image-Text-to-Text • 33B • Updated Jan 9 • 163 • • 227

Feb 7 Releases 🧣

lerobot/pi0_old

Robotics • 4B • Updated Sep 19, 2025 • 733 • 309
kyutai/hibiki-2b-pytorch-bf16

Translation • Updated May 28, 2025 • 472 • 63
Alpha-VLLM/Lumina-Image-2.0

Text-to-Image • 3B • Updated Mar 30, 2025 • 1.16k • • 370
adyen/DABstep

Viewer • Updated about 18 hours ago • 959k • 5.07k • 52

Models, Jan 27

Running on Zero

Agents

269

Qwen2-VL-7B

🔥

269

Answer questions about uploaded images
Running

Agents

67

UI-TARS

🌖

67

Predict UI click coordinates from a screenshot and instruction
Paused

Agents

101

Qwen2.5-1M Demo

💻

101

Ask questions about your uploaded documents
Qwen/Qwen2.5-14B-Instruct-1M

Text Generation • 15B • Updated Jan 29, 2025 • 10.1k • • 342

Jan 17 Releases ❄️

Models and datasets of the second week of Jan 2025.

openbmb/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Oct 5, 2025 • 365k • 1.3k
MiniMaxAI/MiniMax-Text-01

Text Generation • 456B • Updated Jul 3, 2025 • 4.52k • 656
OuteAI/OuteTTS-0.3-1B

Text-to-Speech • 1B • Updated Apr 24, 2025 • 68 • 108
NovaSky-AI/Sky-T1_data_17k

Viewer • Updated Jan 14, 2025 • 16.4k • 339 • 186

Dec 6 Releases 🎄

meta-llama/Llama-3.3-70B-Instruct

Text Generation • 71B • Updated Dec 21, 2024 • 655k • • 2.92k
Qwen/Qwen2-VL-72B

Image-Text-to-Text • 73B • Updated Dec 6, 2024 • 95 • 80
google/paligemma2-3b-pt-224

Image-Text-to-Text • 3B • Updated Dec 5, 2024 • 10.9k • 175
tencent/HunyuanVideo

Text-to-Video • Updated Mar 6, 2025 • 846 • • 2.23k

Nov 22 Releases ❄️

mistralai/Pixtral-Large-Instruct-2411

Updated Jun 2 • 65 • 434
microsoft/orca-agentinstruct-1M-v1

Viewer • Updated Nov 1, 2024 • 1.05M • 1.31k • 466
Xkev/Llama-3.2V-11B-cot

Image-Text-to-Text • 11B • Updated Nov 16, 2025 • 565 • 158
jinaai/jina-clip-v2

Feature Extraction • 0.9B • Updated Apr 8 • 630k • 341

Nov 1 Releases

Running on Zero

Agents

91

LongVU

🌖

91

Answer questions about uploaded videos or images
facebook/MobileLLM-1B

Text Generation • Updated May 5, 2025 • 116 • 122
Vision-CAIR/LongVU_Qwen2_7B

Video-Text-to-Text • 8B • Updated Feb 28, 2025 • 298 • 76
Vision-CAIR/LongVU_Llama3_2_3B_img

Updated Feb 28, 2025 • 6 • 6

October 25 Releases

ibm-granite/granite-3.0-8b-instruct

Text Generation • 8B • Updated Dec 19, 2024 • 165k • 207
ibm-granite/granite-3.0-2b-instruct

Text Generation • 3B • Updated Dec 19, 2024 • 11.7k • 48
CohereLabs/aya-expanse-8b

Text Generation • 8B • Updated Jan 9 • 25k • 439
CohereLabs/aya-expanse-32b

Text Generation • 32B • Updated Jan 9 • 6.36k • • 298

New Depth Models

Recent depth models

Running on Zero

Agents

Featured

208

DepthCrafter

🦀

208

a super consistent video depth model
Paused

Agents

Featured

223

Depth Pro

🚀

223

Generate an inverse depth map from an image
Running on Zero

Agents

79

Lotus Depth

🚀

79

Official Demo of Lotus (https://lotus3d.github.io/)
apple/DepthPro

Depth Estimation • Updated Feb 28, 2025 • 6.77k • 526

Computer Vision Backbones 🧩

Collection of useful computer vision backbones to fine-tune. It also includes large image classification models, that can be used as backbone.

microsoft/resnet-50

Image Classification • 25.6M • Updated Feb 13, 2024 • 1.02M • • 499
google/vit-base-patch16-224-in21k

Image Feature Extraction • 86.4M • Updated Feb 5, 2024 • 923k • 414
google/vit-base-patch32-224-in21k

Image Feature Extraction • 88M • Updated Dec 8, 2022 • 24.3k • 20
facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 1.06M • 115

Object Detection Models 🥥

facebook/detr-resnet-50

Object Detection • 41.6M • Updated Apr 10, 2024 • 1.62M • • 965
facebook/detr-resnet-101-dc5

Object Detection • 60.7M • Updated Sep 6, 2023 • 4.73k • 19
facebook/detr-resnet-50-dc5

Object Detection • 41.6M • Updated Sep 7, 2023 • 3.06k • 6
google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 296k • 150

Zero-shot Image Classification Models 🖼️

This is a collection for models that can be used for zero-shot image classification.

openai/clip-vit-large-patch14

Zero-Shot Image Classification • 0.4B • Updated Sep 15, 2023 • 9.95M • 2.06k
openai/clip-vit-base-patch32

Zero-Shot Image Classification • Updated Feb 29, 2024 • 23.6M • 988
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Zero-Shot Image Classification • Updated Jan 22, 2025 • 111k • 314
kakaobrain/align-base

Zero-Shot Image Classification • Updated Mar 8, 2023 • 17.9k • 31

Video Classification Models 📺

microsoft/xclip-base-patch32

Video Classification • 0.2B • Updated Feb 4, 2024 • 119k • 114
facebook/timesformer-base-finetuned-k400

Video Classification • Updated Jan 2, 2023 • 11.6k • 43
facebook/timesformer-base-finetuned-k600

Video Classification • Updated Dec 12, 2022 • 31.4k • 12
google/vivit-b-16x2

Video Classification • Updated Aug 3, 2023 • 23.6k • 11

Text-to-Image Models 🥑

stabilityai/stable-diffusion-xl-base-1.0

Text-to-Image • 3B • Updated Oct 30, 2023 • 1.47M • • 7.98k
warp-ai/wuerstchen

Text-to-Image • Updated Mar 12, 2024 • 108 • 177
Deci/DeciDiffusion-v1-0

Text-to-Image • 0.9B • Updated Feb 15, 2024 • 40 • 140
stabilityai/stable-diffusion-xl-refiner-1.0

Image-to-Image • 2B • Updated Sep 25, 2023 • 126k • 2.06k

Segment Anything Model

This collection contains models and demos of SAM and it's smaller friends.

facebook/sam-vit-huge

Mask Generation • 0.6B • Updated Jan 11, 2024 • 464k • 197
facebook/sam-vit-base

Mask Generation • 93.7M • Updated Jan 11, 2024 • 1.26M • 171
facebook/sam-vit-large

Mask Generation • 0.3B • Updated Jan 11, 2024 • 10.5k • 34
Runtime error

Agents

43

Grounded SAM

💩

43

SigLIP

A collection dedicated to SigLIP applications

Running on Zero

Agents

Featured

74

Draw To Search Art

🐠

74

Draw/upload image and search among WikiART using SigLIP
Running on CPU Upgrade

Agents

23

Compare Clip Siglip

🏃

23

Compare strong zero-shot image classification models
Runtime error

Agents

13

Multilingual Zero Shot Image Clf

🏢

13

Comparing powerful multilingual zero-shot image clf models
BAAI/bunny-phi-2-siglip-lora

Text Generation • Updated Mar 28, 2024 • 70 • 48

SegGPT

A collection of everything SegGPT.

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

Paper • 2212.02499 • Published Dec 5, 2022
SegGPT: Segmenting Everything In Context

Paper • 2304.03284 • Published Apr 6, 2023 • 1
BAAI/seggpt-vit-large

0.4B • Updated 13 days ago • 23.8k • 5
BAAI/SegGPT

Updated 13 days ago • 22

gvhf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 296k • 150
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 9.6k • 14
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 6.58k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 54k • 30

merve/owl2

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 296k • 150
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 9.6k • 14
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 6.58k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 54k • 30

Document VLM Papers

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

Video Language Models

A collection of video-language models

Paused

Agents

21

Video Llava

🐨

21

Generate descriptions by uploading images or videos
llava-hf/LLaVA-NeXT-Video-7B-hf

Video-Text-to-Text • 7B • Updated Nov 11, 2025 • 197k • 126
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf

Video-Text-to-Text • 7B • Updated Nov 11, 2025 • 945 • 12
llava-hf/LLaVA-NeXT-Video-7B-32K-hf

Image-Text-to-Text • 8B • Updated Nov 11, 2025 • 111 • 9

NVEagle

NVEagle/Eagle-X5-13B

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 18 • 15
NVEagle/Eagle-X5-13B-Chat

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 72 • 28
NVEagle/Eagle-X5-7B

Image-Text-to-Text • 9B • Updated Sep 16, 2024 • 120 • 26
Runtime error

Agents

64

Eagle X5 13B Chat

🚀

64

Combine text and images to generate responses

Zero-shot Segmentation

sam-hq-team/SegInW

Updated Jul 13, 2023 • 1
xdecoder/X-Decoder

Updated Dec 27, 2023 • 5
xdecoder/SEEM

Updated Dec 30, 2023 • 8
Runtime error

Agents

Featured

60

OWLSAM2

🏃

60

Weekly Releases (Jul 17, 2026)

thinkingmachines/Inkling

Image-Text-to-Text • 952B • Updated 3 days ago • 34.5k • • 1.57k
Lightricks/LTX-2.3-22b-LoRA-Foley-V2A

Text-to-Audio • Updated about 3 hours ago • 30
genzeonplatform/healthcare-brain-diagnosis-icd-ner

Token Classification • 0.1B • Updated 6 days ago • 38 • 20
Hippotes/Ideogram4-Fal-ComfyUI

Updated 5 days ago • 8.1k • 23

Weekly Releases (Jul 10, 2026)

OrionLLM/GRM-2.6-Plus-0628

Image-Text-to-Text • 28B • Updated 18 days ago • 1.22k • • 31
SOLRICKS/ltx-2.3-product-ad-style

Text-to-Video • Updated 17 days ago • 37
robbyant/lingbot-vision-vit-large

Image Feature Extraction • Updated 20 days ago • 1 • 24
robbyant/lingbot-video-moe-30b-a3b

30B • Updated 18 days ago • 1.09k • 121

Weekly Releases (Jul 03, 2026)

ai-sage/GFusion-10B-A1.8B

Text Generation • 11B • Updated 21 days ago • 2.01k • 21
malcolmrey/krea2

Updated 16 minutes ago • 55
tencent/Hy3-FP8

Text Generation • 299B • Updated 6 days ago • 19.5k • 66
RudySen/Krea2-realism-V2

Text-to-Image • Updated 24 days ago • 21.6k • • 94

Weekly Releases (Jun 26, 2026)

deepreinforce-ai/Ornith-1.0-397B

Text Generation • 397B • Updated about 1 month ago • 403k • 245
Winnougan/Krea-2-Base-Turbo-NVFP4-FP8-INT8

Updated 23 days ago • 96
wikeeyang/Flux2-Klein-9B-True-V3

Text-to-Image • 9B • Updated 13 days ago • 42.9k • 111
Cseti/LTX2.3-22B_IC-LoRA-Cameraman_v2

Updated Jun 24 • 52

Weekly Releases (Jun 19, 2026)

Boogu/Boogu-Image-0.1-Turbo

Text-to-Image • 10B • Updated 2 days ago • 954 • 68
datalab-to/lift

Image-Text-to-Text • 10B • Updated Jun 19 • 38.2k • 189
Comfy-Org/Boogu-Image

Updated 6 days ago • 126
Boogu/Boogu-Image-0.1-Turbo-fp8

Text-to-Image • Updated 2 days ago • 470 • 49

my recommended vision/mm models

detection, segmentation, OCR, depth, pose, grounding, VLM detection

Roboflow/rf-detr-base

Object Detection • 32.2M • Updated May 20 • 12.4k • 5
Roboflow/rf-detr-seg-large

Image Segmentation • 36.2M • Updated May 20 • 48 • 3
Roboflow/rf-detr-seg-medium

Image Segmentation • 35.7M • Updated May 20 • 388 • 3
facebook/sam3

Mask Generation • 0.9B • Updated Nov 20, 2025 • 2.12M • 2.53k

Vision Intern

Roboflow/rf-detr-medium

Object Detection • 33.7M • Updated May 20 • 1.02k • 6
Qwen/Qwen3.5-9B

Image-Text-to-Text • 10B • Updated Mar 2 • 10.9M • • 1.75k
LiquidAI/LFM2.5-VL-1.6B

Image-Text-to-Text • 2B • Updated Mar 30 • 55.2k • 309
google/gemma-4-E4B-it

Any-to-Any • 8B • Updated 6 days ago • 5.73M • 1.41k

Weekly Releases (Jun 12, 2026)

PaddlePaddle/PP-OCRv6_medium_det

Image-to-Text • Updated Jun 12 • 84.6k • 26
PaddlePaddle/PP-OCRv6_tiny_det_safetensors

Image-to-Text • 438k • Updated Jun 12 • 263 • 27
moonshotai/Kimi-K2.7-Code

Image-Text-to-Text • 1.1T • Updated Jun 15 • 730k • • 1.29k
PaddlePaddle/PP-OCRv6_small_rec_onnx

Image-to-Text • Updated Jun 18 • 6.48k • 19

Weekly Releases (Jun 05, 2026)

Comfy-Org/Ideogram-4

Updated 26 days ago • 221
jdopensource/JoyAI-Echo

Text-to-Video • Updated Jun 18 • 4.37k • 166
litert-community/gemma-4-12B-it-litert-lm

Updated 12 days ago • 12.8k • 46
google/gemma-4-12B-it-qat-q4_0-unquantized

Any-to-Any • 12B • Updated 6 days ago • 536k • 69

Weekly Releases (May 29, 2026)

Comfy-Org/PixelDiT

Updated 9 days ago • 76.5k • 135
spiritbuun/buun-Qwen3.6-chat_template

Updated May 28 • 57
avaturn-live/avtr-1

Image-to-Video • Updated May 31 • 911 • 38
Kwai-Keye/Keye-VL-2.0-30B-A3B

Image-Text-to-Text • 31B • Updated Jun 10 • 8.97k • 121

Weekly Releases (May 22, 2026)

Efficient-Large-Model/SANA-WM_bidirectional

Image-to-Video • Updated May 19 • 126
CohereLabs/command-a-plus-05-2026-w4a4

Image-Text-to-Text • 126B • Updated Jun 16 • 6.65k • • 235
FINAL-Bench/Darwin-28B-Coder

Text Generation • 27B • Updated 3 days ago • 129 • 28
LatitudeGames/Equinox-31B

31B • Updated May 22 • 281 • 59

Weekly Releases (May 15, 2026)

internlm/Intern-S2-Preview

Image-Text-to-Text • 36B • Updated May 29 • 15.5k • 115
nvidia/nemotron-3.5-asr-streaming-0.6b

Automatic Speech Recognition • 0.6B • Updated 20 days ago • 884k • • 942
internlm/Intern-S2-Preview-FP8

Image-Text-to-Text • 36B • Updated May 22 • 401 • 24
Aratako/Irodori-TTS-500M-v3

Text-to-Speech • 0.5B • Updated May 12 • 115

RFDetr

RF-DETR checkpoints converted to be used with 🤗 Transformers

Running on Zero

Agents

Featured

56

RF-DETR Realtime Webcam Demo

🎯

56

Segment objects in live webcam and uploaded media
Roboflow/rf-detr-base

Object Detection • 32.2M • Updated May 20 • 12.4k • 5
Roboflow/rf-detr-base-2

Object Detection • 32.2M • Updated May 20 • 12
Roboflow/rf-detr-nano

Object Detection • 30.5M • Updated May 20 • 688

may-4-releases

Curated collection of notable models, datasets, and spaces from the week.

hadadxyz/OpenSonnet-Lite-MAX

Text Generation • 4B • Updated May 10 • 115 • • 30
hadadxyz/OpenSonnet-Lite

Text Generation • 4B • Updated May 6 • 30 • • 24
Zyphra/ZAYA1-74B-preview

75B • Updated about 1 month ago • 219 • 48
oumoumad/ltx-2.3-dearchive-lora

Video-to-Video • Updated May 9 • 44

Apr 27 Releases

nvidia/Gemma-4-26B-A4B-NVFP4

Text Generation • 14B • Updated May 11 • 1.18M • 122
XiaomiMiMo/MiMo-V2.5-Pro

Text Generation • 1T • Updated 17 days ago • 61.3k • • 718
AngelSlim/Hy-MT1.5-1.8B-1.25bit

Translation • 2B • Updated May 26 • 58 • 194
mistralai/Mistral-Medium-3.5-128B-EAGLE

Updated Apr 30 • 202 • 57

Privacy Filter

openai/privacy-filter

Token Classification • 1B • Updated Apr 22 • 473k • • 1.71k
Running on Zero

Agents

18

OPF Image Anonymizer

🦀

18

Use OAI's Privacy Filter to redact PII info from any image
Running

Featured

65

Privacy Filter WebGPU

🕵

65

PII detection and text masking in your browser

apr-17-releases

Curated collection of notable models and spaces from the week of April 17.

OpenMOSS-Team/MOSS-Audio-4B-Instruct

Audio-Text-to-Text • 5B • Updated Apr 14 • 169k • 78
OpenMOSS-Team/MOSS-Audio-8B-Thinking

Audio-Text-to-Text • 9B • Updated Jun 11 • 39.9k • 79
bytedance-research/Timer-S1

Time Series Forecasting • 8B • Updated Apr 21 • 4.32k • 33
BugTraceAI/BugTraceAI-Apex-G4-26B-Q4

25B • Updated May 11 • 2.63k • 87

Apr 7 Releases

FINAL-Bench/Darwin-4B-Opus

Text Generation • 8B • Updated May 15 • 70 • 28
nvidia/nemocurator-speech-bandwidth-filter

Updated Apr 2 • 21
ACE-Step/acestep-v15-xl-turbo

Text-to-Audio • 5B • Updated Apr 7 • 3.87k • 190
EasonXiao-888/SpatialEdit-16B

Image-Text-to-Image • Updated Apr 8 • 25 • 17

Apr 3 Releases

netflix/void-model

Video-to-Video • Updated Apr 6 • 960
arcee-ai/Trinity-Large-Thinking

Text Generation • 399B • Updated May 28 • 8.65k • • 185
KRAFTON/Raon-VisionEncoder

Feature Extraction • 1B • Updated Apr 1 • 24 • 21
KRAFTON/Raon-SpeechChat-9B

Audio-to-Audio • 10B • Updated 19 days ago • 868 • 38

super cool vision language datasets

ServiceNow/ui-vision

Viewer • Updated May 7, 2025 • 1.46k • 14k • 22
xxxllz/Chart2Code-160k

Updated Jul 7, 2025 • 147 • 11
ReCAP-Agent/ReCAP-187k-SFT

Viewer • Updated Mar 26 • 188k • 59 • 8
allenai/MolmoPoint-GUISyn

Viewer • Updated Apr 3 • 37k • 434 • 12

Multimodal tool calling datasets

AgoraX/OpenImage-FNCall-50k

Viewer • Updated Feb 14, 2024 • 53.3k • 60 • 3
ScaleAI/VisualToolBench

Viewer • Updated Dec 16, 2025 • 1.2k • 1.37k • 5
internlm/ARM-Thinker-Data

Preview • Updated Feb 13 • 34 • 7
zzliang/GRIT

Viewer • Updated Jul 4, 2023 • 20.5M • 604 • 160

Jan 26 Releases

robbyant/lingbot-world-base-cam

Image-to-Video • Updated Feb 2 • 340
nvidia/C-RADIOv4-H

Feature Extraction • 0.7B • Updated Jan 30 • 12.2k • 80
deepseek-ai/DeepSeek-OCR-2

Image-Text-to-Text • 3B • Updated Feb 3 • 2.69M • 1.06k
arcee-ai/Trinity-Large-Base

Text Generation • 399B • Updated May 28 • 282 • 58

Jan 19 Releases

Nemotron ColEmbed V2

Collection

State-of-the-Art Late Interaction Vision-Language Embedding Models • 3 items • Updated 10 days ago • 15
Qwen/Qwen3-TTS-12Hz-1.7B-Base

2B • Updated Jan 23 • 2.39M • 455
fal/flux-2-klein-4B-outpaint-lora

Image-to-Image • Updated Jan 21 • • 88
Qwen/Qwen3-TTS-Tokenizer-12Hz

Audio-to-Audio • 0.2B • Updated Jan 29 • 42.8k • 74

Jan 12 Releases

google/translategemma-27b-it

Image-Text-to-Text • 29B • Updated Jan 28 • 35k • 386
kakaocorp/kanana-2-30b-a3b-mid-2601

Text Generation • 31B • Updated Jan 15 • 25 • 31
black-forest-labs/FLUX.2-klein-base-4B

Image-to-Image • 4B • Updated Feb 24 • 154k • • 151
google/translategemma-12b-it

Image-Text-to-Text • 13B • Updated Jan 28 • 22.2k • 319

YOLO26 Models

YOLO26 models: detection, segmentation, classification, pose, and OBB variants with demos and ONNX variants.

Runtime error

Agents

26

YOLO26

💙

26

Process images with advanced object detection and segmentation
Running

Featured

65

YOLO26 WebGPU

🏆

65

Real-time object detection & pose estimation in your browser
onnx-community/yolo26x-ONNX

Updated Jan 18 • 27 • 5
openvision/yoloe26-n-seg

Zero-Shot Object Detection • Updated Jan 15 • 34 • 2

Jan 5 Releases

LiquidAI/LFM2.5-VL-1.6B

Image-Text-to-Text • 2B • Updated Mar 30 • 55.2k • 309
openbmb/AgentCPM-Explore

Text Generation • 4B • Updated Jan 18 • 287 • • 418
Phr00t/LTX2-Rapid-Merges

Image-Text-to-Video • Updated Feb 12 • 368
LiquidAI/LFM2.5-1.2B-Base

Text Generation • 1B • Updated Mar 30 • 23.6k • 136

Dec 30 Releases

Wuli-art/Qwen-Image-2512-Turbo-LoRA

Text-to-Image • Updated Jan 8 • 7.56k • 218
miromind-ai/MiroThinker-v1.5-235B

Text Generation • 235B • Updated Mar 20 • 50 • 254
prithivMLmods/Qwen-Image-Edit-2511-Object-Remover

Image-to-Image • Updated Jan 4 • 6.06k • • 68
tencent/Youtu-LLM-2B-Base

Text Generation • 2B • Updated Feb 24 • 1.3k • 42

Dec 19 Releases

nvidia/NitroGen

Reinforcement Learning • Updated Feb 5 • 559
google/gemma-scope-2

Updated Dec 19, 2025 • 89
FunAudioLLM/Fun-ASR-MLT-Nano-2512

Automatic Speech Recognition • Updated 4 days ago • 329 • 55
facebook/map-anything-v1

Image-to-3D • 0.6B • Updated Dec 19, 2025 • 443 • 26

Dec 12 Releases

openai/circuit-sparsity

Text Generation • 0.4B • Updated Dec 12, 2025 • 203 • 208
FunAudioLLM/Fun-CosyVoice3-0.5B-2512

Text-to-Speech • Updated Feb 3 • 24.2k • 594
DiffSynth-Studio/Qwen-Image-i2L

Updated Dec 16, 2025 • 257
Aratako/T5Gemma-TTS-2b-2b

Text-to-Speech • 5B • Updated Apr 3 • 559 • 119

Real-time Vision Models

A collection of real-time detectors.

RFDetr

Collection

RF-DETR checkpoints converted to be used with 🤗 Transformers • 15 items • Updated May 27 • 17
PekingU/rtdetr_v2_r50vd

Object Detection • 43M • Updated Feb 6, 2025 • 27.6k • 29
ustc-community/dfine-xlarge-obj365

Object Detection • 63.4M • Updated May 5, 2025 • 2.6k • 5
PekingU/rtdetr_v2_r101vd

Object Detection • 76.8M • Updated Feb 6, 2025 • 8.49k • 14

SAM3

facebook/sam3

Mask Generation • 0.9B • Updated Nov 20, 2025 • 2.12M • 2.53k
Running on Zero

Agents

Featured

115

SAM3 Video Segmentation

🐠

115

Track and label objects in videos using text prompts or clicks
onnx-community/sam3-tracker-ONNX

Mask Generation • Updated Nov 19, 2025 • 633 • 38
Running

30

SAM3 Tracker WebGPU

🎯

30

Segment images with click points and download cutouts

MetaCLIP2 Multilingual

facebook/metaclip-2-worldwide-s16

Zero-Shot Image Classification • 0.4B • Updated Nov 12, 2025 • 2.7k • 9
facebook/metaclip-2-worldwide-m16

Zero-Shot Image Classification • 0.5B • Updated Nov 12, 2025 • 28 • 4
facebook/metaclip-2-worldwide-l14

Zero-Shot Image Classification • 1B • Updated Nov 12, 2025 • 2.33k • 13
facebook/metaclip-2-worldwide-b32

Zero-Shot Image Classification • 0.6B • Updated Nov 12, 2025 • 552 • 7

Oct 6 Releases

Kwaipilot/KAT-Dev-72B-Exp

Text Generation • 73B • Updated Oct 13, 2025 • 669 • • 156
LiquidAI/LFM2-8B-A1B

Text Generation • 8B • Updated May 29 • 30.6k • 370
yanolja/YanoljaNEXT-Rosetta-12B-2510

Translation • 12B • Updated Nov 2, 2025 • 675 • 29
NeuML/colbert-muvera-femto

Sentence Similarity • 243k • Updated Jun 15 • 50 • 20

Sep 30 Releases

deepseek-ai/DeepSeek-V3.2-Exp

Text Generation • 685B • Updated Nov 18, 2025 • 294k • • 992
Qwen3-VL

Collection

37 items • Updated Dec 31, 2025 • 761
SDLM

Collection

Sequential Diffusion Language Models • 4 items • Updated Mar 2 • 8
Ming 2.0

Collection

Ming is the multi-modal series of any-to-any models developed by Ant Ling team. • 14 items • Updated Jun 15 • 35

Sep 23 Releases

ByteDance/lynx

Image-to-Video • Updated Sep 27, 2025 • • 140
tencent/HunyuanImage-3.0

Text-to-Image • 83B • Updated Jan 28 • 12.3k • • 1.1k
meituan-longcat/LongCat-Flash-Thinking

Text Generation • 562B • Updated Sep 24, 2025 • 152 • 148
Qwen/Qwen3Guard-Gen-4B

Text Generation • 4B • Updated Nov 7, 2025 • 187k • • 52

Sep 16 Releases

ibm-granite/granite-docling-258M

Image-Text-to-Text • 0.3B • Updated Sep 23, 2025 • 299k • 1.23k
XiaomiMiMo/MiMo-Audio-7B-Base

Any-to-Any • 8B • Updated Jun 17 • 193 • 57
decart-ai/Lucy-Edit-Dev

Video-to-Video • 5B • Updated Nov 20, 2025 • 358 • 359
OpenGVLab/ScaleCUA-3B

Image-Text-to-Text • 4B • Updated Sep 17, 2025 • 186 • 13

Sep 11 Releases

bytedance-research/HuMo

Image-to-Video • Updated Sep 18, 2025 • 59 • 267
facebook/MobileLLM-R1-950M

Text Generation • 0.9B • Updated Sep 30, 2025 • 433 • 359
tencent/POINTS-Reader

Image-Text-to-Text • 4B • Updated Sep 12, 2025 • 62 • 102
baidu/ERNIE-4.5-21B-A3B-Thinking

Text Generation • 22B • Updated Nov 26, 2025 • 14.5k • 787

Sep 1 Releases

openbmb/MiniCPM4.1-8B

Text Generation • 8B • Updated Oct 24, 2025 • 140k • 391
tencent/Hunyuan-MT-7B

Translation • 8B • Updated Dec 30, 2025 • 1.71k • 735
google/embeddinggemma-300m

Sentence Similarity • 0.3B • Updated Sep 25, 2025 • 1.71M • • 1.81k
moonshotai/Kimi-K2-Instruct-0905

Text Generation • 1T • Updated Jan 30 • 160k • • 785

August 29 Releases

microsoft/VibeVoice-1.5B

Text-to-Speech • 3B • Updated Jan 22 • 52.8k • 2.43k
OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview

Image-Text-to-Text • 0.4B • Updated Aug 29, 2025 • 106k • 83
apple/FastVLM-1.5B

Text Generation • 2B • Updated Sep 3, 2025 • 5.93k • 80
stepfun-ai/Step-Audio-2-mini

Any-to-Any • 8B • Updated Feb 14 • 21.9k • 263

Aug 22 Releases

Qwen/Qwen-Image-Edit

Image-to-Image • 20B • Updated Aug 25, 2025 • 98.9k • • 2.46k
internlm/Intern-S1-mini

Image-Text-to-Text • 9B • Updated Mar 29 • 21k • 115
xai-org/grok-2

Updated Nov 5, 2025 • 13.5k • 1.13k
ByteDance-Seed/Seed-OSS-36B-Instruct

Text Generation • 36B • Updated Aug 26, 2025 • 38.9k • 503

Releases August 9

openai/gpt-oss-120b

Text Generation • 120B • Updated Aug 26, 2025 • 4.38M • • 5.03k
openai/gpt-oss-20b

Text Generation • 22B • Updated Aug 26, 2025 • 7.93M • • 4.84k
openai/BrowseCompLongContext

Viewer • Updated Aug 9, 2025 • 295 • 1.84k • 54
baichuan-inc/Baichuan-M2-32B

Text Generation • 33B • Updated Dec 24, 2025 • 2.08k • • 124

Releases August 2

stepfun-ai/step3

Image-Text-to-Text • 321B • Updated Jan 29 • 166k • 166
nunchaku-ai/nunchaku-flux.1-krea-dev

Text-to-Image • Updated Nov 16, 2025 • 3.75k • 121
fdtn-ai/Foundation-Sec-8B-Instruct

Text Generation • 8B • Updated Aug 26, 2025 • 17.6k • • 71
Wan-AI/Wan2.2-TI2V-5B-Diffusers

Text-to-Video • 5B • Updated Aug 9, 2025 • 160k • 149

Releases July 25

Wan-AI/Wan2.2-I2V-A14B

Image-to-Video • Updated Aug 7, 2025 • 15.6k • • 758
allenai/olmOCR-7B-0725

Image-Text-to-Text • 8B • Updated Aug 26, 2025 • 542 • 64
Wan-AI/Wan2.2-T2V-A14B

Text-to-Video • Updated Aug 7, 2025 • 3.8k • • 526
Qwen/Qwen3-235B-A22B-Thinking-2507

Text Generation • 235B • Updated Aug 17, 2025 • 20.4k • • 408

Releases July 18

nvidia/OpenReasoning-Nemotron-32B

Text Generation • 33B • Updated Sep 16, 2025 • 9.4k • • 125
ByteDance-Seed/Seed-X-RM-7B

Translation • 7B • Updated Jul 31, 2025 • 65 • 32
LGAI-EXAONE/EXAONE-4.0-32B

Text Generation • 32B • Updated Aug 4, 2025 • 30.7k • 281
vidore/colqwen-omni-v0.1

Visual Document Retrieval • Updated Jul 17, 2025 • 7.99k • 94

Releases July 11

HuggingFaceTB/SmolLM3-3B

Text Generation • 3B • Updated Sep 10, 2025 • 767k • 985
moonshotai/Kimi-K2-Instruct

Text Generation • 1T • Updated Apr 23 • 192k • • 2.37k
fal/Realism-Detailer-Kontext-Dev-LoRA

Image-to-Image • Updated Jul 7, 2025 • 143 • • 55
Alibaba-NLP/WebSailor-3B

3B • Updated Jul 10, 2025 • 29 • 74

Releases July 4

apple/DiffuCoder-7B-cpGRPO

8B • Updated Dec 8, 2025 • 761 • 321
BAAI/MTVCraft

Text-to-Video • Updated Jul 7, 2025 • 17 • 36
kyutai/tts-1.6b-en_fr

Text-to-Speech • Updated Sep 11, 2025 • 47.1k • 378
apple/DiffuCoder-7B-Base

8B • Updated Dec 8, 2025 • 511 • 30

Releases June 27

nari-labs/Dia-1.6B-0626

Text-to-Speech • 2B • Updated Jul 3, 2025 • 8.97k • 132
google/gemma-3n-E4B-it

Image-Text-to-Text • 8B • Updated Jul 14, 2025 • 29.7k • • 919
ByteDance/XVerse

Text-to-Image • Updated Jul 1, 2025 • 64 • 91
nvidia/llama-nemoretriever-colembed-3b-v1

Visual Document Retrieval • 4B • Updated Feb 4 • 246 • 74

June 20 Releases

moonshotai/Kimi-VL-A3B-Thinking-2506

Image-Text-to-Text • 16B • Updated Jan 30 • 7.25k • 371
mistralai/Mistral-Small-3.2-24B-Instruct-2506

24B • Updated Dec 22, 2025 • 256k • 598
kyutai/stt-1b-en_fr

Automatic Speech Recognition • 1.0B • Updated Nov 18, 2025 • 133
google/magenta-realtime

Updated Aug 29, 2025 • 32 • 554

OCR Models & Datasets

opendatalab/OmniDocBench

Viewer • Updated about 1 month ago • 1.66k • 29.8k • 98
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20, 2025 • 7.22k • 1.59k
echo840/MonkeyOCR

Image-Text-to-Text • Updated Mar 3 • 352 • 516
Running on Zero

MCP

Featured

143

Multimodal OCR2

💻

143

FireRed / Nanonets / Monkey / Thyme / Typhoon / SmolDocling

Releases June 13

ByteDance/LatentSync-1.6

Updated Jun 12, 2025 • 81.9k • 76
V-JEPA 2

Collection

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann • 8 items • Updated Jun 13, 2025 • 227
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20, 2025 • 7.22k • 1.59k
tencent/Hunyuan3D-2.1

Image-to-3D • Updated Oct 17, 2025 • 39.1k • 1.09k

Releases June 6

Qwen/Qwen3-Reranker-4B

Text Ranking • 4B • Updated Apr 16 • 2.39M • 149
echo840/MonkeyOCR

Image-Text-to-Text • Updated Mar 3 • 352 • 516
openbmb/MiniCPM4-8B

Text Generation • 8B • Updated Oct 24, 2025 • 75.1k • 284
arcee-ai/Homunculus

Text Generation • 12B • Updated Jun 3, 2025 • 18 • 99

Releases 30 May

All the releases of the week of 30th May.

deepseek-ai/DeepSeek-R1-0528

Text Generation • 685B • Updated May 29, 2025 • 256k • • 2.46k
Running on Zero

Agents

Featured

222

BAGEL

🚀

222

Demo for BAGEL
tencent/HunyuanPortrait

Image-to-Video • Updated May 27, 2025 • 75
XiaomiMiMo/MiMo-7B-RL-0530

Text Generation • 8B • Updated Jun 5, 2025 • 277 • 45

Releases 23 May

ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated Jan 9 • 810 • 1.21k
mistralai/Devstral-Small-2505

24B • Updated Aug 18, 2025 • 4.84k • 867
ByteDance/Dolphin

Image-Text-to-Text • 0.4B • Updated Jul 16, 2025 • 526 • 517
moondream/moondream-2b-2025-04-14-4bit

Image-Text-to-Text • 1B • Updated May 22, 2025 • 8.59k • 69

May 16 Releases

Qwen/WorldPM-72B

Text Classification • 73B • Updated May 17, 2025 • 93 • 82
Running on Zero

MCP

Featured

1.54k

LTX Video Fast

🎥

1.54k

ultra-fast video model, LTX 0.9.8 13B distilled
BLIP3o/BLIP3o-Pretrain-Long-Caption

Viewer • Updated Jun 26, 2025 • 27.2M • 9.21k • 74
BLIP3o/BLIP3o-Model-8B

14B • Updated Jun 4, 2025 • 962 • 103

May 9 Releases

tencent/HunyuanCustom

Image-to-Video • Updated Jun 6, 2025 • 191
stepfun-ai/Step1X-3D

Updated May 13, 2025 • 106
cognition-ai/Kevin-32B

33B • Updated May 6, 2025 • 10 • 163
ServiceNow-AI/Apriel-Nemotron-15b-Thinker

Text Generation • 15B • Updated Nov 10, 2025 • 659 • 127

Any-to-Any Models, Datasets, Spaces

Runtime error

Agents

Featured

83

MMaDA

🌍

83

Demo for MMaDA: Multimodal Large Diffusion Language Models
Running on Zero

Agents

Featured

222

BAGEL

🚀

222

Demo for BAGEL
Gen-Verse/MMaDA-8B-Base

Any-to-Any • 8B • Updated May 24, 2025 • 933 • 91
ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated Jan 9 • 810 • 1.21k

Releases Apr 21 & May 2

facebook/EdgeTAM

Updated Apr 30, 2025 • 4 • 31
nvidia/parakeet-tdt-0.6b-v2

Automatic Speech Recognition • Updated 27 days ago • 690k • 1.53k
deepseek-ai/DeepSeek-Prover-V2-671B

Text Generation • 685B • Updated Apr 30, 2025 • 629 • 831
Qwen/Qwen2.5-Omni-3B

Any-to-Any • 6B • Updated Apr 30, 2025 • 1.69M • 344

InternVL3 HF

OpenGVLab/InternVL3-1B-hf

Image-Text-to-Text • 0.9B • Updated Apr 23, 2025 • 248k • 10
OpenGVLab/InternVL3-2B-hf

Image-Text-to-Text • 2B • Updated Apr 23, 2025 • 17.6k • 3
OpenGVLab/InternVL3-8B-hf

Image-Text-to-Text • 8B • Updated Apr 23, 2025 • 69.3k • 10
OpenGVLab/InternVL3-14B-hf

Image-Text-to-Text • 15B • Updated Apr 23, 2025 • 7.3k

April 16 Releases

giskardai/realharm

Viewer • Updated Apr 16, 2025 • 136 • 31 • 12
Junfeng5/Liquid_V1_7B

Any-to-Any • 9B • Updated Mar 20, 2025 • 2.04k • 97

Multimodal DSE Retrievers

A collection of DSE models for multimodal retrieval

racineai/Flantier-SmolVLM-2B-dse

2B • Updated Jun 18, 2025 • 1 • 11
MrLight/dse-qwen2-2b-mrl-v1

Visual Document Retrieval • Updated Feb 26, 2025 • 29.8k • 68
marco/mcdse-2b-v1

2B • Updated Oct 29, 2024 • 67 • 56
llamaindex/vdr-2b-multi-v1

Image-Text-to-Text • 2B • Updated Apr 8 • 12k • 128

April 11 Releases

moonshotai/Kimi-VL-A3B-Thinking

Image-Text-to-Text • 16B • Updated Jan 30 • 162k • 449
agentica-org/DeepCoder-14B-Preview

Text Generation • 15B • Updated May 11, 2025 • 620 • • 679
HiDream-ai/HiDream-I1-Full

Text-to-Image • 17B • Updated Jul 17, 2025 • 18.3k • • 997
OpenGVLab/InternVL3-78B

Image-Text-to-Text • 78B • Updated Sep 11, 2025 • 14.6k • 238

March 28 Releases

deepseek-ai/DeepSeek-V3-0324

Text Generation • 685B • Updated Mar 27, 2025 • 838k • • 3.15k
Qwen/Qwen2.5-Omni-7B

Any-to-Any • 11B • Updated Apr 30, 2025 • 414k • 1.92k
google/txgemma-27b-chat

Text Generation • 27B • Updated Apr 10, 2025 • 441 • • 61
Running

Agents

Featured

374

Qwen2.5 Omni 7B Demo

🏆

374

Chat with text, audio, images, and video, get spoken replies

March 21 Releases

docling-project/SmolDocling-256M-preview

Image-Text-to-Text • 0.3B • Updated Sep 17, 2025 • 31.2k • 1.62k
sesame/csm-1b

Text-to-Speech • 2B • Updated Dec 1, 2025 • 245k • 2.42k
mistralai/Mistral-Small-3.1-24B-Instruct-2503

24B • Updated Dec 22, 2025 • 251k • 1.38k
tencent/Hunyuan3D-2mini

Image-to-3D • Updated Oct 17, 2025 • 22.3k • 140

Türkçe VLMler

Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6, 2025 • 1.47M • 1.28k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • 2B • Updated Jan 12, 2025 • 3.13M • 516
CohereLabs/aya-vision-8b

Image-Text-to-Text • 9B • Updated Jan 9 • 40.4k • 325
CohereLabs/aya-vision-32b

Image-Text-to-Text • 33B • Updated Jan 9 • 163 • • 227

Feb 14 Releases 💌

OpenGVLab/InternVideo2_5_Chat_8B

Video-Text-to-Text • 8B • Updated Aug 4, 2025 • 2.91k • 91
ATH-MaaS/Ovis2-34B

Image-Text-to-Text • 35B • Updated Aug 15, 2025 • 109 • 151
open-r1/OpenR1-Qwen-7B

Text Generation • 8B • Updated May 28, 2025 • 160 • • 54
nomic-ai/nomic-embed-text-v2-moe

Sentence Similarity • 0.5B • Updated Apr 1, 2025 • 1.64M • 493

Feb 7 Releases 🧣

lerobot/pi0_old

Robotics • 4B • Updated Sep 19, 2025 • 733 • 309
kyutai/hibiki-2b-pytorch-bf16

Translation • Updated May 28, 2025 • 472 • 63
Alpha-VLLM/Lumina-Image-2.0

Text-to-Image • 3B • Updated Mar 30, 2025 • 1.16k • • 370
adyen/DABstep

Viewer • Updated about 18 hours ago • 959k • 5.07k • 52

January 31 Releases 🧤

allenai/Llama-3.1-Tulu-3-405B

Text Generation • 406B • Updated Feb 10, 2025 • 254 • 112
Qwen/Qwen2.5-VL-72B-Instruct

Image-Text-to-Text • 73B • Updated Jun 6, 2025 • 565k • • 643
mistralai/Mistral-Small-24B-Instruct-2501

24B • Updated Jul 28, 2025 • 85k • 962
deepseek-ai/Janus-Pro-7B

Any-to-Any • Updated Feb 1, 2025 • 11.3k • 3.64k

Models, Jan 27

Running on Zero

Agents

269

Qwen2-VL-7B

🔥

269

Answer questions about uploaded images
Running

Agents

67

UI-TARS

🌖

67

Predict UI click coordinates from a screenshot and instruction
Paused

Agents

101

Qwen2.5-1M Demo

💻

101

Ask questions about your uploaded documents
Qwen/Qwen2.5-14B-Instruct-1M

Text Generation • 15B • Updated Jan 29, 2025 • 10.1k • • 342

Jan 24 Releases

ostris/Flex.1-alpha

Text-to-Image • 8B • Updated Jan 19, 2025 • 990 • 483
Qwen/Qwen2.5-Math-PRM-72B

Text Classification • 73B • Updated Jan 17, 2025 • 141 • 77
HuggingFaceTB/SmolVLM-500M-Instruct

Image-Text-to-Text • 0.5B • Updated Apr 8, 2025 • 325k • 196
deepseek-ai/DeepSeek-R1

Text Generation • 685B • Updated Mar 27, 2025 • 8.93M • • 13.5k

Jan 17 Releases ❄️

Models and datasets of the second week of Jan 2025.

openbmb/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Oct 5, 2025 • 365k • 1.3k
MiniMaxAI/MiniMax-Text-01

Text Generation • 456B • Updated Jul 3, 2025 • 4.52k • 656
OuteAI/OuteTTS-0.3-1B

Text-to-Speech • 1B • Updated Apr 24, 2025 • 68 • 108
NovaSky-AI/Sky-T1_data_17k

Viewer • Updated Jan 14, 2025 • 16.4k • 339 • 186

Jan 10 Releases 🌨️

vikhyatk/moondream2

Image-Text-to-Text • 2B • Updated Sep 23, 2025 • 2.31M • 1.43k
DAMO-NLP-SG/multimodal_textbook

Updated Mar 17, 2025 • 1.69k • 164
ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Sep 8, 2025 • 601 • 30
nvidia/Cosmos-1.0-Autoregressive-4B

Updated Feb 11, 2025 • 43 • 57

Dec 6 Releases 🎄

meta-llama/Llama-3.3-70B-Instruct

Text Generation • 71B • Updated Dec 21, 2024 • 655k • • 2.92k
Qwen/Qwen2-VL-72B

Image-Text-to-Text • 73B • Updated Dec 6, 2024 • 95 • 80
google/paligemma2-3b-pt-224

Image-Text-to-Text • 3B • Updated Dec 5, 2024 • 10.9k • 175
tencent/HunyuanVideo

Text-to-Video • Updated Mar 6, 2025 • 846 • • 2.23k

Nov 29 Releases 🌲🌲

HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • 2B • Updated Apr 8, 2025 • 24k • 591
Qwen/QwQ-32B-Preview

Text Generation • 33B • Updated Jan 12, 2025 • 42k • • 1.74k
nvidia/Hymba-1.5B-Base

Text Generation • 2B • Updated Nov 26, 2025 • 2.42k • 158
vidore/colsmolvlm-v0.1

Visual Document Retrieval • Updated Mar 14, 2025 • 191 • 55

Nov 22 Releases ❄️

mistralai/Pixtral-Large-Instruct-2411

Updated Jun 2 • 65 • 434
microsoft/orca-agentinstruct-1M-v1

Viewer • Updated Nov 1, 2024 • 1.05M • 1.31k • 466
Xkev/Llama-3.2V-11B-cot

Image-Text-to-Text • 11B • Updated Nov 16, 2025 • 565 • 158
jinaai/jina-clip-v2

Feature Extraction • 0.9B • Updated Apr 8 • 630k • 341

Nov 15 Releases 🍂

microsoft/LLM2CLIP-EVA02-L-14-336

Zero-Shot Image Classification • Updated Nov 22, 2024 • 96 • 62
microsoft/LLM2CLIP-EVA02-B-16

Updated Feb 8, 2025 • 36 • 11
PleIAs/common_corpus

Viewer • Updated May 6 • 69.9k • 65k • 411
Qwen/Qwen2.5-Coder-32B-Instruct

Text Generation • 33B • Updated Jan 12, 2025 • 1.32M • • 2.09k

Nov 1 Releases

Running on Zero

Agents

91

LongVU

🌖

91

Answer questions about uploaded videos or images
facebook/MobileLLM-1B

Text Generation • Updated May 5, 2025 • 116 • 122
Vision-CAIR/LongVU_Qwen2_7B

Video-Text-to-Text • 8B • Updated Feb 28, 2025 • 298 • 76
Vision-CAIR/LongVU_Llama3_2_3B_img

Updated Feb 28, 2025 • 6 • 6

MIT Talk 31/10 Papers

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17, 2024 • 75
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10, 2024 • 20
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27, 2024 • 49
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 123

October 25 Releases

ibm-granite/granite-3.0-8b-instruct

Text Generation • 8B • Updated Dec 19, 2024 • 165k • 207
ibm-granite/granite-3.0-2b-instruct

Text Generation • 3B • Updated Dec 19, 2024 • 11.7k • 48
CohereLabs/aya-expanse-8b

Text Generation • 8B • Updated Jan 9 • 25k • 439
CohereLabs/aya-expanse-32b

Text Generation • 32B • Updated Jan 9 • 6.36k • • 298

LOTUS 🪷

Running on Zero

Agents

Featured

101

Lotus Normal

🌍

101

Official Demo of Lotus (https://lotus3d.github.io/)
Running on Zero

Agents

79

Lotus Depth

🚀

79

Official Demo of Lotus (https://lotus3d.github.io/)
jingheya/lotus-depth-g-v1-0

Depth Estimation • 0.9B • Updated Oct 5, 2024 • 7.07k • 27
jingheya/lotus-depth-d-v1-0

Depth Estimation • 0.9B • Updated Oct 5, 2024 • 81 • 5

New Depth Models

Recent depth models

Running on Zero

Agents

Featured

208

DepthCrafter

🦀

208

a super consistent video depth model
Paused

Agents

Featured

223

Depth Pro

🚀

223

Generate an inverse depth map from an image
Running on Zero

Agents

79

Lotus Depth

🚀

79

Official Demo of Lotus (https://lotus3d.github.io/)
apple/DepthPro

Depth Estimation • Updated Feb 28, 2025 • 6.77k • 526

BRAVE Models 🦁

Models mentioned in https://huggingface.co/papers/2404.07204

facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 1.06M • 115
google/flan-t5-xl

3B • Updated Nov 28, 2023 • 153k • 535
google/siglip-large-patch16-384

Zero-Shot Image Classification • 0.7B • Updated Sep 26, 2024 • 46.8k • 12
google/vit-huge-patch14-224-in21k

Image Feature Extraction • 0.6B • Updated Feb 14, 2024 • 3.75k • 22

Computer Vision Backbones 🧩

Collection of useful computer vision backbones to fine-tune. It also includes large image classification models, that can be used as backbone.

microsoft/resnet-50

Image Classification • 25.6M • Updated Feb 13, 2024 • 1.02M • • 499
google/vit-base-patch16-224-in21k

Image Feature Extraction • 86.4M • Updated Feb 5, 2024 • 923k • 414
google/vit-base-patch32-224-in21k

Image Feature Extraction • 88M • Updated Dec 8, 2022 • 24.3k • 20
facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 1.06M • 115

Image Classification Models 🐶 🐱

facebook/deit-base-distilled-patch16-384

Image Classification • 87.6M • Updated Sep 12, 2023 • 20.9k • • 8
facebook/convnextv2-base-1k-224

Image Classification • 88.7M • Updated Feb 17, 2025 • 5.82k • • 4
facebook/deit-base-distilled-patch16-224

Image Classification • Updated Jul 13, 2022 • 9.88k • • 34
google/vit-base-patch32-384

Image Classification • 88.3M • Updated Sep 11, 2023 • 6.52k • • 23

Object Detection Models 🥥

facebook/detr-resnet-50

Object Detection • 41.6M • Updated Apr 10, 2024 • 1.62M • • 965
facebook/detr-resnet-101-dc5

Object Detection • 60.7M • Updated Sep 6, 2023 • 4.73k • 19
facebook/detr-resnet-50-dc5

Object Detection • 41.6M • Updated Sep 7, 2023 • 3.06k • 6
google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 296k • 150

Image Segmentation Models 💜

A collection of instance/semantic/panoptic segmentation models.

facebook/maskformer-swin-large-coco

Image Segmentation • 0.2B • Updated Sep 11, 2023 • 364 • • 28
nvidia/segformer-b0-finetuned-ade-512-512

Image Segmentation • 3.75M • Updated Jan 14, 2024 • 383k • • 193
facebook/detr-resnet-50-dc5-panoptic

Image Segmentation • 43M • Updated Sep 11, 2023 • 47 • 3
nvidia/segformer-b5-finetuned-cityscapes-1024-1024

Image Segmentation • Updated Aug 9, 2022 • 20.1k • • 44

Zero-shot Image Classification Models 🖼️

This is a collection for models that can be used for zero-shot image classification.

openai/clip-vit-large-patch14

Zero-Shot Image Classification • 0.4B • Updated Sep 15, 2023 • 9.95M • 2.06k
openai/clip-vit-base-patch32

Zero-Shot Image Classification • Updated Feb 29, 2024 • 23.6M • 988
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Zero-Shot Image Classification • Updated Jan 22, 2025 • 111k • 314
kakaobrain/align-base

Zero-Shot Image Classification • Updated Mar 8, 2023 • 17.9k • 31

Image-to-Image Models 🎨

Collection of image to image editing, image enhancement (SR, deblur, brighten) and text-to-image adapter models.

timbrooks/instruct-pix2pix

Image-to-Image • 0.9B • Updated Jul 5, 2023 • 31.1k • 1.18k
TencentARC/t2i-adapter-canny-sdxl-1.0

Image-to-Image • 79M • Updated Sep 7, 2023 • 2.02k • 54
TencentARC/t2i-adapter-sketch-sdxl-1.0

Image-to-Image • 79M • Updated Sep 8, 2023 • 2.37k • 77
CrucibleAI/ControlNetMediaPipeFace

Image-to-Image • 0.4B • Updated May 19, 2023 • 828 • 575

Video Classification Models 📺

microsoft/xclip-base-patch32

Video Classification • 0.2B • Updated Feb 4, 2024 • 119k • 114
facebook/timesformer-base-finetuned-k400

Video Classification • Updated Jan 2, 2023 • 11.6k • 43
facebook/timesformer-base-finetuned-k600

Video Classification • Updated Dec 12, 2022 • 31.4k • 12
google/vivit-b-16x2

Video Classification • Updated Aug 3, 2023 • 23.6k • 11

Image-to-Text Models 📝

This collection contains image captioning and OCR models.

Salesforce/blip-image-captioning-large

Image-to-Text • 0.5B • Updated Feb 3, 2025 • 684k • 1.48k
Salesforce/blip-image-captioning-base

Image-to-Text • Updated Feb 3, 2025 • 1.79M • 868
microsoft/trocr-base-handwritten

Image-to-Text • 0.3B • Updated Feb 11, 2025 • 183k • 502
microsoft/git-large-coco

Image-to-Text • 0.4B • Updated Jun 26, 2023 • 2.98k • 106

Text-to-Image Models 🥑

stabilityai/stable-diffusion-xl-base-1.0

Text-to-Image • 3B • Updated Oct 30, 2023 • 1.47M • • 7.98k
warp-ai/wuerstchen

Text-to-Image • Updated Mar 12, 2024 • 108 • 177
Deci/DeciDiffusion-v1-0

Text-to-Image • 0.9B • Updated Feb 15, 2024 • 40 • 140
stabilityai/stable-diffusion-xl-refiner-1.0

Image-to-Image • 2B • Updated Sep 25, 2023 • 126k • 2.06k

Foundation Models for Vision 🧩

Foundation models for computer vision.

Running

Agents

124

Grounding DINO Demo

💻

124

Cutting edge open-vocabulary object detection app
Running

Agents

Featured

105

Owlv2

👀

105

State-of-the-art Zero-shot Object Detection
Configuration error

Agents

Featured

41

BLIP2 with transformers

🌖

41

BLIP2 (cutting edge image captioning) in 🤗transformers
Build error

Agents

Featured

377

IDEFICS Playground

🐨

377

Segment Anything Model

This collection contains models and demos of SAM and it's smaller friends.

facebook/sam-vit-huge

Mask Generation • 0.6B • Updated Jan 11, 2024 • 464k • 197
facebook/sam-vit-base

Mask Generation • 93.7M • Updated Jan 11, 2024 • 1.26M • 171
facebook/sam-vit-large

Mask Generation • 0.3B • Updated Jan 11, 2024 • 10.5k • 34
Runtime error

Agents

43

Grounded SAM

💩

43

OWL-series 🦉

Models and applications of OWL-ViT and OWLv2.

Running

Agents

Featured

105

Owlv2

👀

105

State-of-the-art Zero-shot Object Detection
Running on Zero

Agents

Featured

64

Owl Tracking

⚡

64

Powerful foundation model for zero-shot object tracking
Sleeping

26

Search and Detect (CLIP/OWL-ViT)

🦉

26

Search and detect objects in images using text queries
Running on Zero

Agents

Featured

110

OWLSAM

😻

110

State-of-the-art open-vocabulary image segmentation ⚡️

SigLIP

A collection dedicated to SigLIP applications

Running on Zero

Agents

Featured

74

Draw To Search Art

🐠

74

Draw/upload image and search among WikiART using SigLIP
Running on CPU Upgrade

Agents

23

Compare Clip Siglip

🏃

23

Compare strong zero-shot image classification models
Runtime error

Agents

13

Multilingual Zero Shot Image Clf

🏢

13

Comparing powerful multilingual zero-shot image clf models
BAAI/bunny-phi-2-siglip-lora

Text Generation • Updated Mar 28, 2024 • 70 • 48

Awesome Document AI

A collection of open-source document AI 📄 📝 📈

Running on Zero

Agents

Featured

83

UDOP

🏃

83

Generate answers or summaries from document images with prompts
Configuration error

Agents

40

Pix2struct

📚

40

Play with all the pix2struct variants in this d
Running

Agents

26

Compare Docvqa Models

🦀

26

Compare different visual question answering
Runtime error

Agents

Featured

289

DocQuery — Document Query Engine

🦉

289

SegGPT

A collection of everything SegGPT.

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

Paper • 2212.02499 • Published Dec 5, 2022
SegGPT: Segmenting Everything In Context

Paper • 2304.03284 • Published Apr 6, 2023 • 1
BAAI/seggpt-vit-large

0.4B • Updated 13 days ago • 23.8k • 5
BAAI/SegGPT

Updated 13 days ago • 22

Vision Language Models Papers 🖼️💬📝

Papers about vision-language models, most important ones are on top of the list.

Improved Baselines with Visual Instruction Tuning

Paper • 2310.03744 • Published Oct 5, 2023 • 39
DeepSeek-VL: Towards Real-World Vision-Language Understanding

Paper • 2403.05525 • Published Mar 8, 2024 • 50
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Paper • 2308.12966 • Published Aug 24, 2023 • 12
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Paper • 2404.01331 • Published Mar 29, 2024 • 28

gvhf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 296k • 150
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 9.6k • 14
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 6.58k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 54k • 30

gv-hf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 296k • 150
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 9.6k • 14
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 6.58k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 54k • 30

merve/owl2

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 296k • 150
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 9.6k • 14
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 6.58k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 54k • 30

Depth Anything v2 Release

A comprehensive collection on DAv2

depth-anything/Depth-Anything-V2-Small

Depth Estimation • Updated Jul 8, 2024 • 14.3k • 79
depth-anything/Depth-Anything-V2-Large

Depth Estimation • Updated Jul 8, 2024 • 48.1k • 156
Running on Zero

Agents

696

Depth Anything V2

🌖

696

Generate depth map from any photo
depth-anything/DA-2K

Viewer • Updated Jun 14, 2024 • 1.04k • 363 • 17

Document VLM Papers

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

Vision Language Leaderboards

This collection has all the vision language leaderboards.

Running

Agents

209

Vidore Leaderboard

🥇

209

Browse and compare visual document retrieval model scores
Running on CPU Upgrade

Agents

1.02k

Open VLM Leaderboard

🌎

1.02k

VLMEvalKit Evaluation Results Collection
Running

Featured

561

Vision Arena (Testing VLMs side-by-side)

🖼

561

Explore Vision Arena visual AI demo online
Build error

Agents

Featured

85

SEED-Bench Leaderboard

🏆

85

Submit model evaluation results to leaderboard

Video Language Models

A collection of video-language models

Paused

Agents

21

Video Llava

🐨

21

Generate descriptions by uploading images or videos
llava-hf/LLaVA-NeXT-Video-7B-hf

Video-Text-to-Text • 7B • Updated Nov 11, 2025 • 197k • 126
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf

Video-Text-to-Text • 7B • Updated Nov 11, 2025 • 945 • 12
llava-hf/LLaVA-NeXT-Video-7B-32K-hf

Image-Text-to-Text • 8B • Updated Nov 11, 2025 • 111 • 9

SAM2

All the models and demos for SAM2

merve/sam2-hiera-tiny

Mask Generation • Updated Aug 2, 2024 • 31
merve/sam2-hiera-small

Mask Generation • Updated Aug 2, 2024 • 17 • 2
merve/sam2-hiera-large

Mask Generation • Updated Aug 2, 2024 • 105 • 2
merve/sam2-hiera-base-plus

Mask Generation • Updated Aug 2, 2024 • 71

NVEagle

NVEagle/Eagle-X5-13B

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 18 • 15
NVEagle/Eagle-X5-13B-Chat

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 72 • 28
NVEagle/Eagle-X5-7B

Image-Text-to-Text • 9B • Updated Sep 16, 2024 • 120 • 26
Runtime error

Agents

64

Eagle X5 13B Chat

🚀

64

Combine text and images to generate responses

Multimodal RAG

vidore/colpali-v1.2

Visual Document Retrieval • Updated Mar 14, 2025 • 134k • 112
Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6, 2025 • 1.47M • 1.28k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • 2B • Updated Jan 12, 2025 • 3.13M • 516
Qwen/Qwen2-72B-Instruct

Text Generation • 73B • Updated Oct 8, 2024 • 60.8k • • 718

Zero-shot Segmentation

sam-hq-team/SegInW

Updated Jul 13, 2023 • 1
xdecoder/X-Decoder

Updated Dec 27, 2023 • 5
xdecoder/SEEM

Updated Dec 30, 2023 • 8
Runtime error

Agents

Featured

60

OWLSAM2

🏃

60

merve PRO

AI & ML interests

Recent Activity

Organizations

merve 's collections 108

RF-DETR Realtime Webcam Demo

YOLO26

YOLO26 WebGPU

SAM3 Video Segmentation

SAM3 Tracker WebGPU

Multimodal OCR2