merve PRO
AI & ML interests
Recent Activity
Organizations
- Running on ZeroAgentsFeatured49
RF-DETR Realtime Webcam Demo
🎯49Segment objects in live webcam and uploaded media
-
Roboflow/rf-detr-base
Object Detection • 32.2M • Updated • 994 • 3 -
Roboflow/rf-detr-base-2
Object Detection • 32.2M • Updated • 207 -
Roboflow/rf-detr-nano
Object Detection • 30.5M • Updated • 410
-
OpenMOSS-Team/MOSS-Audio-4B-Instruct
Audio-Text-to-Text • 5B • Updated • 3.75k • 72 -
OpenMOSS-Team/MOSS-Audio-8B-Thinking
Audio-Text-to-Text • 9B • Updated • 2.01k • 69 -
bytedance-research/Timer-S1
Time Series Forecasting • 8B • Updated • 79.1k • 30 -
BugTraceAI/BugTraceAI-Apex-G4-26B-Q4
25B • Updated • 530 • 60
- Runtime errorAgents26
YOLO26
💙26Process images with advanced object detection and segmentation
- RunningFeatured65
YOLO26 WebGPU
🏆65Real-time object detection & pose estimation in your browser
-
onnx-community/yolo26x-ONNX
Updated • 25 • 5 -
openvision/yoloe26-n-seg
Zero-Shot Object Detection • Updated • 235 • 2
-
Wuli-art/Qwen-Image-2512-Turbo-LoRA
Text-to-Image • Updated • 15.9k • 217 -
miromind-ai/MiroThinker-v1.5-235B
Text Generation • 235B • Updated • 47 • 254 -
prithivMLmods/Qwen-Image-Edit-2511-Object-Remover
Image-to-Image • Updated • 3.76k • • 64 -
tencent/Youtu-LLM-2B-Base
Text Generation • 2B • Updated • 1.12k • 42
-
facebook/sam3
Mask Generation • 0.9B • Updated • 1.84M • 2.18k - Running on ZeroAgentsFeatured114
SAM3 Video Segmentation
🐠114Track and label objects in videos using text prompts or clicks
-
onnx-community/sam3-tracker-ONNX
Mask Generation • Updated • 797 • 37 - Running30
SAM3 Tracker WebGPU
🎯30Segment images with click points and download cutouts
-
opendatalab/OmniDocBench
Viewer • Updated • 1.65k • 15.3k • 89 -
nanonets/Nanonets-OCR-s
Image-Text-to-Text • 4B • Updated • 87k • 1.59k -
echo840/MonkeyOCR
Image-Text-to-Text • Updated • 228 • 515 - Running on ZeroMCPFeatured143
Multimodal OCR2
💻143FireRed / Nanonets / Monkey / Thyme / Typhoon / SmolDocling
-
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text • 16B • Updated • 123k • 448 -
agentica-org/DeepCoder-14B-Preview
Text Generation • 15B • Updated • 330 • • 681 -
HiDream-ai/HiDream-I1-Full
Text-to-Image • Updated • 13.5k • • 996 -
OpenGVLab/InternVL3-78B
Image-Text-to-Text • 78B • Updated • 39.3k • 237
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 75 -
BRAVE: Broadening the visual encoding of vision-language models
Paper • 2404.07204 • Published • 20 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper • 2403.18814 • Published • 49 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 123
- Running on ZeroAgentsFeatured101
Lotus Normal
🌍101Official Demo of Lotus (https://lotus3d.github.io/)
- Running on ZeroAgents78
Lotus Depth
🚀78Official Demo of Lotus (https://lotus3d.github.io/)
-
jingheya/lotus-depth-g-v1-0
Depth Estimation • Updated • 7.52k • 27 -
jingheya/lotus-depth-d-v1-0
Depth Estimation • Updated • 115 • 5
-
facebook/dinov2-large
Image Feature Extraction • 0.3B • Updated • 2.8M • 113 -
google/flan-t5-xl
3B • Updated • 124k • 534 -
google/siglip-large-patch16-384
Zero-Shot Image Classification • 0.7B • Updated • 31.6k • 11 -
google/vit-huge-patch14-224-in21k
Image Feature Extraction • 0.6B • Updated • 2.3k • 22
-
facebook/deit-base-distilled-patch16-384
Image Classification • 87.6M • Updated • 9.37k • • 8 -
facebook/convnextv2-base-1k-224
Image Classification • 88.7M • Updated • 379 • 4 -
facebook/deit-base-distilled-patch16-224
Image Classification • Updated • 6.02k • • 34 -
google/vit-base-patch32-384
Image Classification • 88.3M • Updated • 6.16k • • 23
-
facebook/maskformer-swin-large-coco
Image Segmentation • 0.2B • Updated • 99 • 28 -
nvidia/segformer-b0-finetuned-ade-512-512
Image Segmentation • 3.75M • Updated • 261k • • 190 -
facebook/detr-resnet-50-dc5-panoptic
Image Segmentation • 43M • Updated • 20 • 3 -
nvidia/segformer-b5-finetuned-cityscapes-1024-1024
Image Segmentation • Updated • 84.1k • • 43
-
Salesforce/blip-image-captioning-large
Image-to-Text • 0.5B • Updated • 690k • 1.48k -
Salesforce/blip-image-captioning-base
Image-to-Text • Updated • 2.06M • 860 -
microsoft/trocr-base-handwritten
Image-to-Text • 0.3B • Updated • 129k • 495 -
microsoft/git-large-coco
Image-to-Text • 0.4B • Updated • 3.89k • 105
- RunningAgents122
Grounding DINO Demo
💻122Cutting edge open-vocabulary object detection app
- RunningAgentsFeatured105
Owlv2
👀105State-of-the-art Zero-shot Object Detection
- Configuration errorAgentsFeatured41
BLIP2 with transformers
🌖41BLIP2 (cutting edge image captioning) in 🤗transformers
- Build errorAgentsFeatured377
IDEFICS Playground
🐨377
- RunningAgentsFeatured105
Owlv2
👀105State-of-the-art Zero-shot Object Detection
- Running on ZeroAgentsFeatured64
Owl Tracking
⚡64Powerful foundation model for zero-shot object tracking
- Running26
Search and Detect (CLIP/OWL-ViT)
🦉26Search and detect objects in images using text queries
- Runtime errorAgentsFeatured110
OWLSAM
😻110State-of-the-art open-vocabulary image segmentation ⚡️
- Runtime errorAgentsFeatured83
UDOP
🏃83Generate text from document images
- Configuration errorAgents40
Pix2struct
📚40Play with all the pix2struct variants in this d
- RunningAgents26
Compare Docvqa Models
🦀26Compare different visual question answering
- Runtime errorAgentsFeatured289
DocQuery — Document Query Engine
🦉289
-
Improved Baselines with Visual Instruction Tuning
Paper • 2310.03744 • Published • 39 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper • 2403.05525 • Published • 50 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper • 2308.12966 • Published • 12 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper • 2404.01331 • Published • 28
-
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 148 -
google/owlvit-base-patch16
Zero-Shot Object Detection • Updated • 5.12k • 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection • Updated • 7.7k • 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection • 0.2B • Updated • 16k • 30
-
depth-anything/Depth-Anything-V2-Small
Depth Estimation • Updated • 10.5k • 78 -
depth-anything/Depth-Anything-V2-Large
Depth Estimation • Updated • 71.9k • 154 - Running on ZeroAgents688
Depth Anything V2
🌖688Generate depth map from any photo
-
depth-anything/DA-2K
Viewer • Updated • 1.04k • 345 • 17
- RunningAgents207
Vidore Leaderboard
🥇207Browse and compare visual document retrieval model scores
- Running on CPU UpgradeAgents1.02k
Open VLM Leaderboard
🌎1.02kVLMEvalKit Evaluation Results Collection
- RunningFeatured561
Vision Arena (Testing VLMs side-by-side)
🖼561Explore Vision Arena visual AI demo online
- Build errorAgentsFeatured85
SEED-Bench Leaderboard
🏆85Submit model evaluation results to leaderboard
-
internlm/Intern-S2-Preview
Image-Text-to-Text • 36B • Updated • 6.73k • 107 -
nvidia/nemotron-3.5-asr-streaming-0.6b
Automatic Speech Recognition • Updated • 4.97k • • 359 -
internlm/Intern-S2-Preview-FP8
Image-Text-to-Text • 36B • Updated • 136k • 23 -
Aratako/Irodori-TTS-500M-v3
Text-to-Speech • 0.5B • Updated • 100
-
google/translategemma-27b-it
Image-Text-to-Text • 29B • Updated • 30.3k • 378 -
kakaocorp/kanana-2-30b-a3b-mid-2601
Text Generation • 31B • Updated • 39 • 31 -
black-forest-labs/FLUX.2-klein-base-4B
Image-to-Image • Updated • 93.9k • • 142 -
google/translategemma-12b-it
Image-Text-to-Text • 13B • Updated • 76.1k • 306
-
PekingU/rtdetr_v2_r50vd
Object Detection • 43M • Updated • 450k • 28 -
ustc-community/dfine-xlarge-obj365
Object Detection • 63.4M • Updated • 1.61k • 5 -
PekingU/rtdetr_v2_r101vd
Object Detection • 76.8M • Updated • 5.98k • 14 - Running on T4Agents147
RF-DETR
🔥147SOTA real-time object detection model
-
facebook/metaclip-2-worldwide-s16
Zero-Shot Image Classification • 0.4B • Updated • 183 • 9 -
facebook/metaclip-2-worldwide-m16
Zero-Shot Image Classification • 0.5B • Updated • 26 • 4 -
facebook/metaclip-2-worldwide-l14
Zero-Shot Image Classification • 1B • Updated • 2.42k • 13 -
facebook/metaclip-2-worldwide-b32
Zero-Shot Image Classification • 0.6B • Updated • 161 • 7
-
deepseek-ai/DeepSeek-V3-0324
Text Generation • 685B • Updated • 830k • • 3.13k -
Qwen/Qwen2.5-Omni-7B
Any-to-Any • 11B • Updated • 778k • 1.9k -
google/txgemma-27b-chat
Text Generation • 27B • Updated • 321 • • 60 - RunningAgentsFeatured372
Qwen2.5 Omni 7B Demo
🏆372Chat with text, audio, images, and video, get spoken replies
- Running on ZeroAgents269
Qwen2-VL-7B
🔥269Answer questions about uploaded images
- RunningAgents67
UI-TARS
🌖67Predict UI click coordinates from a screenshot and instruction
- PausedAgents101
Qwen2.5-1M Demo
💻101Ask questions about your uploaded documents
-
Qwen/Qwen2.5-14B-Instruct-1M
Text Generation • 15B • Updated • 40k • • 340
-
ibm-granite/granite-3.0-8b-instruct
Text Generation • 8B • Updated • 63.1k • 206 -
ibm-granite/granite-3.0-2b-instruct
Text Generation • 3B • Updated • 3.67k • 49 -
CohereLabs/aya-expanse-8b
Text Generation • 8B • Updated • 36.1k • 434 -
CohereLabs/aya-expanse-32b
Text Generation • 32B • Updated • 12k • • 294
- Runtime errorAgentsFeatured207
DepthCrafter
🦀207a super consistent video depth model
- PausedAgentsFeatured223
Depth Pro
🚀223Generate an inverse depth map from an image
- Running on ZeroAgents78
Lotus Depth
🚀78Official Demo of Lotus (https://lotus3d.github.io/)
-
apple/DepthPro
Depth Estimation • Updated • 5.12k • 517
-
microsoft/resnet-50
Image Classification • 25.6M • Updated • 266k • • 494 -
google/vit-base-patch16-224-in21k
Image Feature Extraction • 86.4M • Updated • 1.84M • 410 -
google/vit-base-patch32-224-in21k
Image Feature Extraction • 88M • Updated • 6.54k • 19 -
facebook/dinov2-large
Image Feature Extraction • 0.3B • Updated • 2.8M • 113
-
facebook/detr-resnet-50
Object Detection • 41.6M • Updated • 273k • • 953 -
facebook/detr-resnet-101-dc5
Object Detection • 60.7M • Updated • 5.02k • 19 -
facebook/detr-resnet-50-dc5
Object Detection • 41.6M • Updated • 71.1k • 6 -
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 148
-
openai/clip-vit-large-patch14
Zero-Shot Image Classification • 0.4B • Updated • 13.9M • 2.03k -
openai/clip-vit-base-patch32
Zero-Shot Image Classification • Updated • 21.3M • 956 -
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Zero-Shot Image Classification • Updated • 62.3k • 312 -
kakaobrain/align-base
Zero-Shot Image Classification • Updated • 10.8k • 31
-
microsoft/xclip-base-patch32
Video Classification • 0.2B • Updated • 87.4k • 113 -
facebook/timesformer-base-finetuned-k400
Video Classification • Updated • 13.5k • 43 -
facebook/timesformer-base-finetuned-k600
Video Classification • Updated • 3.19k • 12 -
google/vivit-b-16x2
Video Classification • Updated • 15.4k • 11
- Runtime errorAgentsFeatured74
Draw To Search Art
🐠74Draw/upload image and search among WikiART using SigLIP
- Running on CPU UpgradeAgents23
Compare Clip Siglip
🏃23Compare strong zero-shot image classification models
- Runtime errorAgents13
Multilingual Zero Shot Image Clf
🏢13Comparing powerful multilingual zero-shot image clf models
-
BAAI/bunny-phi-2-siglip-lora
Text Generation • Updated • 77 • 48
-
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 148 -
google/owlvit-base-patch16
Zero-Shot Object Detection • Updated • 5.12k • 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection • Updated • 7.7k • 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection • 0.2B • Updated • 16k • 30
-
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 148 -
google/owlvit-base-patch16
Zero-Shot Object Detection • Updated • 5.12k • 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection • Updated • 7.7k • 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection • 0.2B • Updated • 16k • 30
- PausedAgents21
Video Llava
🐨21Generate descriptions by uploading images or videos
-
llava-hf/LLaVA-NeXT-Video-7B-hf
Video-Text-to-Text • 7B • Updated • 142k • 125 -
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf
Video-Text-to-Text • 7B • Updated • 593 • 12 -
llava-hf/LLaVA-NeXT-Video-7B-32K-hf
Image-Text-to-Text • 8B • Updated • 143 • 9
-
internlm/Intern-S2-Preview
Image-Text-to-Text • 36B • Updated • 6.73k • 107 -
nvidia/nemotron-3.5-asr-streaming-0.6b
Automatic Speech Recognition • Updated • 4.97k • • 359 -
internlm/Intern-S2-Preview-FP8
Image-Text-to-Text • 36B • Updated • 136k • 23 -
Aratako/Irodori-TTS-500M-v3
Text-to-Speech • 0.5B • Updated • 100
- Running on ZeroAgentsFeatured49
RF-DETR Realtime Webcam Demo
🎯49Segment objects in live webcam and uploaded media
-
Roboflow/rf-detr-base
Object Detection • 32.2M • Updated • 994 • 3 -
Roboflow/rf-detr-base-2
Object Detection • 32.2M • Updated • 207 -
Roboflow/rf-detr-nano
Object Detection • 30.5M • Updated • 410
-
OpenMOSS-Team/MOSS-Audio-4B-Instruct
Audio-Text-to-Text • 5B • Updated • 3.75k • 72 -
OpenMOSS-Team/MOSS-Audio-8B-Thinking
Audio-Text-to-Text • 9B • Updated • 2.01k • 69 -
bytedance-research/Timer-S1
Time Series Forecasting • 8B • Updated • 79.1k • 30 -
BugTraceAI/BugTraceAI-Apex-G4-26B-Q4
25B • Updated • 530 • 60
-
google/translategemma-27b-it
Image-Text-to-Text • 29B • Updated • 30.3k • 378 -
kakaocorp/kanana-2-30b-a3b-mid-2601
Text Generation • 31B • Updated • 39 • 31 -
black-forest-labs/FLUX.2-klein-base-4B
Image-to-Image • Updated • 93.9k • • 142 -
google/translategemma-12b-it
Image-Text-to-Text • 13B • Updated • 76.1k • 306
- Runtime errorAgents26
YOLO26
💙26Process images with advanced object detection and segmentation
- RunningFeatured65
YOLO26 WebGPU
🏆65Real-time object detection & pose estimation in your browser
-
onnx-community/yolo26x-ONNX
Updated • 25 • 5 -
openvision/yoloe26-n-seg
Zero-Shot Object Detection • Updated • 235 • 2
-
Wuli-art/Qwen-Image-2512-Turbo-LoRA
Text-to-Image • Updated • 15.9k • 217 -
miromind-ai/MiroThinker-v1.5-235B
Text Generation • 235B • Updated • 47 • 254 -
prithivMLmods/Qwen-Image-Edit-2511-Object-Remover
Image-to-Image • Updated • 3.76k • • 64 -
tencent/Youtu-LLM-2B-Base
Text Generation • 2B • Updated • 1.12k • 42
-
PekingU/rtdetr_v2_r50vd
Object Detection • 43M • Updated • 450k • 28 -
ustc-community/dfine-xlarge-obj365
Object Detection • 63.4M • Updated • 1.61k • 5 -
PekingU/rtdetr_v2_r101vd
Object Detection • 76.8M • Updated • 5.98k • 14 - Running on T4Agents147
RF-DETR
🔥147SOTA real-time object detection model
-
facebook/sam3
Mask Generation • 0.9B • Updated • 1.84M • 2.18k - Running on ZeroAgentsFeatured114
SAM3 Video Segmentation
🐠114Track and label objects in videos using text prompts or clicks
-
onnx-community/sam3-tracker-ONNX
Mask Generation • Updated • 797 • 37 - Running30
SAM3 Tracker WebGPU
🎯30Segment images with click points and download cutouts
-
facebook/metaclip-2-worldwide-s16
Zero-Shot Image Classification • 0.4B • Updated • 183 • 9 -
facebook/metaclip-2-worldwide-m16
Zero-Shot Image Classification • 0.5B • Updated • 26 • 4 -
facebook/metaclip-2-worldwide-l14
Zero-Shot Image Classification • 1B • Updated • 2.42k • 13 -
facebook/metaclip-2-worldwide-b32
Zero-Shot Image Classification • 0.6B • Updated • 161 • 7
-
opendatalab/OmniDocBench
Viewer • Updated • 1.65k • 15.3k • 89 -
nanonets/Nanonets-OCR-s
Image-Text-to-Text • 4B • Updated • 87k • 1.59k -
echo840/MonkeyOCR
Image-Text-to-Text • Updated • 228 • 515 - Running on ZeroMCPFeatured143
Multimodal OCR2
💻143FireRed / Nanonets / Monkey / Thyme / Typhoon / SmolDocling
-
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text • 16B • Updated • 123k • 448 -
agentica-org/DeepCoder-14B-Preview
Text Generation • 15B • Updated • 330 • • 681 -
HiDream-ai/HiDream-I1-Full
Text-to-Image • Updated • 13.5k • • 996 -
OpenGVLab/InternVL3-78B
Image-Text-to-Text • 78B • Updated • 39.3k • 237
-
deepseek-ai/DeepSeek-V3-0324
Text Generation • 685B • Updated • 830k • • 3.13k -
Qwen/Qwen2.5-Omni-7B
Any-to-Any • 11B • Updated • 778k • 1.9k -
google/txgemma-27b-chat
Text Generation • 27B • Updated • 321 • • 60 - RunningAgentsFeatured372
Qwen2.5 Omni 7B Demo
🏆372Chat with text, audio, images, and video, get spoken replies
- Running on ZeroAgents269
Qwen2-VL-7B
🔥269Answer questions about uploaded images
- RunningAgents67
UI-TARS
🌖67Predict UI click coordinates from a screenshot and instruction
- PausedAgents101
Qwen2.5-1M Demo
💻101Ask questions about your uploaded documents
-
Qwen/Qwen2.5-14B-Instruct-1M
Text Generation • 15B • Updated • 40k • • 340
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 75 -
BRAVE: Broadening the visual encoding of vision-language models
Paper • 2404.07204 • Published • 20 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper • 2403.18814 • Published • 49 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 123
-
ibm-granite/granite-3.0-8b-instruct
Text Generation • 8B • Updated • 63.1k • 206 -
ibm-granite/granite-3.0-2b-instruct
Text Generation • 3B • Updated • 3.67k • 49 -
CohereLabs/aya-expanse-8b
Text Generation • 8B • Updated • 36.1k • 434 -
CohereLabs/aya-expanse-32b
Text Generation • 32B • Updated • 12k • • 294
- Running on ZeroAgentsFeatured101
Lotus Normal
🌍101Official Demo of Lotus (https://lotus3d.github.io/)
- Running on ZeroAgents78
Lotus Depth
🚀78Official Demo of Lotus (https://lotus3d.github.io/)
-
jingheya/lotus-depth-g-v1-0
Depth Estimation • Updated • 7.52k • 27 -
jingheya/lotus-depth-d-v1-0
Depth Estimation • Updated • 115 • 5
- Runtime errorAgentsFeatured207
DepthCrafter
🦀207a super consistent video depth model
- PausedAgentsFeatured223
Depth Pro
🚀223Generate an inverse depth map from an image
- Running on ZeroAgents78
Lotus Depth
🚀78Official Demo of Lotus (https://lotus3d.github.io/)
-
apple/DepthPro
Depth Estimation • Updated • 5.12k • 517
-
facebook/dinov2-large
Image Feature Extraction • 0.3B • Updated • 2.8M • 113 -
google/flan-t5-xl
3B • Updated • 124k • 534 -
google/siglip-large-patch16-384
Zero-Shot Image Classification • 0.7B • Updated • 31.6k • 11 -
google/vit-huge-patch14-224-in21k
Image Feature Extraction • 0.6B • Updated • 2.3k • 22
-
microsoft/resnet-50
Image Classification • 25.6M • Updated • 266k • • 494 -
google/vit-base-patch16-224-in21k
Image Feature Extraction • 86.4M • Updated • 1.84M • 410 -
google/vit-base-patch32-224-in21k
Image Feature Extraction • 88M • Updated • 6.54k • 19 -
facebook/dinov2-large
Image Feature Extraction • 0.3B • Updated • 2.8M • 113
-
facebook/deit-base-distilled-patch16-384
Image Classification • 87.6M • Updated • 9.37k • • 8 -
facebook/convnextv2-base-1k-224
Image Classification • 88.7M • Updated • 379 • 4 -
facebook/deit-base-distilled-patch16-224
Image Classification • Updated • 6.02k • • 34 -
google/vit-base-patch32-384
Image Classification • 88.3M • Updated • 6.16k • • 23
-
facebook/detr-resnet-50
Object Detection • 41.6M • Updated • 273k • • 953 -
facebook/detr-resnet-101-dc5
Object Detection • 60.7M • Updated • 5.02k • 19 -
facebook/detr-resnet-50-dc5
Object Detection • 41.6M • Updated • 71.1k • 6 -
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 148
-
facebook/maskformer-swin-large-coco
Image Segmentation • 0.2B • Updated • 99 • 28 -
nvidia/segformer-b0-finetuned-ade-512-512
Image Segmentation • 3.75M • Updated • 261k • • 190 -
facebook/detr-resnet-50-dc5-panoptic
Image Segmentation • 43M • Updated • 20 • 3 -
nvidia/segformer-b5-finetuned-cityscapes-1024-1024
Image Segmentation • Updated • 84.1k • • 43
-
openai/clip-vit-large-patch14
Zero-Shot Image Classification • 0.4B • Updated • 13.9M • 2.03k -
openai/clip-vit-base-patch32
Zero-Shot Image Classification • Updated • 21.3M • 956 -
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Zero-Shot Image Classification • Updated • 62.3k • 312 -
kakaobrain/align-base
Zero-Shot Image Classification • Updated • 10.8k • 31
-
microsoft/xclip-base-patch32
Video Classification • 0.2B • Updated • 87.4k • 113 -
facebook/timesformer-base-finetuned-k400
Video Classification • Updated • 13.5k • 43 -
facebook/timesformer-base-finetuned-k600
Video Classification • Updated • 3.19k • 12 -
google/vivit-b-16x2
Video Classification • Updated • 15.4k • 11
-
Salesforce/blip-image-captioning-large
Image-to-Text • 0.5B • Updated • 690k • 1.48k -
Salesforce/blip-image-captioning-base
Image-to-Text • Updated • 2.06M • 860 -
microsoft/trocr-base-handwritten
Image-to-Text • 0.3B • Updated • 129k • 495 -
microsoft/git-large-coco
Image-to-Text • 0.4B • Updated • 3.89k • 105
- RunningAgents122
Grounding DINO Demo
💻122Cutting edge open-vocabulary object detection app
- RunningAgentsFeatured105
Owlv2
👀105State-of-the-art Zero-shot Object Detection
- Configuration errorAgentsFeatured41
BLIP2 with transformers
🌖41BLIP2 (cutting edge image captioning) in 🤗transformers
- Build errorAgentsFeatured377
IDEFICS Playground
🐨377
- RunningAgentsFeatured105
Owlv2
👀105State-of-the-art Zero-shot Object Detection
- Running on ZeroAgentsFeatured64
Owl Tracking
⚡64Powerful foundation model for zero-shot object tracking
- Running26
Search and Detect (CLIP/OWL-ViT)
🦉26Search and detect objects in images using text queries
- Runtime errorAgentsFeatured110
OWLSAM
😻110State-of-the-art open-vocabulary image segmentation ⚡️
- Runtime errorAgentsFeatured74
Draw To Search Art
🐠74Draw/upload image and search among WikiART using SigLIP
- Running on CPU UpgradeAgents23
Compare Clip Siglip
🏃23Compare strong zero-shot image classification models
- Runtime errorAgents13
Multilingual Zero Shot Image Clf
🏢13Comparing powerful multilingual zero-shot image clf models
-
BAAI/bunny-phi-2-siglip-lora
Text Generation • Updated • 77 • 48
- Runtime errorAgentsFeatured83
UDOP
🏃83Generate text from document images
- Configuration errorAgents40
Pix2struct
📚40Play with all the pix2struct variants in this d
- RunningAgents26
Compare Docvqa Models
🦀26Compare different visual question answering
- Runtime errorAgentsFeatured289
DocQuery — Document Query Engine
🦉289
-
Improved Baselines with Visual Instruction Tuning
Paper • 2310.03744 • Published • 39 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper • 2403.05525 • Published • 50 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper • 2308.12966 • Published • 12 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper • 2404.01331 • Published • 28
-
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 148 -
google/owlvit-base-patch16
Zero-Shot Object Detection • Updated • 5.12k • 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection • Updated • 7.7k • 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection • 0.2B • Updated • 16k • 30
-
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 148 -
google/owlvit-base-patch16
Zero-Shot Object Detection • Updated • 5.12k • 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection • Updated • 7.7k • 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection • 0.2B • Updated • 16k • 30
-
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 148 -
google/owlvit-base-patch16
Zero-Shot Object Detection • Updated • 5.12k • 13 -
google/owlvit-large-patch14
Zero-Shot Object Detection • Updated • 7.7k • 29 -
google/owlv2-base-patch16
Zero-Shot Object Detection • 0.2B • Updated • 16k • 30
-
depth-anything/Depth-Anything-V2-Small
Depth Estimation • Updated • 10.5k • 78 -
depth-anything/Depth-Anything-V2-Large
Depth Estimation • Updated • 71.9k • 154 - Running on ZeroAgents688
Depth Anything V2
🌖688Generate depth map from any photo
-
depth-anything/DA-2K
Viewer • Updated • 1.04k • 345 • 17
- RunningAgents207
Vidore Leaderboard
🥇207Browse and compare visual document retrieval model scores
- Running on CPU UpgradeAgents1.02k
Open VLM Leaderboard
🌎1.02kVLMEvalKit Evaluation Results Collection
- RunningFeatured561
Vision Arena (Testing VLMs side-by-side)
🖼561Explore Vision Arena visual AI demo online
- Build errorAgentsFeatured85
SEED-Bench Leaderboard
🏆85Submit model evaluation results to leaderboard
- PausedAgents21
Video Llava
🐨21Generate descriptions by uploading images or videos
-
llava-hf/LLaVA-NeXT-Video-7B-hf
Video-Text-to-Text • 7B • Updated • 142k • 125 -
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf
Video-Text-to-Text • 7B • Updated • 593 • 12 -
llava-hf/LLaVA-NeXT-Video-7B-32K-hf
Image-Text-to-Text • 8B • Updated • 143 • 9