VLMs - a DriTrove Collection

DriTrove 's Collections

translation eval

Medical/Clinical

PubMed Search Apps

Lit Survey: MGTD

Lit Survey: Language Understanding Preservation

Plantation/Gardening + Enviro-friendly sols

Podcast Generator

apps w impressive features/UI

VLMs

updated Jun 11, 2025

Exploring the Potential of Encoder-free Architectures in 3D LMMs

Paper • 2502.09620 • Published Feb 13, 2025 • 26
The Evolution of Multimodal Model Architectures

Paper • 2405.17927 • Published May 28, 2024 • 1
What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 103
Efficient Architectures for High Resolution Vision-Language Models

Paper • 2501.02584 • Published Jan 5, 2025
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 133
Improving Fine-grained Visual Understanding in VLMs through Text-Only Training

Paper • 2412.12940 • Published Dec 17, 2024
VILA: On Pre-training for Visual Language Models

Paper • 2312.07533 • Published Dec 12, 2023 • 21
Renaissance: Investigating the Pretraining of Vision-Language Encoders

Paper • 2411.06657 • Published Nov 11, 2024
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Paper • 2404.07214 • Published Feb 20, 2024
NanoVLMs: How small can we go and still make coherent Vision Language Models?

Paper • 2502.07838 • Published Feb 11, 2025
POINTS: Improving Your Vision-language Model with Affordable Strategies

Paper • 2409.04828 • Published Sep 7, 2024 • 24
Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17, 2024 • 54
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers

Paper • 2410.14072 • Published Oct 17, 2024
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

Paper • 2402.03766 • Published Feb 6, 2024 • 15
HuggingFaceTB/SmolVLM-256M-Instruct

Image-Text-to-Text • Updated Apr 8, 2025 • 249k • 342
Qwen/Qwen2.5-VL-3B-Instruct

Image-Text-to-Text • 4B • Updated Apr 6, 2025 • 12.3M • 616
PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 72
MILVLG/imp-v1-3b

Text Generation • 3B • Updated May 26, 2024 • 130 • 201
marianna13/llava-phi-2-3b

Text Generation • 3B • Updated Jan 29, 2024 • 23 • 14
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Paper • 2411.10640 • Published Nov 16, 2024 • 46
Scalable Vision Language Model Training via High Quality Data Curation

Paper • 2501.05952 • Published Jan 10, 2025 • 5
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27, 2024 • 48
VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 117
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18, 2024 • 78
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Paper • 2401.06209 • Published Jan 11, 2024
Model Composition for Multimodal Large Language Models

Paper • 2402.12750 • Published Feb 20, 2024
A Review of Multi-Modal Large Language and Vision Models

Paper • 2404.01322 • Published Mar 28, 2024
The (R)Evolution of Multimodal Large Language Models: A Survey

Paper • 2402.12451 • Published Feb 19, 2024
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Paper • 2312.16862 • Published Dec 28, 2023 • 31
Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy

Paper • 2412.17759 • Published Dec 23, 2024
TinyLLaVA: A Framework of Small-scale Large Multimodal Models

Paper • 2402.14289 • Published Feb 22, 2024 • 20
Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small Language Model

Paper • 2411.05903 • Published Nov 8, 2024
Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

Paper • 2311.14109 • Published Nov 23, 2023
TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding

Paper • 2501.15513 • Published Jan 26, 2025
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model

Paper • 2401.02330 • Published Jan 4, 2024 • 18
MM-LLMs: Recent Advances in MultiModal Large Language Models

Paper • 2401.13601 • Published Jan 24, 2024 • 48
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

Paper • 2410.16261 • Published Oct 21, 2024 • 4
Vision-Language Models for Edge Networks: A Comprehensive Survey

Paper • 2502.07855 • Published Feb 11, 2025
Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

Paper • 2403.04908 • Published Mar 7, 2024
google/paligemma2-3b-mix-448

Image-Text-to-Text • Updated Feb 7, 2025 • 5.5k • 56
LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 129
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

Paper • 2502.11573 • Published Feb 17, 2025 • 9
Small Vision-Language Models: A Survey on Compact Architectures and Techniques

Paper • 2503.10665 • Published Mar 9, 2025
TIPS: Text-Image Pretraining with Spatial Awareness

Paper • 2410.16512 • Published Oct 21, 2024 • 3