VLMs
updated
Exploring the Potential of Encoder-free Architectures in 3D LMMs
Paper
• 2502.09620
• Published
• 26
The Evolution of Multimodal Model Architectures
Paper
• 2405.17927
• Published
• 1
What matters when building vision-language models?
Paper
• 2405.02246
• Published
• 103
Efficient Architectures for High Resolution Vision-Language Models
Paper
• 2501.02584
• Published
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published
• 133
Improving Fine-grained Visual Understanding in VLMs through Text-Only
Training
Paper
• 2412.12940
• Published
VILA: On Pre-training for Visual Language Models
Paper
• 2312.07533
• Published
• 21
Renaissance: Investigating the Pretraining of Vision-Language Encoders
Paper
• 2411.06657
• Published
Exploring the Frontier of Vision-Language Models: A Survey of Current
Methodologies and Future Directions
Paper
• 2404.07214
• Published
NanoVLMs: How small can we go and still make coherent Vision Language
Models?
Paper
• 2502.07838
• Published
POINTS: Improving Your Vision-language Model with Affordable Strategies
Paper
• 2409.04828
• Published
• 24
Unveiling Encoder-Free Vision-Language Models
Paper
• 2406.11832
• Published
• 54
Efficient Vision-Language Models by Summarizing Visual Tokens into
Compact Registers
Paper
• 2410.14072
• Published
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
Vision Token
Paper
• 2501.03895
• Published
• 52
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper
• 2402.03766
• Published
• 15
HuggingFaceTB/SmolVLM-256M-Instruct
Image-Text-to-Text
• Updated
• 249k
• 342
Qwen/Qwen2.5-VL-3B-Instruct
Image-Text-to-Text
• 4B • Updated
• 12.3M
• 616
PaliGemma: A versatile 3B VLM for transfer
Paper
• 2407.07726
• Published
• 72
Text Generation
• 3B • Updated
• 130
• 201
marianna13/llava-phi-2-3b
Text Generation
• 3B • Updated
• 23
• 14
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
Language Models on Mobile Devices
Paper
• 2411.10640
• Published
• 46
Scalable Vision Language Model Training via High Quality Data Curation
Paper
• 2501.05952
• Published
• 5
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
• 2403.18814
• Published
• 48
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
• 2412.04467
• Published
• 117
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
• 2409.12191
• Published
• 78
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Paper
• 2401.06209
• Published
Model Composition for Multimodal Large Language Models
Paper
• 2402.12750
• Published
A Review of Multi-Modal Large Language and Vision Models
Paper
• 2404.01322
• Published
The (R)Evolution of Multimodal Large Language Models: A Survey
Paper
• 2402.12451
• Published
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Paper
• 2312.16862
• Published
• 31
Survey of Large Multimodal Model Datasets, Application Categories and
Taxonomy
Paper
• 2412.17759
• Published
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
• 2402.14289
• Published
• 20
Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small
Language Model
Paper
• 2411.05903
• Published
Boosting the Power of Small Multimodal Reasoning Models to Match Larger
Models with Self-Consistency Training
Paper
• 2311.14109
• Published
TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal
Models for Video Understanding
Paper
• 2501.15513
• Published
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
Paper
• 2401.02330
• Published
• 18
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper
• 2401.13601
• Published
• 48
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5%
Parameters and 90% Performance
Paper
• 2410.16261
• Published
• 4
Vision-Language Models for Edge Networks: A Comprehensive Survey
Paper
• 2502.07855
• Published
Self-Adapting Large Visual-Language Models to Edge Devices across Visual
Modalities
Paper
• 2403.04908
• Published
google/paligemma2-3b-mix-448
Image-Text-to-Text
• Updated
• 5.5k
• 56
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
• 2411.10440
• Published
• 129
InfiR : Crafting Effective Small Language Models and Multimodal Small
Language Models in Reasoning
Paper
• 2502.11573
• Published
• 9
Small Vision-Language Models: A Survey on Compact Architectures and
Techniques
Paper
• 2503.10665
• Published
TIPS: Text-Image Pretraining with Spatial Awareness
Paper
• 2410.16512
• Published
• 3