| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - lmms-lab/LLaVA-OneVision-Data |
| | - BAAI/Infinity-MM |
| | language: |
| | - en |
| | - zh |
| | base_model: |
| | - google/siglip2-so400m-patch16-512 |
| | - Qwen/Qwen2-1.5B-Instruct |
| | pipeline_tag: image-text-to-text |
| | library_name: transformers |
| | --- |
| | |
| | # Flash-VL-2B-Static |
| | [\[📜 Flash-VL Tech Report\]](https://www.arxiv.org/abs/2505.09498) |
| |
|
| |  |
| |
|
| | ## Introduction |
| |
|
| | We are excited to introduce **Flash-VL**, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications. |
| |
|
| |
|
| | ### Environment Setup |
| |
|
| | ```bash |
| | pip install torch==2.1.2 |
| | pip install transformers==4.50.0.dev0 |
| | ``` |
| |
|
| |
|
| | ### How to use it? |
| |
|
| | ```python |
| | import torch |
| | from PIL import Image |
| | import requests |
| | from io import BytesIO |
| | from transformers import AutoModel, AutoTokenizer, AutoProcessor |
| | |
| | model_path = "FlashVL/FlashVL-2B-Static" |
| | model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,trust_remote_code=True,device_map='cuda') |
| | model.tokenizer = AutoTokenizer.from_pretrained(model_path,device_map='cuda') |
| | model.im_trans = AutoProcessor.from_pretrained(model_path).image_processor |
| | |
| | # single-image single-round conversation (单图单轮对话) |
| | image_url ="https://s3plus.meituan.net/automl-datasets/mlm/0516.png" |
| | response = requests.get(image_url) |
| | image_data = BytesIO(response.content) |
| | pil_image = Image.open(image_data).convert('RGB') |
| | messages = [{'role': 'user', 'content': "生成图中菜品的菜谱"}] # answer: EXTRA |
| | answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256) |
| | print(answer) |
| | |
| | # single-image multi-round conversation (单图多轮对话) |
| | messages = [ |
| | {'role': 'user', 'content': '这是什么'}, |
| | {"role": "assistant", "content": '这是一道看起来像是银耳莲子汤的甜品。\ |
| | 银耳是一种常见的食材,通常用于制作甜品和汤品,具有软糯的口感和清润的口感。莲 \ |
| | 子是莲子的干燥部分,常用于中医和食疗中,具有补脾止泻的功效。图片中还可以看到 \ |
| | 一些枸杞和核桃,枸杞富含维生素和抗氧化物质,核桃则提供丰富的蛋白质和健康脂肪。 \ |
| | 整体来看,这道甜品不仅美味,还具有一定的营养价值。'}, |
| | {'role': 'user', 'content': '对图中菜品卡路里分析'} |
| | ] |
| | answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256) |
| | print(answer) |
| | |
| | # pure-text single-round conversation (纯文本对话) |
| | messages = [{'role': 'user', 'content': "who are you"}] |
| | answer = model.chat(None, messages, do_sample=False, max_new_tokens=256) |
| | print(answer) |
| | ``` |
| |
|
| | ### Evaluation |
| |
|
| | | Benchmark | Qwen2-VL-2B | Aquila-VL-2B | InternVL2.5-2B | Flash-VL-2B<sub>s<sub> | Flash-VL-2B<sub>d<sub> | Flash-VL-2B<sub>d-ISS<sub> | |
| | | :-------------: | :-------------: | :-------------: | :-------------: |:-------------: |:-------------: |:-------------: | |
| | | MMMU<sub>val<sub> | 41.9 | 44.4 | 41.8 | 43.6 | 42.9 | 42.9 | |
| | | MMBench<sup>en<sup> | 74.9 | 78.6 | 74.7 | 78.4 | 78.4 | 79.1 | |
| | | MMBench<sup>cn<sup> | 73.5 | 76.3 | 71.6 | 74.7 | 74.9 | 76.7 | |
| | | MMStar | 48.0 | 54.9 | 54.1 | 53.8 | 54.4 | 54.1 | |
| | | MathVista<sub>testmini<sub> | 43.0 | 59.4 | 50.9 | 59.3 | 58.1 | 61.5 | |
| | | AI2D<sub>test<sub> | 74.1 | 75.0 | 75.1 | 74.2 | 74.1 | 74.4 | |
| | | MMVet | 49.5 | 40.9 | 61.7 | 47.3 | 52.7 | 50.7 | |
| | | HallusionBench | 39.2 | 38.5 | 42.7 | 43.5 | 45.5 | 49.0 | |
| | | OCRBench | 794 | 773 | 800 | 764 | 831 | 843 | |
| | | MME | 1872 | 1813 | 2091 | 1715 | 1866 | 1850 | |
| | | SEEDBench | 71.5 | 78.9 | 73.2 | 73.6 | 73.6 | 74.5 | |
| | | Average | 60.2 | 62.6 | 63.6 | 62.4 | 64.0 | 64.8 | |
| |
|
| |
|
| | We use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) to evaluate FlashVL-2B-Static. |
| |
|
| |
|
| |
|
| | ## Citation |
| | If you find this project useful in your research, please consider citing: |
| |
|
| | ```BibTeX |
| | @misc{zhang2025flashvl2boptimizingvisionlanguage, |
| | title={Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput}, |
| | author={Bo Zhang and Shuo Li and Runhe Tian and Yang Yang and Jixin Tang and Jinhao Zhou and Lin Ma}, |
| | year={2025}, |
| | eprint={2505.09498}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV}, |
| | url={https://arxiv.org/abs/2505.09498}, |
| | } |
| | ``` |