| --- |
| license: apache-2.0 |
| language: |
| - en |
| base_model: |
| - Qwen/Qwen2.5-VL-7B-Instruct |
| pipeline_tag: image-text-to-text |
| library_name: transformers |
| tags: |
| - text-generation-inference |
| - trl |
| - spatial-reasoning |
| - vision-understanding |
| --- |
| |
|  |
|
|
| # **Spatial-VU** |
|
|
| > The **Spatial-VU** model is a fine-tuned version of **Qwen2.5-VL-7B-Instruct**, tailored for **Spatial Reasoning and Vision Understanding**. This variant is designed to generate highly detailed and descriptive captions across a broad range of visual categories, including images with complex, sensitive, or nuanced content—across varying aspect ratios and resolutions. |
|
|
| # Key Highlights |
|
|
| * **Spatial Reasoning & Vision Understanding**: Fine-tuned to provide accurate and descriptive visual interpretations, enabling deeper understanding of spatial relationships, structures, and context. |
|
|
| * **High-Fidelity Descriptions**: Generates comprehensive captions for general, artistic, technical, abstract, and low-context images. |
|
|
| * **Robust Across Aspect Ratios**: Capable of accurately captioning images with wide, tall, square, and irregular dimensions. |
|
|
| * **Variational Detail Control**: Produces outputs with both high-level summaries and fine-grained descriptions as needed. |
|
|
| * **Foundation on Qwen2.5-VL Architecture**: Leverages the strengths of the Qwen2.5-VL-7B multimodal model for visual reasoning, comprehension, and instruction-following. |
|
|
| * **Multilingual Output Capability**: Can support multilingual descriptions (English as default), adaptable via prompt engineering. |
|
|
| # Quick Start with Transformers |
|
|
| ```python |
| from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor |
| from qwen_vl_utils import process_vision_info |
| |
| model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
| "prithivMLmods/Spatial-VU", torch_dtype="auto", device_map="auto" |
| ) |
| |
| processor = AutoProcessor.from_pretrained("prithivMLmods/Spatial-VU") |
| |
| messages = [ |
| { |
| "role": "user", |
| "content": [ |
| { |
| "type": "image", |
| "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", |
| }, |
| {"type": "text", "text": "Describe this image in detail."}, |
| ], |
| } |
| ] |
| |
| text = processor.apply_chat_template( |
| messages, tokenize=False, add_generation_prompt=True |
| ) |
| image_inputs, video_inputs = process_vision_info(messages) |
| inputs = processor( |
| text=[text], |
| images=image_inputs, |
| videos=video_inputs, |
| padding=True, |
| return_tensors="pt", |
| ) |
| inputs = inputs.to("cuda") |
| |
| generated_ids = model.generate(**inputs, max_new_tokens=128) |
| generated_ids_trimmed = [ |
| out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| ] |
| output_text = processor.batch_decode( |
| generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| ) |
| print(output_text) |
| ``` |
|
|
| # Intended Use |
|
|
| * Generating detailed and unfiltered image captions for general-purpose or artistic datasets. |
| * Spatial reasoning and vision understanding research. |
| * Content moderation research, red-teaming, and generative safety evaluations. |
| * Enabling descriptive captioning for visual datasets typically excluded from mainstream models. |
| * Use in creative applications (e.g., storytelling, art generation) that benefit from rich descriptive captions. |
| * Captioning for non-standard aspect ratios and stylized visual content. |
|
|
| # Limitations |
|
|
| * May produce explicit, sensitive, or offensive descriptions depending on image content and prompts. |
| * Not suitable for deployment in production systems requiring content filtering or moderation. |
| * Can exhibit variability in caption tone or style depending on input prompt phrasing. |
| * Accuracy for unfamiliar or synthetic visual styles may vary. |