Text Generation
Transformers
PyTorch
Safetensors
English
phi3_v
Embedding
conversational
custom_code
Instructions to use TIGER-Lab/VLM2Vec-Full with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TIGER-Lab/VLM2Vec-Full with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TIGER-Lab/VLM2Vec-Full", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("TIGER-Lab/VLM2Vec-Full", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TIGER-Lab/VLM2Vec-Full with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TIGER-Lab/VLM2Vec-Full" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TIGER-Lab/VLM2Vec-Full", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/TIGER-Lab/VLM2Vec-Full
- SGLang
How to use TIGER-Lab/VLM2Vec-Full with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TIGER-Lab/VLM2Vec-Full" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TIGER-Lab/VLM2Vec-Full", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TIGER-Lab/VLM2Vec-Full" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TIGER-Lab/VLM2Vec-Full", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use TIGER-Lab/VLM2Vec-Full with Docker Model Runner:
docker model run hf.co/TIGER-Lab/VLM2Vec-Full
ValueError: not enough values to unpack (expected 5, got 4) when using image+text with TIGER-Lab/VLM2Vec-Full
#8
by NancyWangWXY - opened
Hi TIGER Lab team and community π
I'm currently trying to run the TIGER-Lab/VLM2Vec-Full model both on Google Colab and in my local VS Code environment. I'm strictly following the example code provided on the Hugging Face model card as well as the GitHub repository instructions.
Everything works fine up to the point where I attempt to run inference on an image+text pair using:
inputs = processor('<|image_1|> Represent the given image with the following question: What is in the image', [Image.open('figures/example.jpg')])
inputs = {key: value.to('cuda') for key, value in inputs.items()}
qry_output = model(qry=inputs)["qry_reps"]
At this point, I consistently get the following error:
ValueError: not enough values to unpack (expected 5, got 4)
Digging into the source code, it seems the error originates from phi3_v/image_embedding_phi3_v.py here:
num_images, num_crops, c, h, w = pixel_values.shape
Apparently, pixel_values is only 4D at that point (e.g., [1, 3, 336, 336]), whereas the model expects 5D input: [batch_size, num_crops, 3, 336, 336].
π§ͺ What Iβve Tried
I manually printed the shape of pixel_values returned by the processor, and it sometimes gives 5D shape like [1, 5, 3, 336, 336], but other times only 4D depending on how it's called.
I attempted to use .unsqueeze(1) to force it into 5D, but that doesn't consistently fix the issue and may be masking the real problem.
I noticed that the example code references a load_processor() function from src.utils, but this function is not present in the current codebase.
β My Questions
What is the correct way to use the processor to ensure 5D pixel_values are always returned when needed?
Is there a specific load_processor() function you recommend that handles this more reliably?
Should AutoProcessor.from_pretrained('TIGER-Lab/VLM2Vec-Full') be sufficient? Or does it lack the custom logic needed for phi3_v's multi-crop setup?
Are we supposed to configure num_crops=16 manually somewhere in the processor or preprocessor pipeline?
I'm really excited about this model β it looks incredibly promising for multimodal tasks, and I would love to get it running end-to-end. Any guidance you could provide would be very much appreciated π
Thanks in advance!