Instructions to use TIGER-Lab/VLM2Vec-Full with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TIGER-Lab/VLM2Vec-Full with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="TIGER-Lab/VLM2Vec-Full", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("TIGER-Lab/VLM2Vec-Full", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use TIGER-Lab/VLM2Vec-Full with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TIGER-Lab/VLM2Vec-Full"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TIGER-Lab/VLM2Vec-Full",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/TIGER-Lab/VLM2Vec-Full

SGLang

How to use TIGER-Lab/VLM2Vec-Full with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TIGER-Lab/VLM2Vec-Full" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TIGER-Lab/VLM2Vec-Full",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TIGER-Lab/VLM2Vec-Full" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TIGER-Lab/VLM2Vec-Full",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use TIGER-Lab/VLM2Vec-Full with Docker Model Runner:
```
docker model run hf.co/TIGER-Lab/VLM2Vec-Full
```

ValueError: not enough values to unpack (expected 5, got 4) when using image+text with TIGER-Lab/VLM2Vec-Full

by NancyWangWXY - opened Apr 14, 2025

Discussion

NancyWangWXY

Apr 14, 2025

Hi TIGER Lab team and community 👋

I'm currently trying to run the TIGER-Lab/VLM2Vec-Full model both on Google Colab and in my local VS Code environment. I'm strictly following the example code provided on the Hugging Face model card as well as the GitHub repository instructions.

Everything works fine up to the point where I attempt to run inference on an image+text pair using:

inputs = processor('<|image_1|> Represent the given image with the following question: What is in the image', [Image.open('figures/example.jpg')])
inputs = {key: value.to('cuda') for key, value in inputs.items()}
qry_output = model(qry=inputs)["qry_reps"]

At this point, I consistently get the following error:
ValueError: not enough values to unpack (expected 5, got 4)

Digging into the source code, it seems the error originates from phi3_v/image_embedding_phi3_v.py here:
num_images, num_crops, c, h, w = pixel_values.shape

Apparently, pixel_values is only 4D at that point (e.g., [1, 3, 336, 336]), whereas the model expects 5D input: [batch_size, num_crops, 3, 336, 336].

🧪 What I’ve Tried
I manually printed the shape of pixel_values returned by the processor, and it sometimes gives 5D shape like [1, 5, 3, 336, 336], but other times only 4D depending on how it's called.

I attempted to use .unsqueeze(1) to force it into 5D, but that doesn't consistently fix the issue and may be masking the real problem.

I noticed that the example code references a load_processor() function from src.utils, but this function is not present in the current codebase.

❓ My Questions
What is the correct way to use the processor to ensure 5D pixel_values are always returned when needed?

Is there a specific load_processor() function you recommend that handles this more reliably?

Should AutoProcessor.from_pretrained('TIGER-Lab/VLM2Vec-Full') be sufficient? Or does it lack the custom logic needed for phi3_v's multi-crop setup?

Are we supposed to configure num_crops=16 manually somewhere in the processor or preprocessor pipeline?

I'm really excited about this model — it looks incredibly promising for multimodal tasks, and I would love to get it running end-to-end. Any guidance you could provide would be very much appreciated 🙏

Thanks in advance!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment