Instructions to use hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration")
model = AutoModelForImageTextToText.from_pretrained("hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration

SGLang

How to use hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration with Docker Model Runner:
```
docker model run hf.co/hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration
```

tiny-random-Qwen2VLForConditionalGeneration / README.md

Xenova HF Staff

Update README.md

93d0199 verified over 1 year ago

preview code

raw

history blame contribute delete

13.2 kB

	---
	library_name: transformers
	tags: []
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->


	## ONNX export code
	```sh
	pip install --upgrade git+https://github.com/huggingface/transformers.git onnx==1.17.0 onnxruntime==1.20.1 optimum==1.23.3 onnxslim==0.1.42
	```


	```py
	import os
	import torch
	from transformers import (
	AutoProcessor,
	Qwen2VLForConditionalGeneration,
	DynamicCache,
	)


	class PatchedQwen2VLForConditionalGeneration(Qwen2VLForConditionalGeneration):
	def forward(self, *args):
	inputs_embeds, attention_mask, position_ids, *past_key_values_args = args

	# Convert past_key_values list to DynamicCache
	if len(past_key_values_args) == 0:
	past_key_values = None
	else:
	past_key_values = DynamicCache(self.config.num_hidden_layers)
	for i in range(self.config.num_hidden_layers):
	key = past_key_values_args.pop(0)
	value = past_key_values_args.pop(0)
	past_key_values.update(key_states=key, value_states=value, layer_idx=i)

	o = super().forward(
	inputs_embeds=inputs_embeds,
	attention_mask=attention_mask,
	position_ids=position_ids,
	past_key_values=past_key_values,
	)

	flattened_past_key_values_outputs = {
	"logits": o.logits,
	}
	output_past_key_values: DynamicCache = o.past_key_values
	for i, (key, value) in enumerate(
	zip(output_past_key_values.key_cache, output_past_key_values.value_cache)
	):
	flattened_past_key_values_outputs[f"present.{i}.key"] = key
	flattened_past_key_values_outputs[f"present.{i}.value"] = value

	return flattened_past_key_values_outputs


	# Constants
	OUTPUT_FOLDER = "output"
	EMBEDDING_MODEL_NAME = "embed_tokens.onnx"
	TEXT_MODEL_NAME = "decoder_model_merged.onnx"
	VISION_MODEL_NAME = "vision_encoder.onnx"
	TEMP_MODEL_OUTPUT_FOLDER = os.path.join(OUTPUT_FOLDER, "temp")
	FINAL_MODEL_OUTPUT_FOLDER = os.path.join(OUTPUT_FOLDER, "onnx")


	# Load model and processor
	model_id = "hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration"
	model = PatchedQwen2VLForConditionalGeneration.from_pretrained(model_id).eval()
	processor = AutoProcessor.from_pretrained(model_id)


	# Save model configs and processor
	model.config.save_pretrained(OUTPUT_FOLDER)
	model.generation_config.save_pretrained(OUTPUT_FOLDER)
	processor.save_pretrained(OUTPUT_FOLDER)
	os.makedirs(TEMP_MODEL_OUTPUT_FOLDER, exist_ok=True)


	# Configuration values
	## Text model
	text_config = model.config
	num_heads = text_config.num_attention_heads
	num_key_value_heads = text_config.num_key_value_heads
	head_dim = text_config.hidden_size // num_heads
	num_layers = text_config.num_hidden_layers
	hidden_size = text_config.hidden_size

	## Vision model
	vision_config = model.config.vision_config
	channel = vision_config.in_chans
	temporal_patch_size = vision_config.temporal_patch_size
	patch_size = vision_config.spatial_patch_size


	# Dummy input sizes
	grid_t, grid_h, grid_w = [1, 16, 16]
	batch_size = 1
	sequence_length = 16
	num_channels = 3
	past_sequence_length = 0

	image_batch_size = 1 # TODO: Add support for > 1 images
	assert image_batch_size == 1


	# Dummy inputs
	## Embedding inputs
	input_ids = torch.randint(
	0, model.config.vocab_size, (batch_size, sequence_length), dtype=torch.int64
	)

	## Text inputs
	dummy_past_key_values_kwargs = {
	f"past_key_values.{i}.{key}": torch.zeros(
	batch_size,
	num_key_value_heads,
	past_sequence_length,
	head_dim,
	dtype=torch.float32,
	)
	for i in range(num_layers)
	for key in ["key", "value"]
	}
	inputs_embeds = torch.ones(
	batch_size, sequence_length, hidden_size, dtype=torch.float32
	)
	attention_mask = torch.ones(batch_size, sequence_length, dtype=torch.int64)
	position_ids = torch.ones(3, batch_size, sequence_length, dtype=torch.int64)

	## Vision inputs
	grid_thw = torch.tensor(
	[[grid_t, grid_h, grid_w]] * image_batch_size, dtype=torch.int64
	)
	pixel_values = torch.randn(
	image_batch_size * grid_t * grid_h * grid_w,
	channel * temporal_patch_size * patch_size * patch_size,
	dtype=torch.float32,
	)


	# ONNX Exports
	## Embedding model
	embedding_inputs = dict(input_ids=input_ids)
	embedding_inputs_positional = tuple(embedding_inputs.values())
	model.model.embed_tokens(*embedding_inputs_positional) # Test forward pass
	EMBED_TOKENS_OUTPUT_PATH = os.path.join(TEMP_MODEL_OUTPUT_FOLDER, EMBEDDING_MODEL_NAME)
	torch.onnx.export(
	model.model.embed_tokens,
	args=embedding_inputs_positional,
	f=EMBED_TOKENS_OUTPUT_PATH,
	export_params=True,
	opset_version=14,
	do_constant_folding=True,
	input_names=list(embedding_inputs.keys()),
	output_names=["inputs_embeds"],
	dynamic_axes={
	"input_ids": {0: "batch_size", 1: "sequence_length"},
	"inputs_embeds": {0: "batch_size", 1: "sequence_length"},
	},
	)

	## Text model
	text_inputs = dict(
	inputs_embeds=inputs_embeds,
	attention_mask=attention_mask,
	position_ids=position_ids,
	**dummy_past_key_values_kwargs,
	)
	text_inputs_positional = tuple(text_inputs.values())
	text_outputs = model.forward(*text_inputs_positional) # Test forward pass
	TEXT_MODEL_OUTPUT_PATH=os.path.join(TEMP_MODEL_OUTPUT_FOLDER, TEXT_MODEL_NAME)
	torch.onnx.export(
	model,
	args=text_inputs_positional,
	f=TEXT_MODEL_OUTPUT_PATH,
	export_params=True,
	opset_version=14,
	do_constant_folding=True,
	input_names=list(text_inputs.keys()),
	output_names=["logits"]
	+ [f"present.{i}.{key}" for i in range(num_layers) for key in ["key", "value"]],
	dynamic_axes={
	"inputs_embeds": {0: "batch_size", 1: "sequence_length"},
	"attention_mask": {0: "batch_size", 1: "sequence_length"},
	"position_ids": {1: "batch_size", 2: "sequence_length"},
	**{
	f"past_key_values.{i}.{key}": {0: "batch_size", 2: "past_sequence_length"}
	for i in range(num_layers)
	for key in ["key", "value"]
	},
	"logits": {0: "batch_size", 1: "sequence_length"},
	**{
	f"present.{i}.{key}": {0: "batch_size", 2: "past_sequence_length + 1"}
	for i in range(num_layers)
	for key in ["key", "value"]
	},
	},
	)

	## Vision model
	vision_inputs = dict(
	pixel_values=pixel_values,
	grid_thw=grid_thw,
	)
	vision_inputs_positional = tuple(vision_inputs.values())
	vision_outputs = model.visual.forward(*vision_inputs_positional) # Test forward pass
	VISION_ENCODER_OUTPUT_PATH = os.path.join(TEMP_MODEL_OUTPUT_FOLDER, VISION_MODEL_NAME)
	torch.onnx.export(
	model.visual,
	args=vision_inputs_positional,
	f=VISION_ENCODER_OUTPUT_PATH,
	export_params=True,
	opset_version=14,
	do_constant_folding=True,
	input_names=list(vision_inputs.keys()),
	output_names=["image_features"],
	dynamic_axes={
	"pixel_values": {
	0: "batch_size * grid_t * grid_h * grid_w",
	1: "channel * temporal_patch_size * patch_size * patch_size",
	},
	"grid_thw": {0: "batch_size"},
	"image_features": {0: "batch_size * grid_t * grid_h * grid_w"},
	},
	)


	# Post-processing
	import onnx
	import onnxslim
	from optimum.onnx.graph_transformations import check_and_save_model

	os.makedirs(FINAL_MODEL_OUTPUT_FOLDER, exist_ok=True)
	for name in (EMBEDDING_MODEL_NAME, TEXT_MODEL_NAME, VISION_MODEL_NAME):
	temp_model_path = os.path.join(TEMP_MODEL_OUTPUT_FOLDER, name)

	## Shape inference (especially needed by the vision encoder)
	onnx.shape_inference.infer_shapes_path(temp_model_path, check_type=True, strict_mode=True)

	## Attempt to optimize the model with onnxslim
	try:
	model = onnxslim.slim(temp_model_path)
	except Exception as e:
	print(f"Failed to slim {temp_model_path}: {e}")
	model = onnx.load(temp_model_path)

	## Save model
	final_model_path = os.path.join(FINAL_MODEL_OUTPUT_FOLDER, name)
	check_and_save_model(model, final_model_path)

	## Cleanup
	import shutil
	shutil.rmtree(TEMP_MODEL_OUTPUT_FOLDER)
	```


	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

	- Developed by: [More Information Needed]
	- Funded by [optional]: [More Information Needed]
	- Shared by [optional]: [More Information Needed]
	- Model type: [More Information Needed]
	- Language(s) (NLP): [More Information Needed]
	- License: [More Information Needed]
	- Finetuned from model [optional]: [More Information Needed]

	### Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Repository: [More Information Needed]
	- Paper [optional]: [More Information Needed]
	- Demo [optional]: [More Information Needed]

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	### Direct Use

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

	[More Information Needed]

	### Downstream Use [optional]

	<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

	[More Information Needed]

	### Out-of-Scope Use

	<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

	[More Information Needed]

	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	[More Information Needed]

	### Recommendations

	<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	[More Information Needed]

	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	[More Information Needed]

	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	#### Preprocessing [optional]

	[More Information Needed]


	#### Training Hyperparameters

	- Training regime: [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

	#### Speeds, Sizes, Times [optional]

	<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

	[More Information Needed]

	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	### Testing Data, Factors & Metrics

	#### Testing Data

	<!-- This should link to a Dataset Card if possible. -->

	[More Information Needed]

	#### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	[More Information Needed]

	#### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	[More Information Needed]

	### Results

	[More Information Needed]

	#### Summary



	## Model Examination [optional]

	<!-- Relevant interpretability work for the model goes here -->

	[More Information Needed]

	## Environmental Impact

	<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: [More Information Needed]
	- Hours used: [More Information Needed]
	- Cloud Provider: [More Information Needed]
	- Compute Region: [More Information Needed]
	- Carbon Emitted: [More Information Needed]

	## Technical Specifications [optional]

	### Model Architecture and Objective

	[More Information Needed]

	### Compute Infrastructure

	[More Information Needed]

	#### Hardware

	[More Information Needed]

	#### Software

	[More Information Needed]

	## Citation [optional]

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	[More Information Needed]

	APA:

	[More Information Needed]

	## Glossary [optional]

	<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

	[More Information Needed]

	## More Information [optional]

	[More Information Needed]

	## Model Card Authors [optional]

	[More Information Needed]

	## Model Card Contact

	[More Information Needed]