Azhuvath (Rajeev MA)

commented on Run Gemma 4 on Intel® Arc™ GPUs Out-Of-the-Box about 19 hours ago

Thanks for the log. This error is different from the earlier disable_sliding_window issue.

The current failure happens earlier during model config loading: the container's transformers does not recognize the new gemma4_unified architecture yet. So this is a dependency/version mismatch。

The stack we verified for Gemma-4-12B on XPU was:

vLLM: ef3af56

vllm-xpu-kernels: 06e909e

torch: 2.12.0+xpu

transformers: 5.10.0.dev0

Please also verify the transformers version and install the newer Transformers if needed.

Building wheel out of vllm-xpu-kernels looks difficult. Getting some error or other. Let me raise a ticket.

commented on Run Gemma 4 on Intel® Arc™ GPUs Out-Of-the-Box 1 day ago

Hi,
At the moment, I do not have a separate prebuilt public Docker image with this exact stack to point you to. The recommended route for now is to pin the vllm version(git checkout ef3af56) and then proceed the Docker flow in the blog (docker/Dockerfile.xpu).

@yintongl I tried this approach initially. It was not working and hence I tried rebuilding vllm-xpu-kernels.

WARNING 06-15 04:19:59 [argparse_utils.py:257] With vllm serve, you should provide the model as a positional argument or in a config file instead of via the --model option. The --model option will be removed in a future version.
(APIServer pid=428) INFO 06-15 04:19:59 [api_utils.py:339]
(APIServer pid=428) INFO 06-15 04:19:59 [api_utils.py:339] █ █ █▄ ▄█
(APIServer pid=428) INFO 06-15 04:19:59 [api_utils.py:339] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.22.1rc1.dev200+gef3af56a9
(APIServer pid=428) INFO 06-15 04:19:59 [api_utils.py:339] █▄█▀ █ █ █ █ model google/gemma-4-12B
(APIServer pid=428) INFO 06-15 04:19:59 [api_utils.py:339] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=428) INFO 06-15 04:19:59 [api_utils.py:339]
(APIServer pid=428) INFO 06-15 04:19:59 [api_utils.py:273] non-default args: {'model_tag': 'google/gemma-4-12B', 'host': '0.0.0.0', 'model': 'google/gemma-4-12B', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 8192, 'enforce_eager': True, 'served_model_name': ['google/gemma-4-12B'], 'tensor_parallel_size': 2, 'block_size': 64, 'gpu_memory_utilization': 0.7, 'max_num_batched_tokens': 8192}
(APIServer pid=428) Traceback (most recent call last):
(APIServer pid=428) File "/opt/venv/bin/vllm", line 10, in
(APIServer pid=428) sys.exit(main())
(APIServer pid=428) ^^^^^^
(APIServer pid=428) File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 95, in main
(APIServer pid=428) args.dispatch_function(args)
(APIServer pid=428) File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 148, in cmd
(APIServer pid=428) uvloop.run(run_server(args))
(APIServer pid=428) File "/opt/venv/lib/python3.12/site-packages/uvloop/init.py", line 96, in run
(APIServer pid=428) return __asyncio.run(
(APIServer pid=428) ^^^^^^^^^^^^^^
(APIServer pid=428) File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=428) return runner.run(main)
(APIServer pid=428) ^^^^^^^^^^^^^^^^
(APIServer pid=428) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=428) return self._loop.run_until_complete(task)
(APIServer pid=428) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=428) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=428) File "/opt/venv/lib/python3.12/site-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=428) return await main
(APIServer pid=428) ^^^^^^^^^^
(APIServer pid=428) File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 663, in run_server
(APIServer pid=428) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=428) File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 677, in run_server_worker
(APIServer pid=428) async with build_async_engine_client(
(APIServer pid=428) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=428) return await anext(self.gen)
(APIServer pid=428) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=428) File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 98, in build_async_engine_client
(APIServer pid=428) async with build_async_engine_client_from_engine_args(
(APIServer pid=428) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=428) return await anext(self.gen)
(APIServer pid=428) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=428) File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 122, in build_async_engine_client_from_engine_args
(APIServer pid=428) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=428) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=428) File "/opt/venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1735, in create_engine_config
(APIServer pid=428) model_config = self.create_model_config()
(APIServer pid=428) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=428) File "/opt/venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1565, in create_model_config
(APIServer pid=428) return ModelConfig(
(APIServer pid=428) ^^^^^^^^^^^^
(APIServer pid=428) File "/opt/venv/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py", line 121, in init
(APIServer pid=428) s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=428) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=428) Value error, The checkpoint you are trying to load has model type gemma4_unified but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
(APIServer pid=428)
(APIServer pid=428) You can update Transformers with the command pip install --upgrade transformers. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command pip install git+https://github.com/huggingface/transformers.git [type=value_error, input_value=ArgsKwargs((), {'model': ...nderer_num_workers': 1}), input_type=ArgsKwargs]
(APIServer pid=428) For further information visit https://errors.pydantic.dev/2.12/v/value_error

commented on Run Gemma 4 on Intel® Arc™ GPUs Out-Of-the-Box 2 days ago

For gemma-4-12B which was newly released, we have verified on XPU with the following dependencies:
vllm：ef3af56
vllm-xpu-kernels： 06e909e
torch： 2.12.0+xpu
transformers：5.10.0.dev0.

Please give a try.

Building vllm-xpu-kernels looks challenging. Is there any other way to get a docker version?

commented on Run Gemma 4 on Intel® Arc™ GPUs Out-Of-the-Box 4 days ago

Hi, thanks for trying out.
Based on the traceback, this does not look like an Intel XPU kernel/runtime failure. It is specifically a config compatibility bug in the disable_sliding_window path with newer Hf strict config behavior.

As a quick workaround, please retry without --disable-sliding-window and without forcing --max-model-len=8192. In parallel, we should submit a fix to upstream vLLM following the issue you created so that disabling sliding window does not mutate the HF config field to None.

Thanks for the suggestions. It works fine without disable-sliding-window. I tried the model google/gemma-4-12B and it doesn't work. Do I need to upgrade the transformers?