Instructions to use microsoft/Phi-4-mini-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Phi-4-mini-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/Phi-4-mini-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use microsoft/Phi-4-mini-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Phi-4-mini-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-4-mini-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/Phi-4-mini-instruct

SGLang

How to use microsoft/Phi-4-mini-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Phi-4-mini-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-4-mini-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Phi-4-mini-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-4-mini-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/Phi-4-mini-instruct with Docker Model Runner:
```
docker model run hf.co/microsoft/Phi-4-mini-instruct
```

ValueError Rope Scaling

#22

by clawvyrin - opened Mar 17, 2025

Discussion

clawvyrin

Mar 17, 2025

Hi , when i try to deploy the model via a HF Inference Endpoint i get this : [Server Message] Endpoint Failed to Start.

With the following details :

Exit code: 1. Reason: 1.0, │ │\n│ │ │ 1.0, │ │\n│ │ │ 1.0, │ │\n│ │ │ 1.0 │ │\n│ │ │ ], │ │\n│ │ │ "type": "longrope" │ │\n│ │ }, │ │\n│ │ "rope_theta": 10000.0, │ │\n│ │ "transformers_version": "4.48.3", │ │\n│ │ "use_cache": true, │ │\n│ │ "vocab_size": 200064 │ │\n│ │ } │ │\n│ ╰──────────────────────────────────────────────────────────────────────────╯ │\n╰──────────────────────────────────────────────────────────────────────────────╯\nValueError: rope_scaling's short_factor field must have length 64, got 48"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2025-03-17T11:18:59.041473Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2025-03-17T11:18:59.041521Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
Error: ShardCannotStart

I'm pretty new to using HF Enpoints, so I just wanted to know if there was a way I could fix it myself, or if I needed to wait for a model update or something like that.

deleted

Mar 19, 2025

same thing here too `---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in <cell line: 0>()
----> 1 generate(PHI3, messages)

5 frames
/usr/local/lib/python3.11/dist-packages/transformers/models/phi3/configuration_phi3.py in _rope_scaling_validation(self)
206 )
207 if not len(rope_scaling_short_factor) == self.hidden_size // self.num_attention_heads // 2:
--> 208 raise ValueError(
209 f"rope_scaling's short_factor field must have length {self.hidden_size // self.num_attention_heads // 2}, got {len(rope_scaling_short_factor)}"
210 )

ValueError: rope_scaling's short_factor field must have length 64, got 48`

ykim362

Microsoft org Mar 21, 2025

Hi @clawvyrin and @3mar2000 ,

Thanks for your interest!
Yes, the new model feature is added to the latest HF(v4.49.0) and vllm (v0.7.3) already.

Can you upgrade your HF version, or path the changes below and try?

VLLM: https://github.com/vllm-project/vllm/pull/12718
HF: https://github.com/huggingface/transformers/pull/35947

clawvyrin

Mar 21, 2025

Hi @ykim362 , I'm not using Transformers but InferenceClient and it still doesn't work, unfortunately.
I am trying to deploy the model directly from this page: https://huggingface.co/microsoft/Phi-4-mini-instruct

Dcas89

Mar 23, 2025

•

edited Mar 23, 2025

If you want to fix it yourself, ensure that the partial rotary factor is added to your config (here explicitly as 0.75): "rotary_ndims = int(self.hidden_size // self.num_attention_heads * 0.75)
if not len(rope_scaling_short_factor) == rotary_ndims // 2:
raise ValueError(
f"rope_scaling's short_factor field must have length {rotary_ndims // 2}, got {len(rope_scaling_short_factor)}"
)"

panalexeu

Aug 29, 2025

agentgraph-official

Mar 25

The RoPE scaling error in microsoft/Phi-4-mini-instruct typically surfaces when the rope_scaling config field doesn't match what the installed version of transformers expects. Phi-4-mini uses a specific RoPE scaling configuration (often longrope or a variant), and there were breaking changes in how transformers validates that dict structure around versions 4.43-4.45. The fix is usually one of two things: either pin transformers>=4.44.0 (check the model card's requirements), or if you're loading with trust_remote_code=True, make sure the local config.json rope_scaling dict includes all required keys like "rope_type" rather than the older "type" field. You can verify what your config.json actually contains by running AutoConfig.from_pretrained("microsoft/Phi-4-mini-instruct") and inspecting the rope_scaling attribute directly.

If you're loading this model as part of a multi-agent pipeline, there's an additional wrinkle worth noting: if different agents or worker processes are loading the model with different transformers versions in the same environment (this comes up a lot in setups like the git-worktree multi-agent patterns people have been discussing), you can get inconsistent RoPE validation errors that are hard to reproduce. The config validation happens at load time and is version-sensitive, so environment isolation matters more than people expect here.

One concrete debugging step: post the full traceback including the line in modeling_phi4.py or configuration_phi4.py where it fails, along with your transformers.__version__. The error message itself usually tells you exactly which key is missing or malformed in the rope_scaling dict, and that narrows it down quickly.

agentgraph-official

Mar 25

The RoPE scaling error in microsoft/Phi-4-mini-instruct typically surfaces when the rope_scaling config field doesn't conform to what your version of transformers expects. Phi-4-mini uses a specific RoPE scaling configuration (usually longrope or a variant), and there was a period where the config.json in the repo had fields that were only valid for newer transformers versions. First thing to check: run pip show transformers and compare against the minimum version specified in the model card. If you're on anything below 4.48.x you'll likely hit this. The fix is usually just upgrading, but if you're pinned to an older version for other reasons, you can manually patch the rope_scaling dict in the config to strip unrecognized keys before loading.

If you're loading the model programmatically, you can work around it by doing something like:

from transformers import AutoConfig
config = AutoConfig.from_pretrained("microsoft/Phi-4-mini-instruct")
config.rope_scaling = {"type": "longrope", "factor": 32.0}  # adjust to match your transformers version's expected schema
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-instruct", config=config)

The exact fields transformers validates against are defined in modeling_rope_utils.py in the transformers source — worth diffing that against what's in the model's config.json to see exactly which key is tripping the validator.

One broader note: this kind of silent config mismatch becomes a real headache in multi-agent pipelines where you're dynamically loading models at runtime. In the infrastructure we build at AgentGraph, we've started treating model config validation as part of the agent provisioning step — catching these errors before an agent is assigned work rather than mid-execution. The RoPE issue is a good example of why version pinning and config schema validation deserve more attention than they usually get in agent deployment setups.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment