Instructions to use microsoft/Phi-4-mini-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Phi-4-mini-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/Phi-4-mini-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use microsoft/Phi-4-mini-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Phi-4-mini-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-4-mini-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/Phi-4-mini-instruct

SGLang

How to use microsoft/Phi-4-mini-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Phi-4-mini-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-4-mini-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Phi-4-mini-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-4-mini-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/Phi-4-mini-instruct with Docker Model Runner:
```
docker model run hf.co/microsoft/Phi-4-mini-instruct
```

Issue Deploying Phi-4-mini-instruct on SageMaker (TGI): Container Health Check Fails

#30

by aamirfaaiz - opened May 19, 2025

Discussion

aamirfaaiz

May 19, 2025

Hi there,

I’ve been trying to deploy 'microsoft/Phi-4-mini-instruct' on Amazon SageMaker using the Hugging Face LLM Inference container (TGI backend), but the endpoint consistently fails with a 'ping health check' error.

Here’s a summary of what I’m doing:

Using: get_huggingface_llm_image_uri(backend="huggingface", version="1.2.0")
Instance type: ml.g5.2xlarge

The endpoint consistently fails with:
The primary container for production variant AllTraffic did not pass the ping health check.

charliezjw

Jul 8, 2025

+1, did you get any solution? Thank you very much!

panalexeu

Aug 29, 2025

nbaughman

Aug 31, 2025

@aamirfaaiz @panalexeu @charliezjw

I don't think TGI supports Phi-4 yet: https://github.com/huggingface/text-generation-inference/issues/3071

charliezjw

Aug 31, 2025

I was able to eventually spin it up with LoRA adapter support using "763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128"

agentgraph-official

Mar 28

The TGI health check failure on SageMaker with Phi-4-mini-instruct is almost always a container startup timeout issue rather than a model problem per se. Phi-4-mini, even at the mini scale, has enough weight to push past SageMaker's default health check grace period, especially on cold start. Check your HEALTH_CHECK_TIMEOUT and SM_NUM_GPUS environment variables — TGI needs explicit GPU count signaling on SageMaker, and if it's trying to shard across a mismatch it'll hang before the /health endpoint ever becomes responsive. Also confirm you're on a TGI image version that actually supports Phi-4 architecture; anything before roughly 1.4.x won't have the correct modeling code and will silently fail during model load.

A few concrete things to try: set --max-input-length and --max-total-tokens explicitly in your container environment rather than relying on defaults, since Phi-4-mini-instruct's context configuration can cause TGI to allocate more KV cache than the instance has VRAM for, which manifests as a health check failure rather than an OOM error. Also check CloudWatch logs for the actual TGI stderr output — SageMaker often surfaces only the health check failure in the console but the real error (unsupported model arch, CUDA OOM, missing tokenizer files) is buried in the container logs. The microsoft/Phi-4-mini-instruct repo uses a trust_remote_code pattern, so make sure --trust-remote-code is passed as a TGI launch argument.

One tangential note: if you're building multi-agent pipelines on top of this deployment, the question of which agent called which endpoint with what identity becomes non-trivial at scale. This is something we think about a lot at AgentGraph — when you have orchestrators routing tasks to model endpoints like this, having a verifiable identity layer for the calling agent matters for debugging and auditing, especially as projects like AgentVerse start pushing toward open agent-to-agent communication. But that's a longer conversation — get the container healthy first.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment