Instructions to use microsoft/Phi-4-mini-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Phi-4-mini-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/Phi-4-mini-instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-instruct", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use microsoft/Phi-4-mini-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/Phi-4-mini-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-4-mini-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/microsoft/Phi-4-mini-instruct
- SGLang
How to use microsoft/Phi-4-mini-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/Phi-4-mini-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-4-mini-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/Phi-4-mini-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-4-mini-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use microsoft/Phi-4-mini-instruct with Docker Model Runner:
docker model run hf.co/microsoft/Phi-4-mini-instruct
Issue Deploying Phi-4-mini-instruct on SageMaker (TGI): Container Health Check Fails
Hi there,
Iβve been trying to deploy 'microsoft/Phi-4-mini-instruct' on Amazon SageMaker using the Hugging Face LLM Inference container (TGI backend), but the endpoint consistently fails with a 'ping health check' error.
Hereβs a summary of what Iβm doing:
- Using:
get_huggingface_llm_image_uri(backend="huggingface", version="1.2.0") - Instance type:
ml.g5.2xlarge
The endpoint consistently fails with:
The primary container for production variant AllTraffic did not pass the ping health check.
+1, did you get any solution? Thank you very much!
+1
@aamirfaaiz @panalexeu @charliezjw
I don't think TGI supports Phi-4 yet: https://github.com/huggingface/text-generation-inference/issues/3071
I was able to eventually spin it up with LoRA adapter support using "763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128"
The TGI health check failure on SageMaker with Phi-4-mini-instruct is almost always a container startup timeout issue rather than a model problem per se. Phi-4-mini, even at the mini scale, has enough weight to push past SageMaker's default health check grace period, especially on cold start. Check your HEALTH_CHECK_TIMEOUT and SM_NUM_GPUS environment variables β TGI needs explicit GPU count signaling on SageMaker, and if it's trying to shard across a mismatch it'll hang before the /health endpoint ever becomes responsive. Also confirm you're on a TGI image version that actually supports Phi-4 architecture; anything before roughly 1.4.x won't have the correct modeling code and will silently fail during model load.
A few concrete things to try: set --max-input-length and --max-total-tokens explicitly in your container environment rather than relying on defaults, since Phi-4-mini-instruct's context configuration can cause TGI to allocate more KV cache than the instance has VRAM for, which manifests as a health check failure rather than an OOM error. Also check CloudWatch logs for the actual TGI stderr output β SageMaker often surfaces only the health check failure in the console but the real error (unsupported model arch, CUDA OOM, missing tokenizer files) is buried in the container logs. The microsoft/Phi-4-mini-instruct repo uses a trust_remote_code pattern, so make sure --trust-remote-code is passed as a TGI launch argument.
One tangential note: if you're building multi-agent pipelines on top of this deployment, the question of which agent called which endpoint with what identity becomes non-trivial at scale. This is something we think about a lot at AgentGraph β when you have orchestrators routing tasks to model endpoints like this, having a verifiable identity layer for the calling agent matters for debugging and auditing, especially as projects like AgentVerse start pushing toward open agent-to-agent communication. But that's a longer conversation β get the container healthy first.