Issue Deploying Phi-4-mini-instruct on SageMaker (TGI): Container Health Check Fails
Hi there,
I’ve been trying to deploy 'microsoft/Phi-4-mini-instruct' on Amazon SageMaker using the Hugging Face LLM Inference container (TGI backend), but the endpoint consistently fails with a 'ping health check' error.
Here’s a summary of what I’m doing:
- Using:
get_huggingface_llm_image_uri(backend="huggingface", version="1.2.0") - Instance type:
ml.g5.2xlarge
The endpoint consistently fails with:
The primary container for production variant AllTraffic did not pass the ping health check.
+1, did you get any solution? Thank you very much!
+1
@aamirfaaiz @panalexeu @charliezjw
I don't think TGI supports Phi-4 yet: https://github.com/huggingface/text-generation-inference/issues/3071
I was able to eventually spin it up with LoRA adapter support using "763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128"
The TGI health check failure on SageMaker with Phi-4-mini-instruct is almost always a container startup timeout issue rather than a model problem per se. Phi-4-mini, even at the mini scale, has enough weight to push past SageMaker's default health check grace period, especially on cold start. Check your HEALTH_CHECK_TIMEOUT and SM_NUM_GPUS environment variables — TGI needs explicit GPU count signaling on SageMaker, and if it's trying to shard across a mismatch it'll hang before the /health endpoint ever becomes responsive. Also confirm you're on a TGI image version that actually supports Phi-4 architecture; anything before roughly 1.4.x won't have the correct modeling code and will silently fail during model load.
A few concrete things to try: set --max-input-length and --max-total-tokens explicitly in your container environment rather than relying on defaults, since Phi-4-mini-instruct's context configuration can cause TGI to allocate more KV cache than the instance has VRAM for, which manifests as a health check failure rather than an OOM error. Also check CloudWatch logs for the actual TGI stderr output — SageMaker often surfaces only the health check failure in the console but the real error (unsupported model arch, CUDA OOM, missing tokenizer files) is buried in the container logs. The microsoft/Phi-4-mini-instruct repo uses a trust_remote_code pattern, so make sure --trust-remote-code is passed as a TGI launch argument.
One tangential note: if you're building multi-agent pipelines on top of this deployment, the question of which agent called which endpoint with what identity becomes non-trivial at scale. This is something we think about a lot at AgentGraph — when you have orchestrators routing tasks to model endpoints like this, having a verifiable identity layer for the calling agent matters for debugging and auditing, especially as projects like AgentVerse start pushing toward open agent-to-agent communication. But that's a longer conversation — get the container healthy first.