Transformers documentation
Serve CLI
Serve CLI
The transformers serve CLI is a lightweight option for local or self-hosted servers. It avoids the extra runtime and operational overhead of dedicated inference engines like vLLM. Use it for evaluation, experimentation, and moderate load deployments. Features like continuous batching increases throughput and lowers latency.
For large scale production deployments, use vLLM, SGLang or TGI with a Transformer model as the backend. Learn more in the Inference backends guide.
The transformers serve command spawns a local server compatible with the OpenAI SDK. The server works with many third-party applications and supports the REST APIs below.
/v1/chat/completionsfor text and image requests/v1/responsessupports the Responses API/v1/audio/transcriptionsfor audio transcriptions/v1/modelslists available models for third-party integrations/load_modelstreams model loading progress via SSE
Install the serving dependencies.
pip install transformers[serving]
Run transformers serve to launch a server. The default server address is http://localhost:8000.
transformers serve
v1/chat/completions
The v1/chat/completions API is based on the Chat Completions API. It supports text and image-based requests for LLMs and VLMs. Use it with curl, the AsyncInferenceClient, or the OpenAI client.
Text-based completions
curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "Qwen/Qwen2.5-0.5B-Instruct"}'The command returns the following response.
data: {"object": "chat.completion.chunk", "id": "req_0", "created": 1751377863, "model": "Qwen/Qwen2.5-0.5B-Instruct", "system_fingerprint": "", "choices": [{"delta": {"role": "assistant", "content": "", "tool_call_id": null, "tool_calls": null}, "index": 0, "finish_reason": null, "logprobs": null}]}
data: {"object": "chat.completion.chunk", "id": "req_0", "created": 1751377863, "model": "Qwen/Qwen2.5-0.5B-Instruct", "system_fingerprint": "", "choices": [{"delta": {"role": "assistant", "content": "", "tool_call_id": null, "tool_calls": null}, "index": 0, "finish_reason": null, "logprobs": null}]}
(...)Text and image-based completions
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"stream": true,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
],
"max_tokens": 300
}'
The command returns the following response.
data: {"id":"req_0","choices":[{"delta":{"role":"assistant"},"index":0}],"created":1753366665,"model":"Qwen/Qwen2.5-VL-7B-Instruct@main","object":"chat.completion.chunk","system_fingerprint":""}
data: {"id":"req_0","choices":[{"delta":{"content":"The "},"index":0}],"created":1753366701,"model":"Qwen/Qwen2.5-VL-7B-Instruct@main","object":"chat.completion.chunk","system_fingerprint":""}
data: {"id":"req_0","choices":[{"delta":{"content":"image "},"index":0}],"created":1753366701,"model":"Qwen/Qwen2.5-VL-7B-Instruct@main","object":"chat.completion.chunk","system_fingerprint":""}v1/responses
The
v1/responsesAPI is still experimental and there may be bugs. Please open an issue if you encounter any errors.
The Responses API is OpenAI’s latest API endpoint for generation. It supports stateful interactions and integrates built-in tools to extend a model’s capabilities. OpenAI recommends using the Responses API over the Chat Completions API for new projects.
The v1/responses API supports text-based requests for LLMs through the curl command and OpenAI client.
curl http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"stream": true,
"input": "Tell me a three sentence bedtime story about a unicorn."
}'The command returns the following response.
data: {"response":{"id":"resp_req_0","created_at":1754059817.783648,"model":"Qwen/Qwen2.5-0.5B-Instruct@main","object":"response","output":[],"parallel_tool_calls":false,"tool_choice":"auto","tools":[],"status":"queued","text":{"format":{"type":"text"}}},"sequence_number":0,"type":"response.created"}
data: {"response":{"id":"resp_req_0","created_at":1754059817.783648,"model":"Qwen/Qwen2.5-0.5B-Instruct@main","object":"response","output":[],"parallel_tool_calls":false,"tool_choice":"auto","tools":[],"status":"in_progress","text":{"format":{"type":"text"}}},"sequence_number":1,"type":"response.in_progress"}
data: {"item":{"id":"msg_req_0","content":[],"role":"assistant","status":"in_progress","type":"message"},"output_index":0,"sequence_number":2,"type":"response.output_item.added"}
data: {"content_index":0,"item_id":"msg_req_0","output_index":0,"part":{"annotations":[],"text":"","type":"output_text"},"sequence_number":3,"type":"response.content_part.added"}
data: {"content_index":0,"delta":"","item_id":"msg_req_0","output_index":0,"sequence_number":4,"type":"response.output_text.delta"}
data: {"content_index":0,"delta":"Once ","item_id":"msg_req_0","output_index":0,"sequence_number":5,"type":"response.output_text.delta"}
data: {"content_index":0,"delta":"upon ","item_id":"msg_req_0","output_index":0,"sequence_number":6,"type":"response.output_text.delta"}
data: {"content_index":0,"delta":"a ","item_id":"msg_req_0","output_index":0,"sequence_number":7,"type":"response.output_text.delta"}v1/audio/transcriptions
The v1/audio/transcriptions endpoint transcribes audio using speech-to-text models. It follows the Audio transcription API format.
curl -X POST http://localhost:8000/v1/audio/transcriptions \ -H "Content-Type: multipart/form-data" \ -F "file=@/path/to/audio.wav" \ -F "model=openai/whisper-large-v3"
The command returns the following response.
{
"text": "Transcribed text from the audio file",
}v1/models
The v1/models endpoint scans your local Hugging Face cache and returns a list of downloaded models in the OpenAI-compatible format. Third-party tools use this endpoint to discover available models.
Use the command below to download a model before running transformers serve.
transformers download Qwen/Qwen2.5-0.5B-Instruct
The model is now discoverable by the /v1/models endpoint.
curl http://localhost:8000/v1/models
This command returns a JSON object containing the list of models.
Loading models
The /load_model endpoint pre-loads a model and streams progress via Server-Sent Events (SSE). The transformers chat CLI uses it automatically so users see download and loading progress instead of a hanging prompt. It’s also useful for warming up a model before sending inference requests.
Request
curl -N -X POST http://localhost:8000/load_model \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen2.5-0.5B-Instruct"}'The model field is a Hugging Face model identifier, optionally with an @revision suffix (e.g., meta-llama/Llama-3.2-1B-Instruct@main). If no revision is specified, main is assumed.
Response
The response is an SSE stream (Content-Type: text/event-stream). Each frame is a JSON object on a data: line:
data: {"status": "loading", "model": "Qwen/Qwen2.5-0.5B-Instruct@main", "stage": "processor"}Every event contains at minimum a status and model field. Additional fields depend on the status:
| Field | Present when | Description |
|---|---|---|
status | Always | loading, ready, or error |
model | Always | Canonical model_id@revision |
stage | status == "loading" | One of processor, config, download, weights (see stages below) |
progress | download and weights stages | Object with current and total (integer or null) |
cached | status == "ready" | true if the model was already in memory |
message | status == "error" | Error description |
Stages
Loading progresses through these stages in order. Some may be skipped (e.g., download is skipped when files are already cached locally).
| Stage | Has progress? | Description |
|---|---|---|
processor | No | Loading the tokenizer/processor |
config | No | Loading model configuration |
download | Yes (bytes) | Downloading model files |
weights | Yes (items) | Loading weight tensors into memory |
The stream ends with exactly one terminal event: ready (success) or error (failure).
Examples
Fresh load with download:
data: {"status": "loading", "model": "org/model@main", "stage": "processor"}
data: {"status": "loading", "model": "org/model@main", "stage": "config"}
data: {"status": "loading", "model": "org/model@main", "stage": "download", "progress": {"current": 0, "total": 269100000}}
data: {"status": "loading", "model": "org/model@main", "stage": "download", "progress": {"current": 134600000, "total": 269100000}}
data: {"status": "loading", "model": "org/model@main", "stage": "download", "progress": {"current": 269100000, "total": 269100000}}
data: {"status": "loading", "model": "org/model@main", "stage": "weights", "progress": {"current": 1, "total": 272}}
data: {"status": "loading", "model": "org/model@main", "stage": "weights", "progress": {"current": 272, "total": 272}}
data: {"status": "ready", "model": "org/model@main", "cached": false}Files already cached locally (no download stage):
data: {"status": "loading", "model": "org/model@main", "stage": "processor"}
data: {"status": "loading", "model": "org/model@main", "stage": "config"}
data: {"status": "loading", "model": "org/model@main", "stage": "weights", "progress": {"current": 1, "total": 272}}
data: {"status": "loading", "model": "org/model@main", "stage": "weights", "progress": {"current": 272, "total": 272}}
data: {"status": "ready", "model": "org/model@main", "cached": false}Model already in memory:
data: {"status": "ready", "model": "org/model@main", "cached": true}Tool calling
The transformers serve server supports OpenAI-style function calling. Models trained for tool-use generate structured function calls that your application executes.
Tool calling is currently limited to the Qwen model family.
Define tools as a list of function specifications following the OpenAI format.
import json
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="<KEY>")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name, e.g. San Francisco"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "temperature unit"
}
},
"required": ["location"]
}
}
}
]Pass a dictionary of parameters from GenerationConfig to the extra_body argument in create to customize model generation.
generation_config = {
"max_new_tokens": 512,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 50,
"do_sample": True,
"repetition_penalty": 1.1,
"no_repeat_ngram_size": 3,
}
response = client.responses.create(
model="Qwen/Qwen2.5-7B-Instruct",
instructions="You are a helpful weather assistant. Use the get_weather tool to answer questions.",
input="What's the weather like in San Francisco?",
tools=tools,
stream=True,
extra_body={"generation_config": json.dumps(generation_config)}
)
for event in response:
print(event)Port forwarding
The transformers serve server supports port forwarding. This lets you serve models from a remote server. Make sure you have ssh access from your device to the server. Run the following command on your device to set up port forwarding.
ssh -N -f -L 8000:localhost:8000 your_server_account@your_server_IP -p port_to_ssh_into_your_server
Reproducibility
Add the --force-model <repo_id> argument to avoid per-request model hints. This produces stable, repeatable runs.
transformers serve \
--force-model Qwen/Qwen2.5-0.5B-Instruct \
--continuous-batching \
--dtype "bfloat16"