Instructions to use HuggingFaceTB/SmolLM3-3B-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HuggingFaceTB/SmolLM3-3B-Base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM3-3B-Base")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B-Base") model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B-Base") - Transformers.js
How to use HuggingFaceTB/SmolLM3-3B-Base with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('text-generation', 'HuggingFaceTB/SmolLM3-3B-Base'); - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HuggingFaceTB/SmolLM3-3B-Base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HuggingFaceTB/SmolLM3-3B-Base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM3-3B-Base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/HuggingFaceTB/SmolLM3-3B-Base
- SGLang
How to use HuggingFaceTB/SmolLM3-3B-Base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HuggingFaceTB/SmolLM3-3B-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM3-3B-Base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HuggingFaceTB/SmolLM3-3B-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM3-3B-Base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use HuggingFaceTB/SmolLM3-3B-Base with Docker Model Runner:
docker model run hf.co/HuggingFaceTB/SmolLM3-3B-Base
Upload folder using huggingface_hub
#1
by Xenova HF Staff - opened
Example usage:
Transformers.js
import { pipeline, TextStreamer } from "@huggingface/transformers";
// Create a text generation pipeline
const generator = await pipeline(
"text-generation",
"HuggingFaceTB/SmolLM3-3B-Base",
{ dtype: "q4" },
);
// Generate text
const text = "Once upon a time,"
const output = await generator(text, {
max_new_tokens: 512,
do_sample: false,
streamer: new TextStreamer(generator.tokenizer, { skip_prompt: true, skip_special_tokens: true}),
});
console.log(output[0].generated_text);
ONNXRuntime
from transformers import AutoConfig, AutoTokenizer
import onnxruntime
import numpy as np
# 1. Load config, processor, and model
path_to_model = "/path/to/model"
model_id = "HuggingFaceTB/SmolLM3-3B-Base"
config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
decoder_session = onnxruntime.InferenceSession(f"{path_to_model}/model.onnx")
## Set config values
num_key_value_heads = config.num_key_value_heads
head_dim = config.hidden_size // config.num_attention_heads
num_hidden_layers = config.num_hidden_layers
eos_token_id = config.eos_token_id
# 2. Prepare inputs
text = "Once upon a time,"
inputs = tokenizer(text, return_tensors="np")
batch_size = inputs['input_ids'].shape[0]
past_key_values = {
f'past_key_values.{layer}.{kv}': np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
for layer in range(num_hidden_layers)
for kv in ('key', 'value')
}
input_ids = inputs['input_ids']
attention_mask = np.ones_like(input_ids, dtype=np.int64)
position_ids = np.tile(np.arange(1, input_ids.shape[-1] + 1), (batch_size, 1))
# 3. Generation loop
max_new_tokens = 1024
generated_tokens = np.array([[]], dtype=np.int64)
for i in range(max_new_tokens):
logits, *present_key_values = decoder_session.run(None, dict(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
**past_key_values,
))
## Update values for next generation loop
input_ids = logits[:, -1].argmax(-1, keepdims=True)
attention_mask = np.concatenate([attention_mask, np.ones_like(input_ids, dtype=np.int64)], axis=-1)
position_ids = position_ids[:, -1:] + 1
for j, key in enumerate(past_key_values):
past_key_values[key] = present_key_values[j]
generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
if (input_ids == eos_token_id).all():
break
## (Optional) Streaming
print(tokenizer.decode(input_ids[0]), end='', flush=True)
print()
# 4. Output result
print(tokenizer.batch_decode(generated_tokens))
loubnabnl changed pull request status to merged