Instructions to use Tiiny/SmallThinker-3B-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Tiiny/SmallThinker-3B-Preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Tiiny/SmallThinker-3B-Preview")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Tiiny/SmallThinker-3B-Preview")
model = AutoModelForCausalLM.from_pretrained("Tiiny/SmallThinker-3B-Preview")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Tiiny/SmallThinker-3B-Preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Tiiny/SmallThinker-3B-Preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Tiiny/SmallThinker-3B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Tiiny/SmallThinker-3B-Preview

SGLang

How to use Tiiny/SmallThinker-3B-Preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Tiiny/SmallThinker-3B-Preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Tiiny/SmallThinker-3B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Tiiny/SmallThinker-3B-Preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Tiiny/SmallThinker-3B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Tiiny/SmallThinker-3B-Preview with Docker Model Runner:
```
docker model run hf.co/Tiiny/SmallThinker-3B-Preview
```

How to Pair with Larger Models

by windkkk - opened Jan 4, 2025

Discussion

windkkk

Jan 4, 2025

This model is very popular. Congratulations to you, and thank you for your help. I have a question for you:

How should this model work with larger models? Could you provide a specific textual description of the process?
For example, should SmallThinker-3B be instructed to write out its thought processes (without writing down the answer) first, then let the larger model reference these thoughts to generate an answer?

Alternatively, could we not give any additional instructions and instead have SmallThinker-3B directly respond to the questions, followed by the larger model referencing those responses before generating the final answer?

I would appreciate it if you could outline two possible workflows or more specific prompts.

jeremyii

Tiiny AI org Jan 5, 2025

This is a very nice question. I believe one of the most straightforward approaches is to use the smaller model as a "draft model" for the larger model. This could directly improve inference speed by draft "easy" token using smallthinker. You can try this method with llama.cpp.
Additionally, if we’re exploring how smaller and larger models can collaborate effectively, one possible method might be to package the smaller model’s response along with the original question and send them together to the larger model. This approach could allow the larger model to leverage the smaller model’s preliminary reasoning while refining and expanding upon it for the final output.
So far, my focus has primarily been on using speculative decoding to accelerate the larger model’s inference process. I haven’t yet experimented with other methods of collaboration. Thank you for raising such an interesting question—it’s definitely worth exploring further.

SongXiaoMao

Jan 16, 2025

If you use VLLM and QWQ models, an error message will be displayed, indicating that the size of the vocabulary is inconsistent

Cstark

Jan 16, 2025

This model is very popular. Congratulations to you, and thank you for your help. I have a question for you:

How should this model work with larger models? Could you provide a specific textual description of the process?
For example, should SmallThinker-3B be instructed to write out its thought processes (without writing down the answer) first, then let the larger model reference these thoughts to generate an answer?

Alternatively, could we not give any additional instructions and instead have SmallThinker-3B directly respond to the questions, followed by the larger model referencing those responses before generating the final answer?

I would appreciate it if you could outline two possible workflows or more specific prompts.

Hombre , necesitaria mas tiempo para orientar. y explicar la realidad pero bueno poco a poco . y gracias

Cstark

Jan 18, 2025

ok if you speak ingles or chinesse no problem when you use diferent languaje big problem similar speak gribraltar becuase any words ingles or chiniese, its no real thinks, or cadena de pensamientos, this is consecucion de pensamiento, diferent. its no realy inteligence, its blucle when real problem becuase "consecucion de pensamiento"

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment