Instructions to use Tiiny/SmallThinker-3B-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Tiiny/SmallThinker-3B-Preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Tiiny/SmallThinker-3B-Preview")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Tiiny/SmallThinker-3B-Preview")
model = AutoModelForCausalLM.from_pretrained("Tiiny/SmallThinker-3B-Preview")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Tiiny/SmallThinker-3B-Preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Tiiny/SmallThinker-3B-Preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Tiiny/SmallThinker-3B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Tiiny/SmallThinker-3B-Preview

SGLang

How to use Tiiny/SmallThinker-3B-Preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Tiiny/SmallThinker-3B-Preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Tiiny/SmallThinker-3B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Tiiny/SmallThinker-3B-Preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Tiiny/SmallThinker-3B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Tiiny/SmallThinker-3B-Preview with Docker Model Runner:
```
docker model run hf.co/Tiiny/SmallThinker-3B-Preview
```

Prompt/token adjust to stop "Overthinking" in unnescissary cases

by fuzzy-mittenz - opened Jan 4, 2025

Discussion

fuzzy-mittenz

Jan 4, 2025

I was using the model to great effect inside GPT4ALL with it's new Analyze feature. I was hoping you might be able to shed some light on possibly a method of keeping it from being so verbose in it's responses when it isn't necessary. Some of the questions most models are great at after llama 3.2/qwen 2.5 for example,

"If Philip walks into a bar and orders a round of drinks for all, there being 12 other customers in the bar and drinks being 5 smeckles a piece, and then later on in the night, during happy hour, after a woman with a dog comes into the bar joining the original customers Phil buys another round for all at happy hour prices, half off, how much would Phil spend with a healthy tip? "

even the 1.5B model usually gets this right. your model tends to go through to many steps to maintain coherence. Even though I can use the JavaScript_Interpreter and Code_execution to execute things like the factorial of 101 or use the haversine function to measure distance from any 2 points in the world the model seems to lack the solid long form single response.

jeremyii

Tiiny AI org Jan 5, 2025

Thank you for your suggestion. In fact, we have also noticed that the issue of overthinking is relatively prominent. We are currently trying to use some methods to alleviate this problem or to differentiate the level of thinking based on the difficulty of the question. One approach we are considering is incorporating an assessment of the question's difficulty into the response and then customizing the complexity of the response based on the difficulty level.

fuzzy-mittenz

Jan 5, 2025

Well for smaller models I've found using a simple 2 step reasoning method works well when using Qwen models, if that helps at all but all and all but I've been working pretty hard to get the tokenizer to do what you guys have it accomplishing so I really cant complain. thanks for the awesome model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment