Instructions to use Tiiny/SmallThinker-3B-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Tiiny/SmallThinker-3B-Preview with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Tiiny/SmallThinker-3B-Preview") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Tiiny/SmallThinker-3B-Preview") model = AutoModelForCausalLM.from_pretrained("Tiiny/SmallThinker-3B-Preview") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Tiiny/SmallThinker-3B-Preview with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Tiiny/SmallThinker-3B-Preview" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Tiiny/SmallThinker-3B-Preview", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Tiiny/SmallThinker-3B-Preview
- SGLang
How to use Tiiny/SmallThinker-3B-Preview with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Tiiny/SmallThinker-3B-Preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Tiiny/SmallThinker-3B-Preview", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Tiiny/SmallThinker-3B-Preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Tiiny/SmallThinker-3B-Preview", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Tiiny/SmallThinker-3B-Preview with Docker Model Runner:
docker model run hf.co/Tiiny/SmallThinker-3B-Preview
Prompt/token adjust to stop "Overthinking" in unnescissary cases
I was using the model to great effect inside GPT4ALL with it's new Analyze feature. I was hoping you might be able to shed some light on possibly a method of keeping it from being so verbose in it's responses when it isn't necessary. Some of the questions most models are great at after llama 3.2/qwen 2.5 for example,
"If Philip walks into a bar and orders a round of drinks for all, there being 12 other customers in the bar and drinks being 5 smeckles a piece, and then later on in the night, during happy hour, after a woman with a dog comes into the bar joining the original customers Phil buys another round for all at happy hour prices, half off, how much would Phil spend with a healthy tip? "
even the 1.5B model usually gets this right. your model tends to go through to many steps to maintain coherence. Even though I can use the JavaScript_Interpreter and Code_execution to execute things like the factorial of 101 or use the haversine function to measure distance from any 2 points in the world the model seems to lack the solid long form single response.
Thank you for your suggestion. In fact, we have also noticed that the issue of overthinking is relatively prominent. We are currently trying to use some methods to alleviate this problem or to differentiate the level of thinking based on the difficulty of the question. One approach we are considering is incorporating an assessment of the question's difficulty into the response and then customizing the complexity of the response based on the difficulty level.
Well for smaller models I've found using a simple 2 step reasoning method works well when using Qwen models, if that helps at all but all and all but I've been working pretty hard to get the tokenizer to do what you guys have it accomplishing so I really cant complain. thanks for the awesome model.