Instructions to use tiiuae/Falcon-E-3B-Base-prequantized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tiiuae/Falcon-E-3B-Base-prequantized with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tiiuae/Falcon-E-3B-Base-prequantized") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tiiuae/Falcon-E-3B-Base-prequantized") model = AutoModelForCausalLM.from_pretrained("tiiuae/Falcon-E-3B-Base-prequantized") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use tiiuae/Falcon-E-3B-Base-prequantized with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tiiuae/Falcon-E-3B-Base-prequantized" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/Falcon-E-3B-Base-prequantized", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/tiiuae/Falcon-E-3B-Base-prequantized
- SGLang
How to use tiiuae/Falcon-E-3B-Base-prequantized with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tiiuae/Falcon-E-3B-Base-prequantized" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/Falcon-E-3B-Base-prequantized", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tiiuae/Falcon-E-3B-Base-prequantized" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/Falcon-E-3B-Base-prequantized", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use tiiuae/Falcon-E-3B-Base-prequantized with Docker Model Runner:
docker model run hf.co/tiiuae/Falcon-E-3B-Base-prequantized
Table of Contents
This is simply the mirror of https://huggingface.co/tiiuae/Falcon-E-3B-Base - branch prequantized
TL;DR
Model Details
Model Description
- Developed by: https://www.tii.ae
- Model type: Causal decoder-only / Base version
- Architecture: Pure-transformer - 1.58bit version
- Language(s) (NLP): English
- License: Falcon-LLM License
Training details
For more details about the training protocol of this model, please refer to the Falcon-E technical blogpost.
Usage
Currently to use this model you can either rely on Hugging Face transformers library or BitNet library. There are multiple ways to interact with the model depending on your target usage. For each of the Falcon-E series model, you have three variants: the BitNet model, the prequantized checkpoint for fine-tuning and the bfloat16 version of the BitNet model.
Inference
🤗 transformers
In case you want to perform inference on the BitNet checkpoint run:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "tiiuae/Falcon-E-1B-Base"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
).to("cuda")
# Perform text generation
If you want to rather use the classic bfloat16 version, you can run:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "tiiuae/Falcon-E-1B-Base"
revision = "bfloat16"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
revision=revision,
).to("cuda")
# Perform text generation
BitNet
git clone https://github.com/microsoft/BitNet && cd BitNet
pip install -r requirements.txt
python setup_env.py --hf-repo tiiuae/Falcon-E-1B-Base -q i2_s
python run_inference.py -m models/Falcon-E-1B-Base/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv
Apply mlx-lm
pip install -U mlx-lm
Then:
mlx_lm.generate --model tiiuae/Falcon-E-3B-Instruct --prompt "Implement bubble sort" --max-tokens 100 --temp 0.1
Fine-tuning
For fine-tuning the model, you should load the prequantized revision of the model and use the onebitllms Python package:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer
+ from onebitllms import replace_linear_with_bitnet_linear, quantize_to_1bit
model_id = "tiiuae/Falcon-E-1B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision="prequantized")
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
+ revision="prequantized"
)
+ model = replace_linear_with_bitnet_linear(model)
trainer = SFTTrainer(
model,
...
)
trainer.train()
+ quantize_to_1bit(output_directory)
Evaluation
We report in the following table our internal pipeline benchmarks:
Note evaluation results are normalized score from former Hugging Face leaderboard v2 tasks
For 1B scale models and below
| Model | Nb Params | Mem Footprint | IFEVAL | Math-Hard | GPQA | MuSR | BBH | MMLU-Pro | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Qwen-2.5-0.5B | 0.5B | 1GB | 16.27 | 3.93 | 0.0 | 2.08 | 6.95 | 10.06 | 6.55 |
| SmolLM2-360M | 0.36B | 720MB | 21.15 | 1.21 | 0.0 | 7.73 | 5.54 | 1.88 | 6.25 |
| Qwen-2.5-1.5B | 1.5B | 3.1GB | 26.74 | 9.14 | 16.66 | 5.27 | 20.61 | 4.7 | 13.85 |
| Llama-3.2-1B | 1.24B | 2.47GB | 14.78 | 1.21 | 4.37 | 2.56 | 2.26 | 0 | 4.2 |
| SmolLM2-1.7B | 1.7B | 3.4GB | 24.4 | 2.64 | 9.3 | 4.6 | 12.64 | 3.91 | 9.58 |
| Falcon-3-1B-Base | 1.5B | 3GB | 24.28 | 3.32 | 11.34 | 9.71 | 6.76 | 3.91 | 9.89 |
| Hymba-1.5B-Base | 1.5B | 3GB | 22.95 | 1.36 | 7.69 | 5.18 | 10.25 | 0.78 | 8.04 |
| Falcon-E-1B-Base | 1.8B | 635MB | 32.9 | 10.97 | 2.8 | 3.65 | 12.28 | 17.82 | 13.40 |
For 3B scale models
| Model | Nb Params | Mem Footprint | IFEVAL | Math-Hard | GPQA | MuSR | BBH | MMLU-Pro | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Falcon-3-3B-Base | 3B | 6.46GB | 15.74 | 11.78 | 21.58 | 6.27 | 18.09 | 6.26 | 15.74 |
| Qwen2.5-3B | 3B | 6.17GB | 26.9 | 14.8 | 24.3 | 11.76 | 24.48 | 6.38 | 18.1 |
| Falcon-E-3B-Base | 3B | 999MB | 36.67 | 13.45 | 8.67 | 4.14 | 19.83 | 27.16 | 18.32 |
Below are the results for instruction fine-tuned models:
For 1B scale models and below
| Model | Nb Params | Mem Footprint | IFEVAL | Math-Hard | GPQA | MuSR | BBH | MMLU-Pro | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Qwen-2.5-0.5B-Instruct | 500M | 1GB | 30.71 | 0 | 8.43 | 0.94 | 7.75 | 0 | 6.59 |
| SmolLM2-360M-Instruct | 360M | 720MB | 38.42 | 1.51 | 4.17 | 2.77 | 1.3 | 0.67 | 8.14 |
| Qwen-2.5-1.5B-Instruct | 1.5B | 3.1GB | 44.76 | 22.05 | 19.81 | 3.19 | 19.99 | 0.78 | 18.43 |
| SmolLM2-1.7B | 1.7B | 3.4GB | 53.68 | 5.82 | 10.92 | 4.1 | 11.71 | 0 | 15.02 |
| Falcon-3-1B-Instruct | 1.5B | 3GB | 55.57 | 6.34 | 12.96 | 10.56 | 9.32 | 2.24 | 16.16 |
| Hymba-1.5B-Instruct | 1.5B | 3GB | 60.09 | 2.72 | 4.59 | 1.05 | 11.56 | 5.515 | 14.19 |
| Falcon-E-1B-Instruct | 1.8B | 635MB | 54.35 | 9.12 | 16.5 | 2.51 | 19.42 | 9.64 | 18.59 |
For 3B scale models
| Model | Nb Params | Mem Footprint | IFEVAL | Math-Hard | GPQA | MuSR | BBH | MMLU-Pro | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Falcon-3-3B-Instruct | 3B | 6.46GB | 69.77 | 25 | 26.29 | 11.13 | 22.28 | 5.15 | 26.6 |
| Qwen2.5-3B-Instruct | 3B | 6.17GB | 64.75 | 36.78 | 25.8 | 7.57 | 25.05 | 3.02 | 27.16 |
| Falcon-E-3B-Instruct | 3B | 999MB | 60.97 | 15.3 | 23.59 | 2.12 | 26.45 | 7.45 | 22.64666667 |
Useful links
- View our release blogpost.
- Learn more about
onebitllmslibrary. - Feel free to join our discord server if you have any questions or to interact with our researchers and developers.
Citation
If the Falcon-E family of models were helpful to your work, feel free to give us a cite.
@misc{tiionebitllms,
title = {Falcon-E, a series of powerful, universal and fine-tunable 1.58bit language models.},
author = {Falcon-LLM Team},
month = {April},
url = {https://falcon-lm.github.io/blog/falcon-edge},
year = {2025}
}
- Downloads last month
- 351
