Instructions to use tiiuae/Falcon-E-3B-Base-prequantized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tiiuae/Falcon-E-3B-Base-prequantized with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tiiuae/Falcon-E-3B-Base-prequantized")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiiuae/Falcon-E-3B-Base-prequantized")
model = AutoModelForCausalLM.from_pretrained("tiiuae/Falcon-E-3B-Base-prequantized")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use tiiuae/Falcon-E-3B-Base-prequantized with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tiiuae/Falcon-E-3B-Base-prequantized"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/Falcon-E-3B-Base-prequantized",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tiiuae/Falcon-E-3B-Base-prequantized

SGLang

How to use tiiuae/Falcon-E-3B-Base-prequantized with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tiiuae/Falcon-E-3B-Base-prequantized" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/Falcon-E-3B-Base-prequantized",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tiiuae/Falcon-E-3B-Base-prequantized" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/Falcon-E-3B-Base-prequantized",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use tiiuae/Falcon-E-3B-Base-prequantized with Docker Model Runner:
```
docker model run hf.co/tiiuae/Falcon-E-3B-Base-prequantized
```

TL;DR
Model Details
Training Details
Usage
Evaluation
Citation

This is simply the mirror of https://huggingface.co/tiiuae/Falcon-E-3B-Base - branch prequantized

TL;DR

Model Details

Model Description

Developed by: https://www.tii.ae
Model type: Causal decoder-only / Base version
Architecture: Pure-transformer - 1.58bit version
Language(s) (NLP): English
License: Falcon-LLM License

Training details

For more details about the training protocol of this model, please refer to the Falcon-E technical blogpost.

Usage

Currently to use this model you can either rely on Hugging Face transformers library or BitNet library. There are multiple ways to interact with the model depending on your target usage. For each of the Falcon-E series model, you have three variants: the BitNet model, the prequantized checkpoint for fine-tuning and the bfloat16 version of the BitNet model.

Inference

🤗 transformers

In case you want to perform inference on the BitNet checkpoint run:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "tiiuae/Falcon-E-1B-Base"

model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.bfloat16,
).to("cuda")

# Perform text generation

If you want to rather use the classic bfloat16 version, you can run:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "tiiuae/Falcon-E-1B-Base"
revision = "bfloat16"

model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.bfloat16,
  revision=revision,
).to("cuda")

# Perform text generation

BitNet

git clone https://github.com/microsoft/BitNet && cd BitNet
pip install -r requirements.txt
python setup_env.py --hf-repo tiiuae/Falcon-E-1B-Base -q i2_s
python run_inference.py -m models/Falcon-E-1B-Base/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv

Apply mlx-lm

pip install -U mlx-lm

Then:

mlx_lm.generate --model tiiuae/Falcon-E-3B-Instruct --prompt "Implement bubble sort" --max-tokens 100 --temp 0.1

Fine-tuning

For fine-tuning the model, you should load the prequantized revision of the model and use the onebitllms Python package:

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer
+ from onebitllms import replace_linear_with_bitnet_linear, quantize_to_1bit

model_id = "tiiuae/Falcon-E-1B-Base"

tokenizer = AutoTokenizer.from_pretrained(model_id, revision="prequantized")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
+    revision="prequantized"
)
+ model = replace_linear_with_bitnet_linear(model)

trainer = SFTTrainer(
    model,
    ...
)

trainer.train()

+ quantize_to_1bit(output_directory)

Evaluation

We report in the following table our internal pipeline benchmarks:

Note evaluation results are normalized score from former Hugging Face leaderboard v2 tasks

For 1B scale models and below

Model	Nb Params	Mem Footprint	IFEVAL	Math-Hard	GPQA	MuSR	BBH	MMLU-Pro	Avg.
Qwen-2.5-0.5B	0.5B	1GB	16.27	3.93	0.0	2.08	6.95	10.06	6.55
SmolLM2-360M	0.36B	720MB	21.15	1.21	0.0	7.73	5.54	1.88	6.25
Qwen-2.5-1.5B	1.5B	3.1GB	26.74	9.14	16.66	5.27	20.61	4.7	13.85
Llama-3.2-1B	1.24B	2.47GB	14.78	1.21	4.37	2.56	2.26	0	4.2
SmolLM2-1.7B	1.7B	3.4GB	24.4	2.64	9.3	4.6	12.64	3.91	9.58
Falcon-3-1B-Base	1.5B	3GB	24.28	3.32	11.34	9.71	6.76	3.91	9.89
Hymba-1.5B-Base	1.5B	3GB	22.95	1.36	7.69	5.18	10.25	0.78	8.04
Falcon-E-1B-Base	1.8B	635MB	32.9	10.97	2.8	3.65	12.28	17.82	13.40

For 3B scale models

Model	Nb Params	Mem Footprint	IFEVAL	Math-Hard	GPQA	MuSR	BBH	MMLU-Pro	Avg.
Falcon-3-3B-Base	3B	6.46GB	15.74	11.78	21.58	6.27	18.09	6.26	15.74
Qwen2.5-3B	3B	6.17GB	26.9	14.8	24.3	11.76	24.48	6.38	18.1
Falcon-E-3B-Base	3B	999MB	36.67	13.45	8.67	4.14	19.83	27.16	18.32

Below are the results for instruction fine-tuned models:

For 1B scale models and below

Model	Nb Params	Mem Footprint	IFEVAL	Math-Hard	GPQA	MuSR	BBH	MMLU-Pro	Avg.
Qwen-2.5-0.5B-Instruct	500M	1GB	30.71	0	8.43	0.94	7.75	0	6.59
SmolLM2-360M-Instruct	360M	720MB	38.42	1.51	4.17	2.77	1.3	0.67	8.14
Qwen-2.5-1.5B-Instruct	1.5B	3.1GB	44.76	22.05	19.81	3.19	19.99	0.78	18.43
SmolLM2-1.7B	1.7B	3.4GB	53.68	5.82	10.92	4.1	11.71	0	15.02
Falcon-3-1B-Instruct	1.5B	3GB	55.57	6.34	12.96	10.56	9.32	2.24	16.16
Hymba-1.5B-Instruct	1.5B	3GB	60.09	2.72	4.59	1.05	11.56	5.515	14.19
Falcon-E-1B-Instruct	1.8B	635MB	54.35	9.12	16.5	2.51	19.42	9.64	18.59

For 3B scale models

Model	Nb Params	Mem Footprint	IFEVAL	Math-Hard	GPQA	MuSR	BBH	MMLU-Pro	Avg.
Falcon-3-3B-Instruct	3B	6.46GB	69.77	25	26.29	11.13	22.28	5.15	26.6
Qwen2.5-3B-Instruct	3B	6.17GB	64.75	36.78	25.8	7.57	25.05	3.02	27.16
Falcon-E-3B-Instruct	3B	999MB	60.97	15.3	23.59	2.12	26.45	7.45	22.64666667

Useful links

View our release blogpost.
Learn more about onebitllms library.
Feel free to join our discord server if you have any questions or to interact with our researchers and developers.

Citation

If the Falcon-E family of models were helpful to your work, feel free to give us a cite.

@misc{tiionebitllms,
    title = {Falcon-E, a series of powerful, universal and fine-tunable 1.58bit language models.},
    author = {Falcon-LLM Team},
    month = {April},
    url = {https://falcon-lm.github.io/blog/falcon-edge},
    year = {2025}
}