| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | tags: |
| | - math |
| | - reasoning |
| | - agent |
| | - qwen |
| | - grpo |
| | - reinforcement-learning |
| | base_model: Qwen/Qwen3-4B-Thinking-2507 |
| | datasets: |
| | - nvidia/OpenMathReasoning |
| | metrics: |
| | - accuracy |
| | library_name: transformers |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # DeepMath: A Lightweight Math Reasoning Agent |
| |
|
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/ndb_WmPavW1MONAjsGpYT.jpeg" style="width:600px" alt="An LLM is using a calculator to answer questions." /> |
| |
|
| | ## Model Description |
| |
|
| | **DeepMath** is a 4B parameter mathematical reasoning model that combines a fine-tuned LLM with a sandboxed Python executor. Built on [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) and trained with **GRPO (Group Relative Policy Optimization)**, DeepMath generates concise Python snippets for computational steps instead of verbose text explanations, significantly reducing errors and output length. |
| |
|
| | - **Developed by:** Intel AI Labs |
| | - **Model type:** Causal language model with agent capabilities |
| | - **Language:** English |
| | - **Base model:** [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) |
| | - **License:** Apache 2.0 |
| | - **Blog:**: 🔗 <https://huggingface.co/blog/intel-deepmath> |
| | - **Repository:** 💻 [https://github.com/IntelLabs/DeepMath](https://github.com/IntelLabs/DeepMath) |
| |
|
| | ## Key Features |
| |
|
| | ✅ **Code-driven reasoning:** Generates short Python snippets for intermediate computational steps |
| | ✅ **Sandboxed execution:** No file I/O, no network calls, strict timeouts |
| | ✅ **Improved accuracy:** Offloading computation reduces arithmetic errors |
| | ✅ **Reduced verbosity:** Up to 66% shorter outputs compared to baseline |
| | ✅ **Safe and auditable:** Deterministic execution with readable code snippets |
| |
|
| | ## Model Architecture |
| |
|
| | DeepMath uses a LoRA adapter fine-tuned on top of Qwen3-4B Thinking with the following components: |
| |
|
| | - **Agent Interface:** Outputs special tokens for Python code execution during reasoning |
| | - **Executor:** Sandboxed Python environment with allow-listed modules |
| | - **Safety Constraints:** Per-snippet timeouts, no file/network access |
| | - **Training Method:** GRPO with accuracy and code generation rewards |
| |
|
| | <figure> |
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/zOcvJ2DY61QZyozarsKbT.png" style="width:400px" alt="Changes to vLLM client and server in TRL library." /> |
| | <figcaption><p><em>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.</em></p></figcaption> |
| | </figure> |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | - **Dataset:** [OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning) (tool-usage subset) |
| | - **Note:** GRPO training only uses problems, not solutions |
| | - **In-context Learning:** 4 solved examples demonstrating agent call syntax and patterns |
| |
|
| | ### Training Procedure |
| |
|
| | **GRPO (Group Relative Policy Optimization)** fine-tuning with: |
| |
|
| | - **Accuracy Reward:** +1 for correct answers |
| | - **Code Generation Reward:** +1 for using code snippets (weighted 10:1 vs. accuracy) |
| | - **Length Constraint:** GRPO completions limited to 5k tokens |
| | - **Temperature Scheduling:** Linear schedule from T=1.2 → T=0.7 during training |
| | - **Infrastructure:** Modified TRL library's vLLM client and server |
| |
|
| | ### Training Infrastructure |
| |
|
| | - Base inference engine: [vLLM](https://github.com/vllm-project/vllm) |
| | - Agent framework: Based on [SmolAgents](https://github.com/huggingface/smolagents/) |
| | - Training framework: Modified [TRL](https://github.com/huggingface/trl) GRPO trainer |
| |
|
| | ## Performance |
| |
|
| | ### Benchmark Results |
| |
|
| | We evaluated DeepMath on four mathematical reasoning datasets using **majority@16** and mean output length metrics: |
| |
|
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/mBuINzNvjDKdZEuIqzJeO.png" style="width:800px" alt="Main results table showing performance across MATH500, AIME, HMMT, and HLE datasets."/> |
| |
|
| | **Key Findings:** |
| |
|
| | - **Accuracy:** Improved performance on challenging datasets (AIME, HMMT, HLE) |
| | - **Efficiency:** Up to **66% reduction** in output length |
| | - **Robustness:** Consistent improvements when combining agent + GRPO training |
| |
|
| | ### Evaluation Datasets |
| |
|
| | - **MATH500:** Subset of the MATH dataset |
| | - **AIME:** American Invitational Mathematics Examination problems |
| | - **HMMT:** Harvard-MIT Mathematics Tournament problems |
| | - **HLE:** High-level exam problems |
| |
|
| | <figure> |
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/a-kn3oHdlxTP_L-63N9LX.png" style="width:700px" alt="Output example showing Python code generation and execution." /> |
| | <figcaption><p><em>Figure 2: Example output where Python code is generated, evaluated, and the result is inserted into the reasoning trace.</em></p></figcaption> |
| | </figure> |
| |
|
| | ## Usage |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | # Install uv package manager |
| | curl -LsSf https://astral.sh/uv/install.sh | sh |
| | |
| | # Clone repository |
| | git clone https://github.com/IntelLabs/DeepMath.git |
| | cd DeepMath |
| | |
| | # Install dependencies |
| | uv pip install -r requirements.txt |
| | uv pip install -e . |
| | ``` |
| |
|
| | ### Basic Inference |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model_name = "Intel/deepmath-v1" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForCausalLM.from_pretrained(model_name) |
| | |
| | # Example problem |
| | problem = "What is the sum of the first 100 positive integers?" |
| | |
| | inputs = tokenizer(problem, return_tensors="pt") |
| | outputs = model.generate(**inputs, max_new_tokens=3000) |
| | print(tokenizer.decode(outputs[0])) |
| | ``` |
| |
|
| | ### Inference with Agent |
| |
|
| | For full agent capabilities with sandboxed Python execution: |
| |
|
| | ```bash |
| | python inference.py \ |
| | +model.use_vllm=true \ |
| | +model.math_agent=true \ |
| | +model.examples=deep_math/fewshot.txt \ |
| | model.generation.max_new_tokens=3000 \ |
| | +model.max_agent_output=20000 \ |
| | +model.max_steps=50 \ |
| | model.model_name_or_path=Intel/deepmath-v1 \ |
| | hf_tag=HuggingFaceH4/MATH-500 \ |
| | generated_file=output.jsonl |
| | ``` |
| |
|
| | See the [repository](https://github.com/IntelLabs/DeepMath) for complete usage examples. |
| |
|
| | ## Limitations and Biases |
| |
|
| | ### Limitations |
| |
|
| | - **Scope:** Optimized for mathematical reasoning tasks; may not generalize to other domains |
| | - **Problem Types:** Evaluated on contest-style math problems; performance on open-ended mathematical creativity or formal proofs is unknown |
| | - **Model Size:** 4B parameters may limit reasoning depth on extremely complex problems |
| | - **Code Execution:** Requires sandboxed environment for full agent capabilities |
| |
|
| | ### Safety Considerations |
| |
|
| | ⚠️ **Code Execution Risk:** This model generates and executes Python code. While DeepMath uses strict sandboxing and resource limits, any deployment should: |
| |
|
| | - Carefully manage attack surfaces |
| | - Enforce rate limits |
| | - Use proper isolation (containers, VMs) |
| | - Monitor resource usage |
| | - Validate generated code before execution in production |
| |
|
| | ### Ethical Considerations |
| |
|
| | - The model is trained on mathematical problem-solving datasets and should not be used for decision-making in critical applications without human oversight |
| | - Generated code should be reviewed before execution in production environments |
| | - The model may reflect biases present in the training data |
| |
|
| | ## Citation |
| |
|
| | If you use DeepMath in your research, please cite: |
| |
|
| | ```bibtex |
| | @software{deepmath2025, |
| | author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe}, |
| | title = {DeepMath: A Lightweight Math Reasoning Agent for LLMs}, |
| | year = {2025}, |
| | publisher = {Intel AI Labs}, |
| | url = {https://github.com/IntelLabs/DeepMath} |
| | } |
| | ``` |
| |
|
| | ## Model Card Contact |
| |
|
| | For questions or issues, please open an issue on the [GitHub repository](https://github.com/IntelLabs/DeepMath). |
| |
|