| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | pipeline_tag: text-generation |
| | library_name: transformers |
| | model_type: causal-lm |
| | base_model: Qwen/Qwen3-4B |
| | tags: |
| | - reasoning |
| | - tree-of-thoughts |
| | - gnn |
| | - self-improving |
| | - autonomous-training |
| | - multi-agent |
| | - variance-curriculum |
| | - reinforcement-learning |
| | - Trident |
| | datasets: |
| | - gsm8k |
| | - mmlu |
| | - gpqa |
| | - arc-challenge |
| | - truthfulqa |
| | metrics: |
| | - accuracy |
| | inference: true |
| | training: true |
| | --- |
| | |
| | # TRIDENT |
| |
|
| | **TRIDENT** is a reasoning-focused 4B-parameter language model that improves its own reasoning capability through **algorithmic self-improvement**, rather than parameter scaling. |
| |
|
| | The model is built on **Qwen3-4B** and enhanced using the **TRIDENT framework**: a combination of GNN-guided Tree-of-Thoughts search, multi-agent reasoning policies, and variance-based self-training. |
| |
|
| | --- |
| |
|
| | ## Overview |
| |
|
| | Traditional large language model training depends on: |
| | - Human-written reasoning traces |
| | - Manually curated preference datasets |
| | - Static fine-tuning pipelines |
| |
|
| | **TRIDENT removes these dependencies.** |
| |
|
| | Instead, the model: |
| | 1. Explores multiple reasoning paths |
| | 2. Evaluates them using a learned GNN policy |
| | 3. Selects high-uncertainty problems automatically |
| | 4. Generates its own training supervision |
| | 5. Distills improvements back into the model using LoRA |
| |
|
| | --- |
| | model-index: |
| | - name: TRIDENT |
| | results: |
| | - task: |
| | type: text-generation |
| | dataset: |
| | name: GSM8K |
| | type: gsm8k |
| | split: test |
| | metrics: |
| | - type: accuracy |
| | value: 86.58 |
| | - task: |
| | type: text-generation |
| | dataset: |
| | name: MMLU |
| | type: mmlu |
| | split: test |
| | metrics: |
| | - type: accuracy |
| | value: 72.61 |
| | - task: |
| | type: text-generation |
| | dataset: |
| | name: GPQA |
| | type: gpqa |
| | split: test |
| | metrics: |
| | - type: accuracy |
| | value: 42.42 |
| | - task: |
| | type: text-generation |
| | dataset: |
| | name: ARC-Challenge |
| | type: arc-challenge |
| | split: test |
| | metrics: |
| | - type: accuracy |
| | value: 59.0 |
| | |
| | ## Core Capabilities |
| |
|
| | ### GNN-Guided Tree-of-Thoughts |
| | Reasoning is represented as a directed graph of intermediate states. |
| | A 3-layer Graph Convolutional Network predicts a **promise score** for each branch, guiding exploration and pruning. |
| |
|
| | ### Multi-Agent Reasoning |
| | Four internal agents (Conservative, Exploratory, Balanced, Reflective) vote on reasoning actions to balance exploration and correctness. |
| |
|
| | ### Variance-Based Curriculum |
| | Problems are selected for training based on **reward variance**, targeting examples where the model is inconsistent and learning signal is highest. |
| |
|
| | ### Self-Generative Reasoning Loop |
| | No human-annotated reasoning traces are used. |
| | The model autonomously generates, evaluates, and curates its own reasoning data. |
| |
|
| | ### Stable Training |
| | A multi-layer reward stabilization mechanism prevents: |
| | - Reward collapse |
| | - Loss explosions |
| | - Infinite failure loops |
| |
|
| | The architecture is compatible with future GRPO-style reinforcement learning. |
| |
|
| | --- |
| |
|
| |
|
| | --- |
| |
|
| | --- |
| |
|
| | ## Benchmark Results |
| |
|
| | Accuracy comparison against the base model: |
| |
|
| | | Benchmark | Qwen3-4B | TRIDENT | |
| | |--------|--------|-----------| |
| | | GSM8K (5-shot) | 74.14 | **86.58** | |
| | | MMLU (5-shot) | 47.70 | **72.61** | |
| | | ARC-C (25-shot) | 54.0 | **59.0** | |
| | | GPQA (0-shot) | 28.28 | **42.42** | |
| | | Winogrande (0-shot) | 59.6 | **67.08** | |
| | | TruthfulQA (0-shot) | 54.9 | **54.7** | |
| |
|
| | **Highlight:** |
| | +14.14 percentage point improvement on **GPQA (0-shot)**. |
| |
|
| | --- |
| |
|
| | ## Intended Use |
| |
|
| | TRIDENT is suitable for: |
| | - Multi-step mathematical reasoning |
| | - Scientific and logical inference |
| | - Hard QA benchmarks |
| | - Planning and hypothesis exploration |
| | - Research on reasoning systems |
| |
|
| | --- |
| |
|
| | ## Limitations |
| |
|
| | - Higher inference-time compute than single-pass models |
| | - Not optimized for low-latency chat |
| | - Best used where reasoning depth matters more than speed |
| |
|
| | --- |
| |
|
| | ## Ethical Considerations |
| |
|
| | - No human-written reasoning traces used |
| | - No preference data collection |
| | - Training relies on verifiable task rewards |
| | - Like all LLMs, may hallucinate outside structured reasoning workflows |
| |
|
| | --- |
| | ## Paper link |
| |
|
| | https://www.shivik.in/shivik-labs/trident |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{puri2025trident, |
| | title={TRIDENT: Thought-based Reasoning and Improvement through Deep Exploration of Neuronal Trees}, |
| | author={Puri, Shivansh and Khandelwal, Abhisek and Joshi, Vedant and Yadav, Akash}, |
| | year={2025} |
| | } |