LiteResearcher-4B-SFT

This is the SFT cold-start checkpoint for LiteResearcher — a scalable agentic RL training framework for deep-research agents.

It is the initial policy used to launch the two-stage curriculum RL training that produces the final simplex-ai-inc/LiteResearcher-4B model.

If you are looking for the final RL model, please use simplex-ai-inc/LiteResearcher-4B. If you want to reproduce the RL training from scratch, this is the checkpoint you need.

Model details

  • Base model: Qwen/Qwen3-4B-Thinking-2507
  • Architecture: Qwen3ForCausalLM (36 layers, hidden 2560, 32 heads, GQA 8 KV heads)
  • Max position embeddings: 262,144 (RoPE θ = 5,000,000)
  • Precision: bfloat16
  • Total params: ~4B
  • Training framework: LLaMA-Factory

Training recipe

Item Value
Stage SFT (cold-start before RL)
Base model Qwen/Qwen3-4B-Thinking-2507
Dataset simplex-ai-inc/LiteResearcher-Data (~68.2k SFT trajectories)
Max sequence length 64K (cutoff_len=65536)
Global batch size 128 (per-device bs 2 × grad-accum 8 × 8 GPUs)
Epochs 1
Optimizer steps 533
Learning rate 2.0e-5, cosine, 10% warmup
Final train loss ≈ 0.447 (starting loss ≈ 1.19)

The SFT trajectories teach the model the ReAct think → search → visit → answer loop and the strict <answer>...</answer> output contract used by the RL environment. Because the base is the Thinking-2507 variant, the model preserves long chain-of-thought behavior inside <think>...</think> blocks, which is what the downstream RL curriculum builds on.

How to use

As the initial policy for RL (recommended use)

# In the LiteResearcher training scripts (Training/ folder of the repo)
export MODEL_PATH=$(hf download simplex-ai-inc/LiteResearcher-4B-SFT \
                                --local-dir ./literesearcher_sft)

Then follow the Stage-1 / Stage-2 RL instructions in the LiteResearcher repository.

Stand-alone inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "simplex-ai-inc/LiteResearcher-4B-SFT"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

The model expects the same ReAct system prompt and tool schema used by LiteResearcher (see Inference/ in the repo).

Citation

If you use this checkpoint in academic work, please cite the LiteResearcher project — see the GitHub README for the BibTeX entry.

License

Apache-2.0, inheriting from the Qwen3-4B-Thinking-2507 base model.

Downloads last month
1
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for simplex-ai-inc/LiteResearcher-4B-SFT

Finetuned
(237)
this model

Dataset used to train simplex-ai-inc/LiteResearcher-4B-SFT