Instructions to use EphAsad/Atem-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use EphAsad/Atem-8B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="EphAsad/Atem-8B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("EphAsad/Atem-8B") model = AutoModelForCausalLM.from_pretrained("EphAsad/Atem-8B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use EphAsad/Atem-8B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="EphAsad/Atem-8B", filename="Atem-8b.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use EphAsad/Atem-8B with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf EphAsad/Atem-8B:Q4_K_M # Run inference directly in the terminal: llama cli -hf EphAsad/Atem-8B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf EphAsad/Atem-8B:Q4_K_M # Run inference directly in the terminal: llama cli -hf EphAsad/Atem-8B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf EphAsad/Atem-8B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf EphAsad/Atem-8B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf EphAsad/Atem-8B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf EphAsad/Atem-8B:Q4_K_M
Use Docker
docker model run hf.co/EphAsad/Atem-8B:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use EphAsad/Atem-8B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "EphAsad/Atem-8B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EphAsad/Atem-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/EphAsad/Atem-8B:Q4_K_M
- SGLang
How to use EphAsad/Atem-8B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "EphAsad/Atem-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EphAsad/Atem-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "EphAsad/Atem-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EphAsad/Atem-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use EphAsad/Atem-8B with Ollama:
ollama run hf.co/EphAsad/Atem-8B:Q4_K_M
- Unsloth Studio
How to use EphAsad/Atem-8B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for EphAsad/Atem-8B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for EphAsad/Atem-8B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for EphAsad/Atem-8B to start chatting
- Pi
How to use EphAsad/Atem-8B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf EphAsad/Atem-8B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "EphAsad/Atem-8B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use EphAsad/Atem-8B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf EphAsad/Atem-8B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default EphAsad/Atem-8B:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use EphAsad/Atem-8B with Docker Model Runner:
docker model run hf.co/EphAsad/Atem-8B:Q4_K_M
- Lemonade
How to use EphAsad/Atem-8B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull EphAsad/Atem-8B:Q4_K_M
Run and chat with the model
lemonade run user.Atem-8B-Q4_K_M
List all available models
lemonade list
Atem-8B
Ancient logic. Modern intelligence.
An 8B reasoning model trained via a single CoT-preserving SFT pass directly on Qwen3-8B, distilling multi-domain reasoning capability from frontier teacher models while keeping the base model's native thinking capability intact.
Overview
Atem-8B is an 8B parameter reasoning model built via a single supervised fine-tuning pass on raw Qwen3-8B. Like Atem-4B, it uses a CoT-preserving single-pass design — building reasoning capability on top of the base model's intact native foundation rather than erasing and rebuilding thinking in separate stages. Atem-8B is trained on a larger corpus (~91K records before filtering vs ~63K for 4B) with higher per-source caps, producing a model with broader reasoning coverage across mathematics, coding, science, and general domains.
This is the most thoroughly evaluated model in the Atem series, benchmarked across nine tasks including a custom flexible GSM8K evaluator that diagnoses the formatting shift introduced by CoT training.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-8B |
| Training method | Single-pass CoT-Preserving LoRA SFT |
| LoRA config | r=64, alpha=128, dropout=0.05 |
| Target modules | q, k, v, o, gate, up, down projections |
| Parameters | ~8.37B |
| Trainable (LoRA) params | 174,587,904 (2.09% of base) |
| Training records | 58,980 (after token-length filtering) |
| Think / No-think split | 85% / 15% |
| Epochs | 2 (ceiling; early stopping patience=3, never triggered) |
| Effective batch size | 64 (batch 4 × grad accum 16) |
| Learning rate | 1e-4, cosine schedule, 5% warmup |
| Max sequence length | 6,144 tokens |
| Precision | bfloat16 (full 16-bit LoRA, not QLoRA) |
| Hardware | NVIDIA A100-SXM4 80GB |
| Runtime | 7h40m |
| License | Apache 2.0 |
Design Notes
Single combined pass. The earlier Atem-0.6B pipeline erased Qwen3's native thinking mode in Stage 1 then re-imposed an externally-distilled style in Stage 2. This introduced measurable capability costs — the base model's exposed reasoning self-corrected on problems the no-think version got wrong, and ARC-Challenge regressed after Stage 2. Atem-8B skips the erasure entirely: one pass, intact native reasoning, external CoT styles layered on a foundation that still works.
Full 16-bit LoRA. At 8B with an 80GB A100, full 16-bit LoRA requires ~33GB — comfortably within budget. It is both marginally faster and marginally more accurate than QLoRA at equivalent effective batch sizes, since QLoRA pays compute overhead on quantize/dequantize operations at each step.
r=64, alpha=128. r=64 on Qwen3-8B represents 2.09% of the model — somewhat lower than the proven 4B baseline of 3.11% due to the quadratic scaling of total parameters relative to linear scaling of LoRA capacity. The proportional capacity does decrease modestly as model size grows; r=96 would more closely match the 4B reference point. Not a blocker for this run, and noted for future iterations.
Corpus scale. Atem-8B draws from the same eight source datasets as Atem-4B but with higher per-source caps — 91,017 total records before ratio adjustment vs ~63,563 for 4B, yielding 58,980 useable training examples after token-length filtering at 6,144.
Intended Use
Atem-8B is designed for general reasoning tasks where structured, step-by-step thinking adds value:
- Multi-step mathematical reasoning
- Code explanation, implementation, and debugging
- Analytical reasoning and argument evaluation
- Scientific explanation requiring technical depth
- Commonsense reasoning and physical intuition
- Logic, fallacy identification, and conditional reasoning
- Concept explanation across diverse domains
Training Data
Atem-8B was trained on a corpus assembled from eight sources covering mathematics, coding, general reasoning, scientific reasoning, and medical reasoning. All sources include explicit chain-of-thought reasoning traces; 85% of training records were formatted with full think traces and 15% as direct answers.
| Dataset | Records | Source / Teacher |
|---|---|---|
| mitroitskii/OpenR1-Math-220k-formatted | ~10,938 | DeepSeek-R1 — Mathematics (correctness-filtered) |
| Jackrong/Claude-opus-4.6-TraceInversion-9000x | 7,000 | Claude Opus 4.6 — Trace Inversion |
| Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (General-Math) | 8,000 | Kimi K2.5 — Mathematical Reasoning |
| Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (General-Distillation) | 8,000 | Kimi K2.5 — General Reasoning |
| Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (PHD-Science) | 8,000 | Kimi K2.5 — Scientific Reasoning |
| WithinUsAI/MiniMax_M2.7_Distilled_5k | 5,000 | MiniMax M2.7 |
| FreedomIntelligence/medical-o1-reasoning-SFT | 7,500 | Medical reasoning (English config) |
| Modotte/CodeX-2M-Thinking | 15,000 | Mixed — Coding with CoT |
| trjxter/DeepSeek-V4-Pro-Reasoning-8000x | ~8,014 | DeepSeek-V4-Pro |
| nvidia/OpenCodeReasoning | 15,000 | Mixed — Competitive coding |
| Total (pre-filter pool) | 91,017 | |
| Total (post-filter, trained on) | 58,980 |
Non-English reasoning traces (primarily CJK) were filtered at the trace level using an ASCII-ratio threshold; records with CJK traces were retained as no-think records rather than discarded. The 34.3% filter rate reflects the same 6,144-token ceiling that filtered 32.7% of the Atem-4B corpus — the longest, most complex reasoning traces from competitive programming and advanced mathematics exceed this limit.
Training Configuration
# Key hyperparameters
lora_r = 64
lora_alpha = 128
lora_dropout = 0.05
max_seq_length = 6144
learning_rate = 1e-4
lr_scheduler = 'cosine'
warmup_ratio = 0.05
batch_size = 4
grad_accumulation = 16 # effective batch size: 64
num_epochs = 2 # ceiling — early stopping patience=3
eval_steps = 150
early_stopping_patience = 3
early_stopping_threshold = 0.001
nothink_ratio = 0.15
load_in_4bit = False # full 16-bit LoRA
dtype = bfloat16
Training used Unsloth with train_on_responses_only masking. Early stopping was configured with patience=3 and threshold=0.001 — it did not trigger, as validation loss improved at every checkpoint throughout the full 2-epoch run.
Loss Curve
| Step | Train Loss | Val Loss |
|---|---|---|
| 150 | 0.8661 | 0.8367 |
| 300 | 0.7971 | 0.8120 |
| 450 | 0.8006 | 0.7978 |
| 600 | 0.7992 | 0.7880 |
| 750 | 0.7791 | 0.7822 |
| 900 | 0.7879 | 0.7770 |
| 1050 | 0.7328 | 0.7758 |
| 1200 | 0.7357 | 0.7734 |
| 1350 | 0.7223 | 0.7711 |
| 1500 | 0.7461 | 0.7697 |
| 1650 | 0.7501 | 0.7691 |
| 1800 | 0.7691 | 0.7688 |
| Final (1844) | 0.7847 (avg) | 0.7688 |
Validation loss tracked above training loss for most of the run, indicating no overfitting. At step 150, val loss was briefly below train loss — a known early-training artifact when dropout is active during training but not during evaluation. This normalised by step 300 and did not recur. Val loss improved continuously across all 13 checkpoints, confirming the early stopping mechanism was never needed.
Evaluation
Benchmark Results
Evaluated against base Qwen3-8B (Qwen/Qwen3-8B) using lm-evaluation-harness. Both models were loaded in 4-bit for evaluation. GSM8K flexible extraction uses a custom evaluator that accepts #### answer, \boxed{answer}, and prose formats — see note below.
| Task | Base (Qwen3-8B) | Atem-8B | Delta |
|---|---|---|---|
| ARC-Challenge (0-shot, acc_norm) | 56.5% | 56.9% | +0.4pp — |
| GSM8K strict (5-shot, exact_match) | 86.7% | 83.3% | −3.4pp ⚠ |
| GSM8K flexible (5-shot, custom) | 86.7% | 85.6% | −1.1pp — |
| HellaSwag (0-shot, acc_norm) | 74.5% | 76.2% | +1.7pp ✓ |
| MMLU (0-shot, acc) | 72.9% | 72.9% | +0.0pp — |
| Winogrande (0-shot, acc) | 67.2% | 71.8% | +4.6pp ✓ |
| PIQA (0-shot, acc) | 76.2% | 78.1% | +1.9pp ✓ |
| OpenBookQA (0-shot, acc_norm) | 41.4% | 43.2% | +1.8pp ✓ |
| BoolQ (0-shot, acc) | 85.9% | 84.3% | −1.6pp — |
Winogrande (+4.6pp, 2.5σ) is the headline result — the largest gain in the evaluation set. Commonsense pronoun resolution is format-independent and tests exactly the kind of contextual reasoning that CoT training is designed to improve.
HellaSwag (+1.7pp, 2.8σ) uses normalised log-likelihood scoring over multiple-choice options — format-independent and not influenced by generation style. A genuine reasoning signal.
PIQA, OpenBookQA both positive. All four commonsense and reasoning tasks improved. The direction is consistent and matches the expected effect of training on structured reasoning traces.
MMLU exactly tied at 72.9%. The CoT training neither added nor removed knowledge breadth — the correct expected behaviour for SFT on reasoning data.
GSM8K — Formatting Shift Analysis
The strict-match GSM8K regression (−3.4pp) was investigated using a custom flexible extractor that accepts multiple answer formats: #### {number} (lm_eval standard), \boxed{number} (LaTeX, common in mathematics literature), prose declarations, and last-number fallback.
| Extraction method | Atem-8B | Base |
|---|---|---|
Strict-match #### only |
83.3% | 86.7% |
| Flexible extraction | 85.6% | ~86.7% |
| Recovered by flexible | +2.3pp | — |
68% of the observed regression was a formatting artifact. The training corpus — OpenR1-Math, DeepSeek-V4-Pro, Kimi-K2.5 — uses \boxed{answer} (LaTeX notation, standard in academic and competition mathematics) rather than the #### answer format specific to the GSM8K dataset. The SFT pass has shifted Atem's preferred answer format from #### toward \boxed{}. lm_eval's strict-match regex only searches for ####, so correct answers in \boxed{} format count as wrong.
The true capability gap after accounting for formatting is approximately −1.1pp, not −3.4pp. The base model retains a small genuine advantage on this benchmark because it was instruction-tuned on GSM8K-format data and naturally reproduces the #### convention.
BoolQ (−1.6pp, 1.8σ) is borderline — sitting between noise and statistical significance. BoolQ requires committing to a binary yes/no answer; it's possible the more exploratory CoT training style slightly disadvantaged decisive binary classification. Worth monitoring on future runs.
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "EphAsad/Atem-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{
"role": "user",
"content": "Explain why switching doors in the Monty Hall problem gives a 2/3 probability of winning."
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
with torch.no_grad():
output = model.generate(
input_ids=inputs,
max_new_tokens=2000,
temperature=0.6,
top_p=0.95,
top_k=20,
do_sample=True,
repetition_penalty=1.1,
)
response = tokenizer.decode(
output[0][inputs.shape[1]:],
skip_special_tokens=True
)
print(response)
Unsloth (faster inference)
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="EphAsad/Atem-8B",
max_seq_length=6144,
dtype=torch.bfloat16,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
messages = [
{
"role": "user",
"content": "A train travels from A to B at 60 km/h and returns at 90 km/h. What is the average speed?"
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to("cuda")
with torch.no_grad():
output = model.generate(
input_ids=inputs,
max_new_tokens=2000,
temperature=0.6,
top_p=0.95,
top_k=20,
do_sample=True,
)
print(tokenizer.decode(
output[0][inputs.shape[1]:],
skip_special_tokens=True
))
Ollama
# Recommended — best speed/quality balance
ollama run hf.co/EphAsad/Atem-8B:Q4_K_M
# Higher quality
ollama run hf.co/EphAsad/Atem-8B:Q5_K_M
# Near-lossless
ollama run hf.co/EphAsad/Atem-8B:Q8_0
llama.cpp
llama-server -hf EphAsad/Atem-8B:Q4_K_M
Sampling Parameters
Use temperature=0.6, top_p=0.95, top_k=20 for thinking mode — Qwen3's published recommendation, used throughout this evaluation. Do not use greedy decoding with thinking mode enabled.
System Prompt
Atem-8B's identity is baked into the chat template and activates automatically when no system message is provided. For manual override:
You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically — identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.
Available Files
| File | Size | Description |
|---|---|---|
model-XXXX-of-00004.safetensors (×4) |
~16.4 GB total | Full bfloat16 merged weights |
Atem-8b.Q4_K_M.gguf |
5.03 GB | 4-bit quantised — recommended |
Atem-8b.Q5_K_M.gguf |
5.85 GB | 5-bit quantised |
Atem-8b.Q8_0.gguf |
8.71 GB | 8-bit quantised — near-lossless |
Known Limitations
GSM8K formatting shift. As documented in the evaluation section, the SFT corpus uses \boxed{} notation for mathematical answers rather than the #### format specific to the GSM8K benchmark. This creates a systematic measurement gap under strict-match evaluation (−3.4pp), of which 68% is a formatting artifact. Under flexible extraction the true gap is approximately −1.1pp. For production use, \boxed{answer} is standard in mathematical contexts.
6,144 token sequence ceiling. The training corpus's longest reasoning traces (competitive programming, advanced mathematics) exceed 6,144 tokens and were dropped during formatting. The model has not been exposed to very long chain-of-thought traces; raising max_new_tokens at inference time provides budget for longer outputs but does not recover training coverage of ultra-long traces.
LoRA proportional capacity. r=64 represents 2.09% of the 8B model — lower than the proven 4B baseline of 3.11% due to the quadratic scaling of total parameters relative to linear scaling of LoRA capacity. r=96 would more closely match the 4B proportional reference. Not a blocker, but noted for future runs.
No RLHF or DPO. Atem-8B has not undergone preference optimisation. Responses are accurate and structured but may not be as reliably aligned with user preferences in open-ended creative or instructional tasks compared to models that have undergone preference training.
Roadmap
- Atem-14B: Single CoT-preserving pass on Qwen3-14B, r=128 (3.10% proportional capacity), with GSM8K-format examples added to the corpus to restore
####answer convention
Citation
@misc{atem_8b_2026,
author = {Asad, Zain},
title = {Atem-8B: An 8B CoT-Preserving Reasoning Model via
Single-Pass SFT on Qwen3},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/EphAsad/Atem-8B}},
}
License
Released under the Apache 2.0 License, consistent with the base model Qwen/Qwen3-8B.
Built independently by Zain Asad — EphAsad
- Downloads last month
- 380
Model tree for EphAsad/Atem-8B
Datasets used to train EphAsad/Atem-8B
FreedomIntelligence/medical-o1-reasoning-SFT
Modotte/CodeX-2M-Thinking
Evaluation results
- acc_norm on ARC (Challenge)test set self-reported56.900
- exact_match (strict-match) on GSM8Ktest set self-reported83.300
- acc_norm on HellaSwagvalidation set self-reported76.200
- acc on MMLUtest set self-reported72.900
- acc on Winograndevalidation set self-reported71.800
- acc on PIQAvalidation set self-reported78.100
- acc_norm on OpenBookQAtest set self-reported43.200
- acc on BoolQvalidation set self-reported84.300
