SocratTeachLLM
A LoRA fine-tuned GLM4-9B-Chat model trained to act as a Socratic teacher in structured educational dialogues. It generates heuristic questions and formative feedback that guide students through a principled sequence of reasoning stages, following the KELE framework (Peng et al., EMNLP 2025 Findings).
Original model: yuanpan/SocratTeachLLM — this repository is a copy with an expanded README.
What It Does
SocratTeachLLM is designed for the teacher role in a dual-agent Socratic tutoring system. A separate consultant agent (e.g., GPT-4o or Qwen) selects a teaching strategy from a predefined set of 34 Socratic rules (SocRule); SocratTeachLLM then generates the actual dialogue turn implementing that strategy.
Teaching proceeds through five stages (SocRule):
| Stage | Name | State codes | Description |
|---|---|---|---|
| a | Initiation | a1 | Student poses the question; dialogue begins |
| b | Concept Probing | b2–b7 | Teacher probes prior knowledge and surfaces misconceptions |
| c | Inductive Reasoning | c8–c29 | Core teaching stage — guides the student toward generalizations; can repeat many turns |
| d | Answer Derivation | d30–d33 | Help the student arrive at the correct answer |
| e | Summary | e34 | Consolidate and reinforce learning |
The model was fine-tuned on SocratDataset: 6,803 multi-turn Socratic dialogues covering 42,000+ interaction turns across elementary school science topics in Chinese.
Published Performance
Results from Table 1 of the KELE paper (test set: 680 dialogues, 4,245 single-turn examples):
| Model | ROUGE-1 | ROUGE-2 | BLEU-4 | PRR | NDAR | SPR | IAR | Guidance | Logicality | Flexibility |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 38.25 | 22.35 | 29.93 | 72.13 | 81.19 | 85.00 | 87.74 | 4.35 | 4.50 | 4.33 |
| Qwen2.5-7B | 40.95 | 15.27 | 24.96 | 59.02 | 80.52 | 60.00 | 76.45 | 3.87 | 3.96 | 3.87 |
| Qwen2.5-14B | 43.79 | 17.06 | 26.63 | 65.21 | 78.57 | 74.00 | 80.81 | 3.99 | 4.15 | 4.03 |
| Qwen2.5-32B | 46.22 | 19.90 | 28.85 | 65.57 | 83.13 | 81.00 | 84.68 | 4.12 | 4.44 | 4.21 |
| EduChat-13B | 34.75 | 9.91 | 21.11 | 47.62 | 90.73 | 51.00 | 69.02 | 2.93 | 3.42 | 3.18 |
| SocraticLM-7B | 18.63 | 5.56 | 10.93 | 26.83 | 30.26 | 36.00 | 27.05 | 2.62 | 2.88 | 2.78 |
| SocratTeachLLM (this model) | 57.40 | 33.63 | 41.96 | 75.13 | 94.71 | 87.00 | 89.03 | 4.66 | 4.53 | 4.45 |
Metric definitions:
- PRR — Problem Relevance Rate: teacher question relates directly to the problem
- NDAR — No Direct Answer Rate: teacher avoids giving away the answer
- SPR — Summary Pass Rate: correct and complete final summary
- IAR — Instruction Adherence Rate: teacher follows the consultant's recommended strategy
- Guidance / Logicality / Flexibility — GPT-4o judge scores on a 1–5 scale (B.5 rubric)
SocratTeachLLM outperforms GPT-4o on every metric despite being ~40× smaller.
Training Details
| Setting | Value |
|---|---|
| Base model | GLM4-9B-Chat |
| Method | LoRA |
| Epochs | 3 |
| Learning rate | 5e-5 |
| Batch size | 16 |
| Train split | 6,123 dialogues (90%) |
| Test split | 680 dialogues (10%) |
| Hardware | 2× NVIDIA A800 80GB |
| Dataset | SocratDataset (6,803 records, Chinese) |
Training Objective
P(teacher_response | dialogue_history, evaluation, action)
The evaluation (consultant's stage/state assessment) and action (recommended strategy) fields are required conditioning signals. At inference time, a consultant agent produces these before the teacher agent generates its response. Without the consultant outputs as conditioning, the model will underperform.
Model Architecture
| Parameter | Value |
|---|---|
| Base model | GLM4-9B-Chat (ChatGLMForConditionalGeneration) |
| Total parameters | ~9.4B |
| Layers | 40 |
| Hidden size | 4,096 |
| Attention heads | 32 |
| FFN hidden size | 13,696 |
| KV channels | 128 |
| Vocabulary size | 151,552 |
| Max context length | 131,072 tokens (128K) |
| Storage dtype | bfloat16 |
| Attention | Multi-query (2 groups), RoPE (ratio 500) |
| Normalization | RMSNorm |
| Weight files | 4× safetensors shards (~18.8 GB total) |
Generation defaults: temperature 0.8, top-p 0.8.
Usage
Transformers (recommended, ~19 GB VRAM)
The model uses custom modeling code, so trust_remote_code=True is required.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "ulises-c/SocratTeachLLM"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "What do you think causes the seasons to change?"}]
inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.8, top_p=0.8)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
4-bit NF4 via bitsandbytes (~6.5 GB VRAM)
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
vLLM (OpenAI-compatible endpoint)
vllm serve ulises-c/SocratTeachLLM \
--served-model-name SocratTeachLLM \
--dtype bfloat16 \
--trust-remote-code
Ollama
This repo includes a Modelfile (auto-generated by LlamaFactory) with the correct ChatGLM4 stop sequences and a 4,096-token context window.
ollama create SocratTeachLLM -f Modelfile
ollama run SocratTeachLLM
Note: Ollama caps context at 4,096 tokens. For the full 128K context, use Transformers or vLLM.
Built With This Model
csen-346 is a downstream course project (CSEN 346 NLP, Santa Clara University) that reproduces and extends the KELE framework using this model as the teacher agent.
Key integration details:
- Teacher: SocratTeachLLM, served via FastAPI (4-bit on RTX 3070) or vLLM (bfloat16 on RTX 5090 / SCU WAVE cluster L40S)
- Consultant: GPT-4o (baseline) or Qwen3.5-9B (local variant)
- Evaluation: 680-dialogue test split of SocratDataset, automated with ROUGE, BLEU, and GPT-4o judge (B.5 rubric)
- English extension: An English translation of the training dataset is available at ulises-c/SocratDataset-EN
hf download ulises-c/SocratTeachLLM --local-dir ~/hf_models/SocratTeachLLM
Training Data
| Property | Value |
|---|---|
| Dataset | ulises-c/SocratDataset |
| Dialogues | 6,803 |
| Turns | 42,000+ |
| Domain | Elementary school science (grades 1–6) |
| Language | Chinese (Simplified) |
| Train split | 6,123 dialogues (90%) |
| Test split | 680 dialogues (10%) |
| Strategies | 34 SocRule teaching strategies |
An English translation of the training data is available at ulises-c/SocratDataset-EN.
Citation
If you use this model, please cite the original KELE paper:
@inproceedings{peng-etal-2025-kele,
title = {{KELE}: A Multi-Agent Framework for Structured {S}ocratic Teaching with Large Language Models},
author = {Peng, Yuan and others},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025},
year = {2025},
url = {https://aclanthology.org/2025.findings-emnlp.888/}
}
Related Resources
| Resource | Link |
|---|---|
| KELE paper (EMNLP 2025 Findings) | https://aclanthology.org/2025.findings-emnlp.888/ |
| KELE GitHub repository | https://github.com/yuanpan1020/KELE |
| Original model | https://huggingface.co/yuanpan/SocratTeachLLM |
| Training data (Chinese) | https://huggingface.co/datasets/ulises-c/SocratDataset |
| Training data (English translation) | https://huggingface.co/datasets/ulises-c/SocratDataset-EN |
| Evaluation + inference code | https://github.com/ulises-c/csen-346 |
License
- Downloads last month
- 91
Model tree for ulises-c/SocratTeachLLM
Base model
zai-org/glm-4-9b-chat