SocratTeachLLM

A LoRA fine-tuned GLM4-9B-Chat model trained to act as a Socratic teacher in structured educational dialogues. It generates heuristic questions and formative feedback that guide students through a principled sequence of reasoning stages, following the KELE framework (Peng et al., EMNLP 2025 Findings).

Original model: yuanpan/SocratTeachLLM — this repository is a copy with an expanded README.

What It Does

SocratTeachLLM is designed for the teacher role in a dual-agent Socratic tutoring system. A separate consultant agent (e.g., GPT-4o or Qwen) selects a teaching strategy from a predefined set of 34 Socratic rules (SocRule); SocratTeachLLM then generates the actual dialogue turn implementing that strategy.

Teaching proceeds through five stages (SocRule):

Stage	Name	State codes	Description
a	Initiation	a1	Student poses the question; dialogue begins
b	Concept Probing	b2–b7	Teacher probes prior knowledge and surfaces misconceptions
c	Inductive Reasoning	c8–c29	Core teaching stage — guides the student toward generalizations; can repeat many turns
d	Answer Derivation	d30–d33	Help the student arrive at the correct answer
e	Summary	e34	Consolidate and reinforce learning

The model was fine-tuned on SocratDataset: 6,803 multi-turn Socratic dialogues covering 42,000+ interaction turns across elementary school science topics in Chinese.

Published Performance

Results from Table 1 of the KELE paper (test set: 680 dialogues, 4,245 single-turn examples):

Model	ROUGE-1	ROUGE-2	BLEU-4	PRR	NDAR	SPR	IAR	Guidance	Logicality	Flexibility
GPT-4o	38.25	22.35	29.93	72.13	81.19	85.00	87.74	4.35	4.50	4.33
Qwen2.5-7B	40.95	15.27	24.96	59.02	80.52	60.00	76.45	3.87	3.96	3.87
Qwen2.5-14B	43.79	17.06	26.63	65.21	78.57	74.00	80.81	3.99	4.15	4.03
Qwen2.5-32B	46.22	19.90	28.85	65.57	83.13	81.00	84.68	4.12	4.44	4.21
EduChat-13B	34.75	9.91	21.11	47.62	90.73	51.00	69.02	2.93	3.42	3.18
SocraticLM-7B	18.63	5.56	10.93	26.83	30.26	36.00	27.05	2.62	2.88	2.78
SocratTeachLLM (this model)	57.40	33.63	41.96	75.13	94.71	87.00	89.03	4.66	4.53	4.45

Metric definitions:

PRR — Problem Relevance Rate: teacher question relates directly to the problem
NDAR — No Direct Answer Rate: teacher avoids giving away the answer
SPR — Summary Pass Rate: correct and complete final summary
IAR — Instruction Adherence Rate: teacher follows the consultant's recommended strategy
Guidance / Logicality / Flexibility — GPT-4o judge scores on a 1–5 scale (B.5 rubric)

SocratTeachLLM outperforms GPT-4o on every metric despite being ~40× smaller.

Training Details

Setting	Value
Base model	GLM4-9B-Chat
Method	LoRA
Epochs	3
Learning rate	5e-5
Batch size	16
Train split	6,123 dialogues (90%)
Test split	680 dialogues (10%)
Hardware	2× NVIDIA A800 80GB
Dataset	SocratDataset (6,803 records, Chinese)

Training Objective

P(teacher_response | dialogue_history, evaluation, action)

The evaluation (consultant's stage/state assessment) and action (recommended strategy) fields are required conditioning signals. At inference time, a consultant agent produces these before the teacher agent generates its response. Without the consultant outputs as conditioning, the model will underperform.

Model Architecture

Parameter	Value
Base model	GLM4-9B-Chat (`ChatGLMForConditionalGeneration`)
Total parameters	~9.4B
Layers	40
Hidden size	4,096
Attention heads	32
FFN hidden size	13,696
KV channels	128
Vocabulary size	151,552
Max context length	131,072 tokens (128K)
Storage dtype	bfloat16
Attention	Multi-query (2 groups), RoPE (ratio 500)
Normalization	RMSNorm
Weight files	4× safetensors shards (~18.8 GB total)

Generation defaults: temperature 0.8, top-p 0.8.

Usage

Transformers (recommended, ~19 GB VRAM)

The model uses custom modeling code, so trust_remote_code=True is required.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "ulises-c/SocratTeachLLM"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "What do you think causes the seasons to change?"}]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=512, temperature=0.8, top_p=0.8)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

4-bit NF4 via bitsandbytes (~6.5 GB VRAM)

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

vLLM (OpenAI-compatible endpoint)

vllm serve ulises-c/SocratTeachLLM \
  --served-model-name SocratTeachLLM \
  --dtype bfloat16 \
  --trust-remote-code

Ollama

This repo includes a Modelfile (auto-generated by LlamaFactory) with the correct ChatGLM4 stop sequences and a 4,096-token context window.

ollama create SocratTeachLLM -f Modelfile
ollama run SocratTeachLLM

Note: Ollama caps context at 4,096 tokens. For the full 128K context, use Transformers or vLLM.

Built With This Model

csen-346 is a downstream course project (CSEN 346 NLP, Santa Clara University) that reproduces and extends the KELE framework using this model as the teacher agent.

Key integration details:

Teacher: SocratTeachLLM, served via FastAPI (4-bit on RTX 3070) or vLLM (bfloat16 on RTX 5090 / SCU WAVE cluster L40S)
Consultant: GPT-4o (baseline) or Qwen3.5-9B (local variant)
Evaluation: 680-dialogue test split of SocratDataset, automated with ROUGE, BLEU, and GPT-4o judge (B.5 rubric)
English extension: An English translation of the training dataset is available at ulises-c/SocratDataset-EN

hf download ulises-c/SocratTeachLLM --local-dir ~/hf_models/SocratTeachLLM

Training Data

Property	Value
Dataset	ulises-c/SocratDataset
Dialogues	6,803
Turns	42,000+
Domain	Elementary school science (grades 1–6)
Language	Chinese (Simplified)
Train split	6,123 dialogues (90%)
Test split	680 dialogues (10%)
Strategies	34 SocRule teaching strategies

An English translation of the training data is available at ulises-c/SocratDataset-EN.

Citation

If you use this model, please cite the original KELE paper:

@inproceedings{peng-etal-2025-kele,
  title     = {{KELE}: A Multi-Agent Framework for Structured {S}ocratic Teaching with Large Language Models},
  author    = {Peng, Yuan and others},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025},
  year      = {2025},
  url       = {https://aclanthology.org/2025.findings-emnlp.888/}
}

Related Resources

Resource	Link
KELE paper (EMNLP 2025 Findings)	https://aclanthology.org/2025.findings-emnlp.888/
KELE GitHub repository	https://github.com/yuanpan1020/KELE
Original model	https://huggingface.co/yuanpan/SocratTeachLLM
Training data (Chinese)	https://huggingface.co/datasets/ulises-c/SocratDataset
Training data (English translation)	https://huggingface.co/datasets/ulises-c/SocratDataset-EN
Evaluation + inference code	https://github.com/ulises-c/csen-346

License

Apache 2.0

Downloads last month: 91

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ulises-c/SocratTeachLLM

Base model

zai-org/glm-4-9b-chat

Adapter

(9)

this model