One for All
Collection
Multi-teacher geometry distillation: 6 teachers, one 0.5B student (deku). Trained model, GGUF build, soul-space viz data, and the live Gradio Space. • 4 items • Updated • 1
How to use build-small-hackathon/deku with PEFT:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "build-small-hackathon/deku")Deku is a Qwen2.5-0.5B-Instruct fine-tuned with LoRA via gated CKA geometry distillation (Path B). It absorbs representation structure from 6 heterogeneous teacher LLMs simultaneously, without access to teacher logits or shared tokenizers.
| Teacher | Parameters | Hidden dim |
|---|---|---|
| Qwen2.5-1.5B-Instruct | 1.5B | 1536 |
| SmolLM2-1.7B-Instruct | 1.7B | 2048 |
| Phi-3.5-mini-instruct | 3.8B | 3072 |
| gemma-2-2b-it | 2.7B | 2304 |
| MiniCPM-2B-sft-bf16 | 2.7B | 2304 |
| Nemotron-Mini-4B-Instruct | 4B | 3072 |
Path B — geometry-only, tokenizer-agnostic:
The CKA geometry loss aligns the relational structure of representations (which samples are similar to which) rather than raw activation values, making it robust to dimension mismatch and tokenizer differences.
from transformers import AutoTokenizer
from peft import PeftModel, AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained(
"build-small-hackathon/deku",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("build-small-hackathon/deku")
inputs = tokenizer("Explain gradient descent in one sentence.", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
gating.pt — torch.load("gating.pt"), state_dict for a nn.Linear(896, 6)projections.pt — list of 6 nn.Linear state dicts (teacher_i → student space)gating.npz for a torch-free gate