--- license: mit language: - en tags: - causal-lm - scientific-language-model - mathematics - arxiv - research library_name: transformers --- # KiteFish-A1-1.5B **KiteFish-A1-1.5B** is a ~1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources across mathematics, computer science, and theoretical physics. 📄 **Paper:** https://arxiv.org/abs/2602.17288 💻 **Github:** https://github.com/kitefishai/KiteFish-A1-1.5B-Math This is a **base scientific language model** (not instruction-tuned). ## Overview KiteFish-A1-1.5B explores what it takes to train a domain-specialized scientific language model directly from structured LaTeX archives. **Training Scale** - ~52B pretraining tokens - ~5B additional post-training tokens - ~200GB processed scientific corpus - LLaMA-compatible tokenizer (~102k vocab) - 2× NVIDIA A100 (80GB) GPUs - 24 experimental training runs The focus of this project is *scientific language modeling robustness*, not benchmark optimization. ## Model Architecture - 24 Transformer layers - Hidden size: 2048 - FFN size: 5504 - 16 attention heads - Context length: 4096 (trained at 768 tokens) - Dense LLaMA-style architecture **Optimization** - AdamW - Learning rate: 2e-4 - Warmup: 500 steps - Weight decay: 0.1 - Gradient accumulation: 32 - bf16 mixed precision - Gradient checkpointing enabled **Validation Perplexity:** ~4.2 (held-out scientific corpus) ## Intended Use KiteFish-A1-1.5B is suitable for: - Scientific text modeling research - Mathematical language modeling experiments - Pretraining initialization for domain fine-tuning - Tokenization and symbolic modeling research - Studying LaTeX structure modeling It is **not optimized for:** - Instruction following - Chat-based applications - General conversational AI - Benchmark leaderboard performance ## Performance Notes This model was trained under moderate compute constraints and without instruction tuning or alignment stages. Observed characteristics: - Strong familiarity with scientific writing style - Stable LaTeX structural modeling - Reasonable symbolic fluency - Limited reasoning depth - Low downstream benchmark accuracy without fine-tuning Performance improves significantly with supervised fine-tuning (SFT), LoRA adaptation, or domain-specific instruction tuning. ## Limitations - Not instruction-tuned - No RLHF or preference alignment - Trained at 768-token sequence length - Domain restricted to selected arXiv categories - Not optimized for reasoning benchmarks - General NLP benchmark scores may be low This release is intended primarily for research and experimentation. ## Example Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "KiteFishAI/KiteFish-A1-1.5B-Math" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) prompt = "Prove that the sum of two continuous functions is continuous." inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Citation If you use this model in your research, please cite: ``` @article{kitefish_a1_2026, title={KiteFish-A1: Training a Scientific Language Model from Raw LaTeX Archives}, author={...}, year={2026}, eprint={2602.17288}, archivePrefix={arXiv} } ```