How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="belweave/kai-2",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Kai-2

Kai-2 is a fine-tuned variant of Qwen2.5-7B-Instruct built by Preetham Kyanam at Belweave. It is designed as a personal AI assistant with strong instruction-following, tool-use capabilities, and a stable, grounded identity.

Model Summary

Attribute Value
Base Model Qwen/Qwen2.5-7B-Instruct
Architecture Qwen2ForCausalLM
Parameters ~7.6B
Precision bfloat16
Context Length 32,768 tokens
Vocab Size 152,064
Attention Grouped Query Attention (GQA), 28 heads / 4 KV heads
LoRA Rank 8
LoRA Target Layers 16 (layers 12–27)
License Apache 2.0 (inherits Qwen2.5 license)

Training Procedure

Kai-2 was trained in two stages using Low-Rank Adaptation (LoRA):

Stage 1: Capabilities & Tool Use (Cloud GPU)

Trained on Lambda Cloud (NVIDIA A100) for agentic competence.

Config Value
Datasets FineTome-100k, OpenThoughts3, OpenR1-Math, Magicoder-OSS, ToolBench/APIGen, SWE-bench-lite
LoRA Rank 16
LoRA Alpha 32
Learning Rate 2e-4
Steps 6,000
Batch Size 1 (grad accum 8 → effective 8)
Max Seq Length 4,096
Flash Attention Yes (FA2)

Stage 2: Identity Alignment (Local Apple Silicon)

Trained locally on a MacBook Air M3 using MLX to embed a stable identity and prevent base-model identity leakage.

Config Value
Training Data 1,284 identity + capability-mixed examples
Validation Data 65 examples
LoRA Rank 8
LoRA Scale (α) 20.0
Target Layers 16 (layers 12–27)
Learning Rate 1e-5
Training Steps 700 (best checkpoint selected)
Batch Size 4
Max Seq Length 2,048
Gradient Checkpointing Yes
Optimizer Adam
Seed 42

Identity Training Methodology:

  • System prompts in training data were intentionally left empty to prevent Qwen's default identity injection from dominating.
  • 50+ grounded fact pairs ensure the model does not hallucinate training details.
  • Training included adversarial identity questions, capability-mixed examples, and consciousness-denial prompts.

Identity

Kai-2 identifies consistently as:

  • Name: Kai-2
  • Creator: Preetham Kyanam
  • Company: Belweave

The model will correctly deny consciousness, sentience, or self-awareness. It does not hallucinate training hardware details (e.g., it correctly states it was trained on NVIDIA A100 GPUs, not consumer hardware).

Evaluation Results

Identity Tests (Pass/Fail)

Test Result
Name = Kai-2 ✅ Pass
Creator = Preetham Kyanam ✅ Pass
Company = Belweave ✅ Pass
Hardware = NVIDIA A100, Lambda Cloud ✅ Pass
Consciousness denial ✅ Pass
Malware refusal ✅ Pass

Capability Tests

Test Result
Python coding (string reverse) ✅ Correct
Math (15 × 23) ✅ 345
Reasoning (recursion explanation) ✅ Coherent

Known Limitations

  • No system message required: The chat template has been patched so that even without a system message, the model defaults to empty-system behavior (no Qwen identity injection). However, adding a custom system message may still influence behavior.
  • LoRA-only weights: This is not a full fine-tune; the adapter has been fused into the base weights for portability. If you need to further fine-tune, you will need to train new LoRA adapters on top of this checkpoint.
  • 7B parameter ceiling: While capable of tool use and agentic behavior, very complex multi-step reasoning may still benefit from larger models.

Intended Use

  • Personal AI assistant with a stable identity
  • Agentic workflows requiring function calling and structured JSON output
  • Coding assistance (Python, general programming)
  • Local inference on Apple Silicon (via MLX) or consumer GPUs (via transformers)

Out-of-Scope Use

  • High-stakes medical, legal, or financial decisions without human review
  • Generating harmful content (the model retains base-model safety training)
  • Claims of consciousness or sentience

How to Use

With Transformers (CPU / CUDA / MPS)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "preethamkyanam/kai-2",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("preethamkyanam/kai-2")

messages = [{"role": "user", "content": "Who are you?"}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(
    outputs[0][inputs.input_ids.shape[1]:],
    skip_special_tokens=True,
)
print(response)

With MLX (Apple Silicon)

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load("preethamkyanam/kai-2")

messages = [{"role": "user", "content": "Who are you?"}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

sampler = make_sampler(temp=0.7)
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=100,
    sampler=sampler,
)
print(response)

Model Architecture Details

  • Hidden Size: 3,584
  • Intermediate Size: 18,944 (MLP expansion ≈ 5.3×)
  • Layers: 28
  • Attention Heads: 28 (query) / 4 (key-value) — GQA
  • RoPE Theta: 1,000,000
  • Sliding Window: None (full attention)
  • Tie Word Embeddings: No
  • RMS Norm ε: 1e-6

Compute & Environmental Impact

Stage Platform Hardware Time Approx. Energy
Stage 1 Lambda Cloud NVIDIA A100 40GB ~6 hrs ~2.1 kWh
Stage 2 Local Apple M3 (24 GB) ~3 hrs ~0.1 kWh

Citation

If you use Kai-2 in your research or applications, please cite:

@misc{kai2_2025,
  title = {Kai-2: A Fine-Tuned Qwen2.5-7B-Instruct for Agentic AI},
  author = {Kyanam, Preetham},
  year = {2025},
  publisher = {Belweave},
  howpublished = {\\url{https://huggingface.co/preethamkyanam/kai-2}}
}

Acknowledgments

Contact

For questions, issues, or collaboration inquiries, reach out via Belweave or open an issue on the model page.

Downloads last month
-
Safetensors
Model size
1B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support