Experimental global target bits‑per‑weight quantization of Qwen/Qwen3.6-27B and Qwen/Qwen3.6-35B-A3B.
Unlike standard llama.cpp quantizations that rely on fixed type heuristics (e.g., Q4_K_M), the Target BPW approach optimizes per-tensor precision where it matters the most, and produces high quality models that meet a precise global file size target.
Key Advantages: - VRAM Maximization: Can generate high quality models sized exactly to fit hardware constraints (e.g., fitting the model into exactly 24GB VRAM). - Data-Driven Precision: Quantization mix is determined by actual weight error sensitivity rather than hardcoded rules, often yielding better PPL/KLD size trade-offs.
Full benchmarks (PPL, KLD, ARC, GPQA, MMLU, etc.) and methodology in the models' cards.
I tested it on all the tricky scenarios where most LLMs usually face-plant—and guess what? It didn’t flop.
295B total params, 21B active params, 256K context window. Built on MoE architecture, it delivers trillion-parameter-level performance with a much smaller footprint. Long-context capabilities get a massive upgrade.
Agent abilities stand out this time: tool calling, workflow orchestration, and autonomous planning are far more stable in real business scenarios. AI PPT generation in Tencent Docs is also significantly smoother and more reliable.
Real-world tests on WorkBuddy show first-token latency down 54%, success rate over 99.99%, and an Agent workflow that ran continuously for 495 steps.
Its Coding Agent achieved top-tier results on both SWE-Bench Verified and Terminal-Bench 2.0
Now open-sourced on GitHub, HuggingFace, and ModelScope. Available on TokenHub at just 1.2 RMB per million tokens.
Darwin-TTS: 3% of an LLM's Brain Makes TTS Speak with Emotion — Zero Training
We blended 3% of Qwen3-1.7B (LLM) FFN weights into Qwen3-TTS-1.7B's talker module. The result: emotionally enhanced speech synthesis — with zero training, zero data, and zero GPU hours.
Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share 100% identical architecture — same hidden_size (2048), same layers (28), same heads (16). This enabled pure 1:1 weight blending across 84 FFN tensors with a single lerp operation. At 3% blend, emotion appears. At 5%, emotion intensifies. At 10%, the model breaks — producing 655-second outputs for a 3-second sentence, because the LLM's "keep generating" pattern overwhelms the TTS stop signal.
To our knowledge, this is the first training-free cross-modal weight transfer between an LLM and a TTS model. Prior work either requires adapter training (SmolTolk, 2025), fine-tuning (CSLM, 2025), or massive end-to-end compute (GPT-4o). Darwin-TTS achieves cross-modal capability transfer in under 2 minutes on CPU.
The key insight: TTS models with LLM backbones already "think" in language. We're just restoring 3% of the original LLM's language understanding patterns — particularly those related to emotional semantics and prosody planning. The code is three lines: load the model, load the LLM FFN, call p.lerp_(llm_weight, 0.03).
creators of the Darwin Evolutionary Merge Framework. Darwin LLM V7 achieved GPQA Diamond 86.9% (HF Benchmark #3) through CMA-ES optimized FFN crossbreeding. Darwin-TTS extends this principle from LLM-to-LLM merging into cross-modal LLM-to-TTS transfer. Apache 2.0.