moonshotai/Kimi-K2.7-Code optimized for running on a Mac Studio M3 Ultra.

  • A mixed-precision quant that balances speed, memory, and accuracy.
  • 3-bit MoE baseline with important always-on layers at higher precision.
  • Fits into ~460 GB memory, leaving enough room for a smaller utility model.

Usage

# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server \
  --host 127.0.0.1 \
  --port 8080 \
  --model spicyneuron/Kimi-K2.7-Code-MLX-3.6bit

Benchmarks

TBD

Methodology

Quantized with a mlx-lm fork. MLX quantization options differ than llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision
  • More tolerant layers like MoE experts get lower precision
Downloads last month
34
Safetensors
Model size
1T params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for spicyneuron/Kimi-K2.7-Code-MLX-3.6bit

Quantized
(11)
this model