spicyneuron
/

MiniMax-M2.7-MLX-4.9bit

Text Generation

4-bit precision

Model card Files Files and versions

MiniMax-M2.7-MLX-4.9bit / README.md

spicyneuron's picture

Update README.md

5f69518 verified 4 days ago

|

history blame contribute delete

2.01 kB

	---
	pipeline_tag: text-generation
	license: other
	license_name: modified-mit
	license_link: https://github.com/MiniMax-AI/MiniMax-M2.7/blob/main/LICENSE
	library_name: mlx
	tags:
	- mlx
	base_model: MiniMaxAI/MiniMax-M2.7
	---

	[MiniMax-M2.7](MiniMaxAI/MiniMax-M2.7) optimized for MLX.

	- A mixed-precision quant that balances speed, memory, and accuracy.
	- 4 bit baseline with important layers at 5, 6, 8, and BF16.

	# Usage

	```sh
	# Start server at http://localhost:8080/chat/completions
	uvx --from mlx-lm mlx_lm.server \
	--host 127.0.0.1 \
	--port 8080 \
	--model spicyneuron/MiniMax-M2.7-MLX-4.9bit
	```

	# Benchmarks

	metric \| mlx-community_MiniMax-M2.7-4bit \| baa-ai_MiniMax-M2.7-RAM-155GB-MLX \| 4.9 bit (this model)
	--- \| --- \| --- \| ---
	bpw \| 4.501 \| 5.4278 \| 4.915
	peak memory (1024/512) \| 129.632 \| 156.051 \| 141.458
	prompt tok/s (1024) \| 739.996 ± 1.565 \| 708.147 ± 0.818 \| 723.742 ± 0.880
	gen tok/s (512) \| 48.703 ± 0.116 \| 40.253 ± 0.077 \| 42.270 ± 0.143
	perplexity \| 9.120 ± 0.047 \| 8.835 ± 0.045 \| 4.590 ± 0.027
	hellaswag \| 0.504 ± 0.011 \| 0.509 ± 0.011 \| 0.512 ± 0.011
	piqa \| 0.786 ± 0.01 \| 0.787 ± 0.01 \| 0.791 ± 0.009
	winogrande \| 0.636 ± 0.014 \| 0.661 ± 0.013 \| 0.666 ± 0.013

	Tested on a Mac Studio M3 Ultra with:

	```
	mlx_lm.perplexity --sequence-length 2048 --seed 123
	mlx_lm.benchmark --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
	mlx_lm.evaluate --tasks hellaswag --seed 123 --num-shots 0 --limit 2000
	mlx_lm.evaluate --tasks piqa --seed 123 --num-shots 0 --limit 2000
	mlx_lm.evaluate --tasks winogrande --seed 123 --num-shots 0 --limit 2000
	```

	# Methodology

	Quantized with a [mlx-lm fork](https://github.com/ml-explore/mlx-lm/pull/922), drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs.
	MLX quantization options differ than llama.cpp, but the principles are the same:

	- Sensitive layers like MoE routing, attention, and output embeddings get higher precision
	- More tolerant layers like MoE experts get lower precision