CodeLens-7B-MLX

MLX version of sriksven/CodeLens-7B in various oQ levels and dtypes.

Directory oQ Level dtype size
CodeLens-7B-oQ4-bf16 4-bit bfloat16 4.2GB
CodeLens-7B-oQ4-fp16 4-bit fp16 4.2GB
CodeLens-7B-oQ5-bf16 5-bit bfloat16 5.1GB
CodeLens-7B-oQ5-fp16 5-bit fp16 5.1GB
CodeLens-7B-oQ6-bf16 6-bit bfloat16 5.9GB
CodeLens-7B-oQ6-fp16 6-bit fp16 5.9GB
CodeLens-7B-oQ8-bf16 8-bit bfloat16 7.5GB
CodeLens-7B-oQ8-fp16 8-bit fp16 7.5GB

Why choose FP16 over BFLOAT16/BF16?

On older Apple Silicon (M1 and M2), fp16 can be faster. Here are the details from Muhammad Raza:

A lot of MLX builds ship as bf16, and on the M1 and M2 that data type does not get the accelerated path that fp16 does. During prefill those weights run un-accelerated and the penalty multiplies across every input token, which is part of why some “MLX is slow” reports come from older hardware. [...]

If you are on an M1 or M2 and MLX feels sluggish, check this before you blame the format.

Test Results

Using oQ6, here are the results from oMLX 0.4.4 on a Macbook Pro 2021 (M1 Pro).

tl;dr:

Time to First Token (TTFT) and Prompt Processing Tokens Per Second (ppTPS, aka "prefill speed") are about 60% faster when using FP16.

However, Token Generation (tgTPS) only increases moderately, around 1-2%.

BFLOAT16

Single request results
Test TTFT(ms) TPOT(ms) ppTPS tgTPS E2E(s) Throughput PeakMem
pp 4096 / tg 128 24727.9 39.3 165.6 25.7 29.7 142.1 6.84 GB
pp 16384 / tg 128 111811.1 48.4 146.5 20.8 118.0 140.0 7.69 GB
Batch results
Batch tgTPS ppTPS avgTTFT(ms) E2E(s) Speedup
1x baseline 25.7 165.6 24727.9 29.7 1.00x
2x 30.0 164.0 12487.3 21.0 1.17x
4x 31.9 239.7 16885.7 33.1 1.24x

FP16

Single request results
Test TTFT(ms) TPOT(ms) ppTPS tgTPS E2E(s) Throughput PeakMem
pp 4096 / tg 128 15226.8 37.3 269.0 27.0 20.0 211.6 6.84 GB
pp 16384 / tg 128 69595.4 45.6 235.4 22.1 75.4 219.0 7.69 GB
Batch results
Batch tgTPS ppTPS avgTTFT(ms) E2E(s) Speedup
1x baseline 27.0 269.0 15226.8 20.0 1.00x
2x 36.9 266.6 7681.4 14.6 1.37x
4x 38.0 363.3 11084.7 24.8 1.41x

Hardware and Software

These were converted to MLX using oMLX 0.4.4 on a 32GB Macbook Pro 2021 (M1 Pro). I cleared all my RAM so you don't have to.

License

Apache 2.0, as per original model.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xunker/CodeLens-7B-MLX

Base model

Qwen/Qwen2.5-7B
Finetuned
(2639)
this model

Dataset used to train xunker/CodeLens-7B-MLX