CodeLens-7B-MLX / README.md
xunker's picture
Upload 35 files
1e57276 verified
|
Raw
History Blame Contribute Delete
3.98 kB
---
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
tags:
- code
- code-review
- programming
- qwen2.5
- bug-detection
- mlx
datasets:
- sahil2801/CodeAlpaca-20k
language:
- en
pipeline_tag: text-generation
library_name: transformers
model-index:
- name: CodeLens-7B
results: []
---
# [CodeLens-7B-MLX](https://huggingface.co/xunker/CodeLens-7B-MLX)
MLX version of [sriksven/CodeLens-7B](https://huggingface.co/sriksven/CodeLens-7B) in various oQ levels and dtypes.
Directory | oQ Level | dtype | size
----------------------------------------------|----------|----------|------
[CodeLens-7B-oQ4-bf16](CodeLens-7B-oQ4-bf16/) | 4-bit | bfloat16 | 4.2GB
[CodeLens-7B-oQ4-fp16](CodeLens-7B-oQ4-fp16/) | 4-bit | fp16 | 4.2GB
[CodeLens-7B-oQ5-bf16](CodeLens-7B-oQ5-bf16/) | 5-bit | bfloat16 | 5.1GB
[CodeLens-7B-oQ5-fp16](CodeLens-7B-oQ5-fp16/) | 5-bit | fp16 | 5.1GB
[CodeLens-7B-oQ6-bf16](CodeLens-7B-oQ6-bf16/) | 6-bit | bfloat16 | 5.9GB
[CodeLens-7B-oQ6-fp16](CodeLens-7B-oQ6-fp16/) | 6-bit | fp16 | 5.9GB
[CodeLens-7B-oQ8-bf16](CodeLens-7B-oQ8-bf16/) | 8-bit | bfloat16 | 7.5GB
[CodeLens-7B-oQ8-fp16](CodeLens-7B-oQ8-fp16/) | 8-bit | fp16 | 7.5GB
## Why choose FP16 over BFLOAT16/BF16?
On older Apple Silicon (M1 and M2), fp16 can be faster. Here are the details from [Muhammad Raza](https://muhammadraza.me/2026/gguf-vs-mlx-decision-guide/#two-traps-that-will-flip-your-results):
> A lot of MLX builds ship as bf16, and **on the M1 and M2 that data type does not get the accelerated path that fp16 does**. During prefill those weights run un-accelerated and the penalty multiplies across every input token, which is part of why some “MLX is slow” reports come from older hardware. [...]
>
> If you are on an M1 or M2 and MLX feels sluggish, check this before you blame the format.
### Test Results
Using oQ6, here are the results from oMLX 0.4.4 on a Macbook Pro 2021 (M1 Pro).
**tl;dr**:
Time to First Token (TTFT) and Prompt Processing Tokens Per Second
(ppTPS, aka "prefill speed") are about 60% faster when using FP16.
However, Token Generation (tgTPS) only increases moderately, around 1-2%.
#### BFLOAT16
##### Single request results
Test | TTFT(ms) | TPOT(ms) | ppTPS | tgTPS | E2E(s) | Throughput | PeakMem
------------------|----------|----------|-------|-------|--------|------------|--------
pp 4096 / tg 128 | 24727.9 | 39.3 | 165.6 | 25.7 | 29.7 | 142.1 | 6.84 GB
pp 16384 / tg 128 | 111811.1 | 48.4 | 146.5 | 20.8 | 118.0 | 140.0 | 7.69 GB
##### Batch results
Batch | tgTPS | ppTPS | avgTTFT(ms) | E2E(s) | Speedup
------------|-------|-------|-------------|--------|--------
1x baseline | 25.7 | 165.6 | 24727.9 | 29.7 | 1.00x
2x | 30.0 | 164.0 | 12487.3 | 21.0 | 1.17x
4x | 31.9 | 239.7 | 16885.7 | 33.1 | 1.24x
#### FP16
##### Single request results
Test | TTFT(ms) | TPOT(ms) | ppTPS | tgTPS | E2E(s) | Throughput | PeakMem
------------------|----------|----------|-------|-------|--------|------------|--------
pp 4096 / tg 128 | 15226.8 | 37.3 | 269.0 | 27.0 | 20.0 | 211.6 | 6.84 GB
pp 16384 / tg 128 | 69595.4 | 45.6 | 235.4 | 22.1 | 75.4 | 219.0 | 7.69 GB
##### Batch results
Batch | tgTPS | ppTPS | avgTTFT(ms) | E2E(s) | Speedup
------------|-------|-------|-------------|--------|--------
1x baseline | 27.0 | 269.0 | 15226.8 | 20.0 | 1.00x
2x | 36.9 | 266.6 | 7681.4 | 14.6 | 1.37x
4x | 38.0 | 363.3 | 11084.7 | 24.8 | 1.41x
## Hardware and Software
These were converted to MLX using [oMLX](https://github.com/jundot/omlx) [0.4.4](https://github.com/jundot/omlx/releases/tag/v0.4.4) on a 32GB Macbook Pro 2021 (M1 Pro). I cleared all my RAM so you don't have to.
## License
Apache 2.0, as per [original model](https://huggingface.co/sriksven/CodeLens-7B).