---
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
tags:
  - code
  - code-review
  - programming
  - qwen2.5
  - bug-detection
  - mlx
datasets:
  - sahil2801/CodeAlpaca-20k
language:
  - en
pipeline_tag: text-generation
library_name: transformers
model-index:
  - name: CodeLens-7B
    results: []
---

# [CodeLens-7B-MLX](https://huggingface.co/xunker/CodeLens-7B-MLX)

MLX version of [sriksven/CodeLens-7B](https://huggingface.co/sriksven/CodeLens-7B) in various oQ levels and dtypes.

Directory                                     | oQ Level | dtype    | size
----------------------------------------------|----------|----------|------
[CodeLens-7B-oQ4-bf16](CodeLens-7B-oQ4-bf16/) | 4-bit    | bfloat16 | 4.2GB
[CodeLens-7B-oQ4-fp16](CodeLens-7B-oQ4-fp16/) | 4-bit    | fp16     | 4.2GB
[CodeLens-7B-oQ5-bf16](CodeLens-7B-oQ5-bf16/) | 5-bit    | bfloat16 | 5.1GB
[CodeLens-7B-oQ5-fp16](CodeLens-7B-oQ5-fp16/) | 5-bit    | fp16     | 5.1GB
[CodeLens-7B-oQ6-bf16](CodeLens-7B-oQ6-bf16/) | 6-bit    | bfloat16 | 5.9GB
[CodeLens-7B-oQ6-fp16](CodeLens-7B-oQ6-fp16/) | 6-bit    | fp16     | 5.9GB
[CodeLens-7B-oQ8-bf16](CodeLens-7B-oQ8-bf16/) | 8-bit    | bfloat16 | 7.5GB
[CodeLens-7B-oQ8-fp16](CodeLens-7B-oQ8-fp16/) | 8-bit    | fp16     | 7.5GB

## Why choose FP16 over BFLOAT16/BF16?

On older Apple Silicon (M1 and M2), fp16 can be faster. Here are the details from [Muhammad Raza](https://muhammadraza.me/2026/gguf-vs-mlx-decision-guide/#two-traps-that-will-flip-your-results):

> A lot of MLX builds ship as bf16, and **on the M1 and M2 that data type does not get the accelerated path that fp16 does**. During prefill those weights run un-accelerated and the penalty multiplies across every input token, which is part of why some “MLX is slow” reports come from older hardware. [...]
>
> If you are on an M1 or M2 and MLX feels sluggish, check this before you blame the format.

### Test Results

Using oQ6, here are the results from oMLX 0.4.4 on a Macbook Pro 2021 (M1 Pro).

**tl;dr**:

Time to First Token (TTFT) and Prompt Processing Tokens Per Second
(ppTPS, aka "prefill speed") are about 60% faster when using FP16.

However, Token Generation (tgTPS) only increases moderately, around 1-2%.

#### BFLOAT16

##### Single request results
Test	            | TTFT(ms) | TPOT(ms) | ppTPS | tgTPS | E2E(s) | Throughput | PeakMem
------------------|----------|----------|-------|-------|--------|------------|--------
pp 4096 / tg 128  | 24727.9  | 39.3     | 165.6 | 25.7  | 29.7   | 142.1      | 6.84 GB
pp 16384 / tg 128 | 111811.1 | 48.4     | 146.5 | 20.8  | 118.0  | 140.0      | 7.69 GB

##### Batch results
Batch       | tgTPS | ppTPS | avgTTFT(ms) | E2E(s) | Speedup
------------|-------|-------|-------------|--------|--------
1x baseline | 25.7  | 165.6 | 24727.9     | 29.7   | 1.00x
2x          | 30.0  | 164.0 | 12487.3     | 21.0   | 1.17x
4x          | 31.9  | 239.7 | 16885.7     | 33.1   | 1.24x

#### FP16

##### Single request results
Test	            | TTFT(ms) | TPOT(ms) | ppTPS | tgTPS | E2E(s) | Throughput | PeakMem
------------------|----------|----------|-------|-------|--------|------------|--------
pp 4096 / tg 128  | 15226.8  | 37.3     | 269.0 | 27.0  | 20.0   | 211.6      | 6.84 GB
pp 16384 / tg 128 | 69595.4  | 45.6     | 235.4 | 22.1  | 75.4   | 219.0      | 7.69 GB

##### Batch results
Batch       | tgTPS | ppTPS | avgTTFT(ms) | E2E(s) | Speedup
------------|-------|-------|-------------|--------|--------
1x baseline | 27.0  | 269.0 | 15226.8     | 20.0   | 1.00x
2x          | 36.9  | 266.6 | 7681.4      | 14.6   | 1.37x
4x          | 38.0  | 363.3 | 11084.7     | 24.8   | 1.41x

## Hardware and Software

These were converted to MLX using [oMLX](https://github.com/jundot/omlx) [0.4.4](https://github.com/jundot/omlx/releases/tag/v0.4.4) on a 32GB Macbook Pro 2021 (M1 Pro). I cleared all my RAM so you don't have to.

## License

Apache 2.0, as per [original model](https://huggingface.co/sriksven/CodeLens-7B).