--- license: apache-2.0 base_model: Qwen/Qwen2.5-7B-Instruct tags: - code - code-review - programming - qwen2.5 - bug-detection - mlx datasets: - sahil2801/CodeAlpaca-20k language: - en pipeline_tag: text-generation library_name: transformers model-index: - name: CodeLens-7B results: [] --- # [CodeLens-7B-MLX](https://huggingface.co/xunker/CodeLens-7B-MLX) MLX version of [sriksven/CodeLens-7B](https://huggingface.co/sriksven/CodeLens-7B) in various oQ levels and dtypes. Directory | oQ Level | dtype | size ----------------------------------------------|----------|----------|------ [CodeLens-7B-oQ4-bf16](CodeLens-7B-oQ4-bf16/) | 4-bit | bfloat16 | 4.2GB [CodeLens-7B-oQ4-fp16](CodeLens-7B-oQ4-fp16/) | 4-bit | fp16 | 4.2GB [CodeLens-7B-oQ5-bf16](CodeLens-7B-oQ5-bf16/) | 5-bit | bfloat16 | 5.1GB [CodeLens-7B-oQ5-fp16](CodeLens-7B-oQ5-fp16/) | 5-bit | fp16 | 5.1GB [CodeLens-7B-oQ6-bf16](CodeLens-7B-oQ6-bf16/) | 6-bit | bfloat16 | 5.9GB [CodeLens-7B-oQ6-fp16](CodeLens-7B-oQ6-fp16/) | 6-bit | fp16 | 5.9GB [CodeLens-7B-oQ8-bf16](CodeLens-7B-oQ8-bf16/) | 8-bit | bfloat16 | 7.5GB [CodeLens-7B-oQ8-fp16](CodeLens-7B-oQ8-fp16/) | 8-bit | fp16 | 7.5GB ## Why choose FP16 over BFLOAT16/BF16? On older Apple Silicon (M1 and M2), fp16 can be faster. Here are the details from [Muhammad Raza](https://muhammadraza.me/2026/gguf-vs-mlx-decision-guide/#two-traps-that-will-flip-your-results): > A lot of MLX builds ship as bf16, and **on the M1 and M2 that data type does not get the accelerated path that fp16 does**. During prefill those weights run un-accelerated and the penalty multiplies across every input token, which is part of why some “MLX is slow” reports come from older hardware. [...] > > If you are on an M1 or M2 and MLX feels sluggish, check this before you blame the format. ### Test Results Using oQ6, here are the results from oMLX 0.4.4 on a Macbook Pro 2021 (M1 Pro). **tl;dr**: Time to First Token (TTFT) and Prompt Processing Tokens Per Second (ppTPS, aka "prefill speed") are about 60% faster when using FP16. However, Token Generation (tgTPS) only increases moderately, around 1-2%. #### BFLOAT16 ##### Single request results Test | TTFT(ms) | TPOT(ms) | ppTPS | tgTPS | E2E(s) | Throughput | PeakMem ------------------|----------|----------|-------|-------|--------|------------|-------- pp 4096 / tg 128 | 24727.9 | 39.3 | 165.6 | 25.7 | 29.7 | 142.1 | 6.84 GB pp 16384 / tg 128 | 111811.1 | 48.4 | 146.5 | 20.8 | 118.0 | 140.0 | 7.69 GB ##### Batch results Batch | tgTPS | ppTPS | avgTTFT(ms) | E2E(s) | Speedup ------------|-------|-------|-------------|--------|-------- 1x baseline | 25.7 | 165.6 | 24727.9 | 29.7 | 1.00x 2x | 30.0 | 164.0 | 12487.3 | 21.0 | 1.17x 4x | 31.9 | 239.7 | 16885.7 | 33.1 | 1.24x #### FP16 ##### Single request results Test | TTFT(ms) | TPOT(ms) | ppTPS | tgTPS | E2E(s) | Throughput | PeakMem ------------------|----------|----------|-------|-------|--------|------------|-------- pp 4096 / tg 128 | 15226.8 | 37.3 | 269.0 | 27.0 | 20.0 | 211.6 | 6.84 GB pp 16384 / tg 128 | 69595.4 | 45.6 | 235.4 | 22.1 | 75.4 | 219.0 | 7.69 GB ##### Batch results Batch | tgTPS | ppTPS | avgTTFT(ms) | E2E(s) | Speedup ------------|-------|-------|-------------|--------|-------- 1x baseline | 27.0 | 269.0 | 15226.8 | 20.0 | 1.00x 2x | 36.9 | 266.6 | 7681.4 | 14.6 | 1.37x 4x | 38.0 | 363.3 | 11084.7 | 24.8 | 1.41x ## Hardware and Software These were converted to MLX using [oMLX](https://github.com/jundot/omlx) [0.4.4](https://github.com/jundot/omlx/releases/tag/v0.4.4) on a 32GB Macbook Pro 2021 (M1 Pro). I cleared all my RAM so you don't have to. ## License Apache 2.0, as per [original model](https://huggingface.co/sriksven/CodeLens-7B).