Use it from Swift
Add the package
Package.swift:
.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),
// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),
Platforms: iOS 18+ / macOS 15+.
Download + chat (one call)
import CoreMLLLM
// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3.5-2B-CoreML")
let stream = try await llm.generate(
[CoreMLLLM.Message(role: .user, content: "Hello!")],
maxTokens: 256
)
for await chunk in stream {
print(chunk, terminator: "")
}
Multi-turn: keep an [CoreMLLLM.Message] array, append the
user/assistant turns, and pass the whole history to
generate(_:) again. Call llm.reset() to start a new
conversation (clears the KV cache).
Qwen3.5-2B β Core ML (ANE chunked)
Core ML port of Qwen/Qwen3.5-2B, split into 4 INT8 chunks + a raw fp16 embedding sidecar so every chunk fits the iPhone ANE single-mlprogram compile envelope.
iPhone 17 Pro (A18) measured: 17 tok/s decode, ~200 MB phys_footprint, 0 GB sustained Metal heap, ~91 % ANE op placement across all 4 body chunks. First-load ANE compile β 15 min across chunks (cached after).
Files
qwen3_5_2b_decode_chunks/
βββ chunk_a.mlpackage # 340 MB β embed + layers 0-5 + their states
βββ chunk_b.mlpackage # 340 MB β layers 6-11 + states
βββ chunk_c.mlpackage # 340 MB β layers 12-17 + states
βββ chunk_d.mlpackage # 850 MB β layers 18-23 + final_norm + lm_head
βββ embed_weight.bin # 1.02 GB β raw fp16 embed table (248320 Γ 2048)
All 5 pieces are required. They chain hiddenβhidden across chunks per token, plus 48 state tensors (24 layers Γ 2 states each) carried inside the mlpackages.
The embed is not an mlpackage on purpose: Swift mmaps the raw fp16 file so the 1 GB embed table stays in clean virtual pages and only the rows actually touched per prompt page in. Loading the embed as a Core ML weight would dequantize the entire table into the CPU heap and add ~1 GB to phys_footprint.
What this repo does NOT ship
- No
model_config.jsonβ Core ML serializes input/output shapes into each.mlpackagedirectly.coremltoolsloads it without external config. - No tokenizer β fetch from the base model:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
Standalone usage (Python / Mac)
import coremltools as ct
import numpy as np
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
local = snapshot_download("mlboydaisuke/qwen3.5-2B-CoreML")
root = f"{local}/qwen3_5_2b_decode_chunks"
chunks = [
ct.models.MLModel(f"{root}/chunk_{x}.mlpackage")
for x in ("a", "b", "c", "d")
]
embed = np.memmap(f"{root}/embed_weight.bin",
dtype=np.float16, mode="r",
shape=(248320, 2048))
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
Per decode step:
- Look up
embed[token_id]βhidden (1, 1, 2048)fp16 - Pass
hidden+ scalar inputs (position, cos, sin) + state slice tochunk_a.predict(...), take itshidden_outand updated states. - Repeat for
chunk_b,chunk_c,chunk_d. chunk_demitslogits (1, 1, 248320)fp16; argmax (or sample) it and feed back asinput_tokenfor the next step.- Map
new_state_*outputs to the next call'sstate_*inputs.
Full reference Python loop: conversion/qwen35_2b_chunks_parity.py.
iOS / Mac app
Qwen35Generator.swift handles the chunk chaining + embed mmap. Tap Qwen3.5 2B (ANE) in the model picker.
Architecture
Hybrid Gated DeltaNet + GQA, 24 layers, interleaved [L L L F] Γ 6.
| linear_attention | full_attention | |
|---|---|---|
| count | 18 | 6 |
| state A | (1, 6144, 4) |
(1, 2, 2048, 256) |
| state B | (1, 16, 128, 128) |
(1, 2, 2048, 256) |
Hidden=2048, vocab=248320, head_dim=256 (full attn), rotary partial=0.25 (rotary_dim=64), rope_theta=1e7, max_seq=2048.
Conversion
python conversion/build_qwen35_2b_decode_chunks.py \
--out-dir ./output \
--max-seq 2048 --nbits 8
License
Apache 2.0 (inherits from the base model).
- Downloads last month
- 198