Use it from Swift

Add the package

Package.swift:

.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),

Platforms: iOS 18+ / macOS 15+.

Download + chat (one call)

import CoreMLLLM

// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3.5-2B-CoreML")

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream {
    print(chunk, terminator: "")
}

Multi-turn: keep an [CoreMLLLM.Message] array, append the user/assistant turns, and pass the whole history to generate(_:) again. Call llm.reset() to start a new conversation (clears the KV cache).

Qwen3.5-2B — Core ML (ANE chunked)

Core ML port of Qwen/Qwen3.5-2B, split into 4 INT8 chunks + a raw fp16 embedding sidecar so every chunk fits the iPhone ANE single-mlprogram compile envelope.

iPhone 17 Pro (A18) measured: 17 tok/s decode, ~200 MB phys_footprint, 0 GB sustained Metal heap, ~91 % ANE op placement across all 4 body chunks. First-load ANE compile ≈ 15 min across chunks (cached after).

Files

qwen3_5_2b_decode_chunks/
├── chunk_a.mlpackage      # 340 MB — embed + layers 0-5 + their states
├── chunk_b.mlpackage      # 340 MB — layers 6-11 + states
├── chunk_c.mlpackage      # 340 MB — layers 12-17 + states
├── chunk_d.mlpackage      # 850 MB — layers 18-23 + final_norm + lm_head
└── embed_weight.bin       # 1.02 GB — raw fp16 embed table (248320 × 2048)

All 5 pieces are required. They chain hidden→hidden across chunks per token, plus 48 state tensors (24 layers × 2 states each) carried inside the mlpackages.

The embed is not an mlpackage on purpose: Swift mmaps the raw fp16 file so the 1 GB embed table stays in clean virtual pages and only the rows actually touched per prompt page in. Loading the embed as a Core ML weight would dequantize the entire table into the CPU heap and add ~1 GB to phys_footprint.

What this repo does NOT ship

No model_config.json — Core ML serializes input/output shapes into each .mlpackage directly. coremltools loads it without external config.
No tokenizer — fetch from the base model:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")

Standalone usage (Python / Mac)

import coremltools as ct
import numpy as np
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

local = snapshot_download("mlboydaisuke/qwen3.5-2B-CoreML")
root = f"{local}/qwen3_5_2b_decode_chunks"

chunks = [
    ct.models.MLModel(f"{root}/chunk_{x}.mlpackage")
    for x in ("a", "b", "c", "d")
]
embed = np.memmap(f"{root}/embed_weight.bin",
                  dtype=np.float16, mode="r",
                  shape=(248320, 2048))
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")

Per decode step:

Look up embed[token_id] → hidden (1, 1, 2048) fp16
Pass hidden + scalar inputs (position, cos, sin) + state slice to chunk_a.predict(...), take its hidden_out and updated states.
Repeat for chunk_b, chunk_c, chunk_d.
chunk_d emits logits (1, 1, 248320) fp16; argmax (or sample) it and feed back as input_token for the next step.
Map new_state_* outputs to the next call's state_* inputs.

Full reference Python loop: conversion/qwen35_2b_chunks_parity.py.

iOS / Mac app

Qwen35Generator.swift handles the chunk chaining + embed mmap. Tap Qwen3.5 2B (ANE) in the model picker.

Architecture

Hybrid Gated DeltaNet + GQA, 24 layers, interleaved [L L L F] × 6.

	linear_attention	full_attention
count	18	6
state A	`(1, 6144, 4)`	`(1, 2, 2048, 256)`
state B	`(1, 16, 128, 128)`	`(1, 2, 2048, 256)`

Hidden=2048, vocab=248320, head_dim=256 (full attn), rotary partial=0.25 (rotary_dim=64), rope_theta=1e7, max_seq=2048.

Conversion

python conversion/build_qwen35_2b_decode_chunks.py \
  --out-dir ./output \
  --max-seq 2048 --nbits 8

License

Apache 2.0 (inherits from the base model).

Downloads last month: 198

Model tree for mlboydaisuke/qwen3.5-2B-CoreML

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B

Quantized

(88)

this model