Use it from Swift

Add the package

Package.swift:

.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),

Platforms: iOS 18+ / macOS 15+.

Download + chat (one call)

import CoreMLLLM

// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/qwen3.5-2B-CoreML")

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream {
    print(chunk, terminator: "")
}

Multi-turn: keep an [CoreMLLLM.Message] array, append the user/assistant turns, and pass the whole history to generate(_:) again. Call llm.reset() to start a new conversation (clears the KV cache).

Qwen3.5-2B β€” Core ML (ANE chunked)

Core ML port of Qwen/Qwen3.5-2B, split into 4 INT8 chunks + a raw fp16 embedding sidecar so every chunk fits the iPhone ANE single-mlprogram compile envelope.

iPhone 17 Pro (A18) measured: 17 tok/s decode, ~200 MB phys_footprint, 0 GB sustained Metal heap, ~91 % ANE op placement across all 4 body chunks. First-load ANE compile β‰ˆ 15 min across chunks (cached after).

Files

qwen3_5_2b_decode_chunks/
β”œβ”€β”€ chunk_a.mlpackage      # 340 MB β€” embed + layers 0-5 + their states
β”œβ”€β”€ chunk_b.mlpackage      # 340 MB β€” layers 6-11 + states
β”œβ”€β”€ chunk_c.mlpackage      # 340 MB β€” layers 12-17 + states
β”œβ”€β”€ chunk_d.mlpackage      # 850 MB β€” layers 18-23 + final_norm + lm_head
└── embed_weight.bin       # 1.02 GB β€” raw fp16 embed table (248320 Γ— 2048)

All 5 pieces are required. They chain hidden→hidden across chunks per token, plus 48 state tensors (24 layers × 2 states each) carried inside the mlpackages.

The embed is not an mlpackage on purpose: Swift mmaps the raw fp16 file so the 1 GB embed table stays in clean virtual pages and only the rows actually touched per prompt page in. Loading the embed as a Core ML weight would dequantize the entire table into the CPU heap and add ~1 GB to phys_footprint.

What this repo does NOT ship

  • No model_config.json β€” Core ML serializes input/output shapes into each .mlpackage directly. coremltools loads it without external config.
  • No tokenizer β€” fetch from the base model:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")

Standalone usage (Python / Mac)

import coremltools as ct
import numpy as np
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

local = snapshot_download("mlboydaisuke/qwen3.5-2B-CoreML")
root = f"{local}/qwen3_5_2b_decode_chunks"

chunks = [
    ct.models.MLModel(f"{root}/chunk_{x}.mlpackage")
    for x in ("a", "b", "c", "d")
]
embed = np.memmap(f"{root}/embed_weight.bin",
                  dtype=np.float16, mode="r",
                  shape=(248320, 2048))
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")

Per decode step:

  1. Look up embed[token_id] β†’ hidden (1, 1, 2048) fp16
  2. Pass hidden + scalar inputs (position, cos, sin) + state slice to chunk_a.predict(...), take its hidden_out and updated states.
  3. Repeat for chunk_b, chunk_c, chunk_d.
  4. chunk_d emits logits (1, 1, 248320) fp16; argmax (or sample) it and feed back as input_token for the next step.
  5. Map new_state_* outputs to the next call's state_* inputs.

Full reference Python loop: conversion/qwen35_2b_chunks_parity.py.

iOS / Mac app

Qwen35Generator.swift handles the chunk chaining + embed mmap. Tap Qwen3.5 2B (ANE) in the model picker.

Architecture

Hybrid Gated DeltaNet + GQA, 24 layers, interleaved [L L L F] Γ— 6.

linear_attention full_attention
count 18 6
state A (1, 6144, 4) (1, 2, 2048, 256)
state B (1, 16, 128, 128) (1, 2, 2048, 256)

Hidden=2048, vocab=248320, head_dim=256 (full attn), rotary partial=0.25 (rotary_dim=64), rope_theta=1e7, max_seq=2048.

Conversion

python conversion/build_qwen35_2b_decode_chunks.py \
  --out-dir ./output \
  --max-seq 2048 --nbits 8

License

Apache 2.0 (inherits from the base model).

Downloads last month
198
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/qwen3.5-2B-CoreML

Finetuned
Qwen/Qwen3.5-2B
Quantized
(88)
this model