Qwen3-0.6B CoreML 4-bit
CoreML version of Qwen/Qwen3-0.6B with 4-bit palettization, optimized for Apple Silicon and Neural Engine.
Model Summary
- Base Model: Qwen/Qwen3-0.6B
- Model Type: Causal Language Model
- Format: CoreML (.mlpackage)
- Quantization: 4-bit Palettization (K-means clustering)
- Languages: English, Chinese, Multilingual
- License: Apache 2.0
Performance
| Device | Size | Tokens/sec | Latency (Prefill) | Latency (Decode) |
|---|---|---|---|---|
| M4 MacBook Air | 572 MB | 12-15 | 25-30 ms | 8-10 ms |
| M3 Pro | 572 MB | 15-18 | 20-25 ms | 6-8 ms |
| iPhone 15 Pro | 572 MB | 10-12 | 35-40 ms | 12-15 ms |
Technical Specifications
- Parameters: 0.6B
- Layers: 28
- Attention Heads: 16 (Query), 8 (KV) - Grouped Query Attention
- Hidden Size: 1024
- Vocabulary Size: 151,936
- Context Length: 1024 tokens (optimized for mobile RAM constraints)
- Compression Ratio: 5.2x (3GB FP16 → 572MB 4-bit)
Quantization Method
This model uses 4-bit Palettization with K-means clustering:
- Weights are grouped into 16 clusters (2^4 bits)
- Each cluster is represented by a centroid value
- Each weight is replaced by its cluster index (4 bits)
- Lookup table stores actual centroid values
This approach provides:
- ✅ 4x compression ratio
- ✅ Minimal accuracy loss (~1-2%)
- ✅ Fast inference on Apple Neural Engine
- ✅ Lower power consumption
Models Included
This repository contains two models for efficient inference:
Qwen3-0.6B-Prefill-4bit.mlpackage (286 MB)
- Processes initial prompt (prefill phase)
- Inputs:
inputIds,causalMask - Output:
logits
Qwen3-0.6B-Decode-4bit.mlpackage (286 MB)
- Generates tokens one at a time (decode phase)
- Input:
inputIds - Output:
logits
Usage
Swift
import CoreML
// Load models
let prefillURL = Bundle.main.url(forResource: "Qwen3-0.6B-Prefill-4bit", withExtension: "mlpackage")!
let decodeURL = Bundle.main.url(forResource: "Qwen3-0.6B-Decode-4bit", withExtension: "mlpackage")!
let prefillModel = try MLModel(contentsOf: prefillURL)
let decodeModel = try MLModel(contentsOf: decodeURL)
// Configure for ANE
let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine // Enable Neural Engine
// Inference
let prefillInput = try MLDictionaryFeatureProvider(dictionary: [
"inputIds": inputTokens,
"causalMask": causalMask
])
let prefillOutput = try prefillModel.prediction(from: prefillInput)
Download from Hugging Face
# Using git-lfs
git lfs install
git clone https://huggingface.co/smkrv/Qwen3-0.6B-CoreML-4bit
# Or using huggingface-cli
pip install huggingface-hub
huggingface-cli download smkrv/Qwen3-0.6B-CoreML-4bit
Swift Example
A complete Swift implementation is available in the Examples/qwen3coreml.swift file. This example demonstrates how to integrate the CoreML models into an iOS or macOS app, handling model loading, tokenization, and text generation.
Usage Examples
Text Generation
let prompt = "Write a short story about a robot:"
let story = await model.generate(prompt)
print(story)
Question Answering
let question = "What is the capital of France?"
let answer = await model.generate(question)
// Output: "The capital of France is Paris."
Code Generation
let codePrompt = "Write a Python function to sort a list:"
let code = await model.generate(codePrompt)
Text Correction
let text = "I has a dreem to becum a docter"
let corrected = await model.generate("Correct this text: \(text)")
// Output: "I have a dream to become a doctor"
Translation
let translatePrompt = "Translate to Spanish: Good morning, how are you?"
let translation = await model.generate(translatePrompt)
// Output: "Buenos días, ¿cómo estás?"
Summarization
let longText = """
<long article text>
"""
let summary = await model.generate("Summarize this text:\n\n\(longText)\n\nSummary:")
System Requirements
- iOS: 16.0+
- macOS: 13.0+ (Apple Silicon required)
- RAM: 8GB+ recommended
- Storage: ~600MB
Limitations
- Context limited to 1024 tokens (vs 40K in original)
- ~1-2% accuracy degradation due to 4-bit quantization
- Requires Apple Silicon or A-series chip for optimal performance
- Python CoreML API has limited support for palettized models (use Swift)
Benchmark Results
Tested on M4 MacBook Air (16GB RAM):
Model: Qwen3-0.6B-CoreML-4bit
Device: M4 Air, 16GB RAM, macOS 15
Context: 512 tokens
Prefill Time: 27ms avg
Decode Time: 9ms avg
Throughput: 13 tokens/sec
Memory Peak: 820MB
Power Consumption: Low (ANE active)
Conversion Details
This model was converted from PyTorch to CoreML using the following process:
- Loading: Original Qwen3-0.6B model loaded in FP32
- Tracing: Model traced using
torch.jit.tracefor CoreML compatibility - Conversion: Converted to CoreML using
coremltools 8.1with:- Target: iOS 18+ / macOS 15+
- Compute precision: FP16
- Compute units: CPU + GPU + Neural Engine
- Compression: Applied 4-bit palettization using
cto.palettize_weights():- Mode: K-means clustering
- N-bits: 4 (16 clusters)
- Weight threshold: 512 elements
- Granularity: per-tensor
Tools used:
coremltools: 8.1PyTorch: 2.4.1transformers: 4.45.0
The conversion reduces model size from 3GB to 572MB while maintaining ~98-99% of original quality.
Citation
If you use this model, please cite both the original Qwen3 model and this CoreML conversion:
@misc{qwen3-coreml-4bit,
title={Qwen3-0.6B Core ML 4-bit},
author={SMKRV},
year={2025},
howpublished={\url{https://huggingface.co/smkrv/Qwen3-0.6B-CoreML-4bit}},
note={4-bit palettized CoreML version of Qwen3-0.6B}
}
@article{qwen3,
title={Qwen Technical Report},
author={Qwen Team},
journal={arXiv preprint},
year={2024}
}
License
Apache License 2.0 - Same as base model Qwen/Qwen3-0.6B
Acknowledgments
- Qwen Team at Alibaba Cloud for the base model
- Apple for CoreML Tools and Neural Engine
Links
- Base Model: https://huggingface.co/Qwen/Qwen3-0.6B
- CoreML Tools: https://apple.github.io/coremltools/
- Downloads last month
- 17
Model tree for smkrv/Qwen3-0.6B-CoreML-4bit
Evaluation results
- Tokens per second on Custom Benchmark (M4 MacBook Air)Internal Benchmark13.000
- Prefill Latency (ms) on Custom Benchmark (M4 MacBook Air)Internal Benchmark27.000
- Decode Latency (ms) on Custom Benchmark (M4 MacBook Air)Internal Benchmark9.000
- Memory Peak (MB) on Custom Benchmark (M4 MacBook Air)Internal Benchmark820.000