Qwen2.5 0.5B Instruct MLC-LLM
This is a quantized and optimized version of Qwen2.5 0.5B Instruct compiled with MLC-LLM for WebGPU deployment in browsers.
Model Details
- Base Model: Qwen2.5 0.5B Instruct
- Quantization: q4f32_1 (4-bit quantization with float32 scale)
- Context Window: 2048 tokens
- Prefill Chunk Size: 512 tokens
- Target: WebGPU deployment via WebLLM
- Memory Usage: ~800MB VRAM
- Total Parameters: 494,032,768
- Bits per Parameter: 5.004
Usage
With WebLLM
import * as webllm from '@mlc-ai/web-llm';
const customModels = [
{
model: "https://huggingface.co/rubenz-org/qwen2-5-0-5b-instruct-mlc",
model_id: "qwen2.5-0.5b-instruct-custom",
// Use official WebLLM WASM runtime for compatibility
model_lib: "https://raw.githubusercontent.com/mlc-ai/binary-mlc-llm-libs/main/Qwen2.5-0.5B-Instruct-q4f32_1-MLC-1k.wasm",
vram_required_MB: 800,
low_resource_required: true,
overrides: {
context_window_size: 2048,
prefill_chunk_size: 512,
},
},
];
const engine = await webllm.CreateMLCEngine("qwen2.5-0.5b-instruct-custom", {
appConfig: { model_list: customModels }
});
Example Chat
const response = await engine.chat.completions.create({
messages: [{ role: 'user', content: 'Hello! How can you help me today?' }],
stream: true,
temperature: 0.7,
max_tokens: 512,
});
for await (const chunk of response) {
const content = chunk.choices[0]?.delta?.content || '';
if (content) {
console.log(content);
}
}
Files
mlc-chat-config.json: Model configuration for MLC-LLMparams_shard_*.bin: Quantized model parameters (8 shards)tokenizer.json,vocab.json,merges.txt: Tokenizer filestensor-cache.json: Parameter metadata cache- Uses official WebLLM WASM runtime for compatibility
Performance
This model is optimized for browser deployment with:
- Reduced memory footprint through 4-bit quantization
- WebGPU acceleration for efficient inference
- Chunked prefill for better memory management
- Low resource requirements for edge deployment
- Compatible with official WebLLM runtime libraries
Browser Compatibility
- Chrome 113+ with WebGPU enabled
- Edge 113+ with WebGPU enabled
- Firefox with WebGPU experimental support
Memory Requirements
- Without KV cache: 629.45 MB
- With 4K KV cache: 725.45 MB
- Parameters: 294.70 MB
- Temporary buffer: 334.75 MB
License
This model follows the same license as the original Qwen2.5 model (Apache 2.0).
- Downloads last month
- 27
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support