Qwen2.5 0.5B Instruct MLC-LLM

This is a quantized and optimized version of Qwen2.5 0.5B Instruct compiled with MLC-LLM for WebGPU deployment in browsers.

Model Details

Base Model: Qwen2.5 0.5B Instruct
Quantization: q4f32_1 (4-bit quantization with float32 scale)
Context Window: 2048 tokens
Prefill Chunk Size: 512 tokens
Target: WebGPU deployment via WebLLM
Memory Usage: ~800MB VRAM
Total Parameters: 494,032,768
Bits per Parameter: 5.004

Usage

With WebLLM

import * as webllm from '@mlc-ai/web-llm';

const customModels = [
  {
    model: "https://huggingface.co/rubenz-org/qwen2-5-0-5b-instruct-mlc",
    model_id: "qwen2.5-0.5b-instruct-custom",
    // Use official WebLLM WASM runtime for compatibility
    model_lib: "https://raw.githubusercontent.com/mlc-ai/binary-mlc-llm-libs/main/Qwen2.5-0.5B-Instruct-q4f32_1-MLC-1k.wasm",
    vram_required_MB: 800,
    low_resource_required: true,
    overrides: {
      context_window_size: 2048,
      prefill_chunk_size: 512,
    },
  },
];

const engine = await webllm.CreateMLCEngine("qwen2.5-0.5b-instruct-custom", {
  appConfig: { model_list: customModels }
});

Example Chat

const response = await engine.chat.completions.create({
  messages: [{ role: 'user', content: 'Hello! How can you help me today?' }],
  stream: true,
  temperature: 0.7,
  max_tokens: 512,
});

for await (const chunk of response) {
  const content = chunk.choices[0]?.delta?.content || '';
  if (content) {
    console.log(content);
  }
}

Files

mlc-chat-config.json: Model configuration for MLC-LLM
params_shard_*.bin: Quantized model parameters (8 shards)
tokenizer.json, vocab.json, merges.txt: Tokenizer files
tensor-cache.json: Parameter metadata cache
Uses official WebLLM WASM runtime for compatibility

Performance

This model is optimized for browser deployment with:

Reduced memory footprint through 4-bit quantization
WebGPU acceleration for efficient inference
Chunked prefill for better memory management
Low resource requirements for edge deployment
Compatible with official WebLLM runtime libraries

Browser Compatibility

Chrome 113+ with WebGPU enabled
Edge 113+ with WebGPU enabled
Firefox with WebGPU experimental support

Memory Requirements

Without KV cache: 629.45 MB
With 4K KV cache: 725.45 MB
Parameters: 294.70 MB
Temporary buffer: 334.75 MB

License

This model follows the same license as the original Qwen2.5 model (Apache 2.0).

Downloads last month: 27

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support