Qwen2.5 0.5B Instruct MLC-LLM

This is a quantized and optimized version of Qwen2.5 0.5B Instruct compiled with MLC-LLM for WebGPU deployment in browsers.

Model Details

  • Base Model: Qwen2.5 0.5B Instruct
  • Quantization: q4f32_1 (4-bit quantization with float32 scale)
  • Context Window: 2048 tokens
  • Prefill Chunk Size: 512 tokens
  • Target: WebGPU deployment via WebLLM
  • Memory Usage: ~800MB VRAM
  • Total Parameters: 494,032,768
  • Bits per Parameter: 5.004

Usage

With WebLLM

import * as webllm from '@mlc-ai/web-llm';

const customModels = [
  {
    model: "https://huggingface.co/rubenz-org/qwen2-5-0-5b-instruct-mlc",
    model_id: "qwen2.5-0.5b-instruct-custom",
    // Use official WebLLM WASM runtime for compatibility
    model_lib: "https://raw.githubusercontent.com/mlc-ai/binary-mlc-llm-libs/main/Qwen2.5-0.5B-Instruct-q4f32_1-MLC-1k.wasm",
    vram_required_MB: 800,
    low_resource_required: true,
    overrides: {
      context_window_size: 2048,
      prefill_chunk_size: 512,
    },
  },
];

const engine = await webllm.CreateMLCEngine("qwen2.5-0.5b-instruct-custom", {
  appConfig: { model_list: customModels }
});

Example Chat

const response = await engine.chat.completions.create({
  messages: [{ role: 'user', content: 'Hello! How can you help me today?' }],
  stream: true,
  temperature: 0.7,
  max_tokens: 512,
});

for await (const chunk of response) {
  const content = chunk.choices[0]?.delta?.content || '';
  if (content) {
    console.log(content);
  }
}

Files

  • mlc-chat-config.json: Model configuration for MLC-LLM
  • params_shard_*.bin: Quantized model parameters (8 shards)
  • tokenizer.json, vocab.json, merges.txt: Tokenizer files
  • tensor-cache.json: Parameter metadata cache
  • Uses official WebLLM WASM runtime for compatibility

Performance

This model is optimized for browser deployment with:

  • Reduced memory footprint through 4-bit quantization
  • WebGPU acceleration for efficient inference
  • Chunked prefill for better memory management
  • Low resource requirements for edge deployment
  • Compatible with official WebLLM runtime libraries

Browser Compatibility

  • Chrome 113+ with WebGPU enabled
  • Edge 113+ with WebGPU enabled
  • Firefox with WebGPU experimental support

Memory Requirements

  • Without KV cache: 629.45 MB
  • With 4K KV cache: 725.45 MB
  • Parameters: 294.70 MB
  • Temporary buffer: 334.75 MB

License

This model follows the same license as the original Qwen2.5 model (Apache 2.0).

Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support