--- language: - en license: apache-2.0 base_model: allenai/Molmo-72B-0924 tags: - awq - quantized - 4-bit - vision-language - molmo - qwen2 - clip - llm-compressor - vllm library_name: transformers pipeline_tag: image-text-to-text --- # Molmo-72B AWQ 4-bit (Text-Only Quantization) This is a 4-bit AWQ quantized version of [allenai/Molmo-72B-0924](https://huggingface.co/allenai/Molmo-72B-0924) using [LLM Compressor](https://github.com/vllm-project/llm-compressor). ## Key Features - ✅ **Qwen2-72B text decoder quantized** (4-bit AWQ) - 72% size reduction - ✅ **OpenAI CLIP vision encoder preserved** (FP16) - maintains visual quality - ✅ **State-of-the-art VLM performance** - among the best open VLMs - ✅ **Smart quantization** - Only LLM layers quantized, vision parts untouched - ✅ **vLLM compatible** - Fast inference with vLLM - ✅ **Trained on PixMo** - 1M curated image-text pairs ## Model Details - **Base Model:** allenai/Molmo-72B-0924 (73B parameters) - **Architecture:** Molmo (Qwen2-72B decoder + OpenAI CLIP vision encoder) - **Quantization Method:** AWQ (Activation-aware Weight Quantization) - **Quantization Scheme:** W4A16 (4-bit weights, 16-bit activations) - **Calibration Dataset:** Flickr30k (512 samples) ## Size Comparison | Metric | Value | |--------|-------| | **Original (FP16)** | ~145.0 GB | | **Quantized (W4A16)** | ~37.78 GB | | **Reduction** | ~73.9% | | **Memory Saved** | ~107.2 GB | ## What Was Quantized **Quantized (4-bit):** - Qwen2-72B decoder layers (text/language model) - Text processing linear layers in the decoder **Preserved (FP16):** - OpenAI CLIP vision encoder (maintains visual understanding quality) - Vision-text connectors - Embeddings - Language model head This selective quantization ensures that vision understanding quality remains nearly identical to the original model while significantly reducing size. ## About Molmo-72B Molmo-72B is one of the most powerful open vision-language models: - **Text Decoder:** Qwen2-72B (state-of-the-art 72B LLM) - **Vision Encoder:** OpenAI CLIP (proven vision backbone) - **Training Data:** PixMo - 1 million highly-curated image-text pairs - **Performance:** Competitive with GPT-4V on many benchmarks ## Usage ```python from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig from PIL import Image import requests # Load model and processor processor = AutoProcessor.from_pretrained( "ronantakizawa/molmo-72b-awq-w4a16", trust_remote_code=True, torch_dtype='auto', device_map='auto' ) model = AutoModelForCausalLM.from_pretrained( "ronantakizawa/molmo-72b-awq-w4a16", trust_remote_code=True, torch_dtype='auto', device_map='auto' ) # Process the image and text inputs = processor.process( images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)], text="Describe what you see in this image." ) # Move inputs to the correct device and make a batch of size 1 inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()} # Generate output output = model.generate_from_batch( inputs, GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"), tokenizer=processor.tokenizer ) # Decode the generated tokens generated_tokens = output[0, inputs['input_ids'].size(1):] generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True) print(generated_text) ``` ## Quantization Details - **Method:** AWQ (Activation-aware Weight Quantization) - **Independent Pipeline:** Used with BasicPipeline for layer-by-layer quantization - **Calibration:** 512 Flickr30k image-text pairs - **Max Sequence Length:** 2048 tokens - **Why AWQ**: Activation-aware quantization preserves important weights ## Limitations - May have slight quality degradation in complex text generation compared to FP16 - Vision encoder is NOT quantized (intentional for quality) - Requires vLLM or transformers with AWQ support ## Important Notes ### Transparent Images Ensure images are in RGB format: ```python from PIL import Image image = Image.open(...) if image.mode != "RGB": image = image.convert("RGB") ``` ## License Apache 2.0 (same as base model) ## Citation ```bibtex @misc{molmo-72b-awq, title={Molmo-72B AWQ 4-bit}, author={Quantized by ronantakizawa}, year={2025}, url={https://huggingface.co/ronantakizawa/molmo-72b-awq-w4a16} } ``` ## Acknowledgements - Base model by [Allen Institute for AI](https://allenai.org/) - Quantization using [LLM Compressor](https://github.com/vllm-project/llm-compressor) --- 🤖 Generated with [LLM Compressor](https://github.com/vllm-project/llm-compressor)