nanoVLM ExecuTorch (Quantized)

This repository contains the ExecuTorch (.pte) export of nanoVLM, optimized for on-device inference.

Model Details

  • Base Model: nanoVLM-450M
  • Format: ExecuTorch .pte (optimized for deployment)
  • Quantization: int8 weight-only quantization
  • Total Size: ~515 MB (5.3x smaller than unquantized)
  • Components: 6 separate .pte files

Files

  • vision_encoder.pte - Vision encoder (SigLIP-B/16)
  • modality_projector.pte - Projects vision features to language space
  • language_decoder_prefill.pte - Language decoder prefill phase
  • language_decoder_decode.pte - Language decoder decode phase with KV cache
  • token_embedding.pte - Token embedding lookup
  • lm_head.pte - Language model output head
  • config.json - Model configuration

Quick Start

# Install dependencies
pip install executorch torch pillow transformers

# Download model
huggingface-cli download infil00p/nanoVLM-230M-8k-executorch --local-dir executorch_models

# Clone nanoVLM repo for test script
git clone https://github.com/huggingface/nanoVLM
cd nanoVLM

# Run inference test
python test_executorch_pte.py --model_dir ../executorch_models --image assets/image.png

Usage Example

from executorch.extension.pybindings.portable_lib import _load_for_executorch
import torch

# Load models
vision_encoder = _load_for_executorch("vision_encoder.pte")
modality_projector = _load_for_executorch("modality_projector.pte")
prefill_decoder = _load_for_executorch("language_decoder_prefill.pte")
decode_decoder = _load_for_executorch("language_decoder_decode.pte")
token_embedding = _load_for_executorch("token_embedding.pte")
lm_head = _load_for_executorch("lm_head.pte")

# Run inference (see test_executorch_pte.py for full example)
# 1. Encode image with vision_encoder
# 2. Project with modality_projector
# 3. Combine with text embeddings from token_embedding
# 4. Run prefill_decoder for initial KV cache
# 5. Autoregressive decode with decode_decoder
# 6. Get logits with lm_head

For a complete working implementation, see test_executorch_pte.py in the nanoVLM repository.

Performance

Test Results:

  • โœ… All forward pass tests passed
  • โœ… Full inference test with image splitting (17 images, 4x4 grid)
  • โœ… Generated coherent captions

Example output:

"A close-up photograph captures a tabby cat with a focused gaze, sitting on a patterned surface. The cat's fur exhibits a mix of dark..."

Quantization Impact:

  • Size reduction: 5.3x smaller (528 MB vs 2.8 GB)
  • Accuracy: Minimal loss with int8 weight-only quantization
  • Optimized for on-device deployment

Model Architecture

  • Vision Encoder: SigLIP-B/16 (ViT, 768 hidden dim, 512ร—512 patches)
  • Language Model: SmolLM2-135M (576 hidden dim, 30 blocks, 8192 context)
  • Modality Projector: Pixel shuffle + linear (64 image tokens per 512ร—512 patch)
  • Image Resolution: Up to 2048ร—2048 with automatic grid splitting

Export Details

Exported using:

python export_executorch.py --checkpoint lusxvr/nanoVLM --output_dir executorch_models --quantize
  • Quantization Method: int8 weight-only using torchao
  • ExecuTorch Version: Compatible with PyTorch 2.x ExecuTorch runtime
  • Input Constraints: Fixed 512ร—512 image size per patch

Related Links

License

Apache 2.0

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for infil00p/nanoVLM-230M-8k-executorch

Finetuned
(1)
this model