nanoVLM ExecuTorch (Quantized)

This repository contains the ExecuTorch (.pte) export of nanoVLM, optimized for on-device inference.

Model Details

Base Model: nanoVLM-450M
Format: ExecuTorch .pte (optimized for deployment)
Quantization: int8 weight-only quantization
Total Size: ~515 MB (5.3x smaller than unquantized)
Components: 6 separate .pte files

Files

vision_encoder.pte - Vision encoder (SigLIP-B/16)
modality_projector.pte - Projects vision features to language space
language_decoder_prefill.pte - Language decoder prefill phase
language_decoder_decode.pte - Language decoder decode phase with KV cache
token_embedding.pte - Token embedding lookup
lm_head.pte - Language model output head
config.json - Model configuration

Quick Start

# Install dependencies
pip install executorch torch pillow transformers

# Download model
huggingface-cli download infil00p/nanoVLM-230M-8k-executorch --local-dir executorch_models

# Clone nanoVLM repo for test script
git clone https://github.com/huggingface/nanoVLM
cd nanoVLM

# Run inference test
python test_executorch_pte.py --model_dir ../executorch_models --image assets/image.png

Usage Example

from executorch.extension.pybindings.portable_lib import _load_for_executorch
import torch

# Load models
vision_encoder = _load_for_executorch("vision_encoder.pte")
modality_projector = _load_for_executorch("modality_projector.pte")
prefill_decoder = _load_for_executorch("language_decoder_prefill.pte")
decode_decoder = _load_for_executorch("language_decoder_decode.pte")
token_embedding = _load_for_executorch("token_embedding.pte")
lm_head = _load_for_executorch("lm_head.pte")

# Run inference (see test_executorch_pte.py for full example)
# 1. Encode image with vision_encoder
# 2. Project with modality_projector
# 3. Combine with text embeddings from token_embedding
# 4. Run prefill_decoder for initial KV cache
# 5. Autoregressive decode with decode_decoder
# 6. Get logits with lm_head

For a complete working implementation, see test_executorch_pte.py in the nanoVLM repository.

Performance

Test Results:

✅ All forward pass tests passed
✅ Full inference test with image splitting (17 images, 4x4 grid)
✅ Generated coherent captions

Example output:

"A close-up photograph captures a tabby cat with a focused gaze, sitting on a patterned surface. The cat's fur exhibits a mix of dark..."

Quantization Impact:

Size reduction: 5.3x smaller (528 MB vs 2.8 GB)
Accuracy: Minimal loss with int8 weight-only quantization
Optimized for on-device deployment

Model Architecture

Vision Encoder: SigLIP-B/16 (ViT, 768 hidden dim, 512×512 patches)
Language Model: SmolLM2-135M (576 hidden dim, 30 blocks, 8192 context)
Modality Projector: Pixel shuffle + linear (64 image tokens per 512×512 patch)
Image Resolution: Up to 2048×2048 with automatic grid splitting

Export Details

Exported using:

python export_executorch.py --checkpoint lusxvr/nanoVLM --output_dir executorch_models --quantize

Quantization Method: int8 weight-only using torchao
ExecuTorch Version: Compatible with PyTorch 2.x ExecuTorch runtime
Input Constraints: Fixed 512×512 image size per patch

License

Apache 2.0

Downloads last month: 10

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for infil00p/nanoVLM-230M-8k-executorch

Base model

lusxvr/nanoVLM-230M-8k

Finetuned

(1)

this model

infil00p
/

nanoVLM-230M-8k-executorch