nanoVLM ExecuTorch (Quantized)
This repository contains the ExecuTorch (.pte) export of nanoVLM, optimized for on-device inference.
Model Details
- Base Model: nanoVLM-450M
- Format: ExecuTorch .pte (optimized for deployment)
- Quantization: int8 weight-only quantization
- Total Size: ~515 MB (5.3x smaller than unquantized)
- Components: 6 separate .pte files
Files
vision_encoder.pte- Vision encoder (SigLIP-B/16)modality_projector.pte- Projects vision features to language spacelanguage_decoder_prefill.pte- Language decoder prefill phaselanguage_decoder_decode.pte- Language decoder decode phase with KV cachetoken_embedding.pte- Token embedding lookuplm_head.pte- Language model output headconfig.json- Model configuration
Quick Start
# Install dependencies
pip install executorch torch pillow transformers
# Download model
huggingface-cli download infil00p/nanoVLM-230M-8k-executorch --local-dir executorch_models
# Clone nanoVLM repo for test script
git clone https://github.com/huggingface/nanoVLM
cd nanoVLM
# Run inference test
python test_executorch_pte.py --model_dir ../executorch_models --image assets/image.png
Usage Example
from executorch.extension.pybindings.portable_lib import _load_for_executorch
import torch
# Load models
vision_encoder = _load_for_executorch("vision_encoder.pte")
modality_projector = _load_for_executorch("modality_projector.pte")
prefill_decoder = _load_for_executorch("language_decoder_prefill.pte")
decode_decoder = _load_for_executorch("language_decoder_decode.pte")
token_embedding = _load_for_executorch("token_embedding.pte")
lm_head = _load_for_executorch("lm_head.pte")
# Run inference (see test_executorch_pte.py for full example)
# 1. Encode image with vision_encoder
# 2. Project with modality_projector
# 3. Combine with text embeddings from token_embedding
# 4. Run prefill_decoder for initial KV cache
# 5. Autoregressive decode with decode_decoder
# 6. Get logits with lm_head
For a complete working implementation, see test_executorch_pte.py in the nanoVLM repository.
Performance
Test Results:
- โ All forward pass tests passed
- โ Full inference test with image splitting (17 images, 4x4 grid)
- โ Generated coherent captions
Example output:
"A close-up photograph captures a tabby cat with a focused gaze, sitting on a patterned surface. The cat's fur exhibits a mix of dark..."
Quantization Impact:
- Size reduction: 5.3x smaller (528 MB vs 2.8 GB)
- Accuracy: Minimal loss with int8 weight-only quantization
- Optimized for on-device deployment
Model Architecture
- Vision Encoder: SigLIP-B/16 (ViT, 768 hidden dim, 512ร512 patches)
- Language Model: SmolLM2-135M (576 hidden dim, 30 blocks, 8192 context)
- Modality Projector: Pixel shuffle + linear (64 image tokens per 512ร512 patch)
- Image Resolution: Up to 2048ร2048 with automatic grid splitting
Export Details
Exported using:
python export_executorch.py --checkpoint lusxvr/nanoVLM --output_dir executorch_models --quantize
- Quantization Method: int8 weight-only using
torchao - ExecuTorch Version: Compatible with PyTorch 2.x ExecuTorch runtime
- Input Constraints: Fixed 512ร512 image size per patch
Related Links
- Original Model: lusxvr/nanoVLM
- GitHub Repository: nanoVLM
- ExecuTorch: pytorch.org/executorch
License
Apache 2.0
- Downloads last month
- 10
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for infil00p/nanoVLM-230M-8k-executorch
Base model
lusxvr/nanoVLM-230M-8k