MicroVLM-V: Vision-Language Model with FIBER Alignment & Episodic Memory
π Model Overview
MicroVLM-V is a compact vision-language model (~215 MB) that combines:
- Vision Encoder: DeiT-Tiny (5.7M params)
- Language Model: Qwen2.5-0.5B (4-bit quantized, 315M params)
- Alignment: FIBER fusion at layers [6, 8, 10]
- Episodic Memory: Larimar GPM (512 slots, 4.8M params)
Checkpoint: best (Best alignment model)
π Model Architecture
Parameter Distribution
| Component | Total Parameters | Trainable | Status |
|---|---|---|---|
| Total Model | 334.5M | 13.8M | 4.1% trainable |
| Vision Encoder | 8.8M | 3.3M | FIBER fusion trainable |
| Language Model | 315.1M | 0 | Frozen (4-bit) |
| Multimodal Adapter | 5.0M | 5.0M | Fully trainable |
| Episodic Memory | 4.8M | 4.8M | Fully trainable |
Quantization Status
| Component | Quantization |
|---|---|
| Vision Encoder | FP16 |
| Language Model | 4-bit β |
| Episodic Memory | FP32 |
Estimated Model Size: ~214.6 MB
ποΈ Training Details
Configuration
- Dataset: CC12M (Conceptual 12M) - 3M training samples
- Batch Size: 512
- Training Time: ~0.64 hours on 2x A100 80GB
- Throughput: ~332 samples/sec
- Total FLOPs: 2088 PFLOPs
FIBER Alignment
- Mode: Fusion-in-Backbone (FIBER-style)
- Fusion Layers: [6, 8, 10]
- ITC Weight: 1.0
- ITM Weight: 0.5
- ITC Queue Size: 256
Training Metrics (Best Checkpoint)
- Best Alignment Similarity: 0.0249 (step 25)
- Final ITM Loss: ~0.53
- Final Token Loss: ~0.056
- Training stopped: Early stopping at step 1500 (alignment plateau)
π» Usage
Loading the Model
import torch
# Load checkpoint
checkpoint = torch.load('model.pt', map_location='cpu')
# Access model state dict
model_state = checkpoint['model_state_dict']
# Get training info
print(f"Global step: {checkpoint.get('global_step', 'N/A')}")
print(f"Best alignment: {checkpoint.get('best_correct_sim', 'N/A')}")
Inference Example
from PIL import Image
import torchvision.transforms as transforms
# Prepare image
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
image = Image.open('example.jpg').convert('RGB')
image_tensor = transform(image).unsqueeze(0)
# Forward pass (after loading model)
with torch.no_grad():
outputs = model(
images=image_tensor,
input_ids=tokens['input_ids'],
attention_mask=tokens['attention_mask']
)
π Repository Contents
model.pt- Best alignment checkpointstatistics.json- Training statisticsconfig.json- Model configurationREADME.md- This model card
βοΈ Requirements
pip install torch>=2.0.0
pip install transformers>=4.30.0
pip install timm # For DeiT vision encoder
pip install bitsandbytes # For 4-bit quantization
π License
Apache 2.0 License
π Links
- GitHub Repository: euhidaman/MicroVLM-V
- Branch: FocusedAttention
β οΈ Limitations
- This is the Stage 1 alignment checkpoint - focuses on vision-language alignment
- Best for: Image-text matching, alignment tasks
- May need further fine-tuning for generation tasks
Uploaded: 2025-12-08 14:53:01 UTC
- Downloads last month
- 13