MicroVLM-V: Vision-Language Model with FIBER Alignment & Episodic Memory

📋 Model Overview

MicroVLM-V is a compact vision-language model (~215 MB) that combines:

Vision Encoder: DeiT-Tiny (5.7M params)
Language Model: Qwen2.5-0.5B (4-bit quantized, 315M params)
Alignment: FIBER fusion at layers [6, 8, 10]
Episodic Memory: Larimar GPM (512 slots, 4.8M params)

Checkpoint: best (Best alignment model)

📊 Model Architecture

Parameter Distribution

Component	Total Parameters	Trainable	Status
Total Model	334.5M	13.8M	4.1% trainable
Vision Encoder	8.8M	3.3M	FIBER fusion trainable
Language Model	315.1M	0	Frozen (4-bit)
Multimodal Adapter	5.0M	5.0M	Fully trainable
Episodic Memory	4.8M	4.8M	Fully trainable

Quantization Status

Component	Quantization
Vision Encoder	FP16
Language Model	4-bit ✓
Episodic Memory	FP32

Estimated Model Size: ~214.6 MB

🏋️ Training Details

Configuration

Dataset: CC12M (Conceptual 12M) - 3M training samples
Batch Size: 512
Training Time: ~0.64 hours on 2x A100 80GB
Throughput: ~332 samples/sec
Total FLOPs: 2088 PFLOPs

FIBER Alignment

Mode: Fusion-in-Backbone (FIBER-style)
Fusion Layers: [6, 8, 10]
ITC Weight: 1.0
ITM Weight: 0.5
ITC Queue Size: 256

Training Metrics (Best Checkpoint)

Best Alignment Similarity: 0.0249 (step 25)
Final ITM Loss: ~0.53
Final Token Loss: ~0.056
Training stopped: Early stopping at step 1500 (alignment plateau)

💻 Usage

Loading the Model

import torch

# Load checkpoint
checkpoint = torch.load('model.pt', map_location='cpu')

# Access model state dict
model_state = checkpoint['model_state_dict']

# Get training info
print(f"Global step: {checkpoint.get('global_step', 'N/A')}")
print(f"Best alignment: {checkpoint.get('best_correct_sim', 'N/A')}")

Inference Example

from PIL import Image
import torchvision.transforms as transforms

# Prepare image
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

image = Image.open('example.jpg').convert('RGB')
image_tensor = transform(image).unsqueeze(0)

# Forward pass (after loading model)
with torch.no_grad():
    outputs = model(
        images=image_tensor,
        input_ids=tokens['input_ids'],
        attention_mask=tokens['attention_mask']
    )

📁 Repository Contents

model.pt - Best alignment checkpoint
statistics.json - Training statistics
config.json - Model configuration
README.md - This model card

⚙️ Requirements

pip install torch>=2.0.0
pip install transformers>=4.30.0
pip install timm  # For DeiT vision encoder
pip install bitsandbytes  # For 4-bit quantization

📜 License

Apache 2.0 License

🔗 Links

GitHub Repository: euhidaman/MicroVLM-V
Branch: FocusedAttention

⚠️ Limitations

This is the Stage 1 alignment checkpoint - focuses on vision-language alignment
Best for: Image-text matching, alignment tasks
May need further fine-tuning for generation tasks

Uploaded: 2025-12-08 14:53:01 UTC

Downloads last month: 13