MicroVLM-V: Vision-Language Model with FIBER Alignment & Episodic Memory

πŸ“‹ Model Overview

MicroVLM-V is a compact vision-language model (~215 MB) that combines:

  • Vision Encoder: DeiT-Tiny (5.7M params)
  • Language Model: Qwen2.5-0.5B (4-bit quantized, 315M params)
  • Alignment: FIBER fusion at layers [6, 8, 10]
  • Episodic Memory: Larimar GPM (512 slots, 4.8M params)

Checkpoint: best (Best alignment model)


πŸ“Š Model Architecture

Parameter Distribution

Component Total Parameters Trainable Status
Total Model 334.5M 13.8M 4.1% trainable
Vision Encoder 8.8M 3.3M FIBER fusion trainable
Language Model 315.1M 0 Frozen (4-bit)
Multimodal Adapter 5.0M 5.0M Fully trainable
Episodic Memory 4.8M 4.8M Fully trainable

Quantization Status

Component Quantization
Vision Encoder FP16
Language Model 4-bit βœ“
Episodic Memory FP32

Estimated Model Size: ~214.6 MB


πŸ‹οΈ Training Details

Configuration

  • Dataset: CC12M (Conceptual 12M) - 3M training samples
  • Batch Size: 512
  • Training Time: ~0.64 hours on 2x A100 80GB
  • Throughput: ~332 samples/sec
  • Total FLOPs: 2088 PFLOPs

FIBER Alignment

  • Mode: Fusion-in-Backbone (FIBER-style)
  • Fusion Layers: [6, 8, 10]
  • ITC Weight: 1.0
  • ITM Weight: 0.5
  • ITC Queue Size: 256

Training Metrics (Best Checkpoint)

  • Best Alignment Similarity: 0.0249 (step 25)
  • Final ITM Loss: ~0.53
  • Final Token Loss: ~0.056
  • Training stopped: Early stopping at step 1500 (alignment plateau)

πŸ’» Usage

Loading the Model

import torch

# Load checkpoint
checkpoint = torch.load('model.pt', map_location='cpu')

# Access model state dict
model_state = checkpoint['model_state_dict']

# Get training info
print(f"Global step: {checkpoint.get('global_step', 'N/A')}")
print(f"Best alignment: {checkpoint.get('best_correct_sim', 'N/A')}")

Inference Example

from PIL import Image
import torchvision.transforms as transforms

# Prepare image
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

image = Image.open('example.jpg').convert('RGB')
image_tensor = transform(image).unsqueeze(0)

# Forward pass (after loading model)
with torch.no_grad():
    outputs = model(
        images=image_tensor,
        input_ids=tokens['input_ids'],
        attention_mask=tokens['attention_mask']
    )

πŸ“ Repository Contents

  • model.pt - Best alignment checkpoint
  • statistics.json - Training statistics
  • config.json - Model configuration
  • README.md - This model card

βš™οΈ Requirements

pip install torch>=2.0.0
pip install transformers>=4.30.0
pip install timm  # For DeiT vision encoder
pip install bitsandbytes  # For 4-bit quantization

πŸ“œ License

Apache 2.0 License


πŸ”— Links


⚠️ Limitations

  • This is the Stage 1 alignment checkpoint - focuses on vision-language alignment
  • Best for: Image-text matching, alignment tasks
  • May need further fine-tuning for generation tasks

Uploaded: 2025-12-08 14:53:01 UTC

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support