MixLoRA-Qwen2VL: Multimodal Instruction Tuning with Conditional Mixture of LoRA
This model is a MixLoRA (Mixture of LoRA) adaptation of Qwen2-VL-7B, trained on 19 diverse multimodal datasets using continuous learning with label-based expert routing.
Model Description
- Base Model: Qwen2-VL-7B (7.7B parameters)
- Architecture: Conditional Mixture of Adapters (CMOA) with 8 LoRA experts
- Training Method: Continuous learning across 19 datasets with label-based expert selection
- Total Size: ~7.7B (base) + ~83MB (LoRA adapters)
- Expert Selection: Label-based routing (Uni, Syn, Red categories)
- LoRA Configuration: Rank 64, Alpha 16, Dropout 0.05
Training Details
Datasets (19 total)
The model was trained continuously on 19 multimodal datasets, grouped into three categories:
Uni Datasets (7) β Experts [0, 1]
- screen2words - UI understanding
- decimer - Chemical structure recognition
- fer2013 - Facial emotion recognition
- ucmerced - Land use classification
- resisc45 - Remote sensing image classification
- inaturalist - Species identification
- enrico - Mobile UI component detection
Syn Datasets (6) β Experts [3, 4]
- hateful_memes - Multimodal hate speech detection
- ny_cartoon - Cartoon caption understanding
- memotion - Meme emotion analysis
- scienceqa - Science question answering
- memecap - Meme captioning
- mmimdb - Movie genre classification from posters
Red Datasets (6) β Experts [6, 7]
- vqarad - Medical visual question answering
- ok-vqa - Knowledge-based VQA
- path-vqa - Pathology VQA
- slake - Medical VQA
- nlvr - Natural language visual reasoning
- flickr30k - Image captioning
Training Configuration
- Method: Continuous learning (each dataset builds on previous)
- Batch Size: 4 per device
- Gradient Accumulation: 4 steps
- Learning Rate: 2e-4
- Epochs per Dataset: 1
- Total Training Time: ~15-20 hours (1x 40GB GPU)
- Sequence Length: 2048 tokens
- Vision Encoder: CLIP-ViT-Large-336
Key Features
β 8 LoRA Experts: Mixture of 8 specialized experts, selecting 2 per forward pass β Label-Based Routing: Automatic expert selection based on dataset category β Continuous Learning: Sequential training preserving knowledge across datasets β Grayscale Image Support: Handles RGB, grayscale, and black/white images β Multi-Task: Trained on VQA, captioning, classification, and reasoning tasks
Usage
Loading the Model
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
from peft import PeftModel
import torch
# Load base model
base_model_name = "Qwen/Qwen2-VL-7B"
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "sxj1215/mixlora-qwen2vl-19datasets")
# Load processor and tokenizer
processor = AutoProcessor.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("sxj1215/mixlora-qwen2vl-19datasets")
Inference Example
from PIL import Image
from qwen_vl_utils import process_vision_info
# Load image
image = Image.open("example.jpg")
# Prepare input
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
# Process
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt"
).to(model.device)
# Generate
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Model Architecture
Qwen2-VL-7B (Base)
βββ Vision Encoder: CLIP-ViT-Large-336
βββ MLP Projector: 2-layer with GELU
βββ Language Model: Qwen2-7B with 8 LoRA Experts
βββ Expert 0, 1: Uni tasks (7 datasets)
βββ Expert 3, 4: Syn tasks (6 datasets)
βββ Expert 6, 7: Red tasks (6 datasets)
Expert Selection: Based on dataset label (Uni/Syn/Red), automatically routes to appropriate expert pair.
Training Methodology
Continuous Learning Strategy
- Dataset 1 (screen2words): Train from base model β Save checkpoint
- Dataset 2 (decimer): Load checkpoint β Continue training β Save
- Dataset 3-19: Repeat, each building on all previous datasets
This ensures:
- Knowledge accumulation across all 19 datasets
- No catastrophic forgetting
- Each expert specializes in its category (Uni/Syn/Red)
Bug Fixes Applied
8 critical bugs were fixed during development:
- HuggingFace Hub version compatibility
- TrainerControl initialization
- Checkpoint state_dict validation
- Resume logic for continuous training
- Variable scope issues
- Optimizer state handling
- Critical: Continuous training checkpoint logic
- Critical: Grayscale/BW image processing
See training repository for full bug documentation.
Performance
The model demonstrates strong performance across diverse multimodal tasks:
- Visual Question Answering (multiple domains)
- Image Captioning
- Image Classification
- Visual Reasoning
- Meme Understanding
- Medical Image Analysis
- Scientific Reasoning
Specific benchmark scores coming soon
Limitations
- Trained on English datasets primarily
- May have biases present in training data
- Optimal for tasks similar to training datasets
- Requires ~16GB VRAM for inference (bfloat16)
Citation
If you use this model, please cite the original MixLoRA paper:
@article{shen2024multimodal,
title={Multimodal Instruction Tuning with Conditional Mixture of LoRA},
author={Shen, Ying and Xu, Zhiyang and Wang, Qifan and Cheng, Yu and Yin, Wenpeng and Huang, Lifu},
journal={arXiv preprint arXiv:2402.15896},
year={2024}
}
And the Qwen2-VL paper:
@article{qwen2vl,
title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
author={Qwen Team},
journal={arXiv preprint},
year={2024}
}
License
This model inherits the license from Qwen2-VL-7B. The LoRA adapters are released under Apache 2.0.
Model Card Authors
sxj1215
Training Details
- Trained by: sxj1215
- Training Framework: HuggingFace Transformers + PEFT
- Training Hardware: 1x 40GB GPU
- Training Duration: ~15-20 hours
- Date: November 2024
- Funding: z-lab
- Downloads last month
- 4
Model tree for sxj1215/mixlora-qwen2vl-19datasets
Base model
Qwen/Qwen2-VL-7B