MixLoRA-Qwen2VL: Multimodal Instruction Tuning with Conditional Mixture of LoRA

This model is a MixLoRA (Mixture of LoRA) adaptation of Qwen2-VL-7B, trained on 19 diverse multimodal datasets using continuous learning with label-based expert routing.

Model Description

Base Model: Qwen2-VL-7B (7.7B parameters)
Architecture: Conditional Mixture of Adapters (CMOA) with 8 LoRA experts
Training Method: Continuous learning across 19 datasets with label-based expert selection
Total Size: ~7.7B (base) + ~83MB (LoRA adapters)
Expert Selection: Label-based routing (Uni, Syn, Red categories)
LoRA Configuration: Rank 64, Alpha 16, Dropout 0.05

Training Details

Datasets (19 total)

The model was trained continuously on 19 multimodal datasets, grouped into three categories:

Uni Datasets (7) → Experts [0, 1]

screen2words - UI understanding
decimer - Chemical structure recognition
fer2013 - Facial emotion recognition
ucmerced - Land use classification
resisc45 - Remote sensing image classification
inaturalist - Species identification
enrico - Mobile UI component detection

Syn Datasets (6) → Experts [3, 4]

hateful_memes - Multimodal hate speech detection
ny_cartoon - Cartoon caption understanding
memotion - Meme emotion analysis
scienceqa - Science question answering
memecap - Meme captioning
mmimdb - Movie genre classification from posters

Red Datasets (6) → Experts [6, 7]

vqarad - Medical visual question answering
ok-vqa - Knowledge-based VQA
path-vqa - Pathology VQA
slake - Medical VQA
nlvr - Natural language visual reasoning
flickr30k - Image captioning

Training Configuration

Method: Continuous learning (each dataset builds on previous)
Batch Size: 4 per device
Gradient Accumulation: 4 steps
Learning Rate: 2e-4
Epochs per Dataset: 1
Total Training Time: ~15-20 hours (1x 40GB GPU)
Sequence Length: 2048 tokens
Vision Encoder: CLIP-ViT-Large-336

Key Features

✅ 8 LoRA Experts: Mixture of 8 specialized experts, selecting 2 per forward pass ✅ Label-Based Routing: Automatic expert selection based on dataset category ✅ Continuous Learning: Sequential training preserving knowledge across datasets ✅ Grayscale Image Support: Handles RGB, grayscale, and black/white images ✅ Multi-Task: Trained on VQA, captioning, classification, and reasoning tasks

Usage

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
from peft import PeftModel
import torch

# Load base model
base_model_name = "Qwen/Qwen2-VL-7B"
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "sxj1215/mixlora-qwen2vl-19datasets")

# Load processor and tokenizer
processor = AutoProcessor.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("sxj1215/mixlora-qwen2vl-19datasets")

Inference Example

from PIL import Image
from qwen_vl_utils import process_vision_info

# Load image
image = Image.open("example.jpg")

# Prepare input
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Process
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt"
).to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Model Architecture

Qwen2-VL-7B (Base)
├── Vision Encoder: CLIP-ViT-Large-336
├── MLP Projector: 2-layer with GELU
└── Language Model: Qwen2-7B with 8 LoRA Experts
    ├── Expert 0, 1: Uni tasks (7 datasets)
    ├── Expert 3, 4: Syn tasks (6 datasets)
    └── Expert 6, 7: Red tasks (6 datasets)

Expert Selection: Based on dataset label (Uni/Syn/Red), automatically routes to appropriate expert pair.

Training Methodology

Continuous Learning Strategy

Dataset 1 (screen2words): Train from base model → Save checkpoint
Dataset 2 (decimer): Load checkpoint → Continue training → Save
Dataset 3-19: Repeat, each building on all previous datasets

This ensures:

Knowledge accumulation across all 19 datasets
No catastrophic forgetting
Each expert specializes in its category (Uni/Syn/Red)

Bug Fixes Applied

8 critical bugs were fixed during development:

HuggingFace Hub version compatibility
TrainerControl initialization
Checkpoint state_dict validation
Resume logic for continuous training
Variable scope issues
Optimizer state handling
Critical: Continuous training checkpoint logic
Critical: Grayscale/BW image processing

See training repository for full bug documentation.

Performance

The model demonstrates strong performance across diverse multimodal tasks:

Visual Question Answering (multiple domains)
Image Captioning
Image Classification
Visual Reasoning
Meme Understanding
Medical Image Analysis
Scientific Reasoning

Specific benchmark scores coming soon

Limitations

Trained on English datasets primarily
May have biases present in training data
Optimal for tasks similar to training datasets
Requires ~16GB VRAM for inference (bfloat16)

Citation

If you use this model, please cite the original MixLoRA paper:

@article{shen2024multimodal,
  title={Multimodal Instruction Tuning with Conditional Mixture of LoRA},
  author={Shen, Ying and Xu, Zhiyang and Wang, Qifan and Cheng, Yu and Yin, Wenpeng and Huang, Lifu},
  journal={arXiv preprint arXiv:2402.15896},
  year={2024}
}

And the Qwen2-VL paper:

@article{qwen2vl,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Qwen Team},
  journal={arXiv preprint},
  year={2024}
}

License

This model inherits the license from Qwen2-VL-7B. The LoRA adapters are released under Apache 2.0.

Model Card Authors

sxj1215

Training Details

Trained by: sxj1215
Training Framework: HuggingFace Transformers + PEFT
Training Hardware: 1x 40GB GPU
Training Duration: ~15-20 hours
Date: November 2024
Funding: z-lab

Downloads last month: 4

Model tree for sxj1215/mixlora-qwen2vl-19datasets

Base model

Qwen/Qwen2-VL-7B

Adapter

(2)

this model