Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning
Abstract
DoGe, a dual-decoupling framework, enhances vision-language models by separating context learning from problem solving, using a curriculum learning pipeline to improve reward signals and data diversity.
Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data, especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose DoGe (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data methods. By decoupling learning process into dual components (Thinker and Solver), we reasonably quantify the reward signals of this process and propose a two-stage RL post-training approach from freely exploring context to practically solving tasks. Second, to increase the diversity of training data, DoGe constructs an evolving curriculum learning pipeline: an expanded native domain knowledge corpus and an iteratively evolving seed problems pool. Experiments show that our method consistently outperforms the baseline across various benchmarks, providing a scalable pathway for realizing self-evolving LVLMs.
Community
Experiment Results ๐
We evaluate DoGe on 7 benchmarks covering:
- General visual reasoning & hallucination (MMMU, MMStar, HallBench)
- Specialized domain reasoning (MathVision, MathVista, ChemBench, MSEarthMCQ)
3B-level Models Performance
| Method | MMMU | MMStar | HallBench | MathVision | MathVista | ChemBench | MSEarthMCQ | Avg. |
|---|---|---|---|---|---|---|---|---|
| InternVL2.5-2B | 43.6 | 53.7 | 42.6 | 13.5 | 51.3 | - | - | - |
| Visionary-3B | 40.7 | 50.5 | 59.8 | 17.1 | 54.7 | 40.8 | 38.2 | 43.1 |
| Qwen2.5VL-3B* (Base) | 41.0 | 49.3 | 60.6 | 18.7 | 48.8 | 43.4 | 40.8 | 43.2 |
| DoGe-3B (Iter1) | 46.6 | 54.5 | 61.5 | 21.7 | ๐ฅ57.9 | 45.8 | ๐ฅ48.3 | 48.0 |
| DoGe-3B (Iter2) | 48.9 | 52.5 | ๐ฅ62.5 | 23.1 | 54.2 | ๐ฅ47.7 | 46.2 | 47.9 |
| DoGe-3B (Iter3) | ๐ฅ50.2 | ๐ฅ54.7 | 61.8 | ๐ฅ24.2 | 57.0 | 46.9 | 47.3 | ๐ฅ48.9 |
| โฌ๏ธ Max Gain (vs. Base) | +9.2 | +5.4 | +1.9 | +5.5 | +9.1 | +4.3 | +7.5 | +5.7 |
7B-level Models Performance
| Method | MMMU | MMStar | HallBench | MathVision | MathVista | ChemBench | MSEarthMCQ | Avg. |
|---|---|---|---|---|---|---|---|---|
| InternVL2.5-8B | 48.9 | 62.8 | 50.1 | 22.0 | 64.4 | - | - | - |
| Vision-R1-7B | 46.9 | 60.8 | 66.7 | ๐ฅ29.0 | 68.5 | 46.0 | 44.1 | 51.7 |
| Qwen2.5VL-7B* (Base) | 49.9 | 60.7 | 66.3 | 23.6 | 64.1 | 48.6 | 43.3 | 50.9 |
| DoGe-7B (Iter1) | 53.1 | ๐ฅ63.2 | 54.4 | 24.3 | 62.1 | 48.7 | 46.4 | 50.3 |
| DoGe-7B (Iter2) | 50.9 | 60.0 | ๐ฅ68.3 | 25.3 | ๐ฅ68.8 | ๐ฅ49.0 | ๐ฅ46.5 | 52.7 |
| DoGe-7B (Iter3) | ๐ฅ53.6 | 63.0 | 68.0 | 25.2 | 68.3 | 48.5 | 45.8 | ๐ฅ53.2 |
| โฌ๏ธ Max Gain (vs. Base) | +3.7 | +2.5 | +2.0 | +1.7 | +4.7 | +0.4 | +3.2 | +2.3 |
Key Takeaways โจ
- Stable Self-Evolution: DoGe achieves consistent performance improvement across 3 iterations for both 3B and 7B models
- Domain Generalization:
- 3B models: Average +5.7% performance gain across all benchmarks
- 7B models: Average +2.3% performance gain (maintains superiority over strong baselines)
- Hallucination Reduction: +2.0% average improvement on HallBench, mitigating visual hallucination
- Data Efficiency: Excels in data-scarce domains (Chemistry, Earth Science) with limited manual annotations
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VisPlay: Self-Evolving Vision-Language Models from Images (2025)
- EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards (2025)
- ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model (2025)
- Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs (2025)
- VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation (2025)
- Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start (2025)
- Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
