Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,38 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
<div align='center'>
|
| 6 |
+
<h1>EVEv2: Improved Baselines for Encoder-Free Vision-Language Models</h1h1>
|
| 7 |
+
<h3><a href="https://xxxx">EVEv2: Improved Baselines for Encoder-Free Vision-Language Models</a></h3>
|
| 8 |
+
|
| 9 |
+
[Haiwen Diao*](https://scholar.google.com/citations?user=46eCjHQAAAAJ&hl=zh-CN), [Xiaotong Li*](https://scholar.google.com/citations?hl=zh-CN&user=cpCE_T4AAAAJ), [Yufeng Cui*](https://scholar.google.com/citations?user=5Ydha2EAAAAJ&hl=zh-CN&oi=ao), [Yueze Wang*](https://openreview.net/profile?id=~Yueze_Wang1), [Haoge Deng](https://scholar.google.com/citations?user=S2sbvjgAAAAJ&hl=zh-CN), [Ting Pan](https://scholar.google.com/citations?user=qQv6YbsAAAAJ&hl=zh-CN), [Wenxuan Wang](https://scholar.google.com/citations?hl=zh-CN&user=75OyC-oAAAAJ), [Huchuan Lu📧](https://scholar.google.com/citations?user=D3nE0agAAAAJ&hl=zh-CN), [Xinlong Wang📧](https://scholar.google.com/citations?user=DPz0DjYAAAAJ&hl=zh-CN)
|
| 10 |
+
|
| 11 |
+
Dalian University of Technology; Beijing Academy of Artificial Intelligence; Peking University;
|
| 12 |
+
Beijing University of Posts and Telecommunications; University of Chinese Academy of Sciences; Chinese Academy of Sciences Institute of Automation
|
| 13 |
+
|
| 14 |
+
| [Paper](https://xxxx) | [Code](https://github.com/baaivision/EVE) |
|
| 15 |
+
</div>
|
| 16 |
+
|
| 17 |
+
Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for monolithic multimodal systems with structural simplicity and efficient deployment.
|
| 18 |
+
We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We thereby develop the most efficient strategy for developing encoder-free VLMs that rival mainstream encoder-based ones.
|
| 19 |
+
After an in-depth investigation, we launch EVEv2.0, a new family of encoder-free VLMs that explore network potentials and training recipes for efficiently constructing multi-modality paradigms.
|
| 20 |
+
We identify and evaluate that: (i) Sufficient modality-aware decomposition and hierarchical association inside one unified model eliminate interference between vision and language. (ii) Comprehensive optimization towards training recipe provides an effective training pathway tailored for encoder-free VLMs.
|
| 21 |
+
On extensive evaluation, our EVEv2.0 represents a significant step towards developing a pure decoder-only architecture across modalities, demonstrating superior data-scaling efficiency and powerful vision-reasoning capability.
|
| 22 |
+
|
| 23 |
+
## Model Weights
|
| 24 |
+
|
| 25 |
+
We release the instruction-tuned weights of **EVEv2**.
|
| 26 |
+
|
| 27 |
+
| Model name | Weight |
|
| 28 |
+
| ---------- | ------------------------------------------------------- |
|
| 29 |
+
| **EVE-7B-HD-v2.0** | [🤗 HF link](https://huggingface.co/BAAI/EVE-7B-HD-v2.0) (28GB) |
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
## ✒️ Citation
|
| 35 |
+
If **EVE** is helpful for your research, please consider **star** ⭐ and **citation** 📝 :
|
| 36 |
+
```bibtex
|
| 37 |
+
xxxx
|
| 38 |
+
```
|