BAAI
/

EVE-7B-HD-v2.0

Model card Files Files and versions

Paranioar commited on Feb 8

Commit

e86d9e4

·

verified ·

1 Parent(s): aaf4400

Update README.md

Files changed (1) hide show

README.md +7 -5

README.md CHANGED Viewed

@@ -14,11 +14,13 @@ Beijing University of Posts and Telecommunications; University of Chinese Academ
 |  [Paper](https://xxxx) | [Code](https://github.com/baaivision/EVE) |
 </div>
-Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for monolithic multimodal systems with structural simplicity and efficient deployment.
-We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We thereby develop the most efficient strategy for developing encoder-free VLMs that rival mainstream encoder-based ones.
-After an in-depth investigation, we launch EVEv2.0, a new family of encoder-free VLMs that explore network potentials and training recipes for efficiently constructing multi-modality paradigms.
-We identify and evaluate that: (i) Sufficient modality-aware decomposition and hierarchical association inside one unified model eliminate interference between vision and language. (ii) Comprehensive optimization towards training recipe provides an effective training pathway tailored for encoder-free VLMs.
-On extensive evaluation, our EVEv2.0 represents a significant step towards developing a pure decoder-only architecture across modalities, demonstrating superior data-scaling efficiency and powerful vision-reasoning capability.
 ## Model Weights

 |  [Paper](https://xxxx) | [Code](https://github.com/baaivision/EVE) |
 </div>
+Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment.
+We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones.
+After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs.
+We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities.
+(ii) A well-designed training strategy enables effective optimization for encoder-free VLMs.
+Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability.
 ## Model Weights