MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Paper • 2509.16197 • Published Sep 19 • 56
LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence Paper • 2509.12203 • Published Sep 15 • 19
A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning Paper • 2509.15937 • Published Sep 19 • 20
PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits Paper • 2509.11362 • Published Sep 14 • 4