From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence Paper • 2511.18538 • Published 13 days ago • 238
MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity Paper • 2511.03146 • Published Nov 5 • 7
RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization Paper • 2511.04285 • Published about 1 month ago • 7
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs Paper • 2511.07250 • Published 26 days ago • 17
DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains Paper • 2511.10984 • Published 23 days ago • 4
VideoScore2: Think before You Score in Generative Video Evaluation Paper • 2509.22799 • Published Sep 26 • 25
Towards Personalized Deep Research: Benchmarks and Evaluations Paper • 2509.25106 • Published Sep 29 • 29
Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution Paper • 2509.25301 • Published Sep 29 • 19
Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation Paper • 2509.25849 • Published Sep 30 • 47
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs Paper • 2510.10689 • Published Oct 12 • 46
ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems Paper • 2510.11652 • Published Oct 13 • 28
Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures Paper • 2510.14616 • Published Oct 16 • 11
A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning Paper • 2510.12838 • Published Oct 13 • 24
FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning Paper • 2509.13160 • Published Sep 16 • 29
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning Paper • 2509.02544 • Published Sep 2 • 124