Abstract
Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice question answering in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation—and due to the prohibitive cost and slow pace of human annotation for video tasks—we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena’s framework, designed to automatically assess LMMs’ video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a "gold standard" using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAu-Toarena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.
| Original language | English |
|---|---|
| Title of host publication | 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |
| Editors | Cristina Ceballo |
| Place of Publication | Nashville |
| Publisher | IEEE |
| Pages | 8461-8474 |
| Number of pages | 14 |
| ISBN (Electronic) | 9798331543648 |
| ISBN (Print) | 9798331543655 |
| DOIs | |
| Publication status | Published - 10 Jun 2025 |
| Event | The IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 - Music City Center, Nashville, United States Duration: 11 Jun 2025 → 15 Jun 2025 https://cvpr.thecvf.com/Conferences/2025 (Conference website) https://ieeexplore.ieee.org/xpl/conhome/11091818/proceeding (Conference proceedings) https://media.eventhosts.cc/Conferences/CVPR2025/CVPR_main_conf_2025.pdf (Conference program) |
Publication series
| Name | IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |
|---|---|
| Publisher | IEEE |
| ISSN (Print) | 1063-6919 |
| ISSN (Electronic) | 2575-7075 |
Conference
| Conference | The IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 |
|---|---|
| Abbreviated title | CVPR 2025 |
| Country/Territory | United States |
| City | Nashville |
| Period | 11/06/25 → 15/06/25 |
| Internet address |
|
User-Defined Keywords
- video analysis
- benchmarks
- large multimodal models