VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

Ziyang Luo, Haoning Wu, Dongxu Li, Jing Ma, Mohan Kankanhalli, Junnan Li

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

Abstract

Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice question answering in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation—and due to the prohibitive cost and slow pace of human annotation for video tasks—we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena’s framework, designed to automatically assess LMMs’ video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a "gold standard" using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAu-Toarena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.
Original languageEnglish
Title of host publication2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
EditorsCristina Ceballo
Place of Publication Nashville
PublisherIEEE
Pages8461-8474
Number of pages14
ISBN (Electronic)9798331543648
ISBN (Print)9798331543655
DOIs
Publication statusPublished - 10 Jun 2025
EventThe IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 - Music City Center, Nashville, United States
Duration: 11 Jun 202515 Jun 2025
https://cvpr.thecvf.com/Conferences/2025 (Conference website)
https://ieeexplore.ieee.org/xpl/conhome/11091818/proceeding (Conference proceedings)
https://media.eventhosts.cc/Conferences/CVPR2025/CVPR_main_conf_2025.pdf (Conference program)

Publication series

NameIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PublisherIEEE
ISSN (Print)1063-6919
ISSN (Electronic)2575-7075

Conference

ConferenceThe IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025
Abbreviated titleCVPR 2025
Country/TerritoryUnited States
CityNashville
Period11/06/2515/06/25
Internet address

User-Defined Keywords

  • video analysis
  • benchmarks
  • large multimodal models

Fingerprint

Dive into the research topics of 'VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation'. Together they form a unique fingerprint.

Cite this