MM-UAVBench

How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?

1Tsinghua University, 2Nankai University, 3Northwestern Polytechnical University,4Chinese Academy of Sciences, 5Harbin Institute of Technology
*Equal Contribution,Corresponding author

Introduction

We introduce MM-UAVBench, a comprehensive benchmark designed to evaluate the perception, cognition, and planning abilities of MLLMs in low-altitude UAV scenarios. It features three main characteristics:

  • Comprehensive Task Design. It includes 19 tasks across three key capability dimensions and incorporates UAV-specific considerations, specially including multi-level cognition (object, scene, and event) and planning that involves both aerial and ground agents.
  • Diverse Real-World Scenarios. We collect real-world UAV videos and images from diverse data sources, encompassing 1549 video clips and 2873 images with an average resolution of 1622 × 1033.
  • High-quality Human Annotations. We manually annotate 16 tasks, while 3 additional tasks come from rule-based transformation of manual labels, yielding 5702 multiple-choice QA in total.

Dataset Example

BibTeX

@article{dai2025mm,
  title={MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?},
  author={Dai, Shiqi and Ma, Zizhi and Luo, Zhicong and Yang, Xuesong and Huang, Yibin and Zhang, Wanyue and Chen, Chi and Guo, Zonghao and Xu, Wang and Sun, Yufei and others},
  journal={arXiv preprint arXiv:2512.23219},
  year={2025},
  url={https://arxiv.org/abs/2512.23219}
}