Towards Sparse Video Understanding and Reasoning

Published in CVPR, 2026

Chenwei Xu, Zhen Ye, Shang Wu, Weijian Li, Zihan Wang, Zhuofan Xia, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu.

Read the paperProject page

Project Overview

Long videos often contain far more visual evidence than a language model needs to answer a question, but selecting the right evidence is difficult: the model must preserve the relevant moments while avoiding unnecessary visual context. This project studies sparse video understanding and reasoning, with the goal of identifying compact visual evidence that still supports accurate video-language reasoning.

The work focuses on how to select informative video segments, how sparse evidence affects downstream reasoning, and how to evaluate whether a model’s answer is grounded in the right parts of the video. The project page collects the paper, project materials, and updates for this line of work.

Key Ideas

  • Sparse visual evidence can make long-video reasoning more efficient and interpretable.
  • The selected evidence should preserve the information needed for answering video-language questions.
  • Evaluation should consider both final answer quality and whether the reasoning process uses relevant visual content.