Interpretability Track - The 3rd Perception Test Challenge ICCV 2025

Interpretability Track

Use Perception Test as a benchmark for VLM interpretability and win prizes!

Visualization of the attention maps of PerceptionLM with four video frames as input overlaid with ground truth object tracks. Only high activity layers are shown. The task is: "The person uses multiple similar objects to play an occlusion game. How many such objects does the person use?"

Overview

The goal of the Interpretability track is to encourage the development of techniques that provide insights into how state-of-the-art perception models make their decisions. Methods can be varied in nature, i.e. behavioral/black-box, mechanistic, or simply visualizations that explain why a model succeeds or fails at one or more Perception Test tasks.

How to participate

You can use any method you prefer, as long as it highlights convincingly how a model solves (or fails at solving) one or more tasks in the Perception Test.

You will be asked to submit a colab notebook demonstrating your explanations or predictions on example videos from the Perception Test benchmark, as well as a short tech report (max 2 pages) decribing the analysed models, how they were initially trained, and the methods used for interpretability and analysis.

Bonus points for works that leverage the different types of annotations in the Perception Test to design quantitative explainability methods, e.g. show correlations between saliency / attention maps produced when answering videoQAs and ground truth object tracks.

Perception Test tasks

In Perception Test, there are 132 unique questions in the multiple-choice videoQA dataset. Each question is applied to multiple videos (from 20 videos up to more than 1000 videos). We define as Perception Test task a videoQA and the videos it is applied to across train / valid / test splits of the benchmark.

A meaningful interpretability method should identify a pattern or explanation that applies to all or most of the videos within one or more tasks.

Example of a Perception Test task

Question: "The person uses multiple similar objects to play an occlusion game. How many such objects does the person use?"

Options: a) 4 b) 2 c) 3

For this task, there are 116 (out of 2184) train, 305 (out of 5900) valid, and 189 (out of 3525) test videos; see example below. These videos also have ground truth annotations for object tracks, action segments, and sound segments that can be used to quantitatively prove that the model is paying attention to the relevant spatio-temporal regions in the frames.

Maximum activations over layers
Maximum activations over all layers and heads. We can see the attention has high overlap with the objects in the first and last frames (as shown by the boxes).

Resources and Examples

We provide a starter kit with examples for generating visualizations on videos from the Perception Test, as well as some reference works that study LLMs or image VLMs, which might be applicable to video VLMs.

Featured at the top of this page is an example of a visualization using the ground truth object tracks overlaid with the visual attention from PerceptionLM using 4 frames as input. At intermediate layers, we can see where the model pays particular attention to (some of the cups, but also other areas!):

The figure above shows the maximum activations over all the layers overlaid on the video frames.

You can find this visualization in the PerceptionLM demo notebook linked below, which uses TransformerLens.

Suggested resources and ideas to explore (please ignore if not useful for your approach):

Judging Criteria and Timeline

Submissions will be evaluated single-blindedly by the organising team, considering the below criteria:

The timeline for this track follows the main challenge:

Upload your final colab notebook file and tech report to this Google Form.

Back to Main Page