How strong is your perception model? Can it track objects and points even through strong occlusions? Can it localise actions and sounds? Can it answer questions that require memory and understanding of physics, abstraction, and semantics? Can it reason about descriptive, explanatory, predictive, and counterfactual situations? Can it reason over hour-long videos?
Put your model to test and win prizes totalling 50K EUR across 5 tracks!
NEW this year: VQA is a unified track containing regular video QAs, but also questions related to point tracking, object tracking, action localisation in a video QA format.
NEW this year: We have 2 guest tracks: KiVA (image evaluation probing visual analogies skills), and Physics-IQ (assessing if generative models generate physics-aware videos).