The 4th Perception Test Challenge at ECCV

June 20: Challenge server with validation tasks goes live
July 1: Test data and tasks released
August 18: Deadline for submissions
August 20: Deadline for tech report submissions
September 2: Decision to participants

Do large multimodal models truly understand the spatial structure of the world they perceive? Given that they are largely trained on passive, internet-scale data, how do they represent environments across scales, from the tabletop in a tea preparation video to the city-scale layout of an hour-long walking tour?

Put your model to test and win prizes totalling 20K EUR!

NEW this year: We run a unified videoQA track probing spatial intelligence in table-top and city-scale videos. A single model has to handle both scales to be considered eligible.