Do large multimodal models truly understand the spatial
structure of the world they perceive? Given that they are largely trained on passive, internet-scale data, how do they represent environments across
scales, from the tabletop in a tea preparation video to the city-scale layout of an hour-long walking tour?
Put your model to test and win prizes totalling 20K EUR!
NEW this year: We run a unified videoQA track probing spatial intelligence in table-top and city-scale videos. A single model has to handle both scales to be considered
eligible.