Do large multimodal models truly understand the spatial
structure of the world they perceive? Given that they are largely trained on passive, internet-scale data, how do they represent environments across
scales, from the tabletop in a tea preparation video to the city-scale layout of an hour-long walking tour?
Put your model to test and win prizes totalling 20K EUR!
NEW this year: KilometerVision track probing spatial intelligence in walking tour videos, including questions on distance estimation, landmark recognition, compass, map understanding.
NEW this year: KilometerAudio track probing multimodal audio-video understanding in hour-long walking tour videos.