October 20 (1 pm - 5:30 pm), 2025
Room 306 A | Zoom Link (code: farmcans)
Our multi-modal spatial intelligence workshop aims to bring together researchers from computer vision, robotics, graphics, and NLP for a half-day of discussion and talks at the intersection of visual understanding, multimodal learning, and embodied AI. Our focus is on placing multi-modal large language models (MLLMs) at the core of spatial intelligence-- exploring how they can learn, interpret, and act on spatial information from images, videos, and 3D data.
13:00 – 13:10 | Welcome & Introduction |
13:10 – 13:50 | Generate Robotic Data with Spatial Intelligence Yue Wang (USC / NVIDIA) |
13:50 – 14:30 | Towards Spatial Supersensing Saining Xie (NYU / Google DeepMind) |
14:30 – 15:10 | Why is Spatial Understanding Hard for VLMs? Manling Li (Northwestern University) |
15:10 – 15:30 | ☕ Coffee Break & Social |
15:30 – 16:10 | Visual Reasoning Will Be Bigger Than Language Reasoning Ranjay Krishna (UW / AI2) |
16:10 – 16:50 | On Latent Abilities Underlying Spatial Intelligence Qianqian Wang (UC Berkeley) |
16:50 – 17:20 |
Panel Discussion: Future of Multimodal Spatial Intelligence
|
17:20 – 17:30 | Concluding Remarks |
Google DeepMind
NYU
Google DeepMind
Google DeepMind
Stanford / World Labs
Stanford / Google DeepMind
NYU / Google DeepMind
For any questions, please reach out to the primary contact:
Contact Songyou Peng