June 3, 2026 · 8:10 AM – 12:30 PM
Room 601
Our multi-modal spatial intelligence (MUSI) workshop addresses how multimodal large language models (MLLMs) understand, reason about, and interact with spatial information from the physical world. The multimodal nature of spatial intelligence—requiring integration of images, videos, and 3D data—necessitates bringing together researchers from diverse domains: computer vision, robotics, graphics, and NLP. While recent MLLMs show promising visual-spatial capabilities, fundamental questions remain about spatial relationships, 3D environment modeling, and real-world spatial reasoning. This workshop explores how MLLMs learn spatial representations across modalities, advance world modeling and embodied AI, and address ethical considerations. We aim to establish benchmarks and foster cross-disciplinary collaboration to advance spatial reasoning in multimodal AI.
| 08:10 – 08:20 | Welcome & Introduction |
| 08:20 – 08:50 | Keynote Talk 1 Katerina Fragkiadaki (Carnegie Mellon University) |
| 08:50 – 09:20 | Keynote Talk 2 [Slides] Angel X. Chang (Simon Fraser University) |
| 09:20 – 09:50 | Keynote Talk 3 Chuang Gan (UMass Amherst / MIT-IBM Watson AI Lab) |
| 09:50 – 10:05 | ☕ Coffee Break & Social |
| 10:05 – 10:35 | Keynote Talk 4 Roozbeh Mottaghi (Skild AI / University of Washington) |
| 10:30 – 12:30 | Poster Session ExHall A · Board IDs 142 – 149 |
| 10:35 – 11:05 | Keynote Talk 5 [Slides] Saining Xie (NYU / AMI Labs) |
| 11:05 – 11:35 | Keynote Talk 6 Ranjay Krishna (University of Washington / Allen Institute for AI) |
| 11:35 – 12:05 | Keynote Talk 7 Kristen Grauman (University of Texas at Austin) |
| 12:05 – 12:30 | Closing Remarks |
We invite submissions on topics including, but not limited to:
| Submission Deadline | March 13, 2026 (23:59 AoE) | Loading... |
| Author Notification | April 3, 2026 (23:59 AoE) | Loading... |
| Camera Ready | April 24, 2026 (23:59 AoE) | Loading... |
| Workshop Date | June 3rd Morning, 2026 | - |
*All deadlines are Anywhere on Earth (AoE). Timelines are subject to change.
Submissions are closed. The OpenReview portal is no longer accepting new papers.
The workshop will be non-archival. Authors of accepted papers retain the full copyright of their work and are free to submit extended versions to conferences or journals.
Congratulations to all accepted authors!
ExHall A · 10:30 AM – 12:30 PM · Poster board IDs 142 – 149 (2 posters per board)
Synthetic Counterfactual World Models for Multimodal Spatial Reasoning in Low-Resource 3D Domains
Mahule Roy, Subhas Roy
Synthesis of Interactive and Expansive Apartment Environments
ChunTeng Chen
CoT-PL: Chain-of-Thought Pseudo-Labeling for Open-Vocabulary Object Detection
Hojun Choi, Youngsun Lim, Jaeyo Shin, Hyunjung Shim
SPOT: Structured Prompting with Object-centric Tokens for open-world scene graphs
Mengqi Zhang, Sahil Khose, Fiona Ryan, Judy Hoffman
Can VLMs Handle Multi-hop Compositional Spatial Reasoning?
Youngwan Lee, Soojin Jang, Yoorhim Cho, Seunghwan Lee, Yong-Ju Lee, Sung Ju Hwang
ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search
Myungchul Kim, Kwanyong Park, Junmo Kim, In So Kweon
Improving Scene Text Recognition in Multimodal Large Language Models using Visual Text Grounding
Shashank Krishna Vempati, Chetan Arora
Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models
Mahtab Bigverdi, Linjie Li, Weikai Huang, Yiming Liu, Jaemin Cho, Jieyu Zhang, Tuhin Kundu, Chris Dongjoo Kim, Zelun Luo, Ranjay Krishna, Linda Shapiro
When Spatial Reasoning Goes in Circles: Measuring Ordinal Consistency in Multimodal LLMs via Tournament Theory
Kaustubh S. Bukkapatnam, Rayan Malik, Atharv Kanchi
Bridging the Granularity Gap: Object-Centric Masking for Contextual Visual Learning
Jike Zhong
MindBlock: Probing Spatial Assembly and Structure in Unified Multimodal Models
Baiqiao Yin, Junhao Liu, Han Yin, Heyang Yu, Tingxuan Zhang, Zhiheng Li, Chengzu Li, Jihan Yang, Manling Li, Chen Feng, Yiming Li
Multi-Modal Manipulation via Multi-Modal Policy Consensus
ICRA 2026Haonan Chen, Jiaming Xu, Hongyu Chen, Kaiwen Hong, Binghao Huang, Chaoqi Liu, Jiayuan Mao, Yunzhu Li, Yilun Du, Katherine Rose Driggs-Campbell
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
CVPR 2026Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma, Joseph Tighe, Fanyi Xiao
Name That Part: 3D Part Segmentation and Naming
CVPR 2026 FindingsSoumava Paul, Prakhar Kaushik, Ankit Vaidya, Anand Bhattad, Alan Yuille
Theory of Space: Evaluating Multimodal Spatial Belief through Active Exploration
ICLR 2026Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Manling Li
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
CVPR 2026Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
CVPR 2026Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, Jonathan Tremblay
Hear you are: Teaching LLMs Spatial Reasoning with Vision and Spatial Sound
CVPR 2026Hyeonggon Ryu, Joon Son Chung, David Harwath
A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models
ICLR 2026 Workshop (ESR)Yanpeng Zhao, Wentao Ding, Hongtao Li, Baoxiong Jia, Zilong Zheng
SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
CVPR 2026Vaibhav Agrawal, Rishubh Parihar, Pradhaan S Bhat, Ravi Kiran Sarvadevabhatla, Venkatesh Babu Radhakrishnan
We sincerely thank the following reviewers for their time, expertise, and thoughtful feedback during the peer review process.
For any inquiries about the workshop, please reach out via email: