2nd Workshop on Multimodal Spatial Intelligence

June 3, 2026 · 8:10 AM – 12:30 PM

Room 601

About the Workshop

Our multi-modal spatial intelligence (MUSI) workshop addresses how multimodal large language models (MLLMs) understand, reason about, and interact with spatial information from the physical world. The multimodal nature of spatial intelligence—requiring integration of images, videos, and 3D data—necessitates bringing together researchers from diverse domains: computer vision, robotics, graphics, and NLP. While recent MLLMs show promising visual-spatial capabilities, fundamental questions remain about spatial relationships, 3D environment modeling, and real-world spatial reasoning. This workshop explores how MLLMs learn spatial representations across modalities, advance world modeling and embodied AI, and address ethical considerations. We aim to establish benchmarks and foster cross-disciplinary collaboration to advance spatial reasoning in multimodal AI.

Keywords:

Spatial Reasoning Multimodal Large Language Model World Models Embodied AI 3D Understanding

Program

08:10 – 08:20	Welcome & Introduction
08:20 – 08:50	Keynote Talk 1 Katerina Fragkiadaki (Carnegie Mellon University)
08:50 – 09:20	Keynote Talk 2 [Slides] Angel X. Chang (Simon Fraser University)
09:20 – 09:50	Keynote Talk 3 Chuang Gan (UMass Amherst / MIT-IBM Watson AI Lab)
09:50 – 10:05	☕ Coffee Break & Social
10:05 – 10:35	Keynote Talk 4 [Slides] Roozbeh Mottaghi (Skild AI / University of Washington)
10:30 – 12:30	Poster Session ExHall A · Board IDs 142 – 149
10:35 – 11:05	Keynote Talk 5 [Slides] Saining Xie (NYU / AMI Labs)
11:05 – 11:35	Keynote Talk 6 Ranjay Krishna (University of Washington / Allen Institute for AI)
11:35 – 12:05	Keynote Talk 7 Kristen Grauman (University of Texas at Austin)
12:05 – 12:30	Closing Remarks

Call for Papers

Topics

We invite submissions on topics including, but not limited to:

Spatial Reasoning in Multimodal LLMs
World Models for Physical Understanding
Embodied Agents and VLA Models
3D Scene Understanding, Generation, and Reconstruction
Open-Vocabulary 2D/3D Perception and Reasoning

Temporal and Causal Reasoning in Dynamic Environments
Multimodal Interaction, Grounding, and Planning
Neuro-symbolic Approaches for Spatial Intelligence
Benchmarks and Datasets for Spatial Reasoning
Trust, Ethics, and Societal Impact of Spatial AI

Important Dates

Submission Deadline	March 13, 2026 (23:59 AoE)	Loading...
Author Notification	April 3, 2026 (23:59 AoE)	Loading...
Camera Ready	April 24, 2026 (23:59 AoE)	Loading...
Workshop Date	June 3rd Morning, 2026	-

*All deadlines are Anywhere on Earth (AoE). Timelines are subject to change.

Submission Guidelines

Eligibility: We welcome both new work and papers previously accepted at other venues.
Format: For new work, papers must be submitted in the CVPR 2026 format. Previously accepted papers may be submitted in their original format, but must still be anonymized for the review process.
Length: Max 8 pages (excluding references).
Review: Double-blind peer review.
Presentation: Accepted papers will be presented as posters.

Submit via OpenReview

Submissions are closed. The OpenReview portal is no longer accepting new papers.

Publication

The workshop will be non-archival. Authors of accepted papers retain the full copyright of their work and are free to submit extended versions to conferences or journals.

Invited Speakers

Organizers

Accepted Papers

Congratulations to all accepted authors!

Poster presentations

ExHall A · 10:30 AM – 12:30 PM · Poster board IDs 142 – 149 (2 posters per board)

Synthetic Counterfactual World Models for Multimodal Spatial Reasoning in Low-Resource 3D Domains

Mahule Roy, Subhas Roy

PDF
Synthesis of Interactive and Expansive Apartment Environments

ChunTeng Chen

PDF
CoT-PL: Chain-of-Thought Pseudo-Labeling for Open-Vocabulary Object Detection

Hojun Choi, Youngsun Lim, Jaeyo Shin, Hyunjung Shim

PDF
SPOT: Structured Prompting with Object-centric Tokens for open-world scene graphs

Mengqi Zhang, Sahil Khose, Fiona Ryan, Judy Hoffman

PDF
Can VLMs Handle Multi-hop Compositional Spatial Reasoning?

Youngwan Lee, Soojin Jang, Yoorhim Cho, Seunghwan Lee, Yong-Ju Lee, Sung Ju Hwang

PDF
ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

Myungchul Kim, Kwanyong Park, Junmo Kim, In So Kweon

PDF
Improving Scene Text Recognition in Multimodal Large Language Models using Visual Text Grounding

Shashank Krishna Vempati, Chetan Arora

PDF
Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Mahtab Bigverdi, Linjie Li, Weikai Huang, Yiming Liu, Jaemin Cho, Jieyu Zhang, Tuhin Kundu, Chris Dongjoo Kim, Zelun Luo, Ranjay Krishna, Linda Shapiro

PDF
When Spatial Reasoning Goes in Circles: Measuring Ordinal Consistency in Multimodal LLMs via Tournament Theory

Kaustubh S. Bukkapatnam, Rayan Malik, Atharv Kanchi

PDF
Bridging the Granularity Gap: Object-Centric Masking for Contextual Visual Learning

Jike Zhong

PDF
MindBlock: Probing Spatial Assembly and Structure in Unified Multimodal Models

Baiqiao Yin, Junhao Liu, Han Yin, Heyang Yu, Tingxuan Zhang, Zhiheng Li, Chengzu Li, Jihan Yang, Manling Li, Chen Feng, Yiming Li

PDF
Multi-Modal Manipulation via Multi-Modal Policy Consensus
ICRA 2026

Haonan Chen, Jiaming Xu, Hongyu Chen, Kaiwen Hong, Binghao Huang, Chaoqi Liu, Jiayuan Mao, Yunzhu Li, Yilun Du, Katherine Rose Driggs-Campbell

PDF
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
CVPR 2026

Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma, Joseph Tighe, Fanyi Xiao

PDF
Name That Part: 3D Part Segmentation and Naming
CVPR 2026 Findings

Soumava Paul, Prakhar Kaushik, Ankit Vaidya, Anand Bhattad, Alan Yuille

PDF
Theory of Space: Evaluating Multimodal Spatial Belief through Active Exploration
ICLR 2026

Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Manling Li

PDF
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
CVPR 2026

Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen

PDF
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
CVPR 2026

Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, Jonathan Tremblay

PDF
Hear you are: Teaching LLMs Spatial Reasoning with Vision and Spatial Sound
CVPR 2026

Hyeonggon Ryu, Joon Son Chung, David Harwath

PDF
A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models
ICLR 2026 Workshop (ESR)

Yanpeng Zhao, Wentao Ding, Hongtao Li, Baoxiong Jia, Zilong Zheng

PDF
SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
CVPR 2026

Vaibhav Agrawal, Rishubh Parihar, Pradhaan S Bhat, Ravi Kiran Sarvadevabhatla, Venkatesh Babu Radhakrishnan

PDF

Reviewer Acknowledgement

We sincerely thank the following reviewers for their time, expertise, and thoughtful feedback during the peer review process.

Siyi Chen
Daehyeon Choi
Jaewoo Jung
Minseo Kim
Seonho Lee
Damiano Marsili
Kiet T. Nguyen
Chanho Park
Zekun Qi
Marc Unzueta
Austin T. Wang
Chun-Hsiao Yeh
Baiqiao Yin
Jin Yoo
Sangwoo Youn

Contact

For any inquiries about the workshop, please reach out via email:

Juil Koo (63days@kaist.ac.kr)
Phillip Y. Lee (phillip0701@kaist.ac.kr)

2nd Workshop on Multimodal Spatial Intelligence

About the Workshop

Keywords:

Program

Call for Papers

Topics

Important Dates

Submission Guidelines

Publication

Invited Speakers

Katerina Fragkiadaki

Angel X. Chang

Chuang Gan

Roozbeh Mottaghi

Saining Xie

Ranjay Krishna

Kristen Grauman

Organizers

Juil Koo

Phillip Y. Lee

Songyou Peng

Mikaela Angelina Uy

Sanja Fidler

Leonidas Guibas

Minhyuk Sung

Accepted Papers

Poster presentations

Reviewer Acknowledgement

Contact