UniOMA
Gromov–Wasserstein structural alignment for multimodal robot perception (ICRA 2026)
UniOMA: Unified Optimal-Transport Multi-Modal Structural Alignment for Robot Perception — Xinrui Zu, Kevin Sebastian Luck, Shujian Yu. Accepted to ICRA 2026 Late-Breaking Results and the ICRA 2026 “From Data to Decisions” workshop (Poster Session 2). [Paper]
Contrastive losses learn which sensor readings correspond across modalities, yet stay blind to the internal geometry each modality carries. UniOMA closes this structural alignment gap with a geometry-aware Gromov–Wasserstein (GW) regularizer that augments any contrastive multimodal objective with structural alignment.
Standard InfoNCE-style losses align instance-level correspondence — which sample matches which — but remain invariant to transforms that distort each modality’s internal relational geometry. UniOMA encodes each sensor’s intra-modality similarity structure and uses a Gromov–Wasserstein distance to draw every modality toward a dynamically learned consensus geometry, or barycenter. The regularizer is plug-and-play across InfoNCE, Symile, GRAM, and related objectives, scales linearly to three or more modalities, and targets robot perception with underexplored modalities — force, tactile, IMU, proprioception — where sample density varies widely across sensors. Across five robotic benchmarks, including a new real-robot Vision–IMU–Proprioception dataset, UniOMA consistently improves state regression, classification, and cross-modal retrieval.
As a byproduct, the learned barycenter weights yield a label-free per-modality structural-importance ranking, surfacing which sensor most shapes a given task — proprioception in one setting, depth in another — without task labels or repeated retraining.