AAAI 2026
Efficient Modality Translation via Arbitrary Conditioning and Wasserstein Regularization
Abstract
The central challenge in multimodal generative modeling lies in accurately approximating the joint data distribution, even when some modalities are missing. Existing multimodal VAEs solve this by designing increasingly complex encoding architectures, relying on modality-specific encoders, factorized posteriors, and custom inference procedures. This restricts their ability to capture relations among modalities by amortizing the encoding parameters. We challenge this paradigm by introducing a model trained for arbitrary conditioning, i.e., generating any modality given a subset of observed modalities and a logical index indicating which modalities are present or missing. This enables a single unified encoder to handle any subset of modalities while capturing inter-modal relationships via a compact, shared posterior. We find that to work efficiently in the multimodal setup, arbitrary conditioning requires replacing the KL divergence with Wasserstein regularization, which allows more dispersed latent embeddings to support learning over diverse data and modality subsets. This key insight exposes a critical deficiency in existing methods, which rely on KL regularization that tends to concentrate individual embeddings near the standard Gaussian prior, despite coming from very diverse subsets of multimodal inputs. We prove that Wasserstein regularization ensures that the aggregate latent distribution -- spanning all conditioning subsets -- aligns with the prior without requiring mixture models or auxiliary inference tricks. Empirically, the proposed model improves cross-modal generation and yields better reconstructions than state-of-the-art multimodal VAEs.
Authors
Keywords
No keywords are indexed for this paper.
Context
- Venue
- AAAI Conference on Artificial Intelligence
- Archive span
- 1980-2026
- Indexed papers
- 28718
- Paper id
- 1124962740726400786