Paper title: Holder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs
Authors: Huyen Vo, Marı́a Martı́nez-Garcı́a, and Isabel Valera
—
Link: https://vothuckhanhhuyen.github.io/assets/pdf/Holder_ICML2026.pdf
Abstract: Existing approaches for multimodal variational autoencoders (VAEs) face a trade-off between generative quality and coherence—i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to Holder pooling ¨ as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises sample diversity. Inspired by this insight, we propose Holder++, a ¨ novel multimodal VAE that improves the generative quality-coherence trade-off through: (i) the first implementation of Holder pooling with- ¨ out any approximation for multimodal VAEs; (ii) an extended architecture that models distinct shared and private (i.e., modality-specific) representations (Holder+); and (iii) ¨ hierarchical inference that further enhances the disentanglement between the shared and private representations (Holder++). Our experiments corroborate that ¨ Holder++ consistently improves the generative ¨ quality-coherence trade-off, yields more structured latent spaces, and learns shared representations that are informative for downstream task.
