Machine Learning – Publications
1.
Vo, Huyen; Martı́nez-Garcı́a, Marı́a; Valera, Isabel
Holder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs Proceedings Article
In: 2026.
Abstract | Links | BibTeX | Tags: huyen, isabel, maria
@inproceedings{nokey,
title = {Holder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs},
author = {Huyen Vo and Marı́a Martı́nez-Garcı́a and Isabel Valera},
url = {https://vothuckhanhhuyen.github.io/assets/pdf/Holder_ICML2026.pdf},
year = {2026},
date = {2026-03-11},
urldate = {2026-03-11},
abstract = {Existing approaches for multimodal variational autoencoders (VAEs) face a trade-off between generative quality and coherence—i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to Hölder pooling as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises sample diversity. Inspired by this insight, we propose Hölder++, a novel multimodal VAE that improves the generative quality-coherence trade-off through: (i) the first implementation of Hölder pooling without any approximation for multimodal VAEs; (ii) an extended architecture that models distinct shared and private (i.e., modality-specific) representations (Hölder+); and (iii) hierarchical inference that further enhances the disentanglement between the shared and private representations (Hölder++). Our experiments corroborate that Hölder++ consistently improves the generative quality-coherence trade-off, yields more structured latent spaces, and learns shared representations that are informative for downstream tasks.},
keywords = {huyen, isabel, maria},
pubstate = {published},
tppubtype = {inproceedings}
}
Existing approaches for multimodal variational autoencoders (VAEs) face a trade-off between generative quality and coherence—i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to Hölder pooling as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises sample diversity. Inspired by this insight, we propose Hölder++, a novel multimodal VAE that improves the generative quality-coherence trade-off through: (i) the first implementation of Hölder pooling without any approximation for multimodal VAEs; (ii) an extended architecture that models distinct shared and private (i.e., modality-specific) representations (Hölder+); and (iii) hierarchical inference that further enhances the disentanglement between the shared and private representations (Hölder++). Our experiments corroborate that Hölder++ consistently improves the generative quality-coherence trade-off, yields more structured latent spaces, and learns shared representations that are informative for downstream tasks.
2.
Vo, Huyen Thuc Khanh; Valera, Isabel
Hellinger Multimodal Variational Autoencoders Proceedings Article Spotlight
In: The 29th International Conference on Artificial Intelligence and Statistics, 2026.
Abstract | Links | BibTeX | Tags: huyen, isabel, spotlight
@inproceedings{<LineBreak>vo2026hellinger,
title = {Hellinger Multimodal Variational Autoencoders},
author = {Huyen Thuc Khanh Vo and Isabel Valera},
url = {https://openreview.net/forum?id=mxHyYltMUa},
year = {2026},
date = {2026-01-01},
urldate = {2026-01-01},
booktitle = {The 29th International Conference on Artificial Intelligence and Statistics},
abstract = {Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with α=0.5, which corresponds to the unique symmetric member of the α-divergence family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.},
keywords = {huyen, isabel, spotlight},
pubstate = {published},
tppubtype = {inproceedings}
}
Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with α=0.5, which corresponds to the unique symmetric member of the α-divergence family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.
