Data Augmentation with Diffusion for Open-Set Semi-Supervised Learning

Seonghyun Ban; Heesan Kong; Kee-Eung Kim

doi:10.52202/079017-2074

Back to NeurIPS

NeurIPS 2024

Data Augmentation with Diffusion for Open-Set Semi-Supervised Learning

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details DOI

Abstract

Semi-supervised learning (SSL) seeks to utilize unlabeled data to overcome the limited amount of labeled data and improve model performance. However, many SSL methods typically struggle in real-world scenarios, particularly when there is a large number of irrelevant instances in the unlabeled data that do not belong to any class in the labeled data. Previous approaches often downweight instances from irrelevant classes to mitigate the negative impact of class distribution mismatch on model training. However, by discarding irrelevant instances, they may result in the loss of valuable information such as invariance, regularity, and diversity within the data. In this paper, we propose a data-centric generative augmentation approach that leverages a diffusion model to enrich labeled data using both labeled and unlabeled samples. A key challenge is extracting the diversity inherent in the unlabeled data while mitigating the generation of samples irrelevant to the labeled data. To tackle this issue, we combine diffusion model training with a discriminator that identifies and reduces the impact of irrelevant instances. We also demonstrate that such a trained diffusion model can even convert an irrelevant instance into a relevant one, yielding highly effective synthetic data for training. Through a comprehensive suite of experiments, we show that our data augmentation approach significantly enhances the performance of SSL methods, especially in the presence of class distribution mismatch.

Data Augmentation with Diffusion for Open-Set Semi-Supervised Learning

Abstract

Authors

Keywords

Context