IJCAI Conference 2025 Conference Paper
Denoising Diffusion Models are Good General Gaze Feature Learners
- Guanzhong Zeng
- Jingjing Wang
- Pengwei Yin
- Zefu Xu
- Mingyang Zhou
Since the collection of labeled gaze data is laborious and time-consuming, methods which can learn generalizable features by leveraging large-scale available unlabeled data are desirable. In recent years, we have witnessed the tremendous capabilities of diffusion models in generating images as well as their potential in feature representation learning. In this paper, we investigate whether they can acquire discriminative representations for gaze estimation via generative pre-training. To achieve this goal, we propose a self-supervised learning framework with diffusion models for gaze estimation, called GazeDiff. Specifically, we utilize a conditional diffusion model to generate target image with gaze direction specified by the reference image as the pre-training task. To facilitate the diffusion model to learn gaze related features as condition, we propose a disentangling feature learning strategy, which first learns appearance feature, head pose feature, and eye direction feature respectively, and then combines them as the conditional features. Extensive experiments demonstrate denoising diffusion models are also good general gaze feature learners.