Author name cluster

Jufeng Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

2 author rows

AAAI Conference 2026 Conference Paper

Collaborative Feature Matching with Progressive Correspondence Learning

Xin Liu
Yanbing Han
Rong Qin
Bing Wang
Jufeng Yang

Accurate feature matching between image pairs is fundamental for various computer vision applications. In detector-base process, the feature matcher aims to find the optimal feature correspondences, and the match filter is used for further removing mismatches. However, their connection is rarely exploited since they are usually treated as two separate issues in previous method, which may lead to suboptimal results. In this paper, we propose an end-to-end collaborative feature matching (CFM) method, which contains a keypoint learning (KL) module and a correspondence learning (CL) module, to bridge the gap between two types of works. The former improves the discrimination of keypoints, and provides high-quality dynamic matches for CL module. The latter further captures the rich context of matches, and gives effective feedback to KL module. These two modules can reinforce each other in a progressive manner. Besides, we develop an efficient version of CFM, named ECFM, using an adaptive sampling strategy to avoid the negative influence of uninformative keypoints. Experimental results indicate that both methods outperform the state-of-the-art competitors in the tasks of relative pose estimation and visual localization.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

BurstDeflicker: A Benchmark Dataset for Flicker Removal in Dynamic Scenes

Lishen Qu
Zhihao Liu
Shihao Zhou
LUO YAQI
Jie Liang
Hui Zeng
Lei Zhang
Jufeng Yang

Flicker artifacts in short-exposure images are caused by the interplay between the row-wise exposure mechanism of rolling shutter cameras and the temporal intensity variations of alternating current (AC)-powered lighting. These artifacts typically appear as uneven brightness distribution across the image, forming noticeable dark bands. Beyond compromising image quality, this structured noise also affects high-level tasks, such as object detection and tracking, where reliable lighting is crucial. Despite the prevalence of flicker, the lack of a large-scale, realistic dataset has been a significant barrier to advancing research in flicker removal. To address this issue, we present BurstDeflicker, a scalable benchmark constructed using three complementary data acquisition strategies. First, we develop a Retinex-based synthesis pipeline that redefines the goal of flicker removal and enables controllable manipulation of key flicker-related attributes (e. g. , intensity, area, and frequency), thereby facilitating the generation of diverse flicker patterns. Second, we capture 4, 000 real-world flicker images from different scenes, which help the model better understand the spatial and temporal characteristics of real flicker artifacts and generalize more effectively to wild scenarios. Finally, due to the non-repeatable nature of dynamic scenes, we propose a green-screen method to incorporate motion into image pairs while preserving real flicker degradation. Comprehensive experiments demonstrate the effectiveness of our dataset and its potential to advance research in flicker removal.

PDF Details

NeurIPS Conference 2025 Conference Paper

FlareX: A Physics-Informed Dataset for Lens Flare Removal via 2D Synthesis and 3D Rendering

Lishen Qu
Zhihao Liu
Jinshan Pan
Shihao Zhou
Jinglei Shi
Duosheng Chen
Jufeng Yang

Lens flare occurs when shooting towards strong light sources, significantly degrading the visual quality of images. Due to the difficulty in capturing flare-corrupted and flare-free image pairs in the real world, existing datasets are typically synthesized in 2D by overlaying artificial flare templates onto background images. However, the lack of flare diversity in templates and the neglect of physical principles in the synthesis process hinder models trained on these datasets from generalizing well to real-world scenarios. To address these challenges, we propose a new physics-informed method for flare data generation, which consists of three stages: parameterized template creation, the laws of illumination-aware 2D synthesis, and physical engine-based 3D rendering, which finally gives us a mixed flare dataset that incorporates both 2D and 3D perspectives, namely FlareX. This dataset offers 9, 500 2D templates derived from 95 flare patterns and 3, 000 flare image pairs rendered from 60 3D scenes. Furthermore, we design a masking approach to obtain real-world flare-free images from their corrupted counterparts to measure the performance of the model on real-world images. Extensive experiments demonstrate the effectiveness of our method and dataset.

PDF Details

NeurIPS Conference 2025 Conference Paper

Hybrid Re-matching for Continual Learning with Parameter-Efficient Tuning

Weicheng Wang
Guoli Jia
Xialei Liu
Liang Lin
Jufeng Yang

Continual learning seeks to enable a model to assimilate knowledge from non-stationary data streams without catastrophic forgetting. Recently, methods based on Parameter-Efficient Tuning (PET) have achieved superior performance without even storing any historical exemplars, which train much fewer specific parameters for each task upon a frozen pre-trained model, and tailored parameters are retrieved to guide predictions during inference. However, reliance solely on pre-trained features for parameter matching exacerbates the inconsistency between the training and inference phases, thereby constraining the overall performance. To address this issue, we propose HRM-PET, which makes full use of the richer downstream knowledge inherently contained in the trained parameters. Specifically, we introduce a hybrid re-matching mechanism, which benefits from the initial predicted distribution to facilitate the parameter selections. The direct re-matching addresses misclassified samples identified with correct task identity in prediction, despite incorrect initial matching. Moreover, the confidence-based re-matching is specifically designed to handle other more challenging mismatched samples that cannot be calibrated by the former. Besides, to acquire task-invariant knowledge for better matching, we integrate a cross-task instance relationship distillation module into the PET-based method. Extensive experiments conducted on four datasets under five pre-trained settings demonstrate that HRM-PET performs favorably against the state-of-the-art methods. The code is available in the https: //github. com/wei-cheng777/HRM-PET.

PDF Details

ICML Conference 2025 Conference Paper

MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

Zhicheng Zhang
Wuyou Xia
Chenxi Zhao 0002
Zhou Yan
Xiaoqiang Liu
Yongjie Zhu
Wenyu Qin
Pengfei Wan

Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model’s flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks.

Details

NeurIPS Conference 2025 Conference Paper

VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

Zhicheng Zhang
Weicheng Wang
Yongjie Zhu
Wenyu Qin
Pengfei Wan
Di Zhang
Jufeng Yang

Understanding and predicting emotions from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions—characterized by their open-set, dynamic, and context-dependent properties—poses challenge in understanding complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2. 1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.

PDF Details

NeurIPS Conference 2024 Conference Paper

To Err Like Human: Affective Bias-Inspired Measures for Visual Emotion Recognition Evaluation

Chenxi Zhao
Jinglei Shi
Liqiang Nie
Jufeng Yang

Accuracy is a commonly adopted performance metric in various classification tasks, which measures the proportion of correctly classified samples among all samples. It assumes equal importance for all classes, hence equal severity for misclassifications. However, in the task of emotional classification, due to the psychological similarities between emotions, misclassifying a certain emotion into one class may be more severe than another, e. g. , misclassifying 'excitement' as 'anger' apparently is more severe than as 'awe'. Albeit high meaningful for many applications, metrics capable of measuring these cases of misclassifications in visual emotion recognition tasks have yet to be explored. In this paper, based on Mikel's emotion wheel from psychology, we propose a novel approach for evaluating the performance in visual emotion recognition, which takes into account the distance on the emotion wheel between different emotions to mimic the psychological nuances of emotions. Experimental results in semi-supervised learning on emotion recognition and user study have shown that our proposed metrics is more effective than the accuracy to assess the performance and conforms to the cognitive laws of human emotions. The code is available at https: //github. com/ZhaoChenxi-nku/ECC.

PDF Details DOI

AAAI Conference 2020 Conference Paper

An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos

Sicheng Zhao
Yunsheng Ma
Yang Gu
Jufeng Yang
Tengfei Xing
Pengfei Xu
Runbo Hu
Hua Chai

Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i. e. extracting visual and/or audio features and training classiﬁers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Speciﬁcally, we develop a deep Visual- Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classiﬁcation loss, i. e. polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at: https: //github. com/maysonma/VAANet.

PDF Details

AAAI Conference 2020 Conference Paper

Re-Attention for Visual Question Answering

Wenya Guo
Ying Zhang
Xiaoping Wu
Jufeng Yang
Xiangrui Cai
Xiaojie Yuan

Visual Question Answering (VQA) requires a simultaneous understanding of images and questions. Existing methods achieve well performance by focusing on both key objects in images and key words in questions. However, the answer also contains rich information which can help to better describe the image and generate more accurate attention maps. In this paper, to utilize the information in answer, we propose a reattention framework for the VQA task. We ﬁrst associate image and question by calculating the similarity of each objectword pairs in the feature space. Then, based on the answer, the learned model re-attends the corresponding visual objects in images and reconstructs the initial attention map to produce consistent results. Beneﬁting from the re-attention procedure, the question can be better understood, and the satisfactory answer is generated. Extensive experiments on the benchmark dataset demonstrate the proposed method performs favorably against the state-of-the-art approaches.

PDF Details

AAAI Conference 2019 Conference Paper

Learning from Web Data Using Adversarial Discriminative Neural Networks for Fine-Grained Classification

Xiaoxiao Sun
Liyi Chen
Jufeng Yang

Fine-grained classification is absorbed in recognizing the subordinate categories of one field, which need a large number of labeled images, while it is expensive to label these images. Utilizing web data has been an attractive option to meet the demands of training data for convolutional neural networks (CNNs), especially when the well-labeled data is not enough. However, directly training on such easily obtained images often leads to unsatisfactory performance due to factors such as noisy labels. This has been conventionally addressed by reducing the noise level of web data. In this paper, we take a fundamentally different view and propose an adversarial discriminative loss to advocate representation coherence between standard and web data. This is further encapsulated in a simple, scalable and end-to-end trainable multi-task learning framework. We experiment on three public datasets using large-scale web data to evaluate the effectiveness and generalizability of the proposed approach. Extensive experiments demonstrate that our approach performs favorably against the state-of-the-art methods.

PDF Details

AAAI Conference 2018 Conference Paper

Automatic Model Selection in Subspace Clustering via Triplet Relationships

Jufeng Yang
Jie Liang
Kai Wang
Yong-Liang Yang
Ming-Ming Cheng

This paper addresses both the model selection (i. e. estimating the number of clusters K) and subspace clustering problems in a uniﬁed model. The real data always distribute on a union of low-dimensional sub-manifolds which are embedded in a high-dimensional ambient space. In this regard, the state-ofthe-art subspace clustering approaches ﬁrstly learn the afﬁnity among samples, followed by a spectral clustering to generate the segmentation. However, arguably, the intrinsic geometrical structures among samples are rarely considered in the optimization process. In this paper, we propose to simultaneously estimate K and segment the samples according to the local similarity relationships derived from the afﬁnity matrix. Given the correlations among samples, we deﬁne a novel data structure termed the Triplet, each of which reﬂects a high relevance and locality among three samples which are aimed to be segmented into the same subspace. While the traditional pairwise distance can be close between inter-cluster samples lying on the intersection of two subspaces, the wrong assignments can be avoided by the hyper-correlation derived from the proposed triplets due to the complementarity of multiple constraints. Sequentially, we propose to greedily optimize a new model selection reward to estimate K according to the correlations between inter-cluster triplets. We simultaneously optimize a fusion reward based on the similarities between triplets and clusters to generate the ﬁnal segmentation. Extensive experiments on the benchmark datasets demonstrate the effectiveness and robustness of the proposed approach.

PDF Details

AAAI Conference 2018 Conference Paper

Retrieving and Classifying Affective Images via Deep Metric Learning

Jufeng Yang
Dongyu She
Yu-Kun Lai
Ming-Hsuan Yang

Affective image understanding has been extensively studied in the last decade since more and more users express emotion via visual contents. While current algorithms based on convolutional neural networks aim to distinguish emotional categories in a discrete label space, the task is inherently ambiguous. This is mainly because emotional labels with the same polarity (i. e. , positive or negative) are highly related, which is different from concrete object concepts such as cat, dog and bird. To the best of our knowledge, few methods focus on leveraging such characteristic of emotions for affective image understanding. In this work, we address the problem of understanding affective images via deep metric learning and propose a multi-task deep framework to optimize both retrieval and classiﬁcation goals. We propose the sentiment constraints adapted from the triplet constraints, which are able to explore the hierarchical relation of emotion labels. We further exploit the sentiment vector as an effective representation to distinguish affective images utilizing the texture representation derived from convolutional layers. Extensive evaluations on four widely-used affective datasets, i. e. , Flickr and Instagram, IAPSa, Art Photo, and Abstract Paintings, demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods on both affective image retrieval and classiﬁcation tasks.

PDF Details

IJCAI Conference 2018 Conference Paper

Text Emotion Distribution Learning via Multi-Task Convolutional Neural Network

Yuxiang Zhang
Jiamei Fu
Dongyu She
Ying Zhang
Senzhang Wang
Jufeng Yang

Emotion analysis of on-line user generated textual content is important for natural language processing and social media analytics tasks. Most of previous emotion analysis approaches focus on identifying users’ emotional states from text by classifying emotions into one of the finite categories, e. g. , joy, surprise, anger and fear. However, there exists ambiguity characteristic for the emotion analysis, since a single sentence can evoke multiple emotions with different intensities. To address this problem, we introduce emotion distribution learning and propose a multi-task convolutional neural network for text emotion analysis. The end-to-end framework optimizes the distribution prediction and classification tasks simultaneously, which is able to learn robust representations for the distribution dataset with annotations of different voters. While most work adopt the majority voting scheme for the ground truth labeling, we also propose a lexiconbased strategy to generate distributions from a single label, which provides prior information for the emotion classification. Experiments conducted on five public text datasets (i. e. , SemEval, Fairy Tales, ISEAR, TEC, CBET) demonstrate that our proposed method performs favorably against the state-of-the-art approaches.

PDF Details

AAAI Conference 2018 Conference Paper

Understanding Image Impressiveness Inspired by Instantaneous Human Perceptual Cues

Jufeng Yang
Yan Sun
Jie Liang
Yong-Liang Yang
Ming-Ming Cheng

With the explosion of visual information nowadays, millions of digital images are available to the users. How to efﬁciently explore a large set of images and retrieve useful information thus becomes extremely important. Unfortunately only some of the images can impress the user at ﬁrst glance. Others that make little sense in human perception are often discarded, while still costing valuable time and space. Therefore, it is signiﬁcant to identify these two kinds of images for relieving the load of online repositories and accelerating information retrieval process. However, most of the existing image properties, e. g. , memorability and popularity, are based on repeated human interactions, which limit the research and application of evaluating image quality in terms of instantaneous impression. In this paper, we propose a novel image property, called impressiveness, that measures how images impress people with a short-term contact. This is based on an impression-driven model inspired by a number of important human perceptual cues. To achieve this, we ﬁrst collect three datasets in various domains, which are labeled according to the instantaneous sensation of the annotators. Then we investigate the impressiveness property via six established human perceptual cues as well as the corresponding features from pixel to semantic levels. Sequentially, we verify the consistency of the impressiveness which can be quantitatively measured by multiple visual representations, and evaluate their latent relationships. Finally, we apply the proposed impressiveness property to rank the images for an efﬁcient image recommendation system.

PDF Details

IJCAI Conference 2017 Conference Paper

Joint Image Emotion Classification and Distribution Learning via Deep Convolutional Neural Network

Jufeng Yang
Dongyu She
Ming Sun

Visual sentiment analysis is attracting more and more attention with the increasing tendency to express emotions through visual contents. Recent algorithms in convolutional neural networks (CNNs) considerably advance the emotion classification, which aims to distinguish differences among emotional categories and assigns a single dominant label to each image. However, the task is inherently ambiguous since an image usually evokes multiple emotions and its annotation varies from person to person. In this work, we address the problem via label distribution learning (LDL) and develop a multi-task deep framework by jointly optimizing both classification and distribution prediction. While the proposed method prefers to the distribution dataset with annotations of different voters, the majority voting scheme is widely adopted as the ground truth in this area, and few dataset has provided multiple affective labels. Hence, we further exploit two weak forms of prior knowledge, which are expressed as similarity information between labels, to generate emotional distribution for each category. The experiments conducted on both distribution datasets, i. e. , Emotion6, Flickr_LDL, Twitter_LDL, and the largest single emotion dataset, i. e. , Flickr and Instagram, demonstrate the proposed method outperforms the state-of-the-art approaches.

PDF Details

AAAI Conference 2017 Conference Paper

Learning Visual Sentiment Distributions via Augmented Conditional Probability Neural Network

Jufeng Yang
Ming Sun
Xiaoxiao Sun

Visual sentiment analysis is raising more and more attention with the increasing tendency to express emotions through images. While most existing works assign a single dominant emotion to each image, we address the sentiment ambiguity by label distribution learning (LDL), which is motivated by the fact that image usually evokes multiple emotions. Two new algorithms are developed based on conditional probability neural network (CPNN). First, we propose BCPNN which encodes image label into a binary representation to replace the signless integers used in CPNN, and employ it as a part of input for the neural network. Then, we train our ACPNN model by adding noises to ground truth label and augmenting affective distributions. Since current datasets are mostly annotated for single-label learning, we build two new datasets, one of which is relabeled on the popular Flickr dataset and the other is collected from Twitter. These datasets contain 20, 745 images with multiple affective labels, which are over ten times larger than the existing ones. Experimental results show that the proposed methods outperform the state-of-theart works on our large-scale datasets and other publicly available benchmarks.

PDF Details