Author name cluster

Feng Lu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers

2 author rows

AAAI Conference 2026 Conference Paper

Learning Procedural-Aware Video Representations Through State-Grounded Hierarchy Unfolding

Jinghan Zhao
Yifei Huang
Feng Lu

Learning procedural-aware video representations is a key step towards building agents that can reason about and execute complex tasks. Existing methods typically address this problem by aligning visual content with textual descriptions at the task and step levels to inject procedural semantics into video representations. However, due to their high level of abstraction, "task" and "step" descriptions fail to form a robust alignment with the concrete, observable details in visual data. To address this, we introduce "states", i.e., textual snapshots of object configurations, as a visually-grounded semantic layer that anchors abstract procedures to what a model can actually see. We formalize this insight in a novel Task-Step-State (TSS) framework, where tasks are achieved via steps that drive transitions between observable states. To enforce this structure, we propose a progressive pre-training strategy that unfolds the TSS hierarchy, forcing the model to first ground representations in states before associating them with steps and, ultimately, high-level tasks. Extensive experiments on the COIN and CrossTask datasets show that our method outperforms baseline models on multiple downstream tasks, including task recognition, step recognition, and next step prediction. Ablation studies show that introducing state supervision is a key driver of performance gains across all tasks. Additionally, our progressive pretraining strategy proves more effective than standard joint training, as it better enforces the intended hierarchical structure.

PDF Details DOI

AAAI Conference 2026 Conference Paper

MentalGuide: Towards Multi-Turn, State-Aware and Strategy-Driven Conversations for Mental Health Support

Jinwei He
Feng Lu

The global shortage of psychiatrists has become a critical issue, and the advent of large language models (LLMs) presents new opportunities to address this challenge. However, existing approaches continue to underperform in multi-turn mental health counseling, particularly in the arrangement of counseling strategies. To overcome these limitations, we propose MentalGuide, a state-aware and strategy-driven conversation framework designed for multi-turn mental health support. Our method integrates expert-derived prior probabilities of counseling strategies tailored to the target client's state with the reasoning capabilities of LLMs. This enables effective strategy formulation and strategy-driven response generation, without the need for additional training. Experimental results show that MentalGuide surpasses baselines in automated and human expert evaluations, demonstrating the closest alignment with real-world multi-turn counseling dynamics.

PDF Details DOI

AAAI Conference 2026 Conference Paper

MoEGaze: A Mixture of Experts Approach for Generalizable Gaze Estimation

Zheng Liu
Feng Lu

Existing gaze estimation models often struggle to generalize to unseen users, primarily due to significant variations in individual appearance. Empirical observations reveal that performance improves when the visual appearance of test subjects closely resembles that of training subjects. Motivated by this, we propose a generalizable gaze estimation framework MoEGaze based on the Mixture of Experts (MoE) architecture. During training, the model extracts appearance features from facial images and uses them to route samples to specialized gaze expert networks, each tailored to a specific subset of appearances. Rather than directly predicting gaze, each expert outputs intermediate gaze features, which are dynamically aggregated according to the input appearance and then mapped to gaze prediction. This dynamic routing design enables the model to effectively adapt to users with diverse appearances, while also facilitating easier training on sub-datasets with smaller appearance variations. Extensive experiments demonstrate that our method achieves superior cross-domain performance compared to existing approaches, with an average improvement of 27.6% across four cross-domain metrics over the baseline. Furthermore, MoEGaze surpasses baselines trained on the full dataset while requiring only 10% of the training data.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

3DPE-Gaze:Unlocking the Potential of 3D Facial Priors for Generalized Gaze Estimation

Yangshi Ge
Yiwei Bao
Feng Lu

In recent years, face-based deep-learning gaze estimation methods have achieved significant advancements. However, while face images provide supplementary information beneficial for gaze inference, the substantial extraneous information they contain also increases the risk of overfitting during model training and compromises generalization capability. To alleviate this problem, we propose the 3DPE-Gaze framework, explicitly modeling 3D facial priors for feature decoupling and generalized gaze estimation. The 3DPE-Gaze framework consists of two core modules: the 3D Geometric Prior Module (3DGP) incorporating the FLAME model to parameterize facial structures and gaze-irrelevant facial appearances while extracting gaze features; the Semantic Concept Alignment Module (SCAM) separates gaze-related and unrelated concepts through CLIP-guided contrastive learning. Finally, the 3DPE-Gaze framework combines 3D facial landmark as prior for generalized gaze estimation. Experimental results show that 3DPE-Gaze outperforms existing state-of-the-art methods on four major cross-domain tasks, with particularly outstanding performance in challenging scenarios such as lighting variations, extreme head poses, and glasses occlusion.

PDF Details

NeurIPS Conference 2025 Conference Paper

In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting

Taiying Peng
Jiacheng Hua
Miao Liu
Feng Lu

The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants' ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent. To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of gaze-based QA pairs generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions using only global visual tokens. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues. We further conduct experiments on gaze-related fine-tuning and analyze how gaze estimation accuracy impacts prompting effectiveness. These results underscore the value of gaze for more personalized and effective AI assistants in egocentric settings.

PDF Details

AAAI Conference 2025 Conference Paper

NaFV-Net: An Adversarial Four-view Network for Mammogram Classification

Feng Lu
Yuxiang Hou
Wei Li
Xiangying Yang
Haibo Zheng
Wenxi Luo
Leqing Chen
Yuyang Cao

Breast cancer remains a leading cause of mortality among women, with millions of new cases diagnosed annually. Early detection through screening is crucial. Using neural networks to improve the accuracy of breast cancer screening has become increasingly important. In accordance with radiologists' practices, we proposed using images from the unaffected side to create adversarial samples with critical medical implications in our adversarial learning process. By introducing beneficial perturbations, this method aims to reduce overconfidence and improve the precision and robustness of breast cancer classification. Our proposed framework is an adversarial quadruple-view classification network (NaFV-Net) incorporating images from both affected and unaffected perspectives. By comprehensively capturing local and global information and implementing adversarial learning from four mammography views, this framework allows for the fusion of features and the integration of medical principles and radiologist evaluation techniques, thus facilitating the accurate identification and characterization of breast tissues. Extensive experiments have shown the high effectiveness of our model in accurately distinguishing between benign and malignant findings, demonstrating state-of-the-art classification performance on both internal and public datasets.

PDF Details DOI

AAAI Conference 2025 Conference Paper

TdAttenMix: Top-Down Attention Guided Mixup

Zhiming Wang
Lin Gu
Feng Lu

CutMix is a data augmentation strategy that cuts and pastes image patches to mixup training data. Existing methods pick either random or salient areas which are often inconsistent to labels, thus misguiding the training model. By our knowledge, we integrate human gaze to guide cutmix for the first time. Since human attention is driven by both high-level recognition and low-level clues, we propose a controllable Top-down Attention Guided Module to obtain a general artificial attention which balances top-down and bottom-up attention. The proposed TdATttenMix then picks the patches and adjust the label mixing ratio that focuses on regions relevant to the current label. Experimental results demonstrate that our TdAttenMix outperforms existing state-of-the-art mixup methods across eight different benchmarks. Additionally, we introduce a new metric based on the human gaze and use this metric to investigate the issue of image-label inconsistency.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era

Feng Lu
Tong Jin
Canming Ye
Xiangyuan Lan
Yunpeng Liu
Chun Yuan

Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e. g. , NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful information within the patch tokens to the aggregation tokens. Finally, we only take these aggregation tokens from the last output tokens and concatenate them as the global representation. Although implicit aggregation can provide robust global descriptors in an extremely simple manner, where and how to insert additional tokens, as well as the initialization of tokens, remains an open issue worthy of further exploration. To this end, we also propose the optimal token insertion strategy and token initialization method derived from empirical studies. Experimental results show that our method outperforms state-of-the-art methods on several VPR datasets with higher efficiency and ranks 1st on the MSLS challenge leaderboard. The code is available at https: //github. com/lu-feng/image.

PDF Details

JBHI Journal 2024 Journal Article

cbPPGGAN: A Generic Enhancement Framework for Unpaired Pulse Waveforms in Camera-Based Photoplethysmography

Ze Yang
Haofei Wang
Bo Liu
Feng Lu

Camera-based photoplethysmography (cbP PG) is a non-contact technique that measures cardiac-related blood volume alterations in skin surface vessels through the analysis of facial videos. While traditional approaches can estimate heart rate (HR) under different illuminations, their accuracy can be affected by motion artifacts, leading to poor waveform fidelity and hindering further analysis of heart rate variability (HRV); deep learning-based approaches reconstruct high-quality pulse waveform, yet their performance significantly degrades under illumination variations. In this work, we aim to leverage the strength of these two methods and propose a framework that possesses favorable generalization capabilities while maintaining waveform fidelity. For this purpose, we propose the cbPPGGAN, an enhancement framework for cbPPG that enables the flexible incorporation of both unpaired and paired data sources in the training process. Based on the waveforms extracted by traditional approaches, the cbPPGGAN reconstructs high-quality waveforms that enable accurate HR estimation and HRV analysis. In addition, to address the lack of paired training data problems in real-world applications, we propose a cycle consistency loss that guarantees the time-frequency consistency before/after mapping. The method enhances the waveform quality of traditional POS approaches in different illumination tests (BH-rPPG) and cross-datasets (UBFC-rPPG) with mean absolute error (MAE) values of 1. 34 bpm and 1. 65 bpm, and average beat-to-beat (AVBB) values of 27. 46 ms and 45. 28 ms, respectively. Experimental results demonstrate that the cbPPGGAN enhances cbPPG signal quality and outperforms the state-of-the-art approaches in HR estimation and HRV analysis. The proposed framework opens a new pathway toward accurate HR estimation in an unconstrained environment.

Details DOI

AAAI Conference 2024 Conference Paper

CGS-Mask: Making Time Series Predictions Intuitive for All

Feng Lu
Wei Li
Yifei Sun
Cheng Song
Yufei Ren
Albert Y. Zomaya

Artificial intelligence (AI) has immense potential in time series prediction, but most explainable tools have limited capabilities in providing a systematic understanding of important features over time. These tools typically rely on evaluating a single time point, overlook the time ordering of inputs, and neglect the time-sensitive nature of time series applications. These factors make it difficult for users, particularly those without domain knowledge, to comprehend AI model decisions and obtain meaningful explanations. We propose CGS-Mask, a post-hoc and model-agnostic cellular genetic strip mask-based saliency approach to address these challenges. CGS-Mask uses consecutive time steps as a cohesive entity to evaluate the impact of features on the final prediction, providing binary and sustained feature importance scores over time. Our algorithm optimizes the mask population iteratively to obtain the optimal mask in a reasonable time. We evaluated CGS-Mask on synthetic and real-world datasets, and it outperformed state-of-the-art methods in elucidating the importance of features over time. According to our pilot user study via a questionnaire survey, CGS-Mask is the most effective approach in presenting easily understandable time series prediction results, enabling users to comprehend the decision-making process of AI models with ease.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Deep Homography Estimation for Visual Place Recognition

Feng Lu
Shuting Dong
Lijun Zhang
Bingxi Liu
Xiangyuan Lan
Dongmei Jiang
Chun Yuan

Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Deep Semantic Graph Transformer for Multi-View 3D Human Pose Estimation

Lijun Zhang
Kangkang Zhou
Feng Lu
Xiang-Dong Zhou
Yu Shi

Most Graph Convolutional Networks based 3D human pose estimation (HPE) methods were involved in single-view 3D HPE and utilized certain spatial graphs, existing key problems such as depth ambiguity, insufficient feature representation, or limited receptive fields. To address these issues, we propose a multi-view 3D HPE framework based on deep semantic graph transformer, which adaptively learns and fuses multi-view significant semantic features of human nodes to improve 3D HPE performance. First, we propose a deep semantic graph transformer encoder to enrich spatial feature information. It deeply mines the position, spatial structure, and skeletal edge knowledge of joints and dynamically learns their correlations. Then, we build a progressive multi-view spatial-temporal feature fusion framework to mitigate joint depth uncertainty. To enhance the pose spatial representation, deep spatial semantic feature are interacted and fused across different viewpoints during monocular feature extraction. Furthermore, long-time relevant temporal dependencies are modeled and spatial-temporal information from all viewpoints is fused to intermediately supervise the depth. Extensive experiments on three 3D HPE benchmarks show that our method achieves state-of-the-art results. It can effectively enhance pose features, mitigate depth ambiguity in single-view 3D HPE, and improve 3D HPE performance without providing camera parameters. Codes and models are available at https://github.com/z0911k/SGraFormer.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Gaze from Origin: Learning for Generalized Gaze Estimation by Embedding the Gaze Frontalization Process

Mingjie Xu
Feng Lu

Gaze estimation aims to accurately estimate the direction or position at which a person is looking. With the development of deep learning techniques, a number of gaze estimation methods have been proposed and achieved state-of-the-art performance. However, these methods are limited to within-dataset settings, whose performance drops when tested on unseen datasets. We argue that this is caused by infinite and continuous gaze labels. To alleviate this problem, we propose using gaze frontalization as an auxiliary task to constrain gaze estimation. Based on this, we propose a novel gaze domain generalization framework named Gaze Frontalization-based Auxiliary Learning (GFAL) Framework which embeds the gaze frontalization process, i.e., guiding the feature so that the eyeball can rotate and look at the front (camera), without any target domain information during training. Experimental results show that our proposed framework is able to achieve state-of-the-art performance on gaze domain generalization task, which is competitive with or even superior to the SOTA gaze unsupervised domain adaptation methods.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Gaze Target Detection by Merging Human Attention and Activity Cues

Yaokun Yang
Yihan Yin
Feng Lu

Despite achieving impressive performance, current methods for detecting gaze targets, which depend on visual saliency and spatial scene geometry, continue to face challenges when it comes to detecting gaze targets within intricate image backgrounds. One of the primary reasons for this lies in the oversight of the intricate connection between human attention and activity cues. In this study, we introduce an innovative approach that amalgamates the visual saliency detection with the body-part & object interaction both guided by the soft gaze attention. This fusion enables precise and dependable detection of gaze targets amidst intricate image backgrounds. Our approach attains state-of-the-art performance on both the Gazefollow benchmark and the GazeVideoAttn benchmark. In comparison to recent methods that rely on intricate 3D reconstruction of a single input image, our approach, which solely leverages 2D image information, still exhibits a substantial lead across all evaluation metrics, positioning it closer to human-level performance. These outcomes underscore the potent effectiveness of our proposed method in the gaze target detection task.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

SuperVLAD: Compact and Robust Image Descriptors for Visual Place Recognition

Feng Lu
Xinyao Zhang
Canming Ye
Shuting Dong
Lijun Zhang
Xiangyuan Lan
Chun Yuan

Visual place recognition (VPR) is an essential task for multiple applications such as augmented reality and robot localization. Over the past decade, mainstream methods in the VPR area have been to use feature representation based on global aggregation, as exemplified by NetVLAD. These features are suitable for large-scale VPR and robust against viewpoint changes. However, the VLAD-based aggregation methods usually learn a large number of (e. g. , 64) clusters and their corresponding cluster centers, which directly leads to a high dimension of the yielded global features. More importantly, when there is a domain gap between the data in training and inference, the cluster centers determined on the training set are usually improper for inference, resulting in a performance drop. To this end, we first attempt to improve NetVLAD by removing the cluster center and setting only a small number of (e. g. , only 4) clusters. The proposed method not only simplifies NetVLAD but also enhances the generalizability across different domains. We name this method SuperVLAD. In addition, by introducing ghost clusters that will not be retained in the final output, we further propose a very low-dimensional 1-Cluster VLAD descriptor, which has the same dimension as the output of GeM pooling but performs notably better. Experimental results suggest that, when paired with a transformer-based backbone, our SuperVLAD shows better domain generalization performance than NetVLAD with significantly fewer parameters. The proposed method also surpasses state-of-the-art methods with lower feature dimensions on several benchmark datasets. The code is available at https: //github. com/lu-feng/SuperVLAD.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition

Feng Lu
Lijun Zhang
Xiangyuan Lan
Shuting Dong
Yaowei Wang 0001
Chun Yuan 0003

Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model pre-training and VPR, how to bridge the gap and fully unleash the capability of pre-trained models for VPR is still a key issue to address. To this end, we propose a novel method to realize seamless adaptation of pre-trained models for VPR. Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method to achieve both global and local adaptation efficiently, in which only lightweight adapters are tuned without adjusting the pre-trained model. Besides, to guide effective adaptation, we propose a mutual nearest neighbor local feature loss, which ensures proper dense local features are produced for local matching and avoids time-consuming spatial verification in re-ranking. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time, and uses about only 3% retrieval runtime of the two-stage VPR methods with RANSAC-based spatial verification. It ranks 1st on the MSLS challenge leaderboard (at the time of submission). The code is released at https://github.com/Lu-Feng/SelaVPR.

Details

AAAI Conference 2024 Conference Paper

UVAGaze: Unsupervised 1-to-2 Views Adaptation for Gaze Estimation

Ruicong Liu
Feng Lu

Gaze estimation has become a subject of growing interest in recent research. Most of the current methods rely on single-view facial images as input. Yet, it is hard for these approaches to handle large head angles, leading to potential inaccuracies in the estimation. To address this issue, adding a second-view camera can help better capture eye appearance. However, existing multi-view methods have two limitations. 1) They require multi-view annotations for training, which are expensive. 2) More importantly, during testing, the exact positions of the multiple cameras must be known and match those used in training, which limits the application scenario. To address these challenges, we propose a novel 1-view-to-2-views (1-to-2 views) adaptation solution in this paper, the Unsupervised 1-to-2 Views Adaptation framework for Gaze estimation (UVAGaze). Our method adapts a traditional single-view gaze estimator for flexibly placed dual cameras. Here, the "flexibly" means we place the dual cameras in arbitrary places regardless of the training data, without knowing their extrinsic parameters. Specifically, the UVAGaze builds a dual-view mutual supervision adaptation strategy, which takes advantage of the intrinsic consistency of gaze directions between both views. In this way, our method can not only benefit from common single-view pre-training, but also achieve more advanced dual-view gaze estimation. The experimental results show that a single-view estimator, when adapted for dual views, can achieve much higher accuracy, especially in cross-dataset settings, with a substantial improvement of 47.0%. Project page: https://github.com/MickeyLLG/UVAGaze.

PDF Details DOI

AAAI Conference 2023 Conference Paper

A Composite Multi-Attention Framework for Intraoperative Hypotension Early Warning

Feng Lu
Wei Li
Zhiqiang Zhou
Cheng Song
Yifei Sun
Yuwei Zhang
Yufei Ren
Xiaofei Liao

Intraoperative hypotension (IOH) events warning plays a crucial role in preventing postoperative complications, such as postoperative delirium and mortality. Despite significant efforts, two fundamental problems limit its wide clinical use. The well-established IOH event warning systems are often built on proprietary medical devices that may not be available in all hospitals. The warnings are also triggered mainly through a predefined IOH event that might not be suitable for all patients. This work proposes a composite multi-attention (CMA) framework to tackle these problems by conducting short-term predictions on user-definable IOH events using vital signals in a low sampling rate with demographic characteristics. Our framework leverages a multi-modal fusion network to make four vital signals and three demographic characteristics as input modalities. For each modality, a multi-attention mechanism is used for feature extraction for better model training. Experiments on two large-scale real-world data sets show that our method can achieve up to 94.1% accuracy on IOH events early warning while the signals sampling rate is reduced by 3000 times. Our proposal CMA can achieve a mean absolute error of 4.50 mm Hg in the most challenging 15-minute mean arterial pressure prediction task and the error reduction by 42.9% compared to existing solutions.

PDF Details DOI

ICRA Conference 2023 Conference Paper

AANet: Aggregation and Alignment Network with Semi-hard Positive Sample Mining for Hierarchical Place Recognition

Feng Lu
Lijun Zhang
Shuting Dong
Baifan Chen
Chun Yuan 0003

Visual place recognition (VPR) is one of the research hotspots in robotics, which uses visual information to locate robots. Recently, the hierarchical two-stage VPR methods have become popular in this field due to the trade-off between accuracy and efficiency. These methods retrieve the top-k candidate images using the global features in the first stage, then re-rank the candidates by matching the local features in the second stage. However, they usually require additional al-gorithms (e. g. RANSAC) for geometric consistency verification in re-ranking, which is time-consuming. Here we propose a Dynamically Aligning Local Features (DALF) algorithm to align the local features under spatial constraints. It is significantly more efficient than the methods that need geometric consistency verification. We present a unified network capable of extracting global features for retrieving candidates via an aggregation module and aligning local features for re-ranking via the DALF alignment module. We call this network AANet. Meanwhile, many works use the simplest positive samples in triplet for weakly supervised training, which limits the ability of the network to recognize harder positive pairs. To address this issue, we propose a Semi-hard Positive Sample Mining (ShPSM) strategy to select appropriate hard positive images for training more robust VPR networks. Extensive experiments on four benchmark VPR datasets show that the proposed AANet can outperform several state-of-the-art methods with less time consumption. The code is released at https://github.com/Lu-Feng/AANet.

Details

AAAI Conference 2023 Short Paper

AsT: An Asymmetric-Sensitive Transformer for Osteonecrosis of the Femoral Head Detection (Student Abstract)

Haoyang Chen
Shuai Liu
Feng Lu
Wei Li
Bin Sheng
Mi Li
Hai Jin
Albert Y. Zomaya

Early diagnosis of osteonecrosis of the femoral head (ONFH) can inhibit the progression and improve femoral head preservation. The radiograph difference between early ONFH and healthy ones is not apparent to the naked eye. It is also hard to produce a large dataset to train the classification model. In this paper, we propose Asymmetric-Sensitive Transformer (AsT) to capture the uneven development of the bilateral femoral head to enable robust ONFH detection. Our ONFH detection is realized using the self-attention mechanism to femoral head regions while conferring sensitivity to the uneven development by the attention-shared transformer. The real-world experiment studies show that AsT achieves the best performance of AUC 0.9313 in the early diagnosis of ONFH and can find out misdiagnosis cases firmly.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

DFVSR: Directional Frequency Video Super-Resolution via Asymmetric and Enhancement Alignment Network

Shuting Dong
Feng Lu
Zhe Wu
Chun Yuan

Recently, techniques utilizing frequency-based methods have gained significant attention, as they exhibit exceptional restoration capabilities for detail and structure in video super-resolution tasks. However, most of these frequency-based methods mainly have three major limitations: 1) insufficient exploration of object motion information, 2) inadequate enhancement for high-fidelity regions, and 3) loss of spatial information during convolution. In this paper, we propose a novel network, Directional Frequency Video Super-Resolution (DFVSR), to address these limitations. Specifically, we reconsider object motion from a new perspective and propose Directional Frequency Representation (DFR), which not only borrows the property of frequency representation of detail and structure information but also contains the direction information of the object motion that is extremely significant in videos. Based on this representation, we propose a Directional Frequency-Enhanced Alignment (DFEA) to use double enhancements of task-related information for ensuring the retention of high-fidelity frequency regions to generate the high-quality alignment feature. Furthermore, we design a novel Asymmetrical U-shaped network architecture to progressively fuse these alignment features and output the final output. This architecture enables the intercommunication of the same level of resolution in the encoder and decoder to achieve the supplement of spatial information. Powered by the above designs, our method achieves superior performance over state-of-the-art models on both quantitative and qualitative evaluations.

PDF Details DOI

AAAI Conference 2023 Short Paper

ES-Mask: Evolutionary Strip Mask for Explaining Time Series Prediction (Student Abstract)

Yifei Sun
Cheng Song
Feng Lu
Wei Li
Hai Jin
Albert Y. Zomaya

Machine learning models are increasingly used in time series prediction with promising results. The model explanation of time series prediction falls behind the model development and makes less sense to users in understanding model decisions. This paper proposes ES-Mask, a post-hoc and model-agnostic evolutionary strip mask-based saliency approach for time series applications. ES-Mask designs the mask consisting of strips with the same salient value in consecutive time steps to produce binary and sustained feature importance scores over time for easy understanding and interpretation of time series. ES-Mask uses an evolutionary algorithm to search for the optimal mask by manipulating strips in rounds, thus is agnostic to models by involving no internal model states in the search. The initial experiments on MIMIC-III data set show that ES-Mask outperforms state-of-the-art methods.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Learning a Generalized Gaze Estimator from Gaze-Consistent Feature

Mingjie Xu
Haofei Wang
Feng Lu

Gaze estimator computes the gaze direction based on face images. Most existing gaze estimation methods perform well under within-dataset settings, but can not generalize to unseen domains. In particular, the ground-truth labels in unseen domain are often unavailable. In this paper, we propose a new domain generalization method based on gaze-consistent features. Our idea is to consider the gaze-irrelevant factors as unfavorable interference and disturb the training data against them, so that the model cannot fit to these gaze-irrelevant factors, instead, only fits to the gaze-consistent features. To this end, we first disturb the training data via adversarial attack or data augmentation based on the gaze-irrelevant factors, i.e., identity, expression, illumination and tone. Then we extract the gaze-consistent features by aligning the gaze features from disturbed data with non-disturbed gaze features. Experimental results show that our proposed method achieves state-of-the-art performance on gaze domain generalization task. Furthermore, our proposed method also improves domain adaption performance on gaze estimation. Our work provides new insight on gaze domain generalization task.

PDF Details DOI

JBHI Journal 2022 Journal Article

Explainable Diabetic Retinopathy Detection and Retinal Image Generation

Yuhao Niu
Lin Gu
Yitian Zhao
Feng Lu

Though deep learning has shown successful performance in classifying the label and severity stage of certain diseases, most of them give few explanations on how to make predictions. Inspired by Koch's Postulates, the foundation in evidence-based medicine (EBM) to identify the pathogen, we propose to exploit the interpretability of deep learning application in medical diagnosis. By isolating neuron activation patterns from a diabetic retinopathy (DR) detector and visualizing them, we can determine the symptoms that the DR detector identifies as evidence to make prediction. To be specific, we first define novel pathological descriptors using activated neurons of the DR detector to encode both spatial and appearance information of lesions. Then, to visualize the symptom encoded in the descriptor, we propose Patho-GAN, a new network to synthesize medically plausible retinal images. By manipulating these descriptors, we could even arbitrarily control the position, quantity, and categories of generated lesions. We also show that our synthesized images carry the symptoms directly related to diabetic retinopathy diagnosis. Our generated images are both qualitatively and quantitatively superior to the ones by previous methods. Besides, compared to existing methods that take hours to generate an image, our second level speed endows the potential to be an effective solution for data augmentation.

Details DOI

AAAI Conference 2022 Conference Paper

PureGaze: Purifying Gaze Feature for Generalizable Gaze Estimation

Yihua Cheng
Yiwei Bao
Feng Lu

Gaze estimation methods learn eye gaze from facial features. However, among rich information in the facial image, real gaze-relevant features only correspond to subtle changes in eye region, while other gaze-irrelevant features like illumination, personal appearance and even facial expression may affect the learning in an unexpected way. This is a major reason why existing methods show signiﬁcant performance degradation in cross-domain/dataset evaluation. In this paper, we tackle the cross-domain problem in gaze estimation. Different from common domain adaption methods, we propose a domain generalization method to improve the cross-domain performance without touching target samples. The domain generalization is realized by gaze feature puriﬁcation. We eliminate gaze-irrelevant factors such as illumination and identity to improve the cross-domain performance. We design a plug-and-play self-adversarial framework for the gaze feature puriﬁcation. The framework enhances not only our baseline but also existing gaze estimation methods directly and signiﬁcantly. To the best of our knowledge, we are the ﬁrst to propose domain generalization methods in gaze estimation. Our method achieves not only state-of-the-art performance among typical gaze estimation methods but also competitive results among domain adaption methods. The code is released in https: //github. com/yihuacheng/PureGaze.

PDF Details

AAAI Conference 2020 Conference Paper

A Coarse-to-Fine Adaptive Network for Appearance-Based Gaze Estimation

Yihua Cheng
Shiyao Huang
Fei Wang
Chen Qian
Feng Lu

Human gaze is essential for various appealing applications. Aiming at more accurate gaze estimation, a series of recent works propose to utilize face and eye images simultaneously. Nevertheless, face and eye images only serve as independent or parallel feature sources in those works, the intrinsic correlation between their features is overlooked. In this paper we make the following contributions: 1) We propose a coarseto-ﬁne strategy which estimates a basic gaze direction from face image and reﬁnes it with corresponding residual predicted from eye images. 2) Guided by the proposed strategy, we design a framework which introduces a bi-gram model to bridge gaze residual and basic gaze direction, and an attention component to adaptively acquire suitable ﬁne-grained feature. 3) Integrating the above innovations, we construct a coarse-to-ﬁne adaptive network named CA-Net and achieve state-of-the-art performances on MPIIGaze and EyeDiap.

PDF Details

AAAI Conference 2020 Conference Paper

An Integrated Enhancement Solution for 24-Hour Colorful Imaging

Feifan Lv
Yinqiang Zheng
Yicheng Li
Feng Lu

The current industry practice for 24-hour outdoor imaging is to use a silicon camera supplemented with near-infrared (NIR) illumination. This will result in color images with poor contrast at daytime and absence of chrominance at nighttime. For this dilemma, all existing solutions try to capture RGB and NIR images separately. However, they need additional hardware support and suffer from various drawbacks, including short service life, high price, speciﬁc usage scenario, etc. In this paper, we propose a novel and integrated enhancement solution that produces clear color images, whether at abundant sunlight daytime or extremely low-light nighttime. Our key idea is to separate the VIS and NIR information from mixed signals, and enhance the VIS signal adaptively with the NIR signal as assistance. To this end, we build an optical system to collect a new VIS-NIR-MIX dataset and present a physically meaningful image processing algorithm based on CNN. Extensive experiments show outstanding results, which demonstrate the effectiveness of our solution.

PDF Details

AAAI Conference 2020 Conference Paper

Separate in Latent Space: Unsupervised Single Image Layer Separation

Yunfei Liu
Feng Lu

Many real world vision tasks, such as reﬂection removal from a transparent surface and intrinsic image decomposition, can be modeled as single image layer separation. However, this problem is highly ill-posed, requiring accurately aligned and hard to collect triplet data to train the CNN models. To address this problem, this paper proposes an unsupervised method that requires no ground truth data triplet in training. At the core of the method are two assumptions about data distributions in the latent spaces of different layers, based on which a novel unsupervised layer separation pipeline can be derived. Then the method can be constructed based on the GANs framework with self-supervision and cycle consistency constraints, etc. Experimental results demonstrate its successfulness in outperforming existing unsupervised methods in both synthetic and real world tasks. The method also shows its ability to solve a more challenging multi-layer separation task.

PDF Details

AAAI Conference 2019 Conference Paper

Pathological Evidence Exploration in Deep Retinal Image Diagnosis

Yuhao Niu
Lin Gu
Feng Lu
Feifan Lv
Zongji Wang
Imari Sato
Zijian Zhang
Yangyan Xiao

Though deep learning has shown successful performance in classifying the label and severity stage of certain disease, most of them give few evidence on how to make prediction. Here, we propose to exploit the interpretability of deep learning application in medical diagnosis. Inspired by Koch’s Postulates, a well-known strategy in medical research to identify the property of pathogen, we define a pathological descriptor that can be extracted from the activated neurons of a diabetic retinopathy detector. To visualize the symptom and feature encoded in this descriptor, we propose a GAN based method to synthesize pathological retinal image given the descriptor and a binary vessel segmentation. Besides, with this descriptor, we can arbitrarily manipulate the position and quantity of lesions. As verified by a panel of 5 licensed ophthalmologists, our synthesized images carry the symptoms that are directly related to diabetic retinopathy diagnosis. The panel survey also shows that our generated images is both qualitatively and quantitatively superior to existing methods.

PDF Details

IJCAI Conference 2018 Conference Paper

Mixed Causal Structure Discovery with Application to Prescriptive Pricing

Wei Wenjuan
Feng Lu
Liu Chunchen

Prescriptive pricing is one of the most advanced pricing techniques, which derives the optimal price strategy to maximize the future profit/revenue by carrying out a two-stage process, demand modeling and price optimization. Demand modeling tries to reveal price-demand laws by discovering causal relationships among demands, prices, and objective factors, which is the foundation of price optimization. Existing methods either use regression or causal learning for uncovering the price-demand relations, but suffer from pain points in either accuracy/efficiency or mixed data type processing, while all of these are actual requirements in practical pricing scenarios. This paper proposes a novel demand modeling technique for practical usage. Speaking concretely, we propose a new locally consistent information criterion named MIC, and derive MIC-based inference algorithms for an accurate recovery of causal structure on mixed factor space. Experiments on simulate/real datasets show the superiority of our new approach in both price-demand law recovery and demand forecasting, as well as show promising performance in supporting optimal pricing.

PDF Details