Arrow Research search

Author name cluster

Feng Zhang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

28 papers
2 author rows

Possible papers

28

EAAI Journal 2026 Journal Article

An efficient knowledge tracing model via Mamba Contextual Encoding and Dynamic Sparse Attention mechanism

  • RuiJuan Zhang
  • Feng Zhang
  • Cong Liu

Knowledge Tracing (KT) predicts learners’ future performance by analyzing their historical learning records. While deep learning-based knowledge tracing models have significantly improved prediction performance, they suffer from substantial computational overhead and inefficiency when handling long interaction sequences. To address this problem, we propose an efficient knowledge tracing model named Mamba Contextual Encoding and Dynamic Sparse Attention Mechanism-based Knowledge Tracing (MCSKT). Firstly, leveraging the selective state space structure and linear-time complexity of Mamba, we design a dual-encoder composed of a question encoder and a knowledge encoder, which structurally disentangles the contextual dependencies at the question level and concept level. This design enhances semantic modeling capability while maintaining computational efficiency. Secondly, we propose a dynamic k -sparse attention mechanism to overcome the adaptability constraints inherent in traditional sparse attention methods that rely on manually configured static thresholds. This novel mechanism dynamically adjusts the filtering range of historical interactions, adaptively balancing noise suppression and critical information retention, while significantly reducing computational complexity. Experimental results demonstrate that MCSKT achieves an average improvement of 3. 7% in Area Under the Curve (AUC) and 2. 9% in Accuracy (ACC) across four public datasets. Moreover, compared with the state-of-the-art model, MCSKT achieves substantial acceleration, running approximately 10. 1 times faster during training and 3. 5 times faster during inference. In addition, the growth rate of time consumption for MCSKT is markedly slower than that of competing models as sequence length increases, highlighting its advantage in processing long-sequence data.

AAAI Conference 2026 Conference Paper

DuoCast: Duo-Probabilistic Diffusion for Precipitation Nowcasting

  • Penghui Wen
  • Mengwei He
  • Patrick Filippi
  • Na Zhao
  • Feng Zhang
  • Thomas Francis Bishop
  • Zhiyong Wang
  • Kun Hu

Accurate short-term precipitation forecasting is critical for weather-sensitive decision-making in agriculture, transportation, and disaster response. Existing deep learning approaches often struggle to balance global structural consistency with local detail preservation, especially under complex meteorological conditions. We propose DuoCast, a dual-diffusion framework that decomposes precipitation forecasting into low- and high-frequency components modeled in orthogonal latent subspaces. We theoretically prove that this frequency decomposition reduces prediction error compared to conventional single branch U-Net diffusion models. In DuoCast, the low-frequency model captures large-scale trends via convolutional encoders conditioned on weather front dynamics, while the high-frequency model refines fine-scale variability using a self-attention-based architecture. Experiments on four benchmark radar datasets show that DuoCast consistently outperforms state-of-the-art baselines, achieving superior accuracy in both spatial detail and temporal evolution.

AAAI Conference 2026 Conference Paper

Vision-language Incremental Learning with Dual Class-individual Memory

  • Fuhai Chen
  • Feng Zhang
  • XiaoGuang Ma
  • Yiyi Zhou
  • Jiarong Liu
  • Xuri Ge

The emergence of multimodal technologies has propelled Vision-Language Incremental Learning (VLIL) into a research spotlight. Current VLIL approaches predominantly inherit unimodal paradigms, failing to address fundamental distinctions between visual and linguistic modalities. Crucially, the semantic gap between images and text creates divergent learning dynamics: visual data exhibits rich, distributed information while textual representations remain explicit and compact. Consequently, textual elements align with class-specific tasks, whereas individual images inherently span multiple such tasks, creating dual bottlenecks in class-level memory allocation and scene-level knowledge transfer. To overcome these challenges, we propose DCIM (Dual Class-Individual Memory), a novel framework featuring complementary mechanisms for vision-language continual learning. For class-level constraints, our Hierarchical Class Memory Management (HCMM) strategy dynamically allocates memory resources across object categories. It employs forgetting simulation to identify and preserve the most vulnerable samples, ensuring robust long-term knowledge retention. For scene-level adaptation, the Scene Reconstruction Memory(SRM) module captures generalized environmental representations, enabling contextual transfer to novel classes and disambiguation of semantically related concepts within shared scenes.Extensive experiments on two vision-language tasks, i.e., visual question answering (VQA) and Image captioning (IC), demonstrate the effectiveness and excellent generalization ability of our approach, achieving superior performance under continual learning settings.

ICRA Conference 2025 Conference Paper

Dynamic Compact Consensus Tracking for Aerial Robots

  • Xiaolou Sun
  • Zhibin Quan
  • Feng Zhang
  • Yuntian Li
  • Chunyan Wang
  • Wufei Si
  • Wenhui Ni
  • Runwei Guan

Existing one-stream trackers have attracted widespread attention. However, they are not applicable in real-time aerial robot tracking systems due to substantial computational overhead, especially when dynamic templates are introduced. To address this issue, we propose a novel Dynamic Compact Consensus Tracker (DC 2 T), constructed by stacking blocks that each consists of a Compact Token Encoder (CTE) and Dynamic Consensus Attention (DCA). Unlike traditional methods that convert images into a large number of tokens, the CTE, inspired by “superpixel”, extracts a compact set of representative tokens from both initial and dynamic templates, eliminating the need for a large token set. This strategic reduction in the number of compact tokens markedly decreases the computational load of CTE, enhancing the efficiency of subsequent attention operations. To achieve linear complexity of the DCA, compact dynamic template tokens (as keys) are requeried by search tokens (as queries) to perform dynamic consensus on the aggregated tokens (as values). This arrangement seamlessly incorporates dynamic spatio-temporal features into the DCA while avoiding the computational burden typically associated with dynamic templates. With the aim of further enhancing the system's responsiveness and accuracy, a direct control network is crafted to seamlessly incorporate the prediction of high-level control values into the tracking network, ensuring a cohesive and efficient interaction with the controller. Comprehensive experiments and real-world evaluations have proven DC 2 T's superior performance, accompanied by a significant reduction in FLOPs. Furthermore, we have conducted experiments that demonstrate the tracker's ability to integrate seamlessly with other technologies such as SLAM and detection, enabling precise tracking of arbitrary objects. The tracker code will be released in the github.com/xiaolousun/refine-pytracking.

AAAI Conference 2025 Conference Paper

Multi-Label Few-Shot Image Classification via Pairwise Feature Augmentation and Flexible Prompt Learning

  • Han Liu
  • Yuanyuan Wang
  • Xiaotong Zhang
  • Feng Zhang
  • Wei Wang
  • Fenglong Ma
  • Hong Yu

Multi-label few-shot image classification is a crucial and challenging task due to limited annotated data and elusive category specificity. However, research on this topic is still in the rudimentary stage and few methods are available. Existing methods either leverage data augmentation to alleviate data scarcity or utilize label features as auxiliary knowledge to eliminate the negative effect caused by irrelevant categories, but they ignore the utilization of image region features for data augmentation, and overlook to learn appropriate text feature to better match the image features of specific categories. Moreover, these methods only focus on one side and do not effectively tackle the above two issues simultaneously. In this paper, we introduce a novel prototype-based multi-label few-shot learning framework that seamlessly integrates pairwise feature augmentation and flexible prompt learning. Specifically, by pairwise feature augmentation, we leverage the region features of images in the support set to generate more image features and construct image prototypes, thus alleviating the issue of data scarcity. By flexible prompt learning, we adaptively acquire class-specific prompts to build text prototypes that highly match the image features of specific classes, thereby mitigating the impact of irrelevant classes. Finally, with adaptive learnable parameters, we merge image and text prototypes to obtain the final prototypes, achieving a more powerful classifier for multi-label few-shot image classification. Extensive experimental results demonstrate that our proposed method can push the performance to a higher level.

NeurIPS Conference 2025 Conference Paper

RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

  • Zhenyuan Chen
  • Chenxi Wang
  • Ningyu Zhang
  • Feng Zhang

Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62, 351 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC’s ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing. Code and dataset are available at https: //github. com/Bili-Sakura/RSCC.

EAAI Journal 2024 Journal Article

An obstacle avoidance-specific reinforcement learning method based on fuzzy attention mechanism and heterogeneous graph neural networks

  • Feng Zhang
  • Chengbin Xuan
  • Hak-keung Lam

Deep reinforcement learning (RL) is an advancing learning tool to handle robotics control problems. However, it typically suffers from sample efficiency and effectiveness. The emergence of Graph Neural Networks (GNNs) enables the integration of the RL and graph representation learning techniques. It realises outstanding training performance and transfer capability by forming controlling scenarios into the corresponding graph domain. Nevertheless, the existing approaches strongly depend on the artificial graph formation processes with intensive bias and cannot propagate messages discriminatively on explicit physical dependence, which leads to restricted flexibility, size transfer capability and suboptimal performance. This paper proposes a fuzzy attention mechanism-based heterogeneous graph neural network (FAM-HGNN) framework for resolving the control problem under the RL context. FAM emphasises the significant connections and weakening of the trivial connections in a fully connected graph, which mitigates the potential negative influence caused by the artificial graph formation process. HGNN obtains a higher level of relational inductive bias by conducting graph propagations on a masked graph. Experimental results show that our FAM-HGNN outperforms the multi-layer perceptron-based and the existing GNN-based RL approaches regarding training performance and size transfer capability. We also conducted an ablation study and sensitivity analysis to validate the efficacy of the proposed method further.

AAAI Conference 2024 Conference Paper

Depression Detection via Capsule Networks with Contrastive Learning

  • Han Liu
  • Changya Li
  • Xiaotong Zhang
  • Feng Zhang
  • Wei Wang
  • Fenglong Ma
  • Hongyang Chen
  • Hong Yu

Depression detection is a challenging and crucial task in psychological illness diagnosis. Utilizing online user posts to predict whether a user suffers from depression seems an effective and promising direction. However, existing methods suffer from either poor interpretability brought by the black-box models or underwhelming performance caused by the completely separate two-stage model structure. To alleviate these limitations, we propose a novel capsule network integrated with contrastive learning for depression detection (DeCapsNet). The highlights of DeCapsNet can be summarized as follows. First, it extracts symptom capsules from user posts by leveraging meticulously designed symptom descriptions, and then distills them into class-indicative depression capsules. The overall workflow is in an explicit hierarchical reasoning manner and can be well interpreted by the Patient Health Questionnaire-9 (PHQ9), which is one of the most widely adopted questionnaires for depression diagnosis. Second, it integrates with contrastive learning, which can facilitate the embeddings from the same class to be pulled closer, while simultaneously pushing the embeddings from different classes apart. In addition, by adopting the end-to-end training strategy, it does not necessitate additional data annotation, and mitigates the potential adverse effects from the upstream task to the downstream task. Extensive experiments on three widely-used datasets show that in both within-dataset and cross-dataset scenarios our proposed method outperforms other strong baselines significantly.

AAAI Conference 2024 Conference Paper

Liberating Seen Classes: Boosting Few-Shot and Zero-Shot Text Classification via Anchor Generation and Classification Reframing

  • Han Liu
  • Siyang Zhao
  • Xiaotong Zhang
  • Feng Zhang
  • Wei Wang
  • Fenglong Ma
  • Hongyang Chen
  • Hong Yu

Few-shot and zero-shot text classification aim to recognize samples from novel classes with limited labeled samples or no labeled samples at all. While prevailing methods have shown promising performance via transferring knowledge from seen classes to unseen classes, they are still limited by (1) Inherent dissimilarities among classes make the transformation of features learned from seen classes to unseen classes both difficult and inefficient. (2) Rare labeled novel samples usually cannot provide enough supervision signals to enable the model to adjust from the source distribution to the target distribution, especially for complicated scenarios. To alleviate the above issues, we propose a simple and effective strategy for few-shot and zero-shot text classification. We aim to liberate the model from the confines of seen classes, thereby enabling it to predict unseen categories without the necessity of training on seen classes. Specifically, for mining more related unseen category knowledge, we utilize a large pre-trained language model to generate pseudo novel samples, and select the most representative ones as category anchors. After that, we convert the multi-class classification task into a binary classification task and use the similarities of query-anchor pairs for prediction to fully leverage the limited supervision signals. Extensive experiments on six widely used public datasets show that our proposed method can outperform other strong baselines significantly in few-shot and zero-shot tasks, even without using any seen class samples.

NeurIPS Conference 2024 Conference Paper

PLIP: Language-Image Pre-training for Person Representation Learning

  • Jialong Zuo
  • Jiahao Hong
  • Feng Zhang
  • Changqian Yu
  • Hanyu Zhou
  • Changxin Gao
  • Nong Sang
  • Jingdong Wang

Language-image pre-training is an effective technique for learning powerful representations in general domains. However, when directly turning to person representation learning, these general pre-training methods suffer from unsatisfactory performance. The reason is that they neglect critical person-related characteristics, i. e. , fine-grained attributes and identities. To address this issue, we propose a novel language-image pre-training framework for person representation learning, termed PLIP. Specifically, we elaborately design three pretext tasks: 1) Text-guided Image Colorization, aims to establish the correspondence between the person-related image regions and the fine-grained color-part textual phrases. 2) Image-guided Attributes Prediction, aims to mine fine-grained attribute information of the person body in the image; and 3) Identity-based Vision-Language Contrast, aims to correlate the cross-modal representations at the identity level rather than the instance level. Moreover, to implement our pre-train framework, we construct a large-scale person dataset with image-text pairs named SYNTH-PEDES by automatically generating textual annotations. We pre-train PLIP on SYNTH-PEDES and evaluate our models by spanning downstream person-centric tasks. PLIP not only significantly improves existing methods on all these tasks, but also shows great ability in the zero-shot and domain generalization settings. The code, dataset and weight will be made publicly available.

NeurIPS Conference 2024 Conference Paper

Sim2Real-Fire: A Multi-modal Simulation Dataset for Forecast and Backtracking of Real-world Forest Fire

  • Yanzhi Li
  • Keqiu Li
  • Guohui Li
  • Zumin Wang
  • Changqing Ji
  • Lubo Wang
  • Die Zuo
  • Qing Guo

The latest research on wildfire forecast and backtracking has adopted AI models, which require a large amount of data from wildfire scenarios to capture fire spread patterns. This paper explores using cost-effective simulated wildfire scenarios to train AI models and apply them to the analysis of real-world wildfire. This solution requires AI models to minimize the Sim2Real gap, a brand-new topic in the fire spread analysis research community. To investigate the possibility of minimizing the Sim2Real gap, we collect the Sim2Real-Fire dataset that contains 1M simulated scenarios with multi-modal environmental information for training AI models. We prepare 1K real-world wildfire scenarios for testing the AI models. We also propose a deep transformer, S2R-FireTr, which excels in considering the multi-modal environmental information for forecasting and backtracking the wildfire. S2R-FireTr surpasses state-of-the-art methods in real-world wildfire scenarios.

ICRA Conference 2024 Conference Paper

Square-Root Inverse Filter-based GNSS-Visual-Inertial Navigation

  • Jun Hu
  • Xiaoming Lang
  • Feng Zhang
  • Yinian Mao
  • Guoquan Huang 0003

While Global Navigation Satellite System (GNSS) is often used to provide global positioning if available, its intermittency and/or inaccuracy calls for fusion with other sensors. In this paper, we develop a novel GNSS-Visual-Inertial Navigation System (GVINS) that fuses visual, inertial, and raw GNSS measurements within the square-root inverse sliding window filtering (SRI-SWF) framework in a tightly coupled fashion, which thus is termed SRI-GVINS. In particular, for the first time, we deeply fuse the GNSS pseudorange, Doppler shift, single-differenced pseudorange, and double-differenced carrier phase measurements, along with the visual-inertial measurements. Inherited from the SRI-SWF, the proposed SRI-GVINS gains significant numerical stability and computational efficiency over the start-of-the-art methods. Additionally, we propose to use a filter to sequentially initialize the reference frame transformation till converges, rather than collecting measurements for batch optimization. We also perform online calibration of GNSS-IMU extrinsic parameters to mitigate the possible extrinsic parameter degradation. The proposed SRI-GVINS is extensively evaluated on our own collected UAV datasets and the results demonstrate that the proposed method is able to suppress VIO drift in real-time and also show the effectiveness of online GNSS-IMU extrinsic calibration. The experimental validation on the public datasets further reveals that the proposed SRI-GVINS outperforms the state-of-the-art methods in terms of both accuracy and efficiency.

NeurIPS Conference 2024 Conference Paper

UQ-Guided Hyperparameter Optimization for Iterative Learners

  • Jiesong Liu
  • Feng Zhang
  • Jiawei Guan
  • Xipeng Shen

Hyperparameter Optimization (HPO) plays a pivotal role in unleashing the potential of iterative machine learning models. This paper addresses a crucial aspect that has largely been overlooked in HPO: the impact of uncertainty in ML model training. The paper introduces the concept of uncertainty-aware HPO and presents a novel approach called the UQ-guided scheme for quantifying uncertainty. This scheme offers a principled and versatile method to empower HPO techniques in handling model uncertainty during their exploration of the candidate space. By constructing a probabilistic model and implementing probability-driven candidate selection and budget allocation, this approach enhances the quality of the resulting model hyperparameters. It achieves a notable performance improvement of over 50\% in terms of accuracy regret and exploration time.

AAAI Conference 2023 Conference Paper

Boosting Few-Shot Text Classification via Distribution Estimation

  • Han Liu
  • Feng Zhang
  • Xiaotong Zhang
  • Siyang Zhao
  • Fenglong Ma
  • Xiao-ming Wu
  • Hongyang Chen
  • Hong Yu

Distribution estimation has been demonstrated as one of the most effective approaches in dealing with few-shot image classification, as the low-level patterns and underlying representations can be easily transferred across different tasks in computer vision domain. However, directly applying this approach to few-shot text classification is challenging, since leveraging the statistics of known classes with sufficient samples to calibrate the distributions of novel classes may cause negative effects due to serious category difference in text domain. To alleviate this issue, we propose two simple yet effective strategies to estimate the distributions of the novel classes by utilizing unlabeled query samples, thus avoiding the potential negative transfer issue. Specifically, we first assume a class or sample follows the Gaussian distribution, and use the original support set and the nearest few query samples to estimate the corresponding mean and covariance. Then, we augment the labeled samples by sampling from the estimated distribution, which can provide sufficient supervision for training the classification model. Extensive experiments on eight few-shot text classification datasets show that the proposed method outperforms state-of-the-art baselines significantly.

NeurIPS Conference 2023 Conference Paper

HQA-Attack: Toward High Quality Black-Box Hard-Label Adversarial Attack on Text

  • Han Liu
  • Zhi Xu
  • Xiaotong Zhang
  • Feng Zhang
  • Fenglong Ma
  • Hongyang Chen
  • Hong Yu
  • Xianchao Zhang

Black-box hard-label adversarial attack on text is a practical and challenging task, as the text data space is inherently discrete and non-differentiable, and only the predicted label is accessible. Research on this problem is still in the embryonic stage and only a few methods are available. Nevertheless, existing methods rely on the complex heuristic algorithm or unreliable gradient estimation strategy, which probably fall into the local optimum and inevitably consume numerous queries, thus are difficult to craft satisfactory adversarial examples with high semantic similarity and low perturbation rate in a limited query budget. To alleviate above issues, we propose a simple yet effective framework to generate high quality textual adversarial examples under the black-box hard-label attack scenarios, named HQA-Attack. Specifically, after initializing an adversarial example randomly, HQA-attack first constantly substitutes original words back as many as possible, thus shrinking the perturbation rate. Then it leverages the synonym set of the remaining changed words to further optimize the adversarial example with the direction which can improve the semantic similarity and satisfy the adversarial condition simultaneously. In addition, during the optimizing procedure, it searches a transition synonym word for each changed word, thus avoiding traversing the whole synonym set and reducing the query number to some extent. Extensive experimental results on five text classification datasets, three natural language inference datasets and two real-world APIs have shown that the proposed HQA-Attack method outperforms other strong baselines significantly.

NeurIPS Conference 2023 Conference Paper

Lookup Table meets Local Laplacian Filter: Pyramid Reconstruction Network for Tone Mapping

  • Feng Zhang
  • Ming Tian
  • Zhiqiang Li
  • Bin Xu
  • Qingbo Lu
  • Changxin Gao
  • Nong Sang

Tone mapping aims to convert high dynamic range (HDR) images to low dynamic range (LDR) representations, a critical task in the camera imaging pipeline. In recent years, 3-Dimensional LookUp Table (3D LUT) based methods have gained attention due to their ability to strike a favorable balance between enhancement performance and computational efficiency. However, these methods often fail to deliver satisfactory results in local areas since the look-up table is a global operator for tone mapping, which works based on pixel values and fails to incorporate crucial local information. To this end, this paper aims to address this issue by exploring a novel strategy that integrates global and local operators by utilizing closed-form Laplacian pyramid decomposition and reconstruction. Specifically, we employ image-adaptive 3D LUTs to manipulate the tone in the low-frequency image by leveraging the specific characteristics of the frequency information. Furthermore, we utilize local Laplacian filters to refine the edge details in the high-frequency components in an adaptive manner. Local Laplacian filters are widely used to preserve edge details in photographs, but their conventional usage involves manual tuning and fixed implementation within camera imaging pipelines or photo editing tools. We propose to learn parameter value maps progressively for local Laplacian filters from annotated data using a lightweight network. Our model achieves simultaneous global tone manipulation and local edge detail preservation in an end-to-end manner. Extensive experimental results on two benchmark datasets demonstrate that the proposed method performs favorably against state-of-the-art methods.

JBHI Journal 2023 Journal Article

Multi-Task Learning With Hierarchical Guidance for Locating and Stratifying Submucosal Tumors

  • Ruifei Zhang
  • Feng Zhang
  • Si Qin
  • Dejun Fan
  • Chaowei Fang
  • Jie Ma
  • Xiang Wan
  • Guanbin Li

Locating and stratifying the submucosal tumor of the digestive tract from endoscopy ultrasound (EUS) images are of vital significance to the preliminary diagnosis of tumors. However, the above problems are challenging, due to the poor appearance contrast between different layers of the digestive tract wall (DTW) and the narrowness of each layer. Few of existing deep-learning based diagnosis algorithms are devised to tackle this issue. In this article, we build a multi-task framework for simultaneously locating and stratifying the submucosal tumor. And considering the awareness of the DTW is critical to the localization and stratification of the tumor, we integrate the DTW segmentation task into the proposed multi-task framework. Except for sharing a common backbone model, the three tasks are explicitly directed with a hierarchical guidance module, in which the probability map of DTW itself is used to locally enhance the feature representation for tumor localization, and the probability maps of DTW and tumor are jointly employed to locally enhance the feature representation for tumor stratification. Moreover, by means of the dynamic class activation map, probability maps of DTW and tumor are reused to enforce the stratification inference process to pay more attention to DTW and tumor regions, contributing to a reliable and interpretable submucosal tumor stratification model. Additionally, considering the relation with respect to other structures is beneficial for stratifying tumors, we devise a graph reasoning module to replenish non-local relation knowledge for the stratification branch. Experiments on a Stomach-Esophagus and an Intestinal EUS dataset prove that our method achieves very appealing performance on both tumor localization and stratification, significantly outperforming state-of-the-art object detection approaches.

AAAI Conference 2023 Conference Paper

SSPAttack: A Simple and Sweet Paradigm for Black-Box Hard-Label Textual Adversarial Attack

  • Han Liu
  • Zhi Xu
  • Xiaotong Zhang
  • Xiaoming Xu
  • Feng Zhang
  • Fenglong Ma
  • Hongyang Chen
  • Hong Yu

Hard-label textual adversarial attack is a challenging task, as only the predicted label information is available, and the text space is discrete and non-differentiable. Relevant research work is still in fancy and just a handful of methods are proposed. However, existing methods suffer from either the high complexity of genetic algorithms or inaccurate gradient estimation, thus are arduous to obtain adversarial examples with high semantic similarity and low perturbation rate under the tight-budget scenario. In this paper, we propose a simple and sweet paradigm for hard-label textual adversarial attack, named SSPAttack. Specifically, SSPAttack first utilizes initialization to generate an adversarial example, and removes unnecessary replacement words to reduce the number of changed words. Then it determines the replacement order and searches for an anchor synonym, thus avoiding going through all the synonyms. Finally, it pushes substitution words towards original words until an appropriate adversarial example is obtained. The core idea of SSPAttack is just swapping words whose mechanism is simple. Experimental results on eight benchmark datasets and two real-world APIs have shown that the performance of SSPAttack is sweet in terms of similarity, perturbation rate and query efficiency.

JBHI Journal 2022 Journal Article

AwCPM-Net: A Collaborative Constraint GAN for 3D Coronary Artery Reconstruction in Intravascular Ultrasound Sequences

  • Menghua Xia
  • Hongbo Yang
  • Yi Huang
  • Yanan Qu
  • Yi Guo
  • Guohui Zhou
  • Feng Zhang
  • Yuanyuan Wang

3D coronary artery reconstruction (3D-CAR) in intravascular ultrasound (IVUS) sequences allows quantitative analyses of vessel properties. Existing methods treat two main tasks of the 3D-CAR separately, including the cardiac phase retrieval (CPR) and the membrane border extraction (MBE). They ignore the CPR-MBE connection that could achieve mutual promotions to both tasks. In this paper, we pioneer to achieve one-step 3D-CAR via a collaborative constraint generative adversarial network (GAN) named the AwCPM-Net. The AwCPM-Net consists of a dual-task collaborative generator and a dual-task constraint discriminator. The generator combines a self-supervised CPR branch with a semi-supervised MBE branch via a warming-up connection. The discriminator promotes dual-branch predictions simultaneously. The CPR branch requires no annotations and outputs inter-frame deformation fields used for identifying cardiac phases. Deformation fields are additionally constrained by the MBE branch and the discriminator. The MBE branch predicts membrane boundaries for each frame. Two aspects assist the semi-supervised segmentation: annotation augmentation by deformation fields of the CPR branch; information exploitation on unlabeled images enabled by GAN design. Trained and tested on an IVUS dataset acquired from atherosclerosis patients, the AwCPM-Net is effective in both CPR and MBE tasks, superior to state-of-the-art IVUS CPR or MBE methods. Hence, the AwCPM-Net reconstructs reliable 3D artery anatomy in the IVUS modality.

IJCAI Conference 2022 Conference Paper

Dite-HRNet: Dynamic Lightweight High-Resolution Network for Human Pose Estimation

  • Qun Li
  • Ziyi Zhang
  • Fu Xiao
  • Feng Zhang
  • Bir Bhanu

A high-resolution network exhibits remarkable capability in extracting multi-scale features for human pose estimation, but fails to capture long-range interactions between joints and has high computational complexity. To address these problems, we present a Dynamic lightweight High-Resolution Network (Dite-HRNet), which can efficiently extract multi-scale contextual information and model long-range spatial dependency for human pose estimation. Specifically, we propose two methods, dynamic split convolution and adaptive context modeling, and embed them into two novel lightweight blocks, which are named dynamic multi-scale context block and dynamic global context block. These two blocks, as the basic component units of our Dite-HRNet, are specially designed for the high-resolution networks to make full use of the parallel multi-resolution architecture. Experimental results show that the proposed network achieves superior performance on both COCO and MPII human pose estimation datasets, surpassing the state-of-the-art lightweight networks. Code is available at: https: //github. com/ZiyiZhang27/Dite-HRNet.

EAAI Journal 2022 Journal Article

SEM: Safe exploration mask for q-learning

  • Chengbin Xuan
  • Feng Zhang
  • Hak-keung Lam

Most reinforcement learning algorithms focus on discovering the optimal policy to maximize reward while neglecting the safety issue during the exploration stage, which is not acceptable in industrial applications. This paper concerns the efficient method to improve the safety of the agent during the exploration stage in q-learning without any prior knowledge. We propose a novel approach named safe exploration mask to reduce the number of safety violations in q-learning by modifying the transition possibility of the environment. To this end, a safety indicator function consisting of distance metric and controllability metric is designed. The safety indicator function can be learned by the agent through bootstrapping without additional optimization solver. We prove that the safety indicator function will converge in tabular q-learning and introduce two tricks to mitigate the divergence in approximation-based q-learning. Based on the safety indicator function, the safe exploration mask is generated to modify the original exploration policy by reducing the transition possibility of unsafe actions. Finally, the simulations in both discrete and continuous environments demonstrate the advantages, feasibility, and safety of our method in both discrete and continuous q-learning algorithms.

NeurIPS Conference 2022 Conference Paper

TREC: Transient Redundancy Elimination-based Convolution

  • Jiawei Guan
  • Feng Zhang
  • Jiesong Liu
  • Hsin-Hsuan Sung
  • Ruofan Wu
  • Xiaoyong Du
  • Xipeng Shen

The intensive computations in convolutional neural networks (CNNs) pose challenges for resource-constrained devices; eliminating redundant computations from convolution is essential. This paper gives a principled method to detect and avoid transient redundancy, a type of redundancy existing in input data or activation maps and hence changing across inferences. By introducing a new form of convolution (TREC), this new method makes transient redundancy detection and avoidance an inherent part of the CNN architecture, and the determination of the best configurations for redundancy elimination part of CNN backward propagation. We provide a rigorous proof of the robustness and convergence of TREC-equipped CNNs. TREC removes over 96% computations and achieves 3. 51x average speedups on microcontrollers with minimal (about 0. 7%) accuracy loss.

EAAI Journal 2021 Journal Article

A comprehensive survey on 2D multi-person pose estimation methods

  • Chen Wang
  • Feng Zhang
  • Shuzhi Sam Ge

Human pose estimation is a fundamental yet challenging computer vision task and studied by many researchers around the world in recent years. As a basic task in computer vision, multi-person pose estimation is the core component for many practical applications. This paper extensively reviews recent works on multi-person pose estimation. Specifically, we illustrate and analyze popular methods in detail and compare their pros and cons to fill in the gaps existing in other surveys. In addition, the commonly used datasets, evaluation metrics, and open-source systems are also introduced respectively. Finally, we summarize the development of multi-person pose estimation frameworks and discuss the research trends.

AAAI Conference 2021 Conference Paper

Efficient Deep Image Denoising via Class Specific Convolution

  • Lu Xu
  • Jiawei Zhang
  • Xuanye Cheng
  • Feng Zhang
  • Xing Wei
  • Jimmy Ren

Deep neural networks have been widely used in image denoising during the past few years. Even though they achieve great success on this problem, they are computationally inefficient which makes them inappropriate to be implemented in mobile devices. In this paper, we propose an efficient deep neural network for image denoising based on pixel-wise classification. Despite using a computationally efficient network cannot effectively remove the noises from any content, it is still capable to denoise from a specific type of pattern or texture. The proposed method follows such a divide and conquer scheme. We first use an efficient U-net to pixel-wisely classify pixels in the noisy image based on the local gradient statistics. Then we replace part of the convolution layers in existing denoising networks by the proposed Class Specific Convolution layers (CSConv) which use different weights for different classes of pixels. Quantitative and qualitative evaluations on public datasets demonstrate that the proposed method can reduce the computational costs without sacrificing the performance compared to state-of-the-art algorithms.

IJCAI Conference 2020 Conference Paper

PewLSTM: Periodic LSTM with Weather-Aware Gating Mechanism for Parking Behavior Prediction

  • Feng Zhang
  • Ningxuan Feng
  • Yani Liu
  • Cheng Yang
  • Jidong Zhai
  • Shuhao Zhang
  • Bingsheng He
  • Jiazao Lin

In big cities, there are plenty of parking spaces, but we often find nowhere to park. For example, New York has 1. 4 million cars and 4. 4 million on-street parking spaces, but it is still not easy to find a parking place near our destination, especially during peak hours. The reason is the lack of prediction of parking behavior. If we could provide parking behavior in advance, we can ease this parking problem that affects human well-being. We observe that parking lots have periodic parking patterns, which is an important factor for parking behavior prediction. Unfortunately, existing work ignores such periodic parking patterns in parking behavior prediction, and thus incurs low accuracy. To solve this problem, we propose PewLSTM, a novel periodic weather-aware LSTM model that successfully predicts the parking behavior based on historical records, weather, environments, and weekdays. PewLSTM has been successfully integrated into a real parking space reservation system, ThsParking, which is one of the top smart parking platforms in China. Based on 452, 480real parking records in 683 days from 10 parking lots, PewLSTM yields 85. 3% parking prediction accuracy, which is about 20% higher than the state-of-the-art parking behavior prediction method. The code and data can be obtained fromhttps: //github. com/NingxuanFeng/PewLSTM.

AAAI Conference 2019 Conference Paper

Hierarchical Photo-Scene Encoder for Album Storytelling

  • Bairui Wang
  • Lin Ma
  • Wei Zhang
  • Wenhao Jiang
  • Feng Zhang

In this paper, we propose a novel model with a hierarchical photo-scene encoder and a reconstructor for the task of album storytelling. The photo-scene encoder contains two subencoders, namely the photo and scene encoders, which are stacked together and behave hierarchically to fully exploit the structure information of the photos within an album. Specifically, the photo encoder generates semantic representation for each photo while exploiting temporal relationships among them. The scene encoder, relying on the obtained photo representations, is responsible for detecting the scene changes and generating scene representations. Subsequently, the decoder dynamically and attentively summarizes the encoded photo and scene representations to generate a sequence of album representations, based on which a story consisting of multiple coherent sentences is generated. In order to fully extract the useful semantic information from an album, a reconstructor is employed to reproduce the summarized album representations based on the hidden states of the decoder. The proposed model can be trained in an end-to-end manner, which results in an improved performance over the state-of-the-arts on the public visual storytelling (VIST) dataset. Ablation studies further demonstrate the effectiveness of the proposed hierarchical photo-scene encoder and reconstructor.

AAAI Conference 2014 Conference Paper

k-CoRating: Filling Up Data to Obtain Privacy and Utility

  • Feng Zhang
  • Victor Lee
  • Ruoming Jin

For datasets in Collaborative Filtering (CF) recommendations, even if the identifier is deleted and some trivial perturbation operations are applied to ratings before they are released, there are research results claiming that the adversary could discriminate the individual’s identity with a little bit of information. In this paper, we propose k-coRating, a novel privacy-preserving model, to retain data privacy by replacing some null ratings with ”well-predicted” scores. They do not only mask the original ratings such that a k-anonymity-like data privacy is preserved, but also enhance the data utility (measured by prediction accuracy in this paper), which shows that the traditional assumption that accuracy and privacy are two goals in conflict is not necessarily correct. We show that the optimal k-coRated mapping is an NP-hard problem and design a naive but efficient algorithm to achieve k-coRating. All claims are verified by experimental results.

IROS Conference 2004 Conference Paper

Performance study of multi-agent scheduling and coordination framework for maintenance networks

  • Feng Zhang
  • Peter B. Luh
  • Eugene Santos Jr.

Real world maintenance networks often involve multi-organizational scheduling. The traditional centralized methods are not appropriate for solving the maintenance-scheduling problem because of the existence of private information and decision-making authorities at different organizations. Multi-agent scheduling and coordination is able to protect the private information and retain the decision-making authorities at different organizations. However, it presents its own challenges, such as obtaining a high-quality solution in a timely manner, providing the organizations with guidance on operating economically, being able to solve large-scale problems, and so on. In this paper, a price-based multi-agent scheduling and coordination framework is explored and a systematic study is carried out to evaluate the effects of factors on its performance. The empirical results not only show that the framework is able to quickly find high quality solutions for large-scale problems, but also reflect interesting relationships between selected factors such as resource utilization and performance measures such as mean asset turn-around-time.