Arrow Research search

Author name cluster

Feng Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

57 papers
2 author rows

Possible papers

57

AAAI Conference 2026 Conference Paper

ADAPT: Adaptive Decentralized Architecture with Perception-Aligned Training for Structural Generalization in Multi-Agent RL

  • Zhixiang Zhang
  • Shuo Chen
  • Yexin Li
  • Feng Wang

Multi-agent reinforcement learning (MARL) excels in cooperative and competitive tasks, but most architectures are tied to fixed input-output sizes and require retraining when the number of perceptible or controllable objects changes. While structural generalization techniques mitigate this, they rely on centralized training, raising concerns about scalability and privacy. We propose ADAPT, the first framework to support structural generalization under a decentralized training and decentralized execution (DTDE) paradigm. Every agent adopts an object-centric view, encoding each observed object into a feature vector and aggregating them into a variable-length set representation. To enable each agent to infer task-level contexts from this dynamic input independently, we propose a dynamic-consistency loss that enforces spatio-temporal alignment between context representations and observed environmental dynamics. Agents then condition their policies on the inferred contexts to make locally aligned decisions. For zero-shot transfer, we propose FINE (Foresight INdex for multi-agEnt), a metric that considers Q-value overestimation and enables cross-policy comparison of long-term impact, facilitating effective policy transfer. Experiments show that ADAPT surpasses existing DTDE methods and outperforms CTDE baselines in zero-shot generalization.

AAAI Conference 2026 Conference Paper

Unlocking Dynamic Inter-Client Spatial Dependencies: A Federated Spatio-temporal Graph Learning Method for Traffic Flow Forecasting

  • Feng Wang
  • Tianxiang Chen
  • Shuyue Wei
  • Qian Chu
  • Yi Zhang
  • Yifan Sun
  • Zhiming Zheng

Spatio-temporal graphs are powerful tools for modeling complex dependencies in traffic time series. However, the distributed nature of real-world traffic data across multiple stakeholders poses significant challenges in modeling and reconstructing inter-client spatial dependencies while adhering to data locality constraints. Existing methods primarily address static dependencies, overlooking their dynamic nature and resulting in suboptimal performance. In response, we propose Federated Spatio-Temporal Graph with Dynamic Inter-Client Dependencies (FedSTGD), a framework designed to model and reconstruct dynamic inter-client spatial dependencies in federated learning. FedSTGD incorporates a federated nonlinear computation decomposition module to approximate complex graph operations. This is complemented by a graph node embedding augmentation module, which alleviates performance degradation arising from the decomposition. These modules are coordinated through a client-server collective learning protocol, which decomposes dynamic inter-client spatial dependency learning tasks into lightweight, parallelizable subtasks. Extensive experiments on four real-world datasets demonstrate that FedSTGD achieves superior performance over state-of-the-art baselines in terms of RMSE, MAE, and MAPE, approaching that of centralized baselines. Ablation studies confirm the contribution of each module in addressing dynamic inter-client spatial dependencies, while sensitivity analysis highlights the robustness of FedSTGD to variations in hyperparameters.

IJCAI Conference 2025 Conference Paper

AlphaGAT: A Two-Stage Learning Approach for Adaptive Portfolio Selection

  • Shicheng Li
  • Jinshan Zhang
  • Feng Wang

Portfolio selection is a critical task in finance, involving the allocation of resources across various assets. However, current methods often struggle to maintain robust performance due to the inherent low signal-to-noise ratio in raw financial data and shifts in data distribution. We propose AlphaGAT, a novel two-stage learning approach for portfolio selection, designed to adapt to different market scenarios. Inspired by the concept of alpha factors, which transform historical market data into actionable signals, the first stage introduces an advanced model named CATimeMixer for alpha factor generation with a novel loss function to improve the effectiveness and robustness. CATimeMixer integrates TimeMixer with Conv1D (C) and cross-asset Attention (A). Specifically, Conv1D enhances TimeMixer by capturing trend and seasonal features across different scales, while cross-asset attention enables TimeMixer to extract interrelationships between different assets. The second stage applies reinforcement learning to dynamically adjust weights, integrating alpha factors into trading signals. Recognizing the varying effectiveness of alpha factors across different periods, our RL agent innovatively transforms the alpha factors into graphs and employs graph attention networks (GAT) to discern the significance of different alpha factors, enhancing policy robustness. Extensive experiments on real-world market data show that our approach outperforms state-of-the-art methods.

ICLR Conference 2025 Conference Paper

Autoregressive Pretraining with Mamba in Vision

  • Sucheng Ren
  • Xianhang Li
  • Haoqin Tu
  • Feng Wang
  • Fangxun Shu
  • Lei Zhang
  • Jieru Mei
  • Linjie Yang

The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's unidirectional recurrent structure, enabling faster overall training speed compared to other training strategies like mask modeling. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy over its supervised-trained counterparts and, more importantly, successfully unlocks its scaling potential to large and even huge model sizes. For example, with autoregressive pretraining, a base-size Mamba attains 83.2\% ImageNet accuracy, outperforming its supervised counterpart by 2.0\%; our huge-size Mamba, the largest Vision Mamba to date, attains 85.0\% ImageNet accuracy (85.5\% when finetuned with $384\times384$ inputs), notably surpassing all other Mamba variants in vision. The code is available at \url{https://github.com/OliverRensu/ARM}.

AAAI Conference 2025 Conference Paper

Expensive Multi-Objective Bayesian Optimization Based on Diffusion Models

  • Bingdong Li
  • Zixiang Di
  • Yongfan Lu
  • Hong Qian
  • Feng Wang
  • Peng Yang
  • Ke Tang
  • Aimin Zhou

Multi-objective Bayesian optimization (MOBO) has shown promising performance on various expensive multi-objective optimization problems (EMOPs). However, effectively modeling complex distributions of the Pareto optimal solutions is difficult with limited function evaluations. Existing Pareto set learning algorithms may exhibit considerable instability in such expensive scenarios, leading to significant deviations between the obtained solution set and the Pareto set (PS). In this paper, we propose a novel Composite Diffusion Model based Pareto Set Learning algorithm (CDM-PSL) for expensive MOBO. CDM-PSL includes both unconditional and conditional diffusion model for generating high-quality samples efficiently. Besides, we introduce a weighting method based on information entropy to balance different objectives. This method is integrated with a guiding strategy to appropriately balancing different objectives during the optimization process. Experimental results on both synthetic and real-world problems demonstrates that CDM-PSL attains superior performance compared with state-of-the-art MOBO algorithms.

AAAI Conference 2025 Conference Paper

HEP-NAS: Towards Efficient Few-shot Neural Architecture Search via Hierarchical Edge Partitioning

  • Jianfeng Li
  • Jiawen Zhang
  • Feng Wang
  • Lianbo Ma

One-shot methods have significantly advanced the field of neural architecture search (NAS) by adopting weight-sharing strategy to reduce search costs. However, the accuracy of performance estimation can be compromised by co-adaptation. Few-shot methods divide the entire supernet into individual sub-supernets by splitting edge by edge to alleviate this issue, yet neglect relationships among edges and result in performance degradation on huge search space. In this paper, we introduce HEP-NAS, a hierarchy-wise partition algorithm designed to further enhance accuracy. To begin with, HEP-NAS treats edges sharing the same end node as a hierarchy, permuting and splitting edges within the same hierarchy to directly search for the optimal operation combination for each intermediate node. This approach aligns more closely with the ultimate goal of NAS. Furthermore, HEP-NAS selects the most promising sub-supernet after each segmentation, progressively narrowing the search space in which the optimal architecture may exist. To improve performance evaluation of sub-supernets, HEP-NAS employs search space mutual distillation, stabilizing the training process and accelerating the convergence of each individual sub-supernet. Within a given budget, HEP-NAS enables the splitting of all edges and gradually searches for architectures with higher accuracy. Experimental results across various datasets and search spaces demonstrate the superiority of HEP-NAS compared to state-of-the-art methods.

TMLR Journal 2025 Journal Article

Learning to Prompt Your Domain for Federated Vision-Language Models

  • Guoyizhe Wei
  • Feng Wang
  • Anshul Shah
  • Rama Chellappa

The prompt tuning paradigm, with its great advantages of low parameter count and stable training, has recently inspired numerous applications of CLIP-like vision-language models in federated learning. However, in this work, we posit that under significant domain gaps across federated participants, prompt-based CLIP may easily collapse to non-optimal solutions due to the neglect of domain-aware knowledge. We present a novel prompt tuning method, termed ADAPT, to address this issue by learning both intra- and inter-domain prompts. Specifically, we assign each federated participant a domain-specific prompt and use the image's visual features as a condition to guide the generation of language features, with the underlying idea that the prompted CLIP should detect the input image's domain correspondence before making the prediction of its category. Extensive experiments demonstrate ADAPT's significant efficiency and effectiveness in federated learning. For example, by learning and sharing only 2.1M parameters, ADAPT attains a 69.8% average accuracy over the six domains of DomainNet, which improves the original CLIP accuracy by 16.2%.

NeurIPS Conference 2025 Conference Paper

MagCache: Fast Video Generation with Magnitude-Aware Cache

  • Zehong Ma
  • Longhui Wei
  • Feng Wang
  • Shiliang Zhang
  • Qi Tian

Existing acceleration techniques for video diffusion models often rely on uniform heuristics or time-embedding variants to skip timesteps and reuse cached features. These approaches typically require extensive calibration with curated prompts and risk inconsistent outputs due to prompt-specific overfitting. In this paper, we introduce a novel and robust discovery: a unified magnitude law observed across different models and prompts. Specifically, the magnitude ratio of successive residual outputs decreases monotonically, steadily in most timesteps while rapidly in the last several steps. Leveraging this insight, we introduce a Magnitude-aware Cache (MagCache) that adaptively skips unimportant timesteps using an error modeling mechanism and adaptive caching strategy. Unlike existing methods requiring dozens of curated samples for calibration, MagCache only requires a single sample for calibration. Experimental results show that MagCache achieves 2. 10×-2. 68× speedups on Open-Sora, CogVideoX, Wan 2. 1, and HunyuanVideo, while preserving superior visual fidelity. It significantly outperforms existing methods in LPIPS, SSIM, and PSNR, under similar computational budgets.

AAAI Conference 2025 Conference Paper

VisRec: A Semi-Supervised Approach to Visibility Data Reconstruction in Radio Astronomy

  • Ruoqi Wang
  • Haitao Wang
  • Qiong Luo
  • Feng Wang
  • Hejun Wu

Radio telescopes produce visibility data about celestial objects, but these data are sparse and noisy. As a result, images created on raw visibility data are of low quality. Recent studies have used deep learning models to reconstruct visibility data to get cleaner images. However, these methods rely on a substantial amount of labeled training data, which requires significant labeling effort from radio astronomers. Addressing this challenge, we propose VisRec, a model-agnostic semi-supervised learning approach to visibility data reconstruction in radio astronomy. Specifically, VisRec consists of both a supervised learning module and an unsupervised learning module. In the supervised learning module, we introduce a set of data augmentation functions to produce diverse visibility examples. In comparison, the unsupervised learning module in VisRec augments unlabeled data and uses reconstructions from non-augmented visibility as pseudo-labels for training. This hybrid approach allows VisRec to effectively leverage both labeled and unlabeled data. This way, VisRec performs well even when labeled data is scarce. Our evaluation results show that VisRec is applicable to various models, and outperforms all baseline methods in terms of reconstruction quality, robustness, and generalizability.

JBHI Journal 2024 Journal Article

Attention-Based Temporal Graph Representation Learning for EEG-Based Emotion Recognition

  • Chao Li
  • Feng Wang
  • Ziping Zhao
  • Haishuai Wang
  • Björn W. Schuller

Due to the objectivity of emotional expression in the central nervous system, EEG-based emotion recognition can effectively reflect humans' internal emotional states. In recent years, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have made significant strides in extracting local features and temporal dependencies from EEG signals. However, CNNs ignore spatial distribution information from EEG electrodes; moreover, RNNs may encounter issues such as exploding/vanishing gradients and high time consumption. To address these limitations, we propose an attention-based temporal graph representation network (ATGRNet) for EEG-based emotion recognition. Firstly, a hierarchical attention mechanism is introduced to integrate feature representations from both frequency bands and channels ordered by priority in EEG signals. Second, a graph convolutional neural network with top-k operation is utilized to capture internal relationships between EEG electrodes under different emotion patterns. Next, a residual-based graph readout mechanism is applied to accumulate the EEG feature node-level representations into graph-level representations. Finally, the obtained graph-level representations are fed into a temporal convolutional network (TCN) to extract the temporal dependencies between EEG frames. We evaluated our proposed ATGRNet on the SEED, DEAP and FACED datasets. The experimental findings show that the proposed ATGRNet surpasses the state-of-the-art graph-based mehtods for EEG-based emotion recognition.

JBHI Journal 2024 Journal Article

Developing Deep LSTMs With Later Temporal Attention for Predicting COVID-19 Severity, Clinical Outcome, and Antibody Level by Screening Serological Indicators Over Time

  • Jiaxin Cai
  • Yang Li
  • Baichen Liu
  • Zhixi Wu
  • Shengjun Zhu
  • Qiliang Chen
  • Qing Lei
  • Hongyan Hou

Objective: The clinical course of COVID-19, as well as the immunological reaction, is notable for its extreme variability. Identifying the main associated factors might help understand the disease progression and physiological status of COVID-19 patients. The dynamic changes of the antibody against Spike protein are crucial for understanding the immune response. This work explores a temporal attention (TA) mechanism of deep learning to predict COVID-19 disease severity, clinical outcomes, and Spike antibody levels by screening serological indicators over time. Methods: We use feature selection techniques to filter feature subsets that are highly correlated with the target. The specific deep Long Short-Term Memory (LSTM) models are employed to capture the dynamic changes of disease severity, clinical outcome, and Spike antibody level. We also propose deep LSTMs with a TA mechanism to emphasize the later blood test records because later records often attract more attention from doctors. Results: Risk factors highly correlated with COVID-19 are revealed. LSTM achieves the highest classification accuracy for disease severity prediction. Temporal Attention Long Short-Term Memory (TA-LSTM) achieves the best performance for clinical outcome prediction. For Spike antibody level prediction, LSTM achieves the best permanence. Conclusion: The experimental results demonstrate the effectiveness of the proposed models. The proposed models can provide a computer-aided medical diagnostics system by simply using time series of serological indicators.

ICRA Conference 2024 Conference Paper

Frame Fusion with Vehicle Motion Prediction for 3D Object Detection

  • Xirui Li
  • Feng Wang
  • Naiyan Wang
  • Chao Ma 0001

In LiDAR-based 3D detection, history point clouds contain rich temporal information helpful for future prediction. In the same way, history detections should contribute to future detections. In this paper, we propose a detection enhancement method, namely FrameFusion, which improves 3D object detection results by fusing history detection frames. In FrameFusion, we "forward" history frames to the current frame and apply weighted Non-Maximum-Suppression on dense bounding boxes to obtain a fused frame with merged boxes. To "forward" frames, we use vehicle motion models to estimate the future pose of the bounding boxes. Our method is flexible in motion model selection. We explore three motion models in our work and show how the unicycle model and the bicycle model improve turning cases. On Waymo Open Dataset, our FrameFusion method consistently improves the performance of various 3D detectors by about 2. 0 vehicle LEVEL 2 APH with negligible latency and slightly enhances the performance of the temporal fusion method MPPNet. We also conduct extensive experiments on motion model selection.

AAAI Conference 2024 Conference Paper

MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis

  • Wenhao Guan
  • Yishuang Li
  • Tao Li
  • Hukai Huang
  • Feng Wang
  • Jiayan Lin
  • Lingyan Huang
  • Lin Li

The style transfer task in Text-to-Speech (TTS) refers to the process of transferring style information into text content to generate corresponding speech with a specific style. However, most existing style transfer approaches are either based on fixed emotional labels or reference speech clips, which cannot achieve flexible style transfer. Recently, some methods have adopted text descriptions to guide style transfer. In this paper, we propose a more flexible multi-modal and style controllable TTS framework named MM-TTS. It can utilize any modality as the prompt in unified multi-modal prompt space, including reference speech, emotional facial images, and text descriptions, to control the style of the generated speech in a system. The challenges of modeling such a multi-modal style controllable TTS mainly lie in two aspects: 1) aligning the multi-modal information into a unified style space to enable the input of arbitrary modality as the style prompt in a single system, and 2) efficiently transferring the unified style representation into the given text content, thereby empowering the ability to generate prompt style-related voice. To address these problems, we propose an aligned multi-modal prompt encoder that embeds different modalities into a unified style space, supporting style transfer for different modalities. Additionally, we present a new adaptive style transfer method named Style Adaptive Convolutions (SAConv) to achieve a better style representation. Furthermore, we design a Rectified Flow based Refiner to solve the problem of over-smoothing Mel-spectrogram and generate audio of higher fidelity. Since there is no public dataset for multi-modal TTS, we construct a dataset named MEAD-TTS, which is related to the field of expressive talking head. Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multi-modal prompts. The audio samples and constructed dataset are available at https://multimodal-tts.github.io.

NeurIPS Conference 2024 Conference Paper

MonkeySee: Space-time-resolved reconstructions of natural images from macaque multi-unit activity

  • Lynn Le
  • Paolo Papale
  • Katja Seeliger
  • Antonio Lozano
  • Thirza Dado
  • Feng Wang
  • Pieter Roelfsema
  • Marcel van Gerven

In this paper, we reconstruct naturalistic images directly from macaque brain signals using a convolutional neural network (CNN) based decoder. We investigate the ability of this CNN-based decoding technique to differentiate among neuronal populations from areas V1, V4, and IT, revealing distinct readout characteristics for each. This research marks a progression from low-level to high-level brain signals, thereby enriching the existing framework for utilizing CNN-based decoders to decode brain activity. Our results demonstrate high-precision reconstructions of naturalistic images, highlighting the efficiency of CNN-based decoders in advancing our knowledge of how the brain's representations translate into pixels. Additionally, we present a novel space-time-resolved decoding technique, demonstrating how temporal resolution in decoding can advance our understanding of neural representations. Moreover, we introduce a learned receptive field layer that sheds light on the CNN-based model's data processing during training, enhancing understanding of its structure and interpretive capacity.

AAAI Conference 2024 Conference Paper

Semantic-Guided Generative Image Augmentation Method with Diffusion Models for Image Classification

  • Bohan Li
  • Xiao Xu
  • Xinghao Wang
  • Yutai Hou
  • Yunlong Feng
  • Feng Wang
  • Xuanliang Zhang
  • Qingfu Zhu

Existing image augmentation methods consist of two categories: perturbation-based methods and generative methods. Perturbation-based methods apply pre-defined perturbations to augment an original image, but only locally vary the image, thus lacking image diversity. In contrast, generative methods bring more image diversity in the augmented images but may not preserve semantic consistency, thus may incorrectly change the essential semantics of the original image. To balance image diversity and semantic consistency in augmented images, we propose SGID, a Semantic-guided Generative Image augmentation method with Diffusion models for image classification. Specifically, SGID employs diffusion models to generate augmented images with good image diversity. More importantly, SGID takes image labels and captions as guidance to maintain semantic consistency between the augmented and original images. Experimental results show that SGID outperforms the best augmentation baseline by 1.72% on ResNet-50 (from scratch), 0.33% on ViT (ImageNet-21k), and 0.14% on CLIP-ViT (LAION-2B). Moreover, SGID can be combined with other image augmentation baselines and further improves the overall performance. We demonstrate the semantic consistency and image diversity of SGID through quantitative human and automated evaluations, as well as qualitative case studies.

NeurIPS Conference 2024 Conference Paper

Towards Flexible 3D Perception: Object-Centric Occupancy Completion Augments 3D Object Detection

  • Chaoda Zheng
  • Feng Wang
  • Naiyan Wang
  • Shuguang Cui
  • Zhen Li

While 3D object bounding box (bbox) representation has been widely used in autonomous driving perception, it lacks the ability to capture the precise details of an object's intrinsic geometry. Recently, occupancy has emerged as a promising alternative for 3D scene perception. However, constructing a high-resolution occupancy map remains infeasible for large scenes due to computational constraints. Recognizing that foreground objects only occupy a small portion of the scene, we introduce object-centric occupancy as a supplement to object bboxes. This representation not only provides intricate details for detected objects but also enables higher voxel resolution in practical applications. We advance the development of object-centric occupancy perception from both data and algorithm perspectives. On the data side, we construct the first object-centric occupancy dataset from scratch using an automated pipeline. From the algorithmic standpoint, we introduce a novel object-centric occupancy completion network equipped with an implicit shape decoder that manages dynamic-size occupancy generation. This network accurately predicts the complete object-centric occupancy volume for inaccurate object proposals by leveraging temporal information from long sequences. Our method demonstrates robust performance in completing object shapes under noisy detection and tracking conditions. Additionally, we show that our occupancy features significantly enhance the detection results of state-of-the-art 3D object detectors, especially for incomplete or distant objects in the Waymo Open Dataset.

ECAI Conference 2023 Conference Paper

A Conditional Denoising Diffusion Probabilistic Model for Radio Interferometric Image Reconstruction

  • Ruoqi Wang
  • Zhuoyang Chen
  • Qiong Luo 0001
  • Feng Wang

In radio astronomy, signals from radio telescopes are transformed into images of observed celestial objects, or sources. However, these images, called dirty images, contain real sources as well as artifacts due to signal sparsity and other factors. Therefore, radio interferometric image reconstruction is performed on dirty images, aiming to produce clean images in which artifacts are reduced and real sources are recovered. So far, existing methods have limited success on recovering faint sources, preserving detailed structures, and eliminating artifacts. In this paper, we present VIC-DDPM, a Visibility and Image Conditioned Denoising Diffusion Probabilistic Model. Our main idea is to use both the original visibility data in the spectral domain and dirty images in the spatial domain to guide the image generation process with DDPM. This way, we can leverage DDPM to generate fine details and eliminate noise, while utilizing visibility data to separate signals from noise and retaining spatial information in dirty images. We have conducted experiments in comparison with both traditional methods and recent deep learning based approaches. Our results show that our method significantly improves the resulting images by reducing artifacts, preserving fine details, and recovering dim sources. This advancement further facilitates radio astronomical data analysis tasks on celestial phenomena. Our code is available at https: //github. com/RapidsAtHKUST/VIC-DDPM.

NeurIPS Conference 2023 Conference Paper

Echoes Beyond Points: Unleashing the Power of Raw Radar Data in Multi-modality Fusion

  • Yang Liu
  • Feng Wang
  • Naiyan Wang
  • ZHAO-XIANG ZHANG

Radar is ubiquitous in autonomous driving systems due to its low cost and good adaptability to bad weather. Nevertheless, the radar detection performance is usually inferior because its point cloud is sparse and not accurate due to the poor azimuth and elevation resolution. Moreover, point cloud generation algorithms already drop weak signals to reduce the false targets which may be suboptimal for the use of deep fusion. In this paper, we propose a novel method named EchoFusion to skip the existing radar signal processing pipeline and then incorporate the radar raw data with other sensors. Specifically, we first generate the Bird's Eye View (BEV) queries and then take corresponding spectrum features from radar to fuse with other sensors. By this approach, our method could utilize both rich and lossless distance and speed clues from radar echoes and rich semantic clues from images, making our method surpass all existing methods on the RADIal dataset, and approach the performance of LiDAR. The code will be released on https: //github. com/tusen-ai/EchoFusion.

NeurIPS Conference 2023 Conference Paper

Masked Space-Time Hash Encoding for Efficient Dynamic Scene Reconstruction

  • Feng Wang
  • Zilong Chen
  • Guokang Wang
  • Yafei Song
  • Huaping Liu

In this paper, we propose the Masked Space-Time Hash encoding (MSTH), a novel method for efficiently reconstructing dynamic 3D scenes from multi-view or monocular videos. Based on the observation that dynamic scenes often contain substantial static areas that result in redundancy in storage and computations, MSTH represents a dynamic scene as a weighted combination of a 3D hash encoding and a 4D hash encoding. The weights for the two components are represented by a learnable mask which is guided by an uncertainty-based objective to reflect the spatial and temporal importance of each 3D position. With this design, our method can reduce the hash collision rate by avoiding redundant queries and modifications on static areas, making it feasible to represent a large number of space-time voxels by hash tables with small size. Besides, without the requirements to fit the large numbers of temporally redundant features independently, our method is easier to optimize and converge rapidly with only twenty minutes of training for a 300-frame dynamic scene. We evaluate our method on extensive dynamic scenes. As a result, MSTH obtains consistently better results than previous state-of-the-art methods with only 20 minutes of training time and 130 MB of memory storage.

TIST Journal 2022 Journal Article

AggEnhance: Aggregation Enhancement by Class Interior Points in Federated Learning with Non-IID Data

  • Jinxiang Ou
  • Yunheng Shen
  • Feng Wang
  • Qiao Liu
  • Xuegong Zhang
  • Hairong Lv

Federated learning (FL) is a privacy-preserving paradigm for multi-institutional collaborations, where the aggregation is an essential procedure after training on the local datasets. Conventional aggregation algorithms often apply a weighted averaging of the updates generated from distributed machines to update the global model. However, while the data distributions are non-IID, the large discrepancy between the local updates might lead to a poor averaged result and a lower convergence speed, i.e., more iterations required to achieve a certain performance. To solve this problem, this article proposes a novel method named AggEnhance for enhancing the aggregation, where we synthesize a group of reliable samples from the local models and tune the aggregated result on them. These samples, named class interior points (CIPs) in this work, bound the relevant decision boundaries that ensure the performance of aggregated result. To the best of our knowledge, this is the first work to explicitly design an enhancing method for the aggregation in prevailing FL pipelines. A series of experiments on real data demonstrate that our method has noticeable improvements of the convergence in non-IID scenarios. In particular, our approach reduces the iterations by 31.87% on average for the CIFAR10 dataset and 43.90% for the PASCAL VOC dataset. Since our method does not modify other procedures of FL pipelines, it is easy to apply to most existing FL frameworks. Furthermore, it does not require additional data transmitted from the local clients to the global server, thus holding the same security level as the original FL algorithms.

NeurIPS Conference 2022 Conference Paper

DART: Articulated Hand Model with Diverse Accessories and Rich Textures

  • Daiheng Gao
  • Yuliang Xiu
  • Kailin Li
  • Lixin Yang
  • Feng Wang
  • Peng Zhang
  • Bang Zhang
  • Cewu Lu

Hand, the bearer of human productivity and intelligence, is receiving much attention due to the recent fever of digital twins. Among different hand morphable models, MANO has been widely used in vision and graphics community. However, MANO disregards textures and accessories, which largely limits its power to synthesize photorealistic hand data. In this paper, we extend MANO with Diverse Accessories and Rich Textures, namely DART. DART is composed of 50 daily 3D accessories which varies in appearance and shape, and 325 hand-crafted 2D texture maps covers different kinds of blemishes or make-ups. Unity GUI is also provided to generate synthetic hand data with user-defined settings, e. g. , pose, camera, background, lighting, textures, and accessories. Finally, we release DARTset, which contains large-scale (800K), high-fidelity synthetic hand images, paired with perfect-aligned 3D labels. Experiments demonstrate its superiority in diversity. As a complement to existing hand datasets, DARTset boosts the generalization in both hand pose estimation and mesh recovery tasks. Raw ingredients (textures, accessories), Unity GUI, source code and DARTset are publicly available at dart2022. github. io.

NeurIPS Conference 2022 Conference Paper

Fully Sparse 3D Object Detection

  • Lue Fan
  • Feng Wang
  • Naiyan Wang
  • ZHAO-XIANG ZHANG

As the perception range of LiDAR increases, LiDAR-based 3D object detection becomes a dominant task in the long-range perception task of autonomous driving. The mainstream 3D object detectors usually build dense feature maps in the network backbone and prediction head. However, the computational and spatial costs on the dense feature map are quadratic to the perception range, which makes them hardly scale up to the long-range setting. To enable efficient long-range LiDAR-based object detection, we build a fully sparse 3D object detector (FSD). The computational and spatial cost of FSD is roughly linear to the number of points and independent of the perception range. FSD is built upon the general sparse voxel encoder and a novel sparse instance recognition (SIR) module. SIR first groups the points into instances and then applies instance-wise feature extraction and prediction. In this way, SIR resolves the issue of center feature missing, which hinders the design of the fully sparse architecture for all center-based or anchor-based detectors. Moreover, SIR avoids the time-consuming neighbor queries in previous point-based methods by grouping points into instances. We conduct extensive experiments on the large-scale Waymo Open Dataset to reveal the working mechanism of FSD, and state-of-the-art performance is reported. To demonstrate the superiority of FSD in long-range detection, we also conduct experiments on Argoverse 2 Dataset, which has a much larger perception range ($200m$) than Waymo Open Dataset ($75m$). On such a large perception range, FSD achieves state-of-the-art performance and is 2. 4$\times$ faster than the dense counterpart. Codes will be released.

TIST Journal 2022 Journal Article

GPSClean: A Framework for Cleaning and Repairing GPS Data

  • Chenglong Fang
  • Feng Wang
  • Bin Yao
  • Jianqiu Xu

The rise of GPS-equipped mobile devices has led to the emergence of big trajectory data. The collected raw data usually contain errors and anomalies information caused by device failure, sensor error, and environment influence. Low-quality data fails to support application requirements and therefore raw data will be comprehensively cleaned before usage. Existing methods are suboptimal to detect GPS data errors and do the repairing. To solve the problem, we propose a framework called GPSClean to analyze the anomalies data and develop effective methods to repair the data. There are primarily four modules in GPSClean: (i) data preprocessing, (ii) data filling, (iii) data repairing, and (iv) data conversion. For (i), we propose an approach named MDSort (Maximum Disorder Sorting) to efficiently solve the issue of data disorder. For (ii), we propose a method named NNF (Nearest Neighbor Filling) to fill missing data. For (iii), we design an approach named RCSWS (Range Constraints and Sliding Window Statistics) to repair anomalies and also improve the accuracy of data repairing by mak7ing use of driving direction. We use 45 million real trajectory data to evaluate our proposal in a prototype database system SECONDO. Experimental results show that the accuracy of RCSWS is three times higher than an alternative method SCREEN and nearly an order of magnitude higher than an alternative method EWMA.

JBHI Journal 2022 Journal Article

Particle-Based Calculation and Visualization of Protein Cavities Using SES Models

  • Li Feng
  • Feng Wang
  • Jian Zhang
  • Yong Tang
  • Jing Zhao
  • Lisha Zhou
  • Jiayan Wang
  • Dongliang Guo

The analysis of molecular cavities, where ligands interact with protein structures, plays a critical role in protein structure-based drug design. However, it is a challenge because of the ambiguous definition of the cavity boundaries in most cavity detection methods. The cavities are mostly calculated by input parameters, which are difficult for users to visualize cavities in interactive ways. In this paper, we propose a novel method for the interactive exploration of cavity calculation and visualization. Firstly, the proposed method combines the two solvent-excluded surfaces (SES) models of a given protein to define the boundaries and provides cavity emission points. Secondly, the system provides a user-guided interactive method to allow users to select cavities by simply clicking operations and to track the cavity identify and filling process based on position constraints. Finally, the selected cavities are represented with the colorful depth perception method. Experiments show that our work can effectively identify and calculate cavities.

AAAI Conference 2022 Conference Paper

Tracing Text Provenance via Context-Aware Lexical Substitution

  • Xi Yang
  • Jie Zhang
  • Kejiang Chen
  • Weiming Zhang
  • Zehua Ma
  • Feng Wang
  • Nenghai Yu

Text content created by humans or language models is often stolen or misused by adversaries. Tracing text provenance can help claim the ownership of text content or identify the malicious users who distribute misleading content like machine-generated fake news. There have been some attempts to achieve this, mainly based on watermarking techniques. Specifically, traditional text watermarking methods embed watermarks by slightly altering text format like line spacing and font, which, however, are fragile to cross-media transmissions like OCR. Considering this, natural language watermarking methods represent watermarks by replacing words in original sentences with synonyms from handcrafted lexical resources (e. g. , WordNet), but they do not consider the substitution’s impact on the overall sentence’s meaning. Recently, a transformer-based network was proposed to embed watermarks by modifying the unobtrusive words (e. g. , function words), which also impair the sentence’s logical and semantic coherence. Besides, one well-trained network fails on other different types of text content. To address the limitations mentioned above, we propose a natural language watermarking scheme based on contextaware lexical substitution (LS). Specifically, we employ BERT to suggest LS candidates by inferring the semantic relatedness between the candidates and the original sentence. Based on this, a selection strategy in terms of synchronicity and substitutability is further designed to test whether a word is exactly suitable for carrying the watermark signal. Extensive experiments demonstrate that, under both objective and subjective metrics, our watermarking scheme can well preserve the semantic integrity of original sentences and has a better transferability than existing methods. Besides, the proposed LS approach outperforms the state-of-the-art approach on the Stanford Word Substitution Benchmark.

NeurIPS Conference 2021 Conference Paper

Boost Neural Networks by Checkpoints

  • Feng Wang
  • Guoyizhe Wei
  • Qiao Liu
  • Jinxiang Ou
  • Xian Wei
  • Hairong Lv

Training multiple deep neural networks (DNNs) and averaging their outputs is a simple way to improve the predictive performance. Nevertheless, the multiplied training cost prevents this ensemble method to be practical and efficient. Several recent works attempt to save and ensemble the checkpoints of DNNs, which only requires the same computational cost as training a single network. However, these methods suffer from either marginal accuracy improvements due to the low diversity of checkpoints or high risk of divergence due to the cyclical learning rates they adopted. In this paper, we propose a novel method to ensemble the checkpoints, where a boosting scheme is utilized to accelerate model convergence and maximize the checkpoint diversity. We theoretically prove that it converges by reducing exponential loss. The empirical evaluation also indicates our proposed ensemble outperforms single model and existing ensembles in terms of accuracy and efficiency. With the same training budget, our method achieves 4. 16% lower error on Cifar-100 and 6. 96% on Tiny-ImageNet with ResNet-110 architecture. Moreover, the adaptive sample weights in our method make it an effective solution to address the imbalanced class distribution. In the experiments, it yields up to 5. 02% higher accuracy over single EfficientNet-B0 on the imbalanced datasets.

JBHI Journal 2021 Journal Article

Deep Semisupervised Multitask Learning Model and Its Interpretability for Survival Analysis

  • Shengqiang Chi
  • Yu Tian
  • Feng Wang
  • Yu Wang
  • Ming Chen
  • Jingsong Li

Survival analysis is a commonly used method in the medical field to analyze and predict the time of events. In medicine, this approach plays a key role in determining the course of treatment, developing new drugs, and improving hospital procedures. Most of the existing work in this area has addressed the problem by making strong assumptions about the underlying stochastic process. However, these assumptions are usually violated in the real-world data. This paper proposed a semisupervised multitask learning (SSMTL) method based on deep learning for survival analysis with or without competing risks. SSMTL transforms the survival analysis problem into a multitask learning problem that includes semisupervised learning and multipoint survival probability prediction. The distribution of survival times and the relationship between covariates and outcomes were modeled directly without any assumptions. Semisupervised loss and ranking loss are used to deal with censored data and the prior knowledge of the nonincreasing trend of the survival probability. Additionally, the importance of prognostic factors is determined, and the time-dependent and nonlinear effects of these factors on survival outcomes are visualized. The prediction performance of SSMTL is better than that of previous models in settings with or without competing risks, and the effects of predictors are successfully described. This study is of great significance for the exploration and application of deep learning methods involving medical structured data and provides an effective deep-learning-based method for survival analysis with complex-structured clinical data.

AAAI Conference 2021 Conference Paper

Exploiting Behavioral Consistence for Universal User Representation

  • Jie Gu
  • Feng Wang
  • Qinghui Sun
  • Zhiquan Ye
  • Xiaoxiao Xu
  • Jingmin Chen
  • Jun Zhang

User modeling is critical for developing personalized services in industry. A common way for user modeling is to learn user representations that can be distinguished by their interests or preferences. In this work, we focus on developing universal user representation model. The obtained universal representations are expected to contain rich information, and be applicable to various downstream applications without further modifications (e. g. , user preference prediction and user profiling). Accordingly, we can be free from the heavy work of training task-specific models for every downstream task as in previous works. In specific, we propose Self-supervised User Modeling Network (SUMN) to encode behavior data into the universal representation. It includes two key components. The first one is a new learning objective, which guides the model to fully identify and preserve valuable user information under a self-supervised learning framework. The other one is a multi-hop aggregation layer, which benefits the model capacity in aggregating diverse behaviors. Extensive experiments on benchmark datasets show that our approach can outperform state-of-the-art unsupervised representation methods, and even compete with supervised ones.

ECAI Conference 2020 Conference Paper

Black-Box Adversarial Attacks Against Deep Learning Based Malware Binaries Detection with GAN

  • Junkun Yuan
  • Shaofang Zhou
  • Lanfen Lin
  • Feng Wang
  • Jia Cui

For efficient malware detection, there are more and more deep learning methods based on raw software binaries. Recent studies show that deep learning models can easily be fooled to make a wrong decision by introducing subtle perturbations to inputs, which attracts a large influx of work in adversarial attacks. However, most of the existing attack methods are based on manual features (e. g. , API calls) or in the white-box setting, making the attacks impractical in current real-world scenarios. In this work, we propose a novel attack framework called GAPGAN, which generates adversarial payloads (padding bytes) with generative adversarial networks (GANs). To the best of our knowledge, it is the first work that performs end-to-end black-box attacks at the byte-level against deep learning based malware binaries detection. In our attack framework, we map input discrete malware binaries to continuous space, then feed it to the generator of GAPGAN to generate adversarial payloads. We append payloads to the original binaries to craft an adversarial sample while preserving its functionality. We propose to use a dynamic threshold for reducing the loss of the effectiveness of the payloads when mapping it from continuous format back to the original discrete format. For balancing the attention of the generator to the payloads and the adversarial samples, we use an automatic weight tuning strategy. We train GAPGAN with both malicious and benign software. Once the training is finished, the generator can generate an adversarial sample with only the input malware in less than twenty milliseconds. We apply GAPGAN to attack the state-of-the-art detector MalConv and achieve 100% attack success rate with only appending payloads of 2. 5% of the total length of the data for detection. We also attack deep learning models with different structures under different defense methods. The experiments show that GAPGAN outperforms other state-of-the-art attack models in efficiency and effectiveness.

NeurIPS Conference 2020 Conference Paper

Unsupervised Representation Learning by Invariance Propagation

  • Feng Wang
  • Huaping Liu
  • Di Guo
  • Sun Fuchun

Unsupervised learning methods based on contrastive learning have drawn increasing attention and achieved promising results. Most of them aim to learn representations invariant to instance-level variations, which are provided by different views of the same instance. In this paper, we propose Invariance Propagation to focus on learning representations invariant to category-level variations, which are provided by different instances from the same category. Our method recursively discovers semantically consistent samples residing in the same high-density regions in representation space. We demonstrate a hard sampling strategy to concentrate on maximizing the agreement between the anchor sample and its hard positive samples, which provide more intra-class variations to help capture more abstract invariance. As a result, with a ResNet-50 as the backbone, our method achieves 71. 3% top-1 accuracy on ImageNet linear classification and 78. 2% top-5 accuracy fine-tuning on only 1% labels, surpassing previous results. We also achieve state-of-the-art performance on other downstream tasks, including linear classification on Places205 and Pascal VOC, and transfer learning on small scale datasets.

AAAI Conference 2019 Conference Paper

Adapting Translation Models for Transcript Disfluency Detection

  • Qianqian Dong
  • Feng Wang
  • Zhen Yang
  • Wei Chen
  • Shuang Xu
  • Bo Xu

Transcript disfluency detection (TDD) is an important component of the real-time speech translation system, which arouses more and more interests in recent years. This paper presents our study on adapting neural machine translation (NMT) models for TDD. We propose a general training framework for adapting NMT models to TDD task rapidly. In this framework, the main structure of the model is implemented similar to the NMT model. Additionally, several extended modules and training techniques which are independent of the NMT model are proposed to improve the performance, such as the constrained decoding, denoising autoencoder initialization and a TDD-specific training object. With the proposed training framework, we achieve significant improvement. However, it is too slow in decoding to be practical. To build a feasible and production-ready solution for TDD, we propose a fast non-autoregressive TDD model following the non-autoregressive NMT model emerged recently. Even we do not assume the specific architecture of the NMT model, we build our TDD model on the basis of Transformer, which is the state-of-the-art NMT model. We conduct extensive experiments on the publicly available set, Switchboard, and in-house Chinese set. Experimental results show that the proposed model significantly outperforms previous state-ofthe-art models.

IJCAI Conference 2017 Conference Paper

MAT: A Multimodal Attentive Translator for Image Captioning

  • Chang Liu
  • Fuchun Sun
  • Changhu Wang
  • Feng Wang
  • Alan Yuille

In this work we formulate the problem of image captioning as a multimodal translation task. Analogous to machine translation, we present a sequence-to-sequence recurrent neural networks (RNN) model for image caption generation. Different from most existing work where the whole image is represented by convolutional neural network (CNN) feature, we propose to represent the input image as a sequence of detected objects which feeds as the source sequence of the RNN model. In this way, the sequential representation of an image can be naturally translated to a sequence of words, as the target sequence of the RNN model. To represent the image in a sequential way, we extract the objects features in the image and arrange them in a order using convolutional neural networks. To further leverage the visual information from the encoded objects, a sequential attention layer is introduced to selectively attend to the objects that are related to generate corresponding words in the sentences. Extensive experiments are conducted to validate the proposed approach on popular benchmark dataset, i. e. , MS COCO, and the proposed model surpasses the state-of-the-art methods in all metrics following the dataset splits of previous work. The proposed approach is also evaluated by the evaluation server of MS COCO captioning challenge, and achieves very competitive results, e. g. , a CIDEr of 1. 029 (c5) and 1. 064 (c40).

AAMAS Conference 2010 Conference Paper

Developing High-level Cognitive Functions for Service Robots

  • Xiaoping Chen
  • Jianmin Ji
  • Jiehui Jiang
  • Guoqiang Jin
  • Feng Wang
  • Jiongkun Xie

The primary target of this work is human-robot collaboration, especially for service robots in complicated applicationscenarios. Three assumptions and four requirements areidentified. State-of-the-art, general-purpose Natural Language Processing (NLP), Commonsense Reasoning (in particular, ASP), and Robotics techniques are integrated in alayered architecture. The architecture and mechanisms havebeen implemented on a service robot, Ke Jia. Instead ofcommand languages, small limited segments of natural languages are employed in spoken dialog between Ke Jia and itsusers. The information in the dialog is extracted, classifiedand transferred into inner representation by Ke Jia's NLPmechanism, and further used autonomously in problem-solvingand planning. A series of case study was conducted onKe Jia with positive results, verifying its ability of acquiringknowledge through spoken dialog with users, autonomoussolving problems by virtue of acquired causal knowledge, and autonomous planning for complex tasks.