Arrow Research search

Author name cluster

Qiang Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

57 papers
2 author rows

Possible papers

57

AAAI Conference 2026 Conference Paper

Class Incremental Medical Image Segmentation via Prototype-Guided Calibration and Dual-Aligned Distillation

  • Shengqian Zhu
  • Chengrong Yu
  • Qiang Wang
  • Ying Song
  • Guangjun Li
  • Jiafei Wu
  • Xiaogang Xu
  • Zhang Yi

Class incremental medical image segmentation (CIMIS) aims to preserve knowledge of previously learned classes while learning new ones without relying on old-class annotations. However, existing methods 1) either adopt one-size-fits-all strategies that treat all spatial regions and feature channels equally, which may hinder the preservation of accurate old knowledge, 2) or focus solely on aligning local prototypes with global ones for old classes while overlooking their local representations in new data, leading to knowledge degradation. To mitigate the above issues, we propose Prototype-Guided Calibration Distillation (PGCD) and Dual-Aligned Prototype Distillation (DAPD) for CIMIS in this paper. Specifically, PGCD exploits prototype-to-feature similarity to calibrate class-specific distillation intensity in different spatial regions, effectively reinforcing reliable old knowledge and suppressing misleading cues from old classes. Complementarily, DAPD aligns the local prototypes of old classes extracted from the current model with both global historical prototypes and local prototypes, further enhancing segmentation performance on old categories. Comprehensive evaluations on two widely used multi-organ segmentation benchmarks demonstrate that our method outperforms current state-of-the-art methods, highlighting its robustness and generalization capabilities.

AAAI Conference 2026 Conference Paper

D-FCGS: Feedforward Compression of Dynamic Gaussian Splatting for Free-Viewpoint Videos

  • Wenkang Zhang
  • Yan Zhao
  • Qiang Wang
  • Zhixin Xu
  • Li Song
  • Zhengxue Cheng

Free-Viewpoint Video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representation remains a major challenge. Existing dynamic 3D Gaussian Splatting methods couple reconstruction with optimization-dependent compression and customized motion formats, limiting generalization and standardization. To address this, we propose D-FCGS, a novel Feedforward Compression framework for Dynamic Gaussian Splatting. Key innovations include: (1) a standardized Group-of-Frames (GoF) structure with I-P coding, leveraging sparse control points to extract inter-frame motion tensors; (2) a dual prior-aware entropy model that fuses hyperprior and spatial-temporal priors for accurate rate estimation; (3) a control-point-guided motion compensation mechanism and refinement network to enhance view-consistent fidelity. Trained on Gaussian frames derived from multi-view videos, D-FCGS generalizes across diverse scenes in a zero-shot fashion. Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 40 times compression compared to the baseline while preserving visual quality across viewpoints. This work advances feedforward compression of dynamic 3DGS, facilitating scalable FVV transmission and storage for immersive applications.

AAAI Conference 2026 Conference Paper

EC-MVSNet: Enhanced Cascaded Multi-View Stereo with Cross-Scale Relevance Integration

  • Shaoqian Wang
  • Jiadai Sun
  • Bin Fan
  • Qiang Wang
  • Bin Lu
  • Yuchao Dai

Cascade-based multi-scale architectures are currently the mainstream in Multi-view Stereo (MVS), achieving a balance between computational efficiency and reconstruction accuracy. However, existing cascade MVS methods suffer from significant limitations in cross-scale information utilization, where depth estimation processes operate independently across scales without fully exploiting the rich relevance between adjacent scales. To address this fundamental limitation, we propose an Enhanced Cascade Multi-View Stereo framework (EC-MVSNet), which introduces a novel cross-scale relevance integration strategy. Specifically, we introduce a Cross-Scale Feature-based Joint Construction (CFC) module to synergistically combine features from adjacent scales to build more reliable cost volumes. Additionally, a Cross-Scale Probability-guided Enhancement (CPE) module is proposed to propagate depth probability distributions across scales to guide cost volume enhancement. Furthermore, we propose a Monocular Feature-based Refinement (MFR) module to further enhance depth prediction accuracy by leveraging monocular priors. Extensive experiments demonstrate that EC-MVSNet achieves state-of-the-art performance on multiple benchmarks, validating the effectiveness of the cross-scale integration in improving MVS reconstruction quality.

AAAI Conference 2026 Conference Paper

GOAL: Geometrically Optimal Alignment for Continual Generalized Category Discovery

  • Jizhou Han
  • Chenhao Ding
  • Songlin Dong
  • Yuhang He
  • Shaokun Wang
  • Qiang Wang
  • Yihong Gong

Continual Generalized Category Discovery (C-GCD) requires identifying novel classes from unlabeled data while retaining knowledge of known classes over time. Existing methods typically update classifier weights dynamically, resulting in forgetting and inconsistent feature alignment. We propose GOAL, a unified framework that introduces a fixed Equiangular Tight Frame (ETF) classifier to impose a consistent geometric structure throughout learning. GOAL conducts supervised alignment for labeled samples and confidence-guided alignment for novel samples, enabling stable integration of new classes without disrupting old ones. Experiments on four benchmarks show that GOAL outperforms prior methods, reducing forgetting by 16.1% and boosting novel class discovery by 3.2%, establishing a strong solution for long-horizon continual discovery.

AAAI Conference 2026 Conference Paper

PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

  • Sijie Wang
  • Qiang Wang
  • Shaohuai Shi

Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remarkable capabilities. However, their practical deployment is often hindered by slow inference speeds and high memory consumption. In this paper, we propose a novel pipelining framework named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and communication among multiple GPUs to be pipelined, thus reducing the inference latency. Second, we propose DeDiVAE to decouple the diffusion module and the VAE module into two GPU groups whose executions can also be pipelined to reduce the memory consumption and inference latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and HunyuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8-GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06× to 4.02× speedups over OpenSoraPlan and HunyuanVideo.

AAAI Conference 2026 Conference Paper

SALR: Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models

  • Longteng Zhang
  • Sen Wu
  • Shuai Hou
  • Zhengyu Qing
  • Zhuo Zheng
  • Danning Ke
  • Qihong Lin
  • Qiang Wang

Adapting large pre-trained language models to downstream tasks often entails fine-tuning millions of parameters or deploying costly dense weight updates, which hinders their use in resource-constrained environments. Low-rank Adaptation (LoRA) reduces trainable parameters by factorizing weight updates, yet the underlying dense weights still impose high storage and computation costs. Magnitude-based pruning can yield sparse models but typically degrades LoRA’s performance when applied naively. In this paper, we introduce SALR (Sparsity-Aware Low-Rank Representation), a novel fine-tuning paradigm that unifies low-rank adaptation with sparse pruning under a rigorous mean-squared-error framework. We prove that statically pruning only the frozen base weights minimizes the pruning error bound, and we recover the discarded residual information via a truncated-SVD low-rank adapter, which provably reduces per-entry MSE by a factor of (1 - r/min(d, k)). To maximize hardware efficiency, we fuse multiple low-rank adapters into a single concatenated GEMM, and we adopt a bitmap-based encoding with a two-stage pipelined decoding + GEMM design to achieve true model compression and speedup. Empirically, SALR attains 50% sparsity on various LLMs while matching the performance of LoRA on GSM8K and MMLU, reduces model size by 2x, and delivers up to a 1.7x inference speedup.

AAAI Conference 2026 Conference Paper

Shared & Domain Self-Adaptive Experts with Frequency-Aware Discrimination for Continual Test-Time Adaptation

  • Jianchao Zhao
  • Chenhao Ding
  • Songlin Dong
  • Jiangyang Li
  • Qiang Wang
  • Yuhang He
  • Yihong Gong

This paper focuses on the Continual Test-Time Adaptation (CTTA) task, aiming to enable an agent to continuously adapt to evolving target domains while retaining previously acquired domain knowledge for effective reuse when those domains reappear. Existing shared-parameter paradigms struggle to balance adaptation and forgetting, leading to decreased efficiency and stability. To address this, we propose a frequency-aware shared and self-adaptive expert framework, consisting of two key components: (i) a dual-branch expert architecture that extracts general features and dynamically models domain-specific representations, effectively reducing cross-domain interference and repetitive learning cost; and (ii) an online Frequency-aware Domain Discriminator (FDD), which leverages the robustness of low-frequency image signals for online domain shift detection, guiding dynamic allocation of expert resources for more stable and realistic adaptation. Additionally, we introduce a Continual Repeated Shifts (CRS) benchmark to simulate periodic domain changes for more realistic evaluation. Experimental results show that our method consistently outperforms existing approaches on both classification and segmentation CTTA tasks under standard and CRS settings, with ablations and visualizations confirming its effectiveness and robustness.

YNIMG Journal 2026 Journal Article

When More Control Means Better Choices: Cognitive Control Networks Drive Expected-Value Maximization Under Uncertainty

  • Xia Wu
  • Yuning Geng
  • Yan Chen
  • Shuoxian Zhang
  • Tianhao Liu
  • Shuaipeng You
  • Fang Liu
  • Yunpeng Jiang

Human decision-making under outcome uncertainty often deviates from rational expected-value maximization, frequently falling back on the suboptimal probability matching heuristic. The neurocomputational mechanisms determining individual differences in overcoming this heuristic remain elusive. Here, we investigated how cognitive control capacity (CCC) modulates decision-making under varying levels of outcome uncertainty. Participants with high and low CCC performed a predictive inference task during functional magnetic resonance imaging. Behaviorally, high CCC individuals consistently exhibited a significantly higher proportion of maximizing responses (PMR) across all uncertainty levels. Using hierarchical drift-diffusion modeling, we demonstrated that this optimal performance was driven by more cautious decision thresholds, indicating greater deliberation to resist intuitive shortcuts. At the neural level, while localized activations in the cingulo-opercular network (CON) and frontoparietal network (FPN) reflected the general cognitive burden of escalating uncertainty, functional connectivity analyses revealed a specific neural pathway supporting optimal choices. Crucially, the connectivity within the CON (anterior insula to middle frontal gyrus) acted as a specific neural amplifier, which was absolutely necessary for translating high cognitive capacity into optimal expected-value maximization. The FPN, while tracking uncertainty, did not modulate this capacity-performance link. Together, these findings provide an integrated neurocomputational framework demonstrating how specific cognitive control networks mobilize resources to overcome heuristic tendencies and achieve optimal decisions under uncertainty.

IROS Conference 2025 Conference Paper

Applicability Analysis for Optical Cooperative Localization

  • Yixian Li
  • Qiang Wang
  • Jiaxing Wu
  • Wuhong Zhao
  • Shengrong Hu
  • Zhonghu Hao

For optical cooperative localization, which employs optical beacons with prior features as cooperative targets, a fundamental prerequisite is to ensure that the beacons are always captured by the vision sensors during the entire localization process. In other words, there is an applicability issue of optical cooperative localization with respect to the relative range between beacons and vision sensors, whereas the corresponding analysis method has so far remained a gap. In this work, we propose a general applicability analysis method for optical cooperative localization to fill this gap. We translate this problem into constructing a multi-constraint model incorporating geometrics and radiometrics for describing the relationship between optical sensor parameters and relative range or depth. For parameterized beacons and vision sensors, the geometric constraint is related to the imaging quantities and the radiometric constraint is determined by the radiation properties. Numerical evaluations are performed based on the range of parameters in practice, and real-world experiments are conducted to validate the effectiveness of the proposed applicability analysis. The results demonstrate the effectiveness of the proposed applicability analysis method and are instructive for real-world deployment of optical cooperative localization.

NeurIPS Conference 2025 Conference Paper

Consistent Supervised-Unsupervised Alignment for Generalized Category Discovery

  • Jizhou Han
  • Shaokun Wang
  • Yuhang He
  • Chenhao Ding
  • Qiang Wang
  • Xinyuan Gao
  • Songlin Dong
  • Yihong Gong

Generalized Category Discovery (GCD) focuses on classifying known categories while simultaneously discovering novel categories from unlabeled data. However, previous GCD methods face challenges due to inconsistent optimization objectives and category confusion. This leads to feature overlap and ultimately hinders performance on novel categories. To address these issues, we propose the Neural Collapse-inspired Generalized Category Discovery (NC-GCD) framework. By pre-assigning and fixing Equiangular Tight Frame (ETF) prototypes, our method ensures an optimal geometric structure and a consistent optimization objective for both known and novel categories. We introduce a Consistent ETF Alignment Loss that unifies supervised and unsupervised ETF alignment and enhances category separability. Additionally, a Semantic Consistency Matcher (SCM) is designed to maintain stable and consistent label assignments across clustering iterations. Our method significantly enhancing novel category accuracy and demonstrating its effectiveness.

AAAI Conference 2025 Conference Paper

DualCP: Rehearsal-Free Domain-Incremental Learning via Dual-Level Concept Prototype

  • Qiang Wang
  • Yuhang He
  • Songlin Dong
  • Xiang Song
  • Jizhou Han
  • Haoyu Luo
  • Yihong Gong

Domain-Incremental Learning (DIL) enables vision models to adapt to changing conditions in real-world environments while maintaining the knowledge acquired from previous domains. Given privacy concerns and training time, Rehearsal-Free DIL (RFDIL) is more practical. Inspired by the incremental cognitive process of the human brain, we design Dual-level Concept Prototypes (DualCP) for each class to address the conflict between learning new knowledge and retaining old knowledge in RFDIL. To construct DualCP, we propose a Concept Prototype Generator (CPG) that generates both coarse-grained and fine-grained prototypes for each class. Additionally, we introduce a Coarse-to-Fine calibrator (C2F) to align image features with DualCP. Finally, we propose a Dual Dot-Regression (DDR) loss function to optimize our C2F module. Extensive experiments on the DomainNet, CDDB, and CORe50 datasets demonstrate the effectiveness of our method.

AAAI Conference 2025 Conference Paper

Flexible Sharpness-Aware Personalized Federated Learning

  • Xinda Xing
  • Qiugang Zhan
  • Xiurui Xie
  • Yuning Yang
  • Qiang Wang
  • Guisong Liu

Personalized federated learning (PFL) is a new paradigm to address the statistical heterogeneity problem in federated learning. Most existing PFL methods focus on leveraging global and local information such as model interpolation or parameter decoupling. However, these methods often overlook the generalization potential during local client learning. From a local optimization perspective, we propose a simple and general PFL method, Federated learning with Flexible Sharpness-Aware Minimization (FedFSA). Specifically, we emphasize the importance of applying a larger perturbation to critical layers of the local model when using the Sharpness-Aware Minimization (SAM) optimizer. Then, we design a metric, perturbation sensitivity, to estimate the layer-wise sharpness of each local model. Based on this metric, FedFSA can flexibly select the layers with the highest sharpness to employ larger perturbation. Extensive experiments are conducted on four datasets with two types of statistical heterogeneity for image classification. The results show that FedFSA outperforms seven state-of-the-art baselines by up to 8.26% in test accuracy. Besides, FedFSA can be applied to different model architectures and easily integrated into other federated learning methods, achieving a 4.45% improvement.

YNIMG Journal 2025 Journal Article

Frequency- and state-dependent dynamics of EEG microstates during propofol anesthesia

  • Yun Zhang
  • Haidong Wang
  • Fei Yan
  • Dawei Song
  • Qiang Wang
  • Yubo Wang
  • Liyu Huang

Electroencephalography microstate analysis has emerged as a powerful tool for investigating brain dynamics during anesthesia-induced unconsciousness. However, existing studies typically analyze EEG signals across broad frequency bands, leaving the frequency-specific temporal characteristics of microstates poorly understood. In this study, we investigated frequency-specific EEG microstate features in the delta (0.5-4 Hz) and EEG-without-delta (4-30 Hz) frequency bands during propofol anesthesia. Sixty-channel EEG recordings were collected from 18 healthy male participants during wakefulness and propofol-induced unconsciousness. Microstate analysis was conducted separately for delta and EEG-without-delta frequency bands and microstate features were compared across frequency bands and conscious states. Our results revealed eight consistent microstate classes (MS1-MS8) with high topographic similarity across frequency bands, while global explained variance (GEV), mean duration (MeanDur), occurrence (Occ), and coverage (Cov) exhibited significant frequency- and state-dependent variations during propofol anesthesia. In the delta band, propofol-induced unconsciousness was associated with significantly longer MeanDur for microstate classes of MS4, MS5, and MS6 (p < 0.05). In the EEG-without-delta band, GEV, Cov, and Occ significantly increased for MS1 and MS3 (p < 0.01) and decreased for MS2 and MS4 (p < 0.05) during unconsciousness. Notably, microstate features in the EEG-without-delta band showed better sensitivity for discriminating conscious states, achieving a classification accuracy of 0.944. These findings emphasize the importance of frequency-specific microstate analysis in unraveling the neural dynamics of anesthesia-induced unconsciousness and highlight its potential clinical applications for improving anesthesia depth monitoring.

YNIMG Journal 2025 Journal Article

Loss aversion and evidence accumulation in short-video addiction: A behavioral and neuroimaging investigation

  • Chang Liu
  • Jinlian Wang
  • Hanbing Li
  • Qianyi Shangguan
  • Weipeng Jin
  • Wenwei Zhu
  • Pinchun Wang
  • Xuyi Chen

Excessive use of short-video platforms not only impairs decision-making processes but also predisposes individuals to addictive behaviors. This study investigated the relationship between short-video addiction (SVA) symptoms and loss aversion (LA), delving into the underlying computational and neural mechanisms using the drift diffusion model (DDM) and the inter-subject representational similarity analysis (IS-RSA). Behavioral analyses revealed a significant negative correlation between SVA symptoms and the LA coefficient (lnλ). Additionally, the DDM-based drift rate (v) was found to mediate this relationship. Neuroimaging analyses further indicated that SVA symptoms were negatively associated with gain-related activity in the right precuneus, while positively correlating with loss-related activity in the right cerebellum and left postcentral gyrus. Notably, precuneus activation during gain processing mediated the relationship between SVA symptoms and both lnλ and drift rate. IS-RSA revealed that inter-subject variations in SVA symptoms were significantly associated with distinct activation patterns related to gain processing in the frontoparietal network (e.g., frontal pole, inferior frontal gyrus, and supramarginal gyrus) and motor network (e.g., precentral), as well as loss-related activation patterns in the motor networks (e.g., postcentral and pre-supplementary motor area). Similar patterns emerged when examining simultaneous gain and loss-related activation patterns. Mediation analyses further demonstrated that functional activation patterns in the motor network mediated the relationships between inter-subject variations in SVA symptoms and both loss-aversion and psychological processing patterns (e.g., decision threshold, drift rate, and non-decision time). These findings provide novel insights into the cognitive and neural mechanisms underlying the influence of SVA symptoms on loss aversion, and suggest the critical roles of evidence accumulation speed and specific brain activation patterns-particularly within the cognitive control and motor network-in shaping decision-making biases associated with addiction.

YNIMG Journal 2025 Journal Article

Neural, psychological, and transcriptomic predictors of short video addiction: A multi-site longitudinal study of fear of missing out and negative affect

  • Chang Liu
  • Hanbing Li
  • Qianyi Shangguan
  • Yuyang Zeng
  • Pinchun Wang
  • Zong Zhang
  • Weipeng Jin
  • Qiang Wang

Short video addiction symptoms (SVAS) have become increasingly prevalent, yet their longitudinal neurobiological basis remains unclear. In a multi-site longitudinal study (n = 280), we examined whether baseline brain features and dispositional traits-negative affect (NA) and fear of missing out (FoMO)-predict future SVAS. Participants completed self-report measures and MRI scans at baseline, with follow-up assessments conducted after 5 months to 5 years. Behaviorally, both baseline and follow-up NA and FoMO significantly predicted SVAS. Structurally, baseline gray matter volume (GMV) in the frontal-parietal network (FPN), default mode network (DMN), and hippocampal morphological patterns predicted follow-up SVAS severity. Functionally, baseline regional homogeneity (ReHo) in the FPN, DMN, ventral attention network (VAN), and sensorimotor network (SMN) also predicted SVAS. Parallel multiple mediation analyses revealed a dissociable neural architecture: hippocampal morphological patterns predicted SVAS via the unique indirect effect of follow-up FoMO, whereas DMN functional profiles (e.g., ReHo) predicted SVAS via follow-up NA. Notably, the VAN served as an integrative hub, exerting its influence via the unique indirect effects of both follow-up NA and FoMO. Transcriptomic analyses linked SVAS-related ReHo to two gene sets, namely positively correlated (SVAS-ReHo⁺) and negatively correlated (SVAS-ReHo⁻) genes. SVAS-ReHo⁺ genes were enriched in RNA processing and vascular signaling and expressed in endothelial cells; SVAS-ReHo⁻ genes were enriched in synaptic transmission and expressed in excitatory and inhibitory neurons. Spatial-temporal patterns showed SVAS-ReHo⁺ genes were expressed in subcortical regions across adolescence, whereas SVAS-ReHo⁻ genes were prominent in cortical-limbic areas during postnatal development. Functional decoding linked SVAS-ReHo⁺ genes to sensorimotor function and metabolism, and SVAS-ReHo⁻ genes to emotion and psychiatric risk. Together, these findings highlight dissociable structural, functional, and molecular pathways through which FoMO and NA contribute to short video addiction development.

YNIMG Journal 2025 Journal Article

Neuroanatomical and functional substrates of the short video addiction and its association with brain transcriptomic and cellular architecture

  • Yuanyuan Gao
  • Ying Hu
  • Jinlian Wang
  • Chang Liu
  • Hohjin Im
  • Weipeng Jin
  • Wenwei Zhu
  • Wei Ge

Short video addiction (SVA) has emerged as a growing behavioral and social issue, driven by the widespread use of digital platforms that provide highly engaging, personalized, and brief video content. We investigated the neuroanatomical and functional substrates of SVA symptoms, alongside brain transcriptomic and cellular characteristics, using Inter-Subject Representational Similarity Analysis (IS-RSA) and transcriptomic approaches. Behaviorally, we found that dispositional envy was associated with SVA. Structurally, SVA was positively correlated with increased morphological volumes in the orbitofrontal cortex (OFC) and bilateral cerebellum. Functionally, the dorsolateral prefrontal cortex (DLPFC), posterior cingulate cortex (PCC), cerebellum, and temporal pole (TP) exhibited heightened spontaneous activity, which was positively correlated with SVA severity. Transcriptomic and cellular analyses also showed specific genes linked to gray matter volume (GMV) associated with SVA, with predominant expression in excitatory and inhibitory neurons. These genes showed distinct spatiotemporal expression patterns in the cerebellum during adolescence. This study offers a comprehensive framework integrating structural, functional, and neurochemical evidence to highlight the neural-transcriptomic underpinnings of SVA symptoms in a non-clinical population.

AAAI Conference 2025 Conference Paper

OAMaskFlow: Occlusion-Aware Motion Mask for Scene Flow

  • Xiongfeng Peng
  • Zhihua Liu
  • Weiming Li
  • Yamin Mao
  • Qiang Wang

The scene flow estimation methods make significant progress by estimating pixel-wise 3D motion on implicitly learning a motion embedding using an end-to-end differentiable optimization framework. However, the motion embedding learned implicitly is insufficient for grouping pixels into rigid object in challenging regions, such as occlusion and inconsistent multi-view geometric properties. To address this issue, we propose a novel method for estimating scene flow called OAMaskFlow, which has three novelties. Firstly, we propose the concept of occlusion-aware motion (OAM) mask and generate the ground truth annotation through the photo-metric and geometry consistency. Secondly, we propose to supervise the motion embedding with the OAM mask to learn informative and reliable motion representation of the scene. Finally, a 3D motion propagation module is proposed to propagate high-quality 3D motion from reliable pixels to the challenging occluded regions. Experiments show that our proposed OAMaskFlow has reduced the EPE3D metric by 21.0% on the FlyingThings3D dataset and decreased SF-all metric by 24.3% on the KITTI scene flow benchmark than the baseline method RAFT-3D. Furthermore, we apply our proposed OAM mask in simultaneous localization and mapping (SLAM) to improve a state-of-the-art method DROID-SLAM. In comparison, the ATE metric has decreased by 65.7% and 58.3% on the TartanAir monocular and stereo datasets respectively.

AAAI Conference 2025 Conference Paper

ParZC: Parametric Zero-Cost Proxies for Efficient NAS

  • Peijie Dong
  • Lujun Li
  • Zhenheng Tang
  • Xiang Liu
  • Zimian Wei
  • Qiang Wang
  • Xiaowen Chu

Recent advancements in Zero-shot Neural Architecture Search (NAS) highlight the ability of zero-cost proxies in identifying superior architecture. However, we identify a critical issue with current zero-cost proxies: they aggregate node-wise zero-cost statistics without considering that not all nodes in a neural network equally impact performance estimation. Our observations reveal that node-wise zero-cost statistics significantly vary in their contributions to performance, with each node exhibiting a degree of uncertainty. Based on this insight, we introduce a novel method called Parametric Zero-Cost Proxies (ParZC) framework to enhance the adaptability of zero-cost proxies through parameterization. To address the node indiscrimination, we propose a Mixer Architecture with Bayesian Network (MABN) to explore the node-wise zero-cost statistics and estimate node-specific uncertainty. Moreover, we propose DiffKendall as a loss function to improve ranking consistency. Comprehensive experiments on NAS-Bench-101, 201, and NDS demonstrate the superiority of our proposed ParZC compared to existing zero-shot NAS methods. Additionally, we demonstrate the versatility and adaptability of ParZC on Vision Transformer search space.

IROS Conference 2025 Conference Paper

SAFormer: Spatially Adaptive Transformer for Efficient and Multi-Resolution Occupancy Prediction

  • Song Tang
  • Qiang Wang
  • Xiaowen Chu 0001

Accurate and efficient 3D scene understanding from multi-view images remains a fundamental challenge in autonomous driving. Existing methods often struggle with high-dimensional features, leading to excessive computational costs and memory usage. In this paper, we present SAFormer, a novel transformer-based framework for efficient spatially adaptive occupancy prediction. SAFormer incorporates two key techniques to reduce resource consumption: Octree-based Multi-resolution Feature (OMRF) Learning and Spatial-Adaptive Progressive Query (SAPQ). First, OMRF introduces an Octree-based hierarchical structure to compress multi-resolution 3D feature volumes. Second, SAPQ facilitates efficient information flow across different scales while effectively addressing scene sparsity. It employs a region-aware query mechanism that intelligently allocates computational resources, processing safety-critical regions at high resolution while handling background elements at lower resolutions. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance while significantly reducing inference latency (up to 3×) and memory cost (up to 2. 9×). Additional experiments on SSCBench-KITTI-360 further validate our approach’s generalizability. Our approach excels in managing scene sparsity and recognizing small, safety-critical objects, highlighting its potential for practical applications in autonomous driving.

AAAI Conference 2024 Conference Paper

CF-NeRF: Camera Parameter Free Neural Radiance Fields with Incremental Learning

  • Qingsong Yan
  • Qiang Wang
  • Kaiyong Zhao
  • Jie Chen
  • Bo Li
  • Xiaowen Chu
  • Fei Deng

Neural Radiance Fields have demonstrated impressive performance in novel view synthesis. However, NeRF and most of its variants still rely on traditional complex pipelines to provide extrinsic and intrinsic camera parameters, such as COLMAP. Recent works, like NeRFmm, BARF, and L2G-NeRF, directly treat camera parameters as learnable and estimate them through differential volume rendering. However, these methods work for forward-looking scenes with slight motions and fail to tackle the rotation scenario in practice. To overcome this limitation, we propose a novel camera parameter free neural radiance field (CF-NeRF), which incrementally reconstructs 3D representations and recovers the camera parameters inspired by incremental structure from motion. Given a sequence of images, CF-NeRF estimates camera parameters of images one by one and reconstructs the scene through initialization, implicit localization, and implicit optimization. To evaluate our method, we use a challenging real-world dataset, NeRFBuster, which provides 12 scenes under complex trajectories. Results demonstrate that CF-NeRF is robust to rotation and achieves state-of-the-art results without providing prior information and constraints.

NeurIPS Conference 2024 Conference Paper

Discovering Sparsity Allocation for Layer-wise Pruning of Large Language Models

  • Lujun Li
  • Peijie Dong
  • Zhenheng Tang
  • Xiang Liu
  • Qiang Wang
  • Wenhan Luo
  • Wei Xue
  • Qifeng Liu

In this paper, we present DSA, the first automated framework for discovering sparsity allocation schemes for layer-wise pruning in Large Language Models (LLMs). LLMs have become increasingly powerful, but their large parameter counts make them computationally expensive. Existing pruning methods for compressing LLMs primarily focus on evaluating redundancies and removing element-wise weights. However, these methods fail to allocate adaptive layer-wise sparsities, leading to performance degradation in challenging tasks. We observe that per-layer importance statistics can serve as allocation indications, but their effectiveness depends on the allocation function between layers. To address this issue, we develop an expression discovery framework to explore potential allocation strategies. Our allocation functions involve two steps: reducing element-wise metrics to per-layer importance scores, and modelling layer importance to sparsity ratios. To search for the most effective allocation function, we construct a search space consisting of pre-process, reduction, transform, and post-process operations. We leverage an evolutionary algorithm to perform crossover and mutation on superior candidates within the population, guided by performance evaluation. Finally, we seamlessly integrate our discovered functions into various uniform methods, resulting in significant performance improvements. We conduct extensive experiments on multiple challenging tasks such as arithmetic, knowledge reasoning, and multimodal benchmarks spanning GSM8K, MMLU, SQA, and VQA, demonstrating that our DSA method achieves significant performance gains on the LLaMA-1|2|3, Mistral, and OPT models. Notably, the LLaMA-1|2|3 model pruned by our DSA reaches 4. 73\%|6. 18\%|10. 65\% gain over the state-of-the-art techniques (e. g. , Wanda and SparseGPT).

AAAI Conference 2024 Conference Paper

DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding

  • Xiaoxuan Yu
  • Hao Wang
  • Weiming Li
  • Qiang Wang
  • Soonyong Cho
  • Younghun Sung

Point scene understanding is a challenging task to process real-world scene point cloud, which aims at segmenting each object, estimating its pose, and reconstructing its mesh simultaneously. Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks. This leads to a complex pipeline to optimize and makes it hard to leverage the relationship constraints between multiple objects. In this work, we propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation to facilitate learning with multiple objects for the multiple sub-tasks in a unified manner. Each object is represented as a query, and a Transformer decoder is adapted to iteratively optimize all the queries involving their relationship. In particular, we introduce a semantic-geometry disentangled query (SGDQ) design that enables the query features to attend separately to semantic information and geometric information relevant to the corresponding sub-tasks. A hybrid bipartite matching module is employed to well use the supervisions from all the sub-tasks during training. Qualitative and quantitative experimental results demonstrate that our method achieves state-of-the-art performance on the challenging ScanNet dataset. Code is available at https://github.com/SAITPublic/DOCTR.

ICRA Conference 2024 Conference Paper

DVI-SLAM: A Dual Visual Inertial SLAM Network

  • Xiongfeng Peng
  • Zhihua Liu
  • Weiming Li
  • Ping Tan
  • SoonYong Cho
  • Qiang Wang

Recent deep learning based visual simultaneous localization and mapping (SLAM) methods have made significant progress. However, how to make full use of visual information as well as better integrate with inertial measurement unit (IMU) in visual SLAM has potential research value. This paper proposes a novel deep SLAM network with dual visual factors. The basic idea is to integrate both photometric factor and re-projection factor into the end-to-end differentiable structure through multi-factor data association module. We show that the proposed network dynamically learns and adjusts the confidence maps of both visual factors and it can be further extended to include the IMU factors as well. Extensive experiments validate that our proposed method significantly outperforms the state-of-the-art methods on several public datasets, including TartanAir, EuRoC and ETH3D-SLAM. Specifically, when dynamically fusing the three factors together, the absolute trajectory error for both monocular and stereo configurations on EuRoC dataset has reduced by 45. 3% and 36. 2% respectively.

YNIMG Journal 2024 Journal Article

Happy people are always similar: The evidence from brain morphological and functional inter-subject correlations

  • Zixi Li
  • Keying Jiang
  • Ye Zhu
  • Hanxiao Du
  • Hohjin Im
  • Yingying Zhu
  • Lei Feng
  • Wenwei Zhu

A fundamental question in the study of happiness is whether there is neural evidence to support a well-known hypothesis that happy people are always similar while unfortunate people have their own misfortunes. To investigate this, we employed several happiness-related questionnaires to identify potential components of happiness, and further investigated and confirmed their associations with personality, mood, aggressive behaviors, and amygdala reactivity to fearful faces within a substantial sample size of college students (n = 570). Additionally, we examined the functional and morphological similarities and differences among happy individuals using the inter-subject representational similarity analysis (IS-RSA). IS-RSA emphasizes the geometric properties in a high-dimensional space constructed by brain or behavioral patterns and focuses on individual subjects. Our behavioral findings unveiled two factors of happiness: individual and social, both of which mediated the effect of personality traits on individual aggression. Subsequently, mood mediated the impact of happiness on aggressive behaviors across two subgroup splits. Functional imaging data revealed that individuals with higher levels of happiness exhibited reduced amygdala reactivity to fearful faces, as evidenced by a conventional face-matching task (n = 104). Moreover, IS-RSA demonstrated that these participants manifested similar neural activation patterns when processing fearful faces within the visual pathway, but not within the emotional network (e.g., amygdala). Morphological observations (n = 425) indicated that individuals with similar high happiness levels exhibited comparable gray matter volume patterns within several networks, including the default mode network, fronto-parietal network, visual network, and attention network. Collectively, these findings offer early neural evidence supporting the proposition that happy individuals may share common neural characteristics.

NeurIPS Conference 2024 Conference Paper

Is Your HD Map Constructor Reliable under Sensor Corruptions?

  • Xiaoshuai Hao
  • Mengchuan Wei
  • Yifan Yang
  • Haimei Zhao
  • Hui Zhang
  • Yi Zhou
  • Qiang Wang
  • Weiming Li

Driving systems often rely on high-definition (HD) maps for precise environmental information, which is crucial for planning and navigation. While current HD map constructors perform well under ideal conditions, their resilience to real-world challenges, \eg, adverse weather and sensor failures, is not well understood, raising safety concerns. This work introduces MapBench, the first comprehensive benchmark designed to evaluate the robustness of HD map construction methods against various sensor corruptions. Our benchmark encompasses a total of 29 types of corruptions that occur from cameras and LiDAR sensors. Extensive evaluations across 31 HD map constructors reveal significant performance degradation of existing methods under adverse weather conditions and sensor failures, underscoring critical safety concerns. We identify effective strategies for enhancing robustness, including innovative approaches that leverage multi-modal fusion, advanced data augmentation, and architectural techniques. These insights provide a pathway for developing more reliable HD map construction methods, which are essential for the advancement of autonomous driving technology. The benchmark toolkit and affiliated code and model checkpoints have been made publicly accessible.

AAAI Conference 2024 Conference Paper

Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation

  • Yasi Wang
  • Hong Liu
  • Chao Zhang
  • Lu Xu
  • Qiang Wang

Homography estimation is a fundamental problem in computer vision. Previous works mainly focus on estimating either a single homography, or multiple homographies based on mesh grid division of the image. In practical scenarios, single homography is inadequate and often leads to a compromised result for multiple planes; while mesh grid multi-homography damages the plane distribution of the scene, and does not fully address the restriction to use homography. In this work, we propose a novel semantics guided multi-homography estimation framework, Mask-Homo, to provide an explicit solution to the multi-plane depth disparity problem. First, a pseudo plane mask generation module is designed to obtain multiple correlated regions that follow the plane distribution of the scene. Then, multiple local homography transformations, each of which aligns a correlated region precisely, are predicted and corresponding warped images are fused to obtain the final result. Furthermore, a new metric, Mask-PSNR, is proposed for more comprehensive evaluation of alignment. Extensive experiments are conducted to verify the effectiveness of the proposed method. Our code is available at https://github.com/SAITPublic/MaskHomo.

YNIMG Journal 2024 Journal Article

Microstructural and functional substrates underlying dispositional greed and its link with trait but not state impulsivity

  • Keying Jiang
  • Jinlian Wang
  • Yuanyuan Gao
  • Xiang Li
  • Hohjin Im
  • Yingying Zhu
  • Hanxiao Du
  • Lei Feng

The interplay between personality traits and impulsivity has long been a central theme in psychology and psychiatry. However, the potential association between Greed Personality Traits (GPT) and impulsivity, encompassing both trait and state impulsivity and future time perspective, remains largely unexplored. To address these issues, we employed questionnaires and an inter-temporal choice task to estimate corresponding trait/state impulsivity and collected multi-modal neuroimaging data (resting-state functional imaging: n = 430; diffusion-weighted imaging: n = 426; task-related functional imaging: n = 53) to investigate the underlying microstructural and functional substrates. Behavioral analyses revealed that GPT mediated the association between time perspective (e.g., present fatalism) and trait impulsivity (e.g., motor impulsivity). Functional imaging analyses further identified that brain activation strengths and patterns related to delay length, particularly in the dorsomedial prefrontal cortex, superior parietal lobule, and cerebellum, were associated with GPT. Moreover, individuals with similar levels of greed exhibited analogous spontaneous brain activity patterns, predominantly in the Default Mode Network (DMN), Fronto-Parietal Network (FPN), and Visual Network (VIS). Diffusion imaging analysis observed specific microstructural characteristics in the spinocerebellar/pontocerebellar fasciculus, internal/external capsule, and corona radiata that support the formation of GPT. Furthermore, the corresponding neural activation pattern, spontaneous neural activity pattern, and analogous functional couplings among the aforementioned brain regions mediated the relationships between time perspective and GPT and between GPT and motor impulsivity. These findings provide novel insights into the possible pathway such as time perspective → dispositional greed → impulsivity and uncover their underlying microstructural and functional substrates.

AAAI Conference 2024 Conference Paper

MmAP: Multi-Modal Alignment Prompt for Cross-Domain Multi-Task Learning

  • Yi Xin
  • Junlong Du
  • Qiang Wang
  • Ke Yan
  • Shouhong Ding

Multi-Task Learning (MTL) is designed to train multiple correlated tasks simultaneously, thereby enhancing the performance of individual tasks. Typically, a multi-task network structure consists of a shared backbone and task-specific decoders. However, the complexity of the decoders increases with the number of tasks. To tackle this challenge, we integrate the decoder-free vision-language model CLIP, which exhibits robust zero-shot generalization capability. Recently, parameter-efficient transfer learning methods have been extensively explored with CLIP for adapting to downstream tasks, where prompt tuning showcases strong potential. Nevertheless, these methods solely fine-tune a single modality (text or visual), disrupting the modality structure of CLIP. In this paper, we first propose Multi-modal Alignment Prompt (MmAP) for CLIP, which aligns text and visual modalities during fine-tuning process. Building upon MmAP, we develop an innovative multi-task prompt learning framework. On the one hand, to maximize the complementarity of tasks with high similarity, we utilize a gradient-driven task grouping method that partitions tasks into several disjoint groups and assign a group-shared MmAP to each group. On the other hand, to preserve the unique characteristics of each task, we assign an task-specific MmAP to each task. Comprehensive experiments on two large multi-task learning datasets demonstrate that our method achieves significant performance improvements compared to full fine-tuning while only utilizing approximately ~ 0.09% of trainable parameters.

AAAI Conference 2024 Conference Paper

VMT-Adapter: Parameter-Efficient Transfer Learning for Multi-Task Dense Scene Understanding

  • Yi Xin
  • Junlong Du
  • Qiang Wang
  • Zhiwen Lin
  • Ke Yan

Large-scale pre-trained models have achieved remarkable success in various computer vision tasks. A standard approach to leverage these models is to fine-tune all model parameters for downstream tasks, which poses challenges in terms of computational and storage costs. Recently, inspired by Natural Language Processing (NLP), parameter-efficient transfer learning has been successfully applied to vision tasks. However, most existing techniques primarily focus on single-task adaptation, and despite limited research on multi-task adaptation, these methods often exhibit suboptimal training/inference efficiency. In this paper, we first propose an once-for-all Vision Multi-Task Adapter (VMT-Adapter), which strikes approximately O(1) training and inference efficiency w.r.t task number. Concretely, VMT-Adapter shares the knowledge from multiple tasks to enhance cross-task interaction while preserves task-specific knowledge via independent knowledge extraction modules. Notably, since task-specific modules require few parameters, VMT-Adapter can handle an arbitrary number of tasks with a negligible increase of trainable parameters. We also propose VMT-Adapter-Lite, which further reduces the trainable parameters by learning shared parameters between down- and up-projections. Extensive experiments on four dense scene understanding tasks demonstrate the superiority of VMT-Adapter(-Lite), achieving a 3.96% (1.34%) relative improvement compared to single-task full fine-tuning, while utilizing merely ~1% (0.36%) trainable parameters of the pre-trained model.

NeurIPS Conference 2023 Conference Paper

BadTrack: A Poison-Only Backdoor Attack on Visual Object Tracking

  • Bin Huang
  • Jiaqian Yu
  • Yiwei Chen
  • Siyang Pan
  • Qiang Wang
  • Zhi Wang

Visual object tracking (VOT) is one of the most fundamental tasks in computer vision community. State-of-the-art VOT trackers extract positive and negative examples that are used to guide the tracker to distinguish the object from the background. In this paper, we show that this characteristic can be exploited to introduce new threats and hence propose a simple yet effective poison-only backdoor attack. To be specific, we poison a small part of the training data by attaching a predefined trigger pattern to the background region of each video frame, so that the trigger appears almost exclusively in the extracted negative examples. To the best of our knowledge, this is the first work that reveals the threat of poison-only backdoor attack on VOT trackers. We experimentally show that our backdoor attack can significantly degrade the performance of both two-stream Siamese and one-stream Transformer trackers on the poisoned data while gaining comparable performance with the benign trackers on the clean data.

YNIMG Journal 2023 Journal Article

EEG spectral slope: A reliable indicator for continuous evaluation of consciousness levels during propofol anesthesia

  • Yun Zhang
  • Yubo Wang
  • Huanhuan Cheng
  • Fei Yan
  • Dingning Li
  • Dawei Song
  • Qiang Wang
  • Liyu Huang

The level of consciousness undergoes continuous alterations during anesthesia. Prior to the onset of propofol-induced complete unconsciousness, degraded levels of behavioral responsiveness can be observed. However, a reliable index to monitor altered consciousness levels during anesthesia has not been sufficiently investigated. In this study, we obtained 60-channel EEG data from 24 healthy participants during an ultra-slow propofol infusion protocol starting with an initial concentration of 1 μg/ml and a stepwise increase of 0.2 μg/ml in concentration. Consecutive auditory stimuli were delivered every 5 to 6 s, and the response time to the stimuli was used to assess the responsiveness levels. We calculated the spectral slope in a time-resolved manner by extracting 5-second EEG segments at each auditory stimulus and estimated their correlation with the corresponding response time. Our results demonstrated that during slow propofol infusion, the response time to external stimuli increased, while the EEG spectral slope, fitted at 15-45 Hz, became steeper, and a significant negative correlation was observed between them. Moreover, the spectral slope further steepened at deeper anesthetic levels and became flatter during anesthesia recovery. We verified these findings using an external dataset. Additionally, we found that the spectral slope of frontal electrodes over the prefrontal lobe had the best performance in predicting the response time. Overall, this study used a time-resolved analysis to suggest that the EEG spectral slope could reliably track continuously altered consciousness levels during propofol anesthesia. Furthermore, the frontal spectral slope may be a promising index for clinical monitoring of anesthesia depth.

JBHI Journal 2023 Journal Article

Multi-Task Learning for Pulmonary Arterial Hypertension Prognosis Prediction Via Memory Drift and Prior Prompt Learning on 3D Chest CT

  • Guanyu Yang
  • Yuting He
  • Yang Lv
  • Yang Chen
  • Jean-Louis Coatrieux
  • Xiaoxuan Sun
  • Qiang Wang
  • Yongyue Wei

Pulmonary arterial hypertension (PAH) prognosis prediction on 3D non-contrast CT images is one of the most important tasks for PAH treatment. It will help clinicians stratify patients into different groups for early diagnosis and timely intervention via automatically extracting the potential biomarkers of PAH to predict mortality. However, it is still a task of great challenges due to the large volume and low-contrast regions of interest in 3D chest CT images. In this paper, we propose the first multi-task learning-based PAH prognosis prediction framework, P $^{2}$ -Net, which effectively optimizes the model and powerfully represents task-dependent features via our Memory Drift (MD) and Prior Prompt Learning (PPL) strategies. 1) Our MD maintains a large memory bank to provide a dense sampling of the deep biomarkers' distribution. Therefore, although the batch size is very small caused by our large volume, a reliable (negative log partial) likelihood loss is still able to be calculated on a representative probability distribution for robust optimization. 2) Our PPL simultaneously learns an additional manual biomarkers prediction task to embed clinical prior knowledge into our deep prognosis prediction task in hidden and explicit ways. Therefore, it will prompt the prediction of deep biomarkers and improve the perception of task-dependent features in our low-contrast regions. Our P $^{2}$ -Net achieves a high prognostic correlation of the prediction and great generalization with the highest 70. 19% C-index and 2. 14 HR. Extensive experiments with promising results on our PAH prognosis prediction reveal powerful prognosis performance and great clinical significance in PAH treatment. All of our code will be made publicly available online.

AAAI Conference 2023 Conference Paper

Rethinking Disparity: A Depth Range Free Multi-View Stereo Based on Disparity

  • Qingsong Yan
  • Qiang Wang
  • Kaiyong Zhao
  • Bo Li
  • Xiaowen Chu
  • Fei Deng

Existing learning-based multi-view stereo (MVS) methods rely on the depth range to build the 3D cost volume and may fail when the range is too large or unreliable. To address this problem, we propose a disparity-based MVS method based on the epipolar disparity flow (E-flow), called DispMVS, which infers the depth information from the pixel movement between two views. The core of DispMVS is to construct a 2D cost volume on the image plane along the epipolar line between each pair (between the reference image and several source images) for pixel matching and fuse uncountable depths triangulated from each pair by multi-view geometry to ensure multi-view consistency. To be robust, DispMVS starts from a randomly initialized depth map and iteratively refines the depth map with the help of the coarse-to-fine strategy. Experiments on DTUMVS and Tanks\&Temple datasets show that DispMVS is not sensitive to the depth range and achieves state-of-the-art results with lower GPU memory.

ICLR Conference 2023 Conference Paper

SketchKnitter: Vectorized Sketch Generation with Diffusion Models

  • Qiang Wang
  • Haoge Deng
  • Yonggang Qi
  • Da Li 0001
  • Yi-Zhe Song

We show vectorized sketch generation can be identified as a reversal of the stroke deformation process. This relationship was established by means of a diffusion model that learns data distributions over the stroke-point locations and pen states of real human sketches. Given randomly scattered stroke-points, sketch generation becomes a process of deformation-based denoising, where the generator rectifies positions of stroke points at each timestep to converge at a recognizable sketch. A key innovation was to embed recognizability into the reverse time diffusion process. It was observed that the estimated noise during the reversal process is strongly correlated with sketch classification accuracy. An auxiliary recurrent neural network (RNN) was consequently used to quantify recognizability during data sampling. It follows that, based on the recognizability scores, a sampling shortcut function can also be devised that renders better quality sketches with fewer sampling steps. Finally it is shown that the model can be easily extended to a conditional generation framework, where given incomplete and unfaithful sketches, it yields one that is more visually appealing and with higher recognizability.

AAAI Conference 2023 Conference Paper

Towards Reliable Neural Machine Translation with Consistency-Aware Meta-Learning

  • Rongxiang Weng
  • Qiang Wang
  • Wensen Cheng
  • Changfeng Zhu
  • Min Zhang

Neural machine translation (NMT) has achieved remarkable success in producing high-quality translations. However, current NMT systems suffer from a lack of reliability, as their outputs that are often affected by lexical or syntactic changes in inputs, resulting in large variations in quality. This limitation hinders the practicality and trustworthiness of NMT. A contributing factor to this problem is that NMT models trained with the one-to-one paradigm struggle to handle the source diversity phenomenon, where inputs with the same meaning can be expressed differently. In this work, we treat this problem as a bilevel optimization problem and present a consistency-aware meta-learning (CAML) framework derived from the model-agnostic meta-learning (MAML) algorithm to address it. Specifically, the NMT model with CAML (named CoNMT) first learns a consistent meta representation of semantically equivalent sentences in the outer loop. Subsequently, a mapping from the meta representation to the output sentence is learned in the inner loop, allowing the NMT model to translate semantically equivalent sentences to the same target sentence. We conduct experiments on the NIST Chinese to English task, three WMT translation tasks, and the TED M2O task. The results demonstrate that CoNMT effectively improves overall translation quality and reliably handles diverse inputs.

YNIMG Journal 2022 Journal Article

Functional connectivity maps of theta/alpha and beta coherence within the subthalamic nucleus region

  • Bernadette C.M. van Wijk
  • Wolf-Julian Neumann
  • Daniel Kroneberg
  • Andreas Horn
  • Friederike Irmen
  • Tilmann H. Sander
  • Qiang Wang
  • Vladimir Litvak

The subthalamic nucleus (STN) is a primary target for deep brain stimulation in Parkinson's disease (PD). Although small in size, the STN is commonly partitioned into sensorimotor, cognitive/associative, and limbic subregions based on its structural connectivity profile to cortical areas. We investigated whether such a regional specialization is also supported by functional connectivity between local field potential recordings and simultaneous magnetoencephalography. Using a novel data set of 21 PD patients, we replicated previously reported cortico-STN coherence networks in the theta/alpha and beta frequency ranges, and looked for the spatial distribution of these networks within the STN region. Although theta/alpha and beta coherence peaks were both observed in on-medication recordings from electrode contacts at several locations within and around the STN, sites with theta/alpha coherence peaks were situated at significantly more inferior MNI coordinates than beta coherence peaks. Sites with only theta/alpha coherence peaks, i.e. without distinct beta coherence, were mostly located near the border of sensorimotor and cognitive/associative subregions as defined by a tractography-based atlas of the STN. Peak coherence values were largely unaltered by the medication state of the subject, however, theta/alpha peaks were more often identified in recordings obtained after administration of dopaminergic medication. Our findings suggest the existence of a frequency-specific topography of cortico-STN coherence within the STN, albeit with considerable spatial overlap between functional networks. Consequently, optimization of deep brain stimulation targeting might remain a trade-off between alleviating motor symptoms and avoiding adverse neuropsychiatric side effects.

IJCAI Conference 2022 Conference Paper

Learning from Students: Online Contrastive Distillation Network for General Continual Learning

  • Jin Li
  • Zhong Ji
  • Gang Wang
  • Qiang Wang
  • Feng Gao

The goal of General Continual Learning (GCL) is to preserve learned knowledge and learn new knowledge with constant memory from an infinite data stream where task boundaries are blurry. Distilling the model's response of reserved samples between the old and the new models is an effective way to achieve promise performance on GCL. However, it accumulates the inherent old model's response bias and is not robust to model changes. To this end, we propose an Online Contrastive Distillation Network (OCD-Net) to tackle these problems, which explores the merit of the student model in each time step to guide the training process of the student model. Concretely, the teacher model is devised to help the student model to consolidate the learned knowledge, which is trained online via integrating the model weights of the student model to accumulate the new knowledge. Moreover, our OCD-Net incorporates both relation and adaptive response to help the student model alleviate the catastrophic forgetting, which is also beneficial for the teacher model preserves the learned knowledge. Extensive experiments on six benchmark datasets demonstrate that our proposed OCD-Net significantly outperforms state-of-the-art approaches in 3. 26%~8. 71% with various buffer sizes. Our code is available at https: //github. com/lijincm/OCD-Net.

IJCAI Conference 2021 Conference Paper

Dynamic Rebalancing Dockless Bike-Sharing System based on Station Community Discovery

  • Jingjing Li
  • Qiang Wang
  • Wenqi Zhang
  • Donghai Shi
  • Zhiwei Qin

Influenced by the era of the sharing economy and mobile payment, Dockless Bike-Sharing System (Dockless BSS) is expanding in many major cities. The mobility of users constantly leads to supply and demand imbalance, which seriously affects the total profit and customer satisfaction. In this paper, we propose the Spatio-Temporal Mixed Integer Program (STMIP) with Flow-graphed Community Discovery (FCD) approach to rebalancing the system. Different from existing studies that ignore the route of trucks and adopt a centralized rebalancing, our approach considers the spatio-temporal information of trucks and discovers station communities for truck-based rebalancing. First, we propose the FCD algorithm to detect station communities. Significantly, rebalancing communities decomposes the centralized system into a distributed multi-communities system. Then, by considering the routing and velocity of trucks, we design the STMIP model with the objective of maximizing total profit, to find a repositioning policy for each station community. We design a simulator built on real-world data from DiDi Chuxing to test the algorithm performance. The extensive experimental results demonstrate that our approach outperforms in terms of service level, profit, and complexity compared with the state-of-the-art approach.

JBHI Journal 2021 Journal Article

Effective Brain State Estimation During Propofol-Induced Sedation Using Advanced EEG Microstate Spectral Analysis

  • Yamin Li
  • Wen Shi
  • Zhian Liu
  • Jing Li
  • Qiang Wang
  • Xiangguo Yan
  • Zehong Cao
  • Gang Wang

Brain states are patterns of neuronal synchrony, and the electroencephalogram (EEG) microstate provides a promising tool to characterize and analyze the synchronous neural firing. However, the topographical spectral information for each predominate microstate is still unclear during the switch of consciousness, such as sedation, and the practical usage of the EEG microstate is worth probing. Also, the mechanism behind the anesthetic-induced alternations of brain states remains poorly understood. In this study, an advanced EEG microstate spectral analysis was utilized using multivariate empirical mode decomposition in Hilbert-Huang transform. The practicability was further investigated in scalp EEG recordings during the propofol-induced transition of consciousness. The process of transition from the awake baseline to moderate sedation was accompanied by apparent increases in microstate (A, B, and F) energy, especially in the whole-brain delta band, frontal alpha band and beta band. In comparison to other effective EEG-based parameters that commonly used to measure anesthetic depth, using the selected spectral features reached better performance (80% sensitivity, 90% accuracy) to estimate the brain states during sedation. The changes in microstate energy also exhibited high correlations with individual behavioral data during sedation. In a nutshell, the EEG microstate spectral analysis is an effective method to estimate brain states during propofol-induced sedation, giving great insights into the underlying mechanism. The generated spectral features can be promising markers to dynamically assess the consciousness level.

AAAI Conference 2021 System Paper

Fashion Focus: Multi-modal Retrieval System for Video Commodity Localization in E-commerce

  • Yanhao Zhang
  • Qiang Wang
  • Pan Pan
  • Yun Zheng
  • Cheng Da
  • Siyang Sun
  • Yinghui Xu

Nowadays, live-stream and short video shopping in Ecommerce have grown exponentially. However, the sellers are required to manually match images of the selling products to the timestamp of exhibition in the untrimmed video, resulting in a complicated process. To solve the problem, we present an innovative demonstration of multi-modal retrieval system called “Fashion Focus”, which enables to exactly localize the product images in the online video as the focuses. Different modality contributes to the community localization, including visual content, linguistic features and interaction context are jointly investigated via presented multi-modal learning. Our system employs two procedures for analysis, including video content structuring and multi-modal retrieval, to automatically achieve accurate video-to-shop matching. Fashion Focus presents a unified framework that can orientate the consumers towards relevant product exhibitions during watching videos and help the sellers to effectively deliver the products over search and recommendation.

ICLR Conference 2020 Conference Paper

Adversarial AutoAugment

  • Xinyu Zhang
  • Qiang Wang
  • Jian Zhang
  • Zhao Zhong

Data augmentation (DA) has been widely utilized to improve generalization in training deep neural networks. Recently, human-designed data augmentation has been gradually replaced by automatically learned augmentation policy. Through finding the best policy in well-designed search space of data augmentation, AutoAugment (Cubuk et al., 2019) can significantly improve validation accuracy on image classification tasks. However, this approach is not computationally practical for large-scale problems. In this paper, we develop an adversarial method to arrive at a computationally-affordable solution called Adversarial AutoAugment, which can simultaneously optimize target related object and augmentation policy search loss. The augmentation policy network attempts to increase the training loss of a target network through generating adversarial augmentation policies, while the target network can learn more robust features from harder examples to improve the generalization. In contrast to prior work, we reuse the computation in target network training for policy evaluation, and dispense with the retraining of the target network. Compared to AutoAugment, this leads to about 12x reduction in computing cost and 11x shortening in time overhead on ImageNet. We show experimental results of our approach on CIFAR-10/CIFAR-100, ImageNet, and demonstrate significant performance improvements over state-of-the-art. On CIFAR-10, we achieve a top-1 test error of 1.36%, which is the currently best performing single model. On ImageNet, we achieve a leading performance of top-1 accuracy 79.40% on ResNet-50 and 80.00% on ResNet-50-D without extra data.

YNICL Journal 2020 Journal Article

Biotypes of major depressive disorder: Neuroimaging evidence from resting-state default mode network patterns

  • Sugai Liang
  • Wei Deng
  • Xiaojing Li
  • Andrew J. Greenshaw
  • Qiang Wang
  • Mingli Li
  • Xiaohong Ma
  • Tong-Jian Bai

BACKGROUND: Major depressive disorder (MDD) is heterogeneous disorder associated with aberrant functional connectivity within the default mode network (DMN). This study focused on data-driven identification and validation of potential DMN-pattern-based MDD subtypes to parse heterogeneity of the disorder. METHODS: The sample comprised 1397 participants including 690 patients with MDD and 707 healthy controls (HC) registered from multiple sites based on the REST-meta-MDD Project in China. Baseline resting-state functional magnetic resonance imaging (rs-fMRI) data was recorded for each participant. Discriminative features were selected from DMN between patients and HC. Patient subgroups were defined by K-means and principle component analysis in the multi-site datasets and validated in an independent single-site dataset. Statistical significance of resultant clustering were confirmed. Demographic and clinical variables were compared between identified patient subgroups. RESULTS: Two MDD subgroups with differing functional connectivity profiles of DMN were identified in the multi-site datasets, and relatively stable in different validation samples. The predominant dysfunctional connectivity profiles were detected among superior frontal cortex, ventral medial prefrontal cortex, posterior cingulate cortex and precuneus, whereas one subgroup exhibited increases of connectivity (hyperDMN MDD) and another subgroup showed decreases of connectivity (hypoDMN MDD). The hyperDMN subgroup in the discovery dataset had age-related severity of depressive symptoms. Patient subgroups had comparable demographic and clinical symptom variables. CONCLUSIONS: Findings suggest the existence of two neural subtypes of MDD associated with different dysfunctional DMN connectivity patterns, which may provide useful evidence for parsing heterogeneity of depression and be valuable to inform the search for personalized treatment strategies.

AAAI Conference 2020 Conference Paper

Multi-Speaker Video Dialog with Frame-Level Temporal Localization

  • Qiang Wang
  • Pin Jiang
  • Zhiyi Guo
  • Yahong Han
  • Zhou Zhao

To simulate human interaction in real life, dialog systems are introduced to generate a response to previous chat utterances. There have been several studies for two-speaker video dialogs in the form of question answering. However, more informative semantic cues might be exploited via a multirounds chatting or discussing about the video among multiple speakers. So multi-speakers video dialogs are more applicable in real life. Besides, speakers always chat about a subsegment of the long video fragment for a period of time. Current video dialog systems require to be directly given the relevant video sub-segment which speakers are chatting about. However, it is always hard to accurately spot the corresponding video sub-segment in practical applications. In this paper, we introduce a novel task of Multi-Speaker Video Dialog with frame-level Temporal Localization (MSVD-TL) to make video dialog systems more applicable. Given a long video fragment and a set of chat history utterances, MSVD- TL targets to predict the following response and localize the relevant video sub-segment in frame level, simultaneously. We develop a new multi-task model with a response prediction module and a frame-level temporal localization module. Besides, we focus on the characteristic of the video dialog generation process and exploit the relation among the video fragment, the chat history, and the following response to re- fine their representations. We evaluate our approach for both the Multi-Speaker Video Dialog without frame-level temporal localization (MSVD w/o TL) task and the MSVD-TL task. The experimental results further demonstrate that MSVD-TL enhances the applicability of video dialog in real life.

AAAI Conference 2020 Conference Paper

Neural Machine Translation with Joint Representation

  • Yanyang Li
  • Qiang Wang
  • Tong Xiao
  • Tongran Liu
  • Jingbo Zhu

Though early successes of Statistical Machine Translation (SMT) systems are attributed in part to the explicit modelling of the interaction between any two source and target units, e. g. , alignment, the recent Neural Machine Translation (NMT) systems resort to the attention which partially encodes the interaction for efficiency. In this paper, we employ Joint Representation that fully accounts for each possible interaction. We sidestep the inefficiency issue by refining representations with the proposed efficient attention operation. The resulting Reformer models offer a new Sequence-to- Sequence modelling paradigm besides the Encoder-Decoder framework and outperform the Transformer baseline in either the small scale IWSLT14 German-English, English-German and IWSLT15 Vietnamese-English or the large scale NIST12 Chinese-English translation tasks by about 1 BLEU point. We also propose a systematic model scaling approach, allowing the Reformer model to beat the state-of-the-art Transformer in IWSLT14 German-English and NIST12 Chinese-English with about 50% fewer parameters. The code is publicly available at https: //github. com/lyy1994/reformer.

AAAI Conference 2020 Conference Paper

Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild

  • Yueying Kao
  • Weiming Li
  • Qiang Wang
  • Zhouchen Lin
  • Wooshik Kim
  • Sunghoon Hong

Monocular object pose estimation is an important yet challenging computer vision problem. Depth features can provide useful information for pose estimation. However, existing methods rely on real depth images to extract depth features, leading to its difficulty on various applications. In this paper, we aim at extracting RGB and depth features from a single RGB image with the help of synthetic RGB-depth image pairs for object pose estimation. Specifically, a deep convolutional neural network is proposed with an RGB-to-Depth Embedding module and a Synthetic-Real Adaptation module. The embedding module is trained with synthetic pair data to learn a depth-oriented embedding space between RGB and depth images optimized for object pose estimation. The adaptation module is to further align distributions from synthetic to real data. Compared to existing methods, our method does not need any real depth images and can be trained easily with large-scale synthetic data. Extensive experiments and comparisons show that our method achieves best performance on a challenging public PASCAL 3D+ dataset in all the metrics, which substantiates the superiority of our method and the above modules.

IJCAI Conference 2019 Conference Paper

A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification

  • Shaohuai Shi
  • Kaiyong Zhao
  • Qiang Wang
  • Zhenheng Tang
  • Xiaowen Chu

Gradient sparsification is a promising technique to significantly reduce the communication overhead in decentralized synchronous stochastic gradient descent (S-SGD) algorithms. Yet, many existing gradient sparsification schemes (e. g. , Top-k sparsification) have a communication complexity of O(kP), where k is the number of selected gradients by each worker and P is the number of workers. Recently, the gTop-k sparsification scheme has been proposed to reduce the communication complexity from O(kP) to O(k logP), which significantly boosts the system scalability. However, it remains unclear whether the gTop-k sparsification scheme can converge in theory. In this paper, we first provide theoretical proofs on the convergence of the gTop-k scheme for non-convex objective functions under certain analytic assumptions. We then derive the convergence rate of gTop-k S-SGD, which is at the same order as the vanilla mini-batch SGD. Finally, we conduct extensive experiments on different machine learning models and data sets to verify the soundness of the assumptions and theoretical results, and discuss the impact of the compression ratio on the convergence performance.

YNICL Journal 2019 Journal Article

An enriched granger causal model allowing variable static anatomical constraints

  • Kun Bi
  • Guoping Luo
  • Shui Tian
  • Siqi Zhang
  • Xiaoxue Liu
  • Qiang Wang
  • Qing Lu
  • Zhijian Yao

The anatomical connectivity constrains but does not fully determine functional connectivity, especially when one explores into the dynamics over the course of a trial. Therefore, an enriched granger causal model (GCM) integrated with anatomical prior information is proposed in this study, to describe the dynamic effective connectivity to distinguish the depression and explore the pathogenesis of depression. In the proposed frame, the anatomical information was converted via an optimized transformation model, which was then integrated into the normal GCM by variational bayesian model. Magnetoencephalography (MEG) signals and diffusion tensor imaging (DTI) of 24 depressive patients and 24 matched controls were utilized for performance comparison. Together with the sliding windowed MEG signals under sad facial stimuli, the enriched GCM was applied to calculate the regional-pair dynamic effective connectivity, which were repeatedly sifted via feature selection and fed into different classifiers. From the aspects of model errors and recognition accuracy rates, results supported the superiority of the enriched GCM with anatomical priors over the normal GCM. For the effective connectivity with anatomical priors, the best subject discrimination accuracy of SVM was 85.42% (the sensitivity was 87.50% and the specificity was 83.33%). Furthermore, discriminative feature analysis suggested that the enriched GCM that detect the variable anatomical constraint on function could better detect more stringent and less dynamic brain function in depression. The proposed approach is valuable in dynamic functional dysfunction exploration in depression and could be useful for depression recognition.

IJCAI Conference 2018 Conference Paper

An Appearance-and-Structure Fusion Network for Object Viewpoint Estimation

  • Yueying Kao
  • Weiming Li
  • Zairan Wang
  • Dongqing Zou
  • Ran He
  • Qiang Wang
  • Minsu Ahn
  • Sunghoon Hong

Automatic object viewpoint estimation from a single image is an important but challenging problem in machine intelligence community. Although impressive performance has been achieved, current state-of-the-art methods still have difficulty to deal with the visual ambiguity and structure ambiguity in real world images. To tackle these problems, a novel Appearance-and-Structure Fusion network, which we call it ASFnet that estimates viewpoint by fusing both appearance and structure information, is proposed in this paper. The structure information is encoded by precise semantic keypoints and can help address the visual ambiguity. Meanwhile, distinguishable appearance features contribute to overcoming the structure ambiguity. Our ASFnet integrates an appearance path and a structure path to an end-to-end network and allows deep features effectively share supervision from both the two complementary aspects. A convolutional layer is learned to fuse the two path results adaptively. To balance the influence from the two supervision sources, a piecewise loss weight strategy is employed during training. Experimentally, our proposed network outperforms state-of-the-art methods on a public PASCAL 3D+ dataset, which verifies the effectiveness of our method and further corroborates the above proposition.

IJCAI Conference 2018 Conference Paper

Do not Lose the Details: Reinforced Representation Learning for High Performance Visual Tracking

  • Qiang Wang
  • Mengdan Zhang
  • Junliang Xing
  • Jin Gao
  • Weiming Hu
  • Steve Maybank

This work presents a novel end-to-end trainable CNN model for high performance visual object tracking. It learns both low-level fine-grained representations and a high-level semantic embedding space in a mutual reinforced way, and a multi-task learning strategy is proposed to perform the correlation analysis on representations from both levels. In particular, a fully convolutional encoder-decoder network is designed to reconstruct the original visual features from the semantic projections to preserve all the geometric information. Moreover, the correlation filter layer working on the fine-grained representations leverages a global context constraint for accurate object appearance modeling. The correlation filter in this layer is updated online efficiently without network fine-tuning. Therefore, the proposed tracker benefits from two complementary effects: the adaptability of the fine-grained correlation analysis and the generalization capability of the semantic embedding. Extensive experimental evaluations on four popular benchmarks demonstrate its state-of-the-art performance.

IJCAI Conference 2018 Conference Paper

HCR-Net: A Hybrid of Classification and Regression Network for Object Pose Estimation

  • Zairan Wang
  • Weiming Li
  • Yueying Kao
  • Dongqing Zou
  • Qiang Wang
  • Minsu Ahn
  • Sunghoon Hong

Object pose estimation from a single image is a fundamental and challenging problem in computer vision and robotics. Generally, current methods treat pose estimation as a classification or a regression problem. However, regression based methods usually suffer from the issue of imbalanced training data, while classification methods are difficult to discriminate nearby poses. In this paper, a hybrid CNN model, which we call it HCR-Net that integrates both a classification network and a regression network, is proposed to deal with these issues. Our model is inspired by that regression methods can get better accuracy on homogeneously distributed datasets while classification methods are more effective for coarse quantization of the poses even if the dataset is not well balanced. The classification methods and the regression methods essentially complement each other. Thus we integrate both them into a neural network in a hybrid fashion and train it end-to-end with two novel loss functions. As a result, our method surpass the state-of-the-art methods, even with imbalanced training data and much less data augmentation. The experimental results on the challenging Pascal3D+ database demonstrate that our method outperforms the state-of-the-arts significantly, achieving improvements on ACC and AVP metrics up to 4% and 6%, respectively.

AAAI Conference 2013 Conference Paper

Personalized Recommendation Based on Co-Ranking and Query-Based Collaborative Diffusion

  • Xiao Yang
  • Zhaoxin Zhang
  • Qiang Wang

In this paper, we present an adaptive graph-based personalized recommendation method based on co-ranking and query-based collaborative diffusion. By utilizing the unique network structure of n-partite heterogeneous graph, we attempt to address the problem of personalized recommendation in a two-layer ranking process with the help of reasonable measure of high and low order relationships and analyzing the representation of user’s preference in the graph. The experiments show that this algorithm can outperform the traditional CF methods and achieve competitive performance compared with many model-based and graph-based recommendation methods, and have better scalability and flexibility.