Arrow Research search

Author name cluster

Wei Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

89 papers
2 author rows

Possible papers

89

JBHI Journal 2026 Journal Article

Efficient Sleep Staging With Bayesian Uncertainty-Guided Active Learning

  • Tianyou Yu
  • Rui Huang
  • Fei Wang
  • Jun Zhang
  • Wei Wu
  • Zhuliang Yu
  • Yuanqing Li
  • Jun Xiao

Automated sleep staging is essential for large-scale and home-based sleep monitoring; however, in routine clinical practice, sleep annotation remains largely dependent on experienced experts performing time-consuming and labor-intensive manual scoring. Existing automatic systems often struggle to adapt reliably to new subjects, limiting their clinical adoption and reinforcing the reliance on expert review. This creates a strong demand for adaptive and efficient sleep staging systems that can substantially reduce annotation workload while preserving expert-level accuracy. We propose BayesSleepNet, a novel framework that integrates Bayesian uncertainty quantification with active learning for adaptive sleep staging. BayesSleepNet employs principled Bayesian modeling by placing distributions over network weights and performing Monte Carlo sampling at inference, enabling explicit quantification of model (epistemic) uncertainty. These uncertainty estimates drive a two-stage sample selection strategy that first fine-tunes the model using representative epochs and subsequently prioritizes persistently uncertain samples for expert review. Across four public sleep datasets, BayesSleepNet consistently improves performance—by 7. 60% in accuracy, 8. 27% in macro-F1, and 0. 104 in Cohen's $\kappa$ —while requiring manual annotation of only 20% of data from new subjects. Despite its adaptive learning capability, BayesSleepNet remains computationally lightweight, using substantially fewer parameters than representative high-capacity state-of-the-art models. These results demonstrate the clinical promise of uncertainty-aware active learning as a practical and cost-efficient paradigm for semi-automated sleep staging. Code is available at https://github.com/yuty2009/bayesugal.

AAAI Conference 2026 Conference Paper

Explicit Modeling of Causal Factors and Confounders for Image Classification

  • Wei Wu
  • Lei Meng
  • Zhuang Qi
  • Zixuan Li
  • Yachong Zhang
  • Xiaoshuo Yan
  • Xiangxu Meng

Causal inference has emerged as a promising approach for identifying decisive semantic factors and eliminating spurious correlations in visual representation learning. However, most existing methods rely on latent, data-driven confounder modeling, normally attributing the source of bias to background information while neglecting object-level semantic confusions that commonly occur in complex scenes. This limits their effectiveness in disentangling causal factors from confounding semantics. To address this challenge, we propose an explicit modeling approach for both causal factors and confounders, termed Explicit Modeling Causal Model (EMCM). The proposed framework consists of three key components. The Features Stability Estimation module explicitly models the relationship between visual semantics and class labels by leveraging clustering patterns to perform class-aware separation of causal and confounding factors. It produces class-specific causal factors and confounding factors linked to ambiguous categories. Subsequently, the Discriminative Features Enhancing module integrates causal factors into fused patch features via front-door intervention for stable semantics. In parallel, the Explicit Confounder Modeling and Debiasing Module learns confounders under clear label guidance and derives debiased context features by TDE modeling. This framework leverages two complementary causal perspectives to construct a unified semantic representation that facilitates improved generalization. Extensive experiments on two datasets demonstrate that EMCM effectively disentangles causal and confounding factors in complex scenarios, consistently outperforming state-of-the-art causal debiasing methods and text-guided methods in all metrics.

AAAI Conference 2026 Conference Paper

Introducing Decomposed Causality with Spatiotemporal Object-Centric Representation for Video Classification

  • Yachong Zhang
  • Lei Meng
  • Shuo Xu
  • Zhuang Qi
  • Wei Wu
  • Lei Wu
  • Xiangxu Meng

Video classification requires event-level representations of objects and their interactions. Existing methods typically rely on data-driven approaches, which either learn such features from whole frames or object-centric visual regions. Therefore, the modeling of spatiotemporal interactions among objects is usually overlooked. To address this issue, this paper presents a Decomposition of Synergistic, Unique, and Redundant Causal Representations Learning (SurdCRL) model for video classification, which introduces a newly-proposed SURD causal theory to model the spatiotemporal features of both object dynamics and their in- and cross-frame interactions. Specifically, SurdCRL employs three modules to model the object-centric spatiotemporal dynamics using distinct types of causal components, where the first module Spatial-Temporal Entity Modeling decouples the frame into object and context entities, and employs a temporal message passing block to capture object state changes over time, generating spatiotemporal features as basic causal variables. Second, the Dual-Path Causal Inference module mitigates confounders among causal variables by front-door and back-door interventions, thus enabling the subsequent causal components to reflect their intrinsic effects. Finally, the Causal Composition and Selection module employs the compositional structure-aware attention to project the causal variables and their high-order interactions into the synergistic, unique, and redundant components. Experiments on two benchmarking datasets verify that SurdCRL better captures event-relevant object-centric representation by decomposing spatiotemporal object interactions into three types of causal components.

JBHI Journal 2026 Journal Article

MB-STFormer: A Multi-Band Spectral-Temporal Transformer with Efficient Attention for Enhanced EEG-Based Fatigue Detection

  • Ke Liu
  • Lilong Sun
  • Wenlong Wang
  • Zhenghui Gu
  • Zhuliang Yu
  • Wei Wu

Accurate detection of driver fatigue is critical for preventing traffic accidents. Although electroencephalogram (EEG) signals provide a robust physiological indicator of fatigue, effectively capturing their intricate spatiotemporal-spectral dynamics poses significant challenges. In this paper, we propose MB-STFormer, a novel deep neural network designed for EEG-based fatigue detection, which systematically integrates neurophysiological priors into deep feature learning. The proposed MB-STFormer employs a multi-branch frequency-aware module to extract spatiotemporal features from EEG signals, with each branch dedicated to a distinct frequency sub-band. By leveraging adaptive temporal convolution kernel sizes tailored to each sub-band, the model adeptly captures the inherent rhythmic patterns and temporal dynamics unique to different frequency components. Additionally, we introduce an Efficient Additive Attention mechanism to aggregate global contextual information, thereby addressing the over-smoothing of subtle yet critical features often encountered with conventional transformer self-attention mechanisms. Extensive experiments conducted on three publicly available datasets demonstrate that MB-STFormer achieves state-of-the-art performance while maintaining superior interpretability and generalizability. The proposed framework offers a promising solution for real-world fatigue monitoring systems.

AAAI Conference 2026 Conference Paper

MetaAct-RL: Training Language Models for Reasoning Through Meta-Action-Based Reinforcement Learning

  • Zhiheng Xi
  • Yuhui Wang
  • Yiwen Ding
  • Guanyu Li
  • Senjie Jin
  • Shichun Liu
  • Jixuan Huang
  • Dingwen Yang

Outcome-based reinforcement learning has made notable advances in training language models (LMs) for reasoning. However, without explicit incentives and controls, this paradigm has limitations and instability in eliciting high-quality reasoning trajectories with diverse actions—particularly for models whose pretraining lacked extensive reasoning-related data. To this end, we introduce MetaAct-RL, a new RL framework that frames LMs’ thinking as sequential decision making over meta-actions. In this framework, the model chooses and executes a high-level action at each step—such as forward reasoning, critique, or refinement—to gradually reach the correct answer. To encourage deeper exploration, richer action diversity, and to improve sampling efficiency in the RL optimization process, MetaAct-RL incorporates appropriate length-based reward and regularization, and a key-state restart mechanism. Extensive experiments across six benchmarks show that MetaAct-RL improves reasoning performance by 7.99 on Llama3.2-1B and 7.17 on Llama3.1-8B relative to vanilla RL method. Moreover, on the challenging AIME-2024, our method outperforms the vanilla RL by 7.5 with Qwen2.5-1.5B.

JBHI Journal 2025 Journal Article

A Novel Approach to Explore Internal Cardiac Electrophysiological Pattern under Emotional Stress

  • Hanrui Dong
  • Shijie He
  • Wei Wu
  • Xianbin Zhang
  • Ming Li
  • Richard Millham
  • Guibin Bian
  • Wanqing Wu

Numerous psychological and clinical studies have confirmed a correlation between mental and cardiac health. We aim to explore this relationship further by examining how emotions influence cardiac health. By collecting body surface potential and utilizing the electrocardiographic imaging (ECGI) model, we can noninvasively and continuously reconstruct internal cardiac electrical activity. To enhance the existing ECGI model on various datasets, we propose an information fusion strategy called Emotional Potential Conversion CycleGAN. It enables data alignment across diverse datasets while preserving emotional information, allowing us to reconstruct cardiac electrical activity in various emotional states. Our results demonstrate successful data conversion while maintaining emotional integrity, achieving an impressive 91. 92% accuracy in emotion recognition. We further validated this approach using publicly available datasets, WESAD and SWELL, which yielded consistent results. Additionally, we conducted preliminary investigations into the correlation and variability of cardiac activity across different sites under stress. The correlation study indicates a generalized association among various regions of the heart, while variability studies reveal that fluctuations in cardiac electrical activity during stress are primarily concentrated around the atrioventricular node and Purkinje fibers. This suggests a potential risk for pre-excitation syndrome, possibly due to the possible presence of a Kent bundle. Overall, we present a practical approach for studying the interplay between emotional states and cardiac health. Our findings indicate a potential relationship under stress that may provide valuable insights for future research.

JBHI Journal 2025 Journal Article

ADMM-ESINet: A Deep Unrolling Network for EEG Extended Source Imaging

  • Ke Liu
  • Hang Jiang
  • Hu Yang
  • Jun Zhang
  • Zhenghui Gu
  • Zhuliang Yu
  • Yu Zhang
  • Bin Xiao

Electroencephalography (EEG) source imaging (ESI) methods aim to reconstruct cortical sources from scalp EEG signals, a crucial task for understanding the normal brain as well as brain disorders. Traditional model-driven ESI methods face challenges in real-time reconstruction, while deep neural network (DNN)-based ESI methods often struggle with generalization to new data. To address these issues, we propose ADMM-ESINet, a novel deep unfolding neural network for robust and efficient reconstruction of EEG extended sources. ADMM-ESINet leverages a structured sparsity constraint within a regularization framework and employs the Alternating Direction Method of Multipliers (ADMM) to achieve iterative solutions. By unrolling the ADMM algorithm into a cascaded network architecture, ADMM-ESINet effectively integrates prior knowledge, enabling end-to-end, real-time ESI. Crucially, both the regularization parameters and the spatial transform operator are learned directly from the training data. Numerical results demonstrate that ADMM-ESINet surpasses traditional DNN-based methods in generalization ability and accurately reconstructs the location, extent, and temporal dynamics of extended sources, establishing ADMM-ESINet as a promising method for real-time ESI.

NeurIPS Conference 2025 Conference Paper

Beyond Node-Centric Modeling: Sketching Signed Networks with Simplicial Complexes

  • Wei Wu
  • Xuan Tan
  • Yan Peng
  • Ling Chen
  • Fangfang Li
  • Chuan Luo

Signed networks can reflect more complex connections through positive and negative edges, and cost-effective signed network sketching can significantly benefit an important link sign prediction task in the era of big data. Existing signed network embedding algorithms mainly learn node representation in the Graph Neural Network (GNN) framework with the balance theory. However, the node-wise representation learning methods either limit the representational power because they primarily rely on node pairwise relationship in the network, or suffer from severe efficiency issues. Recent research has explored simplicial complexes to capture higher-order interactions and integrated them into GNN frameworks. Motivated by that, we propose EdgeSketch+, a simple and effective edge embedding algorithm beyond traditional node-centric modeling that directly represents edges as low-dimensional vectors without transitioning from node embeddings. The proposed approach maintains a good balance between accuracy and efficiency by exploiting the Locality Sensitive Hashing (LSH) technique to swiftly capture the higher-order information derived from the simplicial complex in a manner of no learning processes. Experiments show that EdgeSketch+ matches state-of-the-art accuracy while significantly reducing runtime, achieving speedups of up to $546. 07\times$ compared to GNN-based methods.

ICLR Conference 2025 Conference Paper

BLEND: Behavior-guided Neural Population Dynamics Modeling via Privileged Knowledge Distillation

  • Zhengrui Guo
  • Fangxu Zhou
  • Wei Wu
  • Qichen Sun
  • Lishuang Feng
  • Jinzhuo Wang
  • Hao Chen 0011

Modeling the nonlinear dynamics of neuronal populations represents a key pursuit in computational neuroscience. Recent research has increasingly focused on jointly modeling neural activity and behavior to unravel their interconnections. Despite significant efforts, these approaches often necessitate either intricate model designs or oversimplified assumptions. Given the frequent absence of perfectly paired neural-behavioral datasets in real-world scenarios when deploying these models, a critical yet understudied research question emerges: how to develop a model that performs well using only neural activity as input at inference, while benefiting from the insights gained from behavioral signals during training? To this end, we propose **BLEND**, the **B**ehavior-guided neura**L** population dynamics mod**E**lling framework via privileged k**N**owledge **D**istillation. By considering behavior as privileged information, we train a teacher model that takes both behavior observations (privileged features) and neural activities (regular features) as inputs. A student model is then distilled using only neural activity. Unlike existing methods, our framework is model-agnostic and avoids making strong assumptions about the relationship between behavior and neural activity. This allows BLEND to enhance existing neural dynamics modeling architectures without developing specialized models from scratch. Extensive experiments across neural population activity modeling and transcriptomic neuron identity prediction tasks demonstrate strong capabilities of BLEND, reporting over 50% improvement in behavioral decoding and over 15% improvement in transcriptomic neuron identity prediction after behavior-guided distillation. Furthermore, we empirically explore various behavior-guided distillation strategies within the BLEND framework and present a comprehensive analysis of effectiveness and implications for model performance. Code will be made available at https://github.com/dddavid4real/BLEND.

AAAI Conference 2025 Conference Paper

Causal Inference over Visual-Semantic-Aligned Graph for Image Classification

  • Lei Meng
  • Xiangxian Li
  • Xiaoshuo Yan
  • Haokai Ma
  • Zhuang Qi
  • Wei Wu
  • Xiangxu Meng

Incorporating tagging information to regularize the representation learning of images usually leads to improved performance in image classification by aligning the visual features with the textual ones of higher discriminative power. Existing methods typically follow the predictive approach, which uses tags as the semantic labels for visual input to make predictions. However, they typically face the problem of handling the heterogeneity between modalities. In order to learn accurate visual-semantic mapping, this paper presents a visual-semantic causal association modeling framework termed VSCNet. It aligns visual regions with tags, uses a pre-learned hierarchy of visual and semantic exemplars to refine tag predictions and constructs an augmented heterogeneous graph to perform causal intervention. Specifically, the fine-grained visual-semantic alignment (FVA) module adaptively locates the semantic-intensive regions corresponding to tags. The heterogeneous association refinement (HAR) module associates the visual regions, semantic elements and pre-learned visual prototypes in a heterogeneous graph to filter the error predictions and enrich the information. The causal inference with graphical masking (CIM) module applies self-learned masks to discover the causal nodes and edges in the heterogeneous graph to address the spurious association, forming robust causal representations. Experimental results from two benchmarking datasets show that VSCNet effectively builds the visual-semantic associations from images and leads to better performance than the state-of-the-art methods with enriched predictive information.

JBHI Journal 2025 Journal Article

DMSACNN: Deep Multiscale Attentional Convolutional Neural Network for EEG-Based Motor Decoding

  • Ke Liu
  • Xin Xing
  • Tao Yang
  • Zhuliang Yu
  • Bin Xiao
  • Guoyin Wang
  • Wei Wu

Objective: Accurate decoding of electroencephalogram (EEG) signals has become more significant for the brain-computer interface (BCI). Specifically, motor imagery and motor execution (MI/ME) tasks enable the control of external devices by decoding EEG signals during imagined or real movements. However, accurately decoding MI/ME signals remains a challenge due to the limited utilization of temporal information and ineffective feature selection methods. Methods: This paper introduces DMSACNN, an end-to-end deep multiscale attention convolutional neural network for MI/ME-EEG decoding. DMSACNN incorporates a deep multiscale temporal feature extraction module to capture temporal features at various levels. These features are then processed by a spatial convolutional module to extract spatial features. Finally, a local and global feature fusion attention module is utilized to combine local and global information and extract the most discriminative spatiotemporal features. Main results: DMSACNN achieves impressive accuracies of 78. 20%, 96. 34% and 70. 90% for hold-out analysis on the BCI-IV-2a, High Gamma and OpenBMI datasets, respectively, outperforming most of the state-of-the-art methods. Conclusion and significance: These results highlight the potential of DMSACNN in robust BCI applications. Our proposed method provides a valuable solution to improve the accuracy of the MI/ME-EEG decoding, which can pave the way for more efficient and reliable BCI systems.

NeurIPS Conference 2025 Conference Paper

DynaAct: Large Language Model Reasoning with Dynamic Action Spaces

  • Xueliang Zhao
  • Wei Wu
  • Jian Guan
  • Qintong Li
  • Lingpeng Kong

In modern sequential decision-making systems, the construction of an optimal candidate action space is critical to efficient inference. However, existing approaches either rely on manually defined action spaces that lack scalability or utilize unstructured spaces that render exhaustive search computationally prohibitive. In this paper, we propose a novel framework named \textsc{DynaAct} for automatically constructing a compact action space to enhance sequential reasoning in complex problem-solving scenarios. Our method first estimates a proxy for the complete action space by extracting general sketches observed in a corpus covering diverse complex reasoning problems using large language models. We then formulate a submodular function that jointly evaluates candidate actions based on their utility to the current state and their diversity, and employ a greedy algorithm to select an optimal candidate set. Extensive experiments on six diverse standard benchmarks demonstrate that our approach significantly improves overall performance, while maintaining efficient inference without introducing substantial latency. The implementation is available at \url{https: //github. com/zhaoxlpku/DynaAct}.

ICML Conference 2025 Conference Paper

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

  • Xiang Hu
  • Zhihao Teng
  • Jun Zhao
  • Wei Wu
  • Kewei Tu

Despite the success of Transformers, handling longer contexts remains challenging due to the limited length generalization and quadratic complexity of self-attention, which often requires post-training with a larger attention window, significantly increasing computational and memory costs. In this paper, we propose a novel attention mechanism based on dynamic context, Grouped Cross Attention (GCA), which can generalize to 1000 $\times$ the pre-training context length while maintaining the ability to access distant information with a constant attention window size. For a given input sequence, we split it into chunks and use each chunk to retrieve top-$k$ relevant past chunks for subsequent text generation. Specifically, unlike most previous works that use an off-the-shelf retriever, our key innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner, which adapts better to causal language models. Such a mechanism accommodates retrieved chunks with a fixed-size attention window to achieve long-range information access, significantly reducing computational and memory costs during training and inference. Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths, which is $1000 \times$ the training length.

IJCAI Conference 2025 Conference Paper

Empowering Vision Transformers with Multi-Scale Causal Intervention for Long-Tailed Image Classification

  • Xiaoshuo Yan
  • Zhaochuan Li
  • Lei Meng
  • Zhuang Qi
  • Wei Wu
  • Zixuan Li
  • Xiangxu Meng

Causal inference has emerged as a promising approach to mitigate long-tail classification by handling the biases introduced by class imbalance. However, along with the change of advanced backbone models from Convolutional Neural Networks (CNNs) to Visual Transformers (ViT), existing causal models may not achieve an expected performance gain. This paper investigates the influence of existing causal models on CNNs and ViT variants, highlighting that ViT's global feature representation makes it hard for causal methods to model associations between fine-grained features and predictions, which leads to difficulties in classifying tail classes with similar visual appearance. To address these issues, this paper proposes TSCNet, a two-stage causal modeling method to discover fine-grained causal associations through multi-scale causal interventions. Specifically, in the hierarchical causal representation learning stage (HCRL), it decouples the background and objects, applying backdoor interventions at both the patch and feature level to prevent model from using class-irrelevant areas to infer labels which enhances fine-grained causal representation. In the counterfactual logits' bias calibration stage (CLBC), it refines the optimization of model's decision boundary by adaptive constructing counterfactual balanced data distribution to remove the spurious associations in the logits caused by data distribution. Extensive experiments conducted on various long-tail benchmarks demonstrate that the proposed TSCNet can eliminate multiple biases introduced by data imbalance, which outperforms existing methods.

NeurIPS Conference 2025 Conference Paper

Generalized and Invariant Single-Neuron In-Vivo Activity Representation Learning

  • Wei Wu
  • Yuxing Lu
  • Zhengrui Guo
  • Chi Zhang
  • Can Liao
  • Yifan Bu
  • Fangxu Zhou
  • Jinzhuo Wang

In computational neuroscience, models representing single-neuron in-vivo activity have become essential for understanding the functional identities of individual neurons. These models, such as implicit representation methods based on Transformer architectures, contrastive learning frameworks, and variational autoencoders, aim to capture the invariant and intrinsic computational features of single neurons. The learned single-neuron computational role representations should remain invariant across changing environment and are affected by their molecular expression and location. Thus, the representations allow for in vivo prediction of the molecular cell types and anatomical locations of single neurons, facilitating advanced closed-loop experimental designs. However, current models face the problem of limited generalizability. This is due to batch effects caused by differences in experimental design, animal subjects, and recording platforms. These confounding factors often lead to overfitting, reducing the robustness and practical utility of the models across various experimental scenarios. Previous studies have not rigorously evaluated how well the models generalize to new animals or stimulus conditions, creating a significant gap in the field. To solve this issue, we present a comprehensive experimental protocol that explicitly evaluates model performance on unseen animals and stimulus types. Additionally, we propose a model-agnostic adversarial training strategy. In this strategy, a discriminator network is used to eliminate batch-related information from the learned representations. The adversarial framework forces the representation model to focus on the intrinsic properties of neurons, thereby enhancing generalizability. Our approach is compatible with all major single-neuron representation models and significantly improves model robustness. This work emphasizes the importance of generalization in single-neuron representation models and offers an effective solution, paving the way for the practical application of computational models in vivo. It also shows potential for building unified atlases based on single-neuron in vivo activity.

NeurIPS Conference 2025 Conference Paper

Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access

  • Xiang Hu
  • Jiaqi Leng
  • Jun Zhao
  • Kewei Tu
  • Wei Wu

A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose \textbf{H}ierarchical \textbf{S}parse \textbf{A}ttention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selecting the top-$k$ chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba's huge potential in long-context modeling.

NeurIPS Conference 2025 Conference Paper

KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment

  • Yuxing Lu
  • Wei Wu
  • Xukai Zhao
  • Rui Peng
  • Jinzhuo Wang

Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical for modern AI systems, but manual curation struggles to scale with the rapid growth of scientific literature. This paper presents KARMA, a novel framework employing multi-agent large language models (LLMs) to automate KG enrichment through structured analysis of unstructured text. Our approach employs nine collaborative agents, spanning entity discovery, relation extraction, schema alignment, and conflict resolution that iteratively parse documents, verify extracted knowledge, and integrate it into existing graph structures while adhering to domain-specific schema. Experiments on 1, 200 PubMed articles from three different domains demonstrate the effectiveness of KARMA in knowledge graph enrichment, with the identification of up to 38, 230 new entities while achieving 83. 1\% LLM-verified correctness and reducing conflict edges by 18. 6\% through multi-layer assessments.

ICLR Conference 2025 Conference Paper

Neuron Platonic Intrinsic Representation From Dynamics Using Contrastive Learning

  • Wei Wu
  • Can Liao
  • Zizhen Deng
  • Zhengrui Guo
  • Jinzhuo Wang

The Platonic Representation Hypothesis posits that behind different modalities of data (what we sense or detect), there exists a universal, modality-independent representation of reality. Inspired by this, we treat each neuron as a system, where we can detect the neuron’s multi-segment activity data under different peripheral conditions. We believe that, similar to the Platonic idea, there exists a time-invariant representation behind the different segments of the same neuron, which reflects the intrinsic properties of the neuron’s system. Intrinsic properties include the molecular profiles, brain regions and morphological structure, etc. The optimization objective for obtaining the intrinsic representation of neurons should satisfy two criteria: (I) segments from the same neuron should have a higher similarity than segments from different neurons; (II) the representations should generalize well to out-of-domain data. To achieve this, we employ contrastive learning, treating different segments from the same neuron as positive pairs and segments from different neurons as negative pairs. During the implementation, we chose the VICReg, which uses only positive pairs for optimization but indirectly separates dissimilar samples via regularization terms. To validate the efficacy of our method, we first applied it to simulated neuron population dynamics data generated using the Izhikevich model. We successfully confirmed that our approach captures the type of each neuron as defined by preset hyperparameters. We then applied our method to two real-world neuron dynamics datasets, including spatial transcriptomics-derived neuron type annotations and the brain regions where each neuron is located. The learned representations from our model not only predict neuron type and location but also show robustness when tested on out-of-domain data (unseen animals). This demonstrates the potential of our approach in advancing the understanding of neuronal systems and offers valuable insights for future neuroscience research.

AAAI Conference 2025 Conference Paper

ProtCLIP: Function-Informed Protein Multi-Modal Learning

  • Hanjing Zhou
  • Mingze Yin
  • Wei Wu
  • Mingyang Li
  • Kun Fu
  • Jintai Chen
  • Jian Wu
  • Zheng Wang

Multi-modality pre-training paradigm that aligns protein sequences and biological descriptions has learned general protein representations and achieved promising performance in various downstream applications. However, these works were still unable to replicate the extraordinary success of language-supervised visual foundation models due to the ineffective usage of aligned protein-text paired data and the lack of an effective function-informed pre-training paradigm. To address these issues, this paper curates a large-scale protein-text paired dataset called ProtAnno with a property-driven sampling strategy, and introduces a novel function-informed protein pre-training paradigm. Specifically, the sampling strategy determines selecting probability based on the sample confidence and property coverage, balancing the data quality and data quantity in face of large-scale noisy data. Furthermore, motivated by significance of the protein specific functional mechanism, the proposed paradigm explicitly model protein static and dynamic functional segments by two segment-wise pre-training objectives, injecting fine-grained information in a function-informed manner. Leveraging all these innovations, we develop ProtCLIP, a multi-modality foundation model that comprehensively represents function-aware protein embeddings. On 22 different protein benchmarks within 5 types, including protein functionality classification, mutation effect prediction, cross-modal transformation, semantic similarity inference and protein-protein interaction prediction, our ProtCLIP consistently achieves SOTA performance, with remarkable improvements of 75% on average in five cross-modal transformation benchmarks, 59.9% in GO-CC and 39.7% in GO-BP protein function prediction. The experimental results verify the extraordinary potential of ProtCLIP serving as the protein multi-modality foundation model.

IROS Conference 2025 Conference Paper

Real-time Whole-body Motion Planning Based on Optimized NMPC in Static and Dynamic Environments for Mobile Manipulator

  • Wei Wu
  • Ximeng Zhou
  • Fei Yan
  • Shouxing Zhang
  • Yan Zhuang
  • Guiyang Xin

Recently, the research on mobile manipulators has attracted increasing attention. Ensuring that mobile manipulators can meet obstacle avoidance constraints and efficiently accomplish assigned tasks in dynamic environments remains a significant challenge. To address this issue, this paper proposes an integrated framework for environment perception, real-time planning, and control optimization. Firstly, we develop a fusion map that combines euclidean signed distance field (ESDF) with clustered point clouds occupying cubes, enabling robots to perceive more precise environmental information in complex and changing conditions. Secondly, we introduce a novel rapid generation strategy for 6-DOF guide point sequences, which directs the mobile manipulator to follow the most efficient path to the target location while making real-time adjustments to avoid dynamic obstacles. Additionally, utilizing optimized nonlinear model predictive control (NMPC), we design a whole-body motion controller for the mobile manipulator to prevent the system from becoming trapped in local optima, thereby allowing the manipulator to adjust its state tracking guide points promptly in complex indoor environments. Finally, the proposed algorithm was implemented on a mobile manipulator with an Ackerman base and tested through both simulations and real-world experiments.

NeurIPS Conference 2025 Conference Paper

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

  • Junfei Wu
  • Jian Guan
  • Kaituo Feng
  • Qiang Liu
  • Shu Wu
  • Liang Wang
  • Wei Wu
  • Tieniu Tan

As textual reasoning with large language models (LLMs) has advanced significant, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking\textemdash capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named \textsc{Spark}, consistently outperforms existing methods across diverse spatial reasoning benchmarks involving maze navigation, static spatial reasoning, video-based reasoning and multi-view-based reasoning tasks, with an average improvement of 11. 5\%. Ablation studies reveal the critical role of each training stage, with reflective rejection sampling particularly enhancing the model's self-correction capabilities and reasoning potential.

NeurIPS Conference 2025 Conference Paper

RoboScape: Physics-informed Embodied World Model

  • Yu Shang
  • Xin Zhang
  • Yinzhou Tang
  • Lei Jin
  • Chen Gao
  • Wei Wu
  • Yong Li

World models have become indispensable tools for embodied intelligence, serving as powerful simulators capable of generating realistic robotic videos while addressing critical data scarcity challenges. However, current embodied world models exhibit limited physical awareness, particularly in modeling 3D geometry and motion dynamics, resulting in unrealistic video generation for contact-rich robotic scenarios. In this paper, we present RoboScape, a unified physics-informed world model that jointly learns RGB video generation and physics knowledge within an integrated framework. We introduce two key physics-informed joint training tasks: temporal depth prediction that enhances 3D geometric consistency in video rendering, and keypoint dynamics learning that implicitly encodes physical properties (e. g. , object shape and material characteristics) while improving complex motion modeling. Extensive experiments demonstrate that RoboScape generates videos with superior visual fidelity and physical plausibility across diverse robotic scenarios. We further validate its practical utility through downstream applications including robotic policy training with generated data and policy evaluation. Our work provides new insights for building efficient physics-informed world models to advance embodied intelligence research. Our code and demos are available at: https: //github. com/tsinghua-fib-lab/RoboScape.

IJCAI Conference 2025 Conference Paper

Semantic-Space-Intervened Diffusive Alignment for Visual Classification

  • Zixuan Li
  • Lei Meng
  • Guoqing Chao
  • Wei Wu
  • Yimeng Yang
  • Xiaoshuo Yan
  • Zhuang Qi
  • Xiangxu Meng

Cross-modal alignment is an effective approach to improving visual classification. Existing studies typically enforce a one-step mapping that uses deep neural networks to project the visual features to mimic the distribution of textual features. However, they typically face difficulties in finding such a projection due to the two modalities in both the distribution of class-wise samples and the range of their feature values. To address this issue, this paper proposes a novel Semantic-Space-Intervened Diffusive Alignment method, termed SeDA, models a semantic space as a bridge in the visual-to-textual projection, considering both types of features share the same class-level information in classification. More importantly, a bi-stage diffusion framework is developed to enable the progressive alignment between the two modalities. Specifically, SeDA first employs a Diffusion-Controlled Semantic Learner to model the semantic feature space of visual features by constraining the interactive features of the diffusion model and the category centers of visual features. In the later stage of SeDA, the Diffusion-Controlled Semantic Translator focuses on learning the distribution of textual features from the semantic space. Meanwhile, the Progressive Feature Interaction Network introduces stepwise feature interactions at each alignment step, progressively integrating textual information into mapped features. Experimental results show that SeDA achieves stronger cross-modal feature alignment, leading to superior performance over existing methods across multiple scenarios.

AAAI Conference 2025 Conference Paper

Synergy of GFlowNet and Protein Language Model Makes a Diverse Antibody Designer

  • Mingze Yin
  • Hanjing Zhou
  • Yiheng Zhu
  • Jialu Wu
  • Wei Wu
  • Mingyang Li
  • Kun Fu
  • Zheng Wang

Antibodies defend our health by binding to antigens with high specificity and potentiality, primarily relying on the Complementarity-Determining Region (CDR). Yet, current experimental methods of discovering new antibody CDRs are heavily time-consuming. Computational design could alleviate this burden; especially, protein language models have proven quite beneficial in many recent studies. However, most existing models solely focus on antibody potentiality and struggle to encapsulate the diverse range of plausible CDR candidates, limiting their effectiveness in real-world scenarios as binding is only one factor in the multitude of drug-forming criteria. In this paper, we introduce PG-AbD, a framework uniting Generative Flow Networks (GFlowNets) and pretrained Protein Language Models (PLMs) to successfully generate highly potent, diverse and novel antibody candidates. We innovatively construct a Products of Experts (PoE) composed by the global-distribution-modeling PLM and the local-distribution-modeling Potts Model to serve as the reward function of GFlowNet. The joint training paradigm is introduced, where PoE is trained by contrastive divergence with the negative samples generated by GFlowNet, and then guides GFlowNet to sample diverse antibody candidates. We evaluate PG-AbD on extensive antibody design benchmarks. It significantly outperforms existing methods in diversity (13.5% on RabDab, 31.1% on SabDab) while maintaining optimal potential and novelty. Generated antibodies are also found to form stable, regular 3D structures with their corresponding antigens, demonstrating the great potential of PG-AbD to accelerate real-world antibody discovery.

NeurIPS Conference 2025 Conference Paper

Theoretical Benefit and Limitation of Diffusion Language Model

  • Guhao Feng
  • Yihan Geng
  • Jian Guan
  • Wei Wu
  • Liwei Wang
  • Di He

Diffusion language models have emerged as a new approach for text generation. By enabling the parallel sampling of multiple tokens in each diffusion step, they appear to offer a more efficient alternative to auto-regressive models. However, our observations show that current open-sourced diffusion language models require more sampling steps to achieve comparable accuracy on representative tasks--resulting in even higher inference costs than their auto-regressive counterparts. To investigate whether this is an inherent limitation, we conduct a rigorous theoretical analysis of a widely adopted variant: the Masked Diffusion Model (MDM). Surprisingly, our analysis reveals that the conclusion is highly sensitive to the choice of evaluation metric. Under mild conditions, we prove that when the target is near-optimal perplexity, MDMs can achieve this goal in a constant number of sampling steps, independent of sequence length. This result demonstrates that efficiency can, in principle, be attained without compromising generation quality. However, when targeting low sequence error rate--which is important for assessing the ``correctness" of a generated sequence, such as a reasoning chain--we show that in the worst case, the required sampling steps must scale linearly with sequence length, thereby eliminating the efficiency advantage. Our analysis establishes the first theoretical foundation for understanding the comparative strengths and limitations of MDMs, offering practical guidance on when to favor MDMs over the auto-regressive models and vice versa.

IROS Conference 2025 Conference Paper

TIETracker: A CLIP-based RGB-T Tracking via Feature Interaction and Semantic Enhancement

  • Weidai Xia
  • Xingliang Mao
  • Wei Wu
  • Chengzhang Zhu
  • Fangfang Li

The goal of RGB-T tracking is to enhance the accuracy and robustness by leveraging the complementary features of RGB and TIR modalities in complex scenarios. Previous methods have overlooked the power of semantic features in extracting valuable information from different modalities and improving interactions across them. Moreover, using Bounding Boxes (BBox) for target initialization can cause issues like bounding box blurring and tracking drift when the target’s appearance changes or gets occluded. To address these challenges, we propose the CLIP-based RGBT tracking algorithm TIETracker, which aims to to exploit the complementary advantages of multimodality more effectively using textual information. Textual descriptions direct the backbone network to learn target representations in multimodality and facilitate the interaction of multi-modal features. Additionally, in scenarios of occlusion and scale transformations that lead to missing or altered target features, textual information adaptively supplements the target representation. This approach also improves the response in the image region of the target, addressing issues with bounding box accuracy and tracking drift. Our extensive evaluation on three leading RGB-T tracking benchmarks demonstrates that TIETracker achieves competitive compared to state-of-the-art methods, effectively countering feature loss from changes in target appearance and occlusion.

NeurIPS Conference 2025 Conference Paper

Towards Doctor-Like Reasoning: Medical RAG Fusing Knowledge with Patient Analogy through Textual Gradients

  • Yuxing Lu
  • Gecheng Fu
  • Wei Wu
  • Xukai Zhao
  • Sin Yee Goi
  • Jinzhuo Wang

Existing medical RAG systems mainly leverage knowledge from medical knowledge bases, neglecting the crucial role of experiential knowledge derived from similar patient cases - a key component of human clinical reasoning. To bridge this gap, we propose DoctorRAG, a RAG framework that emulates doctor-like reasoning by integrating both explicit clinical knowledge and implicit case-based experience. DoctorRAG enhances retrieval precision by first allocating conceptual tags for queries and knowledge sources, together with a hybrid retrieval mechanism from both relevant knowledge and patient. In addition, a Med-TextGrad module using multi-agent textual gradients is integrated to ensure that the final output adheres to the retrieved knowledge and patient query. Comprehensive experiments on multilingual, multitask datasets demonstrate that DoctorRAG significantly outperforms strong baseline RAG models and gains improvements from iterative refinements. Our approach generates more accurate, relevant, and comprehensive responses, taking a step towards more doctor-like medical reasoning systems.

NeurIPS Conference 2025 Conference Paper

Understanding Parametric and Contextual Knowledge Reconciliation within Large Language Models

  • Jun Zhao
  • Yongzhuo Yang
  • Xiang Hu
  • Jingqi Tong
  • Yi Lu
  • Wei Wu
  • Tao Gui
  • Qi Zhang

Retrieval-Augmented Generation (RAG) provides additional contextual knowledge to complement the parametric knowledge in Large Language Models (LLMs). These two knowledge interweave to enhance the accuracy and timeliness of LLM responses. However, the internal mechanisms by which LLMs utilize these knowledge remain unclear. We propose modeling the forward propagation of knowledge as an entity flow, employing this framework to trace LLMs' internal behaviors when processing mixed-source knowledge. Linear probing utilizes a trainable linear classifier to detect specific attributes in hidden layers. However, once trained, a probe cannot adapt to dynamically specified entities. To address this challenge, we construct an entity-aware probe, which introduces special tokens to mark probing targets and employs a small trainable rank-8 lora update to process these special markers. We first verify this approach through an attribution experiment, demonstrating that it can accurately detect information about ad-hoc entities from complex hidden states. Next, we trace entity flows across layers to understand how LLMs reconcile conflicting knowledge internally. Our probing results reveal that contextual and parametric knowledge are routed between tokens through distinct sets of attention heads, supporting attention competition only within knowledge types. While conflicting knowledge maintains a residual presence across layers, aligned knowledge from multiple sources gradually accumulates, with the magnitude of this accumulation directly determining its influence on final outputs.

NeurIPS Conference 2024 Conference Paper

AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback

  • Jian Guan
  • Wei Wu
  • Zujie Wen
  • Peng Xu
  • Hongning Wang
  • Minlie Huang

The notable success of large language models (LLMs) has sparked an upsurge in building language agents to complete various complex tasks. We present AMOR, an agent framework based on open-source LLMs, which reasons with external knowledge bases and adapts to specific domains through human supervision to the reasoning process. AMOR builds reasoning logic over a finite state machine (FSM)that solves problems through autonomous executions and transitions over disentangled modules. This allows humans to provide direct feedback to the individual modules, and thus naturally forms process supervision. Based on this reasoning and feedback framework, we develop AMOR through two-stage fine-tuning: warm-up and adaptation. The former fine-tunes the LLM with examples automatically constructed from various public datasets, enabling AMOR to generalize across different knowledge environments, while the latter tailors AMOR to specific domains using process feedback. Extensive experiments across multiple domains demonstrate the advantage of AMOR to strong baselines, thanks to its FSM-based reasoning and process feedback mechanism. The code and data are publicly available athttps: //github. com/JianGuanTHU/AMOR.

ICLR Conference 2024 Conference Paper

Augmenting Transformers with Recursively Composed Multi-grained Representations

  • Xiang Hu
  • Qingyang Zhu
  • Kewei Tu
  • Wei Wu

We present ReCAT, a recursive composition augmented Transformer that is able to explicitly model hierarchical syntactic structures of raw texts without relying on gold trees during both learning and inference. Existing research along this line restricts data to follow a hierarchical tree structure and thus lacks inter-span communications. To overcome the problem, we propose a novel contextual inside-outside (CIO) layer that learns contextualized representations of spans through bottom-up and top-down passes, where a bottom-up pass forms representations of high-level spans by composing low-level spans, while a top-down pass combines information inside and outside a span. By stacking several CIO layers between the embedding layer and the attention layers in Transformer, the ReCAT model can perform both deep intra-span and deep inter-span interactions, and thus generate multi-grained representations fully contextualized with other spans. Moreover, the CIO layers can be jointly pre-trained with Transformers, making ReCAT enjoy scaling ability, strong performance, and interpretability at the same time. We conduct experiments on various sentence-level and span-level tasks. Evaluation results indicate that ReCAT can significantly outperform vanilla Transformer models on all span-level tasks and recursive models on natural language inference tasks. More interestingly, the hierarchical structures induced by ReCAT exhibit strong consistency with human-annotated syntactic trees, indicating good interpretability brought by the CIO layers.

NeurIPS Conference 2024 Conference Paper

Bridge-IF: Learning Inverse Protein Folding with Markov Bridges

  • Yiheng Zhu
  • Jialu Wu
  • Qiuyi Li
  • Jiahuan Yan
  • Mingze Yin
  • Wei Wu
  • Mingyang Li
  • Jieping Ye

Inverse protein folding is a fundamental task in computational protein design, which aims to design protein sequences that fold into the desired backbone structures. While the development of machine learning algorithms for this task has seen significant success, the prevailing approaches, which predominantly employ a discriminative formulation, frequently encounter the error accumulation issue and often fail to capture the extensive variety of plausible sequences. To fill these gaps, we propose Bridge-IF, a generative diffusion bridge model for inverse folding, which is designed to learn the probabilistic dependency between the distributions of backbone structures and protein sequences. Specifically, we harness an expressive structure encoder to propose a discrete, informative prior derived from structures, and establish a Markov bridge to connect this prior with native sequences. During the inference stage, Bridge-IF progressively refines the prior sequence, culminating in a more plausible design. Moreover, we introduce a reparameterization perspective on Markov bridge models, from which we derive a simplified loss function that facilitates more effective training. We also modulate protein language models (PLMs) with structural conditions to precisely approximate the Markov bridge process, thereby significantly enhancing generation performance while maintaining parameter-efficient training. Extensive experiments on well-established benchmarks demonstrate that Bridge-IF predominantly surpasses existing baselines in sequence recovery and excels in the design of plausible proteins with high foldability. The code is available at https: //github. com/violet-sto/Bridge-IF.

NeurIPS Conference 2024 Conference Paper

LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

  • Wei Wu
  • Kecheng Zheng
  • Shuailei Ma
  • Fan Lu
  • Yuxin Guo
  • Yifei Zhang
  • Wei Chen
  • Qingpei Guo

In this work, we empirically confirm that the key reason causing such an issue is that the training images are usually paired with short captions, leaving certain tokens easily overshadowed by salient tokens. Towards this problem, our initial attempt is to relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text (e. g. , in the image classification task). Then, after incorporating corner tokens to aggregate diverse textual information, we manage to help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding. We further look into whether the model can continuously benefit from longer captions and notice a clear trade-off between the performance and the efficiency. Finally, we validate the effectiveness of our approach using a self-constructed large-scale dataset, which consists of 100M long caption oriented text-image pairs. Our method achieves superior performance in long-text-image retrieval tasks. The project page is available at https: //wuw2019. github. io/lot-lip.

JBHI Journal 2024 Journal Article

MSVTNet: Multi-Scale Vision Transformer Neural Network for EEG-Based Motor Imagery Decoding

  • Ke Liu
  • Tao Yang
  • Zhuliang Yu
  • Weibo Yi
  • Hong Yu
  • Guoyin Wang
  • Wei Wu

Object: Transformer-based neural networks have been applied to the electroencephalography (EEG) decoding for motor imagery (MI). However, most networks focus on applying the self-attention mechanism to extract global temporal information, while the cross-frequency coupling features between different frequencies have been neglected. Additionally, effectively integrating different neural networks poses challenges for the advanced design of decoding algorithms. Methods: This study proposes a novel end-to-end Multi-Scale Vision Transformer Neural Network (MSVTNet) for MI-EEG classification. MSVTNet first extracts local spatio-temporal features at different filtered scales through convolutional neural networks (CNNs). Then, these features are concatenated along the feature dimension to form local multi-scale spatio-temporal feature tokens. Finally, Transformers are utilized to capture cross-scale interaction information and global temporal correlations, providing more distinguishable feature embeddings for classification. Moreover, auxiliary branch loss is leveraged for intermediate supervision to ensure the effective integration of CNNs and Transformers. Results: The performance of MSVTNet was assessed through subject-dependent (session-dependent and session-independent) and subject-independent experiments on three MI datasets, i. e. , the BCI competition IV 2a, 2b and OpenBMI datasets. The experimental results demonstrate that MSVTNet achieves state-of-the-art performance in all analyses. Conclusion: MSVTNet shows superiority and robustness in enhancing MI decoding performance.

JBHI Journal 2024 Journal Article

Sleep Stage Classification Via Multi-View Based Self-Supervised Contrastive Learning of EEG

  • Chen Zhao
  • Wei Wu
  • Haoyi Zhang
  • Ruiyan Zhang
  • Xinyue Zheng
  • Xiangzeng Kong

Self-supervised learning (SSL) is a challenging task in sleep stage classification (SSC) that is capable of mining valuable representations from unlabeled data. However, traditional SSL methods typically focus on single-view learning and do not fully exploit the interactions among information across multiple views. In this study, we focused on a multi-domain view of the same EEG signal and developed a self-supervised multi-view representation learning framework via time series and time–frequency contrasting (MV-TTFC). In the MV-TTFC framework, we built-in a cross-domain view contrastive learning prediction task to establish connections between the temporal view and time–frequency (TF) view, thereby enhancing the information exchange between multiple views. In addition, to improve the quality of the TF view inputs, we introduced an enhanced multisynchrosqueezing transform, which can create high energy concentration TF image views to compensate for the inaccurate representations in traditional TF processing techniques. Finally, integrating temporal, TF, and fusion space contrastive learning effectively captured the latent features in EEG signals. We evaluated MV-TTFC based on two real-world SSC datasets (SleepEDF-78 and SHHS) and compared it with baseline methods in downstream tasks. Our method exhibited state-of-the-art performance, achieving accuracies of 78. 64% and 81. 45% with SleepEDF-78 and SHHS, respectively, and macro F1-scores of 70. 39% with SleepEDF-78 and 70. 47% with SHHS.

NeurIPS Conference 2024 Conference Paper

SMART: Scalable Multi-agent Real-time Motion Generation via Next-token Prediction

  • Wei Wu
  • Xiaoxin Feng
  • Ziyan Gao
  • Yuheng Kan

Data-driven autonomous driving motion generation tasks are frequently impacted by the limitations of dataset size and the domain gap between datasets, which precludes their extensive application in real-world scenarios. To address this issue, we introduce SMART, a novel autonomous driving motion generation paradigm that models vectorized map and agent trajectory data into discrete sequence tokens. These tokens are then processed through a decoder-only transformer architecture to train for the next token prediction task across spatial-temporal series. This GPT-style method allows the model to learn the motion distribution in real driving scenarios. SMART achieves state-of-the-art performance across most of the metrics on the generative Sim Agents challenge, ranking 1st on the leaderboards of Waymo Open Motion Dataset (WOMD), demonstrating remarkable inference speed. Moreover, SMART represents the generative model in the autonomous driving motion domain, exhibiting zero-shot generalization capabilities: Using only the NuPlan dataset for training and WOMD for validation, SMART achieved a competitive score of 0. 72 on the Sim Agents challenge. Lastly, we have collected over 1 billion motion tokens from multiple datasets, validating the model's scalability. These results suggest that SMART has initially emulated two important properties: scalability and zero-shot generalization, and preliminarily meets the needs of large-scale real-time simulation applications. We have released all the code to promote the exploration of models for motion generation in the autonomous driving field. The source code is available at https: //github. com/rainmaker22/SMART.

AAAI Conference 2024 Conference Paper

SwiftPillars: High-Efficiency Pillar Encoder for Lidar-Based 3D Detection

  • Xin Jin
  • Kai Liu
  • Cong Ma
  • Ruining Yang
  • Fei Hui
  • Wei Wu

Lidar-based 3D Detection is one of the significant components of Autonomous Driving. However, current methods over-focus on improving the performance of 3D Lidar perception, which causes the architecture of networks becoming complicated and hard to deploy. Thus, the methods are difficult to apply in Autonomous Driving for real-time processing. In this paper, we propose a high-efficiency network, SwiftPillars, which includes Swift Pillar Encoder (SPE) and Multi-scale Aggregation Decoder (MAD). The SPE is constructed by a concise Dual-attention Module with lightweight operators. The Dual-attention Module utilizes feature pooling, matrix multiplication, etc. to speed up point-wise and channel-wise attention extraction and fusion. The MAD interconnects multiple scale features extracted by SPE with minimal computational cost to leverage performance. In our experiments, our proposal accomplishes 61.3% NDS and 53.2% mAP in nuScenes dataset. In addition, we evaluate inference time on several platforms (P4, T4, A2, MLU370, RTX3080), where SwiftPillars achieves up to 13.3ms (75FPS) on NVIDIA Tesla T4. Compared with PointPillars, SwiftPillars is on average 26.58% faster in inference speed with equivalent GPUs and a higher mAP of approximately 3.2% in the nuScenes dataset.

NeurIPS Conference 2024 Conference Paper

Tackling Uncertain Correspondences for Multi-Modal Entity Alignment

  • Liyi Chen
  • Ying Sun
  • Shengzhe Zhang
  • Yuyang Ye
  • Wei Wu
  • Hui Xiong

Recently, multi-modal entity alignment has emerged as a pivotal endeavor for the integration of Multi-Modal Knowledge Graphs (MMKGs) originating from diverse data sources. Existing works primarily focus on fully depicting entity features by designing various modality encoders or fusion approaches. However, uncertain correspondences between inter-modal or intra-modal cues, such as weak inter-modal associations, description diversity, and modality absence, still severely hinder the effective exploration of aligned entity similarities. To this end, in this paper, we propose a novel Tackling uncertain correspondences method for Multi-modal Entity Alignment (TMEA). Specifically, to handle diverse attribute knowledge descriptions, we design alignment-augmented abstract representation that incorporates the large language model and in-context learning into attribute alignment and filtering for generating and embedding the attribute abstract. In order to mitigate the influence of the modality absence, we propose to unify all modality features into a shared latent subspace and generate pseudo features via variational autoencoders according to existing modal features. Then, we develop an inter-modal commonality enhancement mechanism based on cross-attention with orthogonal constraints, to address weak semantic associations between modalities. Extensive experiments on two real-world datasets validate the effectiveness of TMEA with a clear improvement over competitive baselines.

ICRA Conference 2023 Conference Paper

A Hybrid Quadratic Programming Framework for Real-Time Embedded Safety-Critical Control

  • Ryan M. Bena
  • Sushmit Hossain
  • Buyun Chen
  • Wei Wu
  • Quan Nguyen 0004

We present a new framework for implementing real-time embedded safety-critical controllers which utilizes hybrid computing to address the issue of limited computational resources, a problem that is particularly prevalent in microrobotics. In our approach, the nominal stabilizing control algorithm is implemented digitally while the safety-critical quadratic program is solved via a dedicated analog resistor array. We apply this hybrid computing architecture to a simulated collision avoidance task for a micro-aerial vehicle and show the benefit relative to a purely-digital implementation. By leveraging analog quadratic programming on the Crazyflie 2. 1 micro quadrotor, a reduction in overall processing time from 8. 9 ms to 0. 6 ms is estimated for this computationally-limited system. We further display the viability of our proposed safety-critical control framework through real-time flight demonstrations, utilizing a novel prototype analog circuit tethered to the Crazyflie. The flight results confirm the functionality of the control structure and prototype circuit while highlighting the overall capabilities of hybrid computing.

IJCAI Conference 2023 Conference Paper

Intent-aware Recommendation via Disentangled Graph Contrastive Learning

  • Yuling Wang
  • Xiao Wang
  • Xiangzhou Huang
  • Yanhua Yu
  • Haoyang Li
  • Mengdi Zhang
  • Zirui Guo
  • Wei Wu

Graph neural network (GNN) based recommender systems have become one of the mainstream trends due to the powerful learning ability from user behavior data. Understanding the user intents from behavior data is the key to recommender systems, which poses two basic requirements for GNN-based recommender systems. One is how to learn complex and diverse intents especially when the user behavior is usually inadequate in reality. The other is different behaviors have different intent distributions, so how to establish their relations for a more explainable recommender system. In this paper, we present the Intent-aware Recommendation via Disentangled Graph Contrastive Learning (IDCL), which simultaneously learns interpretable intents and behavior distributions over those intents. Specifically, we first model the user behavior data as a user-item-concept graph, and design a GNN based behavior disentangling module to learn the different intents. Then we propose the intent-wise contrastive learning to enhance the intent disentangling and meanwhile infer the behavior distributions. Finally, the coding rate reduction regularization is introduced to make the behaviors of different intents orthogonal. Extensive experiments demonstrate the effectiveness of IDCL in terms of substantial improvement and the interpretability.

IJCAI Conference 2023 Conference Paper

Local and Global: Temporal Question Answering via Information Fusion

  • Yonghao Liu
  • Di Liang
  • Mengyu Li
  • Fausto Giunchiglia
  • Ximing Li
  • Sirui Wang
  • Wei Wu
  • Lan Huang

Many models that leverage knowledge graphs (KGs) have recently demonstrated remarkable success in question answering (QA) tasks. In the real world, many facts contained in KGs are time-constrained thus temporal KGQA has received increasing attention. Despite the fruitful efforts of previous models in temporal KGQA, they still have several limitations. (I) They neither emphasize the graph structural information between entities in KGs nor explicitly utilize a multi-hop relation path through graph neural networks to enhance answer prediction. (II) They adopt pre-trained language models (LMs) to obtain question representations, focusing merely on the global information related to the question while not highlighting the local information of the entities in KGs. To address these limitations, we introduce a novel model that simultaneously explores both Local information and Global information for the task of temporal KGQA (LGQA). Specifically, we first introduce an auxiliary task in the temporal KG embedding procedure to make timestamp embeddings time-order aware. Then, we design information fusion layers that effectively incorporate local and global information to deepen question understanding. We conduct extensive experiments on two benchmarks, and LGQA significantly outperforms previous state-of-the-art models, especially in difficult questions. Moreover, LGQA can generate interpretable and trustworthy predictions.

IJCAI Conference 2022 Conference Paper

Ensemble Multi-Relational Graph Neural Networks

  • Yuling Wang
  • Hao Xu
  • Yanhua Yu
  • Mengdi Zhang
  • Zhenhao Li
  • Yuji Yang
  • Wei Wu

It is well established that graph neural networks (GNNs) can be interpreted and designed from the perspective of optimization objective. With this clear optimization objective, the deduced GNNs architecture has sound theoretical foundation, which is able to flexibly remedy the weakness of GNNs. However, this optimization objective is only proved for GNNs with single-relational graph. Can we infer a new type of GNNs for multi-relational graphs by extending this optimization objective, so as to simultaneously solve the issues in previous multi-relational GNNs, e. g. , over-parameterization? In this paper, we propose a novel ensemble multi-relational GNNs by designing an ensemble multi-relational (EMR) optimization objective. This EMR optimization objective is able to derive an iterative updating rule, which can be formalized as an ensemble message passing (EnMP) layer with multi-relations. We further analyze the nice properties of EnMP layer, e. g. , the relationship with multi-relational personalized PageRank. Finally, a new multi-relational GNNs which well alleviate the over-smoothing and over-parameterization issues are proposed. Extensive experiments conducted on four benchmark datasets well demonstrate the effectiveness of the proposed model.

NeurIPS Conference 2022 Conference Paper

Moderate-fitting as a Natural Backdoor Defender for Pre-trained Language Models

  • Biru Zhu
  • Yujia Qin
  • Ganqu Cui
  • Yangyi Chen
  • Weilin Zhao
  • Chong Fu
  • Yangdong Deng
  • Zhiyuan Liu

Despite the great success of pre-trained language models (PLMs) in a large set of natural language processing (NLP) tasks, there has been a growing concern about their security in real-world applications. Backdoor attack, which poisons a small number of training samples by inserting backdoor triggers, is a typical threat to security. Trained on the poisoned dataset, a victim model would perform normally on benign samples but predict the attacker-chosen label on samples containing pre-defined triggers. The vulnerability of PLMs under backdoor attacks has been proved with increasing evidence in the literature. In this paper, we present several simple yet effective training strategies that could effectively defend against such attacks. To the best of our knowledge, this is the first work to explore the possibility of backdoor-free adaptation for PLMs. Our motivation is based on the observation that, when trained on the poisoned dataset, the PLM's adaptation follows a strict order of two stages: (1) a moderate-fitting stage, where the model mainly learns the major features corresponding to the original task instead of subsidiary features of backdoor triggers, and (2) an overfitting stage, where both features are learned adequately. Therefore, if we could properly restrict the PLM's adaptation to the moderate-fitting stage, the model would neglect the backdoor triggers but still achieve satisfying performance on the original task. To this end, we design three methods to defend against backdoor attacks by reducing the model capacity, training epochs, and learning rate, respectively. Experimental results demonstrate the effectiveness of our methods in defending against several representative NLP backdoor attacks. We also perform visualization-based analysis to attain a deeper understanding of how the model learns different features, and explore the effect of the poisoning ratio. Finally, we explore whether our methods could defend against backdoor attacks for the pre-trained CV model. The codes are publicly available at https: //github. com/thunlp/Moderate-fitting.

IJCAI Conference 2022 Conference Paper

Searching for Optimal Subword Tokenization in Cross-domain NER

  • Ruotian Ma
  • Yiding Tan
  • Xin Zhou
  • Xuanting Chen
  • Di Liang
  • Sirui Wang
  • Wei Wu
  • Tao Gui

Input distribution shift is one of the vital problems in unsupervised domain adaptation (UDA). The most popular UDA approaches focus on domain-invariant representation learning, trying to align the features from different domains into a similar feature distribution. However, these approaches ignore the direct alignment of input word distributions between domains, which is a vital factor in word-level classification tasks such as cross-domain NER. In this work, we shed new light on cross-domain NER by introducing a subword-level solution, X-Piece, for input word-level distribution shift in NER. Specifically, we re-tokenize the input words of the source domain to approach the target subword distribution, which is formulated and solved as an optimal transport problem. As this approach focuses on the input level, it can also be combined with previous DIRL methods for further improvement. Experimental results show the effectiveness of the proposed method based on BERT-tagger on four benchmark NER datasets. Also, the proposed method is proved to benefit DIRL methods such as DANN.

IJCAI Conference 2021 Conference Paper

A Survey on Response Selection for Retrieval-based Dialogues

  • Chongyang Tao
  • Jiazhan Feng
  • Rui Yan
  • Wei Wu
  • Daxin Jiang

Building an intelligent dialogue system capable of naturally and coherently conversing with humans has been a long-standing goal of artificial intelligence. In the past decade, with the development of machine/deep learning technology and the explosive growth of available conversation data in social media, numerous neural models have been developed for context-response matching tasks in retrieval-based dialogue systems, with more fluent and informative responses compared with generative models. This paper presents a comprehensive survey of recent advances in response selection for retrieval-based dialogues. In particular, we first formulate the problem of response selection and review state-of-the-art context-response matching models categorized by their architecture. Then we summarize some recent advances on the research of response selection, including incorporation with extra knowledge and exploration on more effective model learning. Finally, we highlight the challenges which are not yet well addressed in this task and present future research directions.

AAAI Conference 2021 Conference Paper

BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation

  • Haisheng Su
  • Weihao Gan
  • Wei Wu
  • Yu Qiao
  • Junjie Yan

Generating human action proposals in untrimmed videos is an important yet challenging task with wide applications. Current methods often suffer from the noisy boundary locations and the inferior quality of confidence scores used for proposal retrieving. In this paper, we present BSN++, a new framework which exploits complementary boundary regressor and relation modeling for temporal proposal generation. First, we propose a novel boundary regressor based on the complementary characteristics of both starting and ending boundary classifiers. Specifically, we utilize the Ushaped architecture with nested skip connections to capture rich contexts and introduce bi-directional boundary matching mechanism to improve boundary precision. Second, to account for the proposal-proposal relations ignored in previous methods, we devise a proposal relation block to which includes two self-attention modules from the aspects of position and channel. Furthermore, we find that there inevitably exists data imbalanced problems in the positive/negative proposals and temporal durations, which harm the model performance on tail distributions. To relieve this issue, we introduce the scale-balanced re-sampling strategy. Extensive experiments are conducted on two popular benchmarks: ActivityNet-1. 3 and THUMOS14, which demonstrate that BSN++ achieves the state-of-the-art performance. Not surprisingly, the proposed BSN++ ranked 1st place in the CVPR19 - ActivityNet challenge leaderboard on temporal action localization task.

TIST Journal 2021 Journal Article

Conditional Text Generation for Harmonious Human-Machine Interaction

  • Bin Guo
  • Hao Wang
  • Yasan Ding
  • Wei Wu
  • Shaoyang Hao
  • Yueqi Sun
  • Zhiwen Yu

In recent years, with the development of deep learning, text-generation technology has undergone great changes and provided many kinds of services for human beings, such as restaurant reservation and daily communication. The automatically generated text is becoming more and more fluent so researchers begin to consider more anthropomorphic text-generation technology, that is, the conditional text generation, including emotional text generation, personalized text generation, and so on. Conditional Text Generation (CTG) has thus become a research hotspot. As a promising research field, we find that much attention has been paid to exploring it. Therefore, we aim to give a comprehensive review of the new research trends of CTG. We first summarize several key techniques and illustrate the technical evolution route in the field of neural text generation, based on the concept model of CTG. We further make an investigation of existing CTG fields and propose several general learning models for CTG. Finally, we discuss the open issues and promising research directions of CTG.

AAAI Conference 2021 Conference Paper

Context-Aware Graph Convolution Network for Target Re-identification

  • Deyi Ji
  • Haoran Wang
  • Hanzhe Hu
  • Weihao Gan
  • Wei Wu
  • Junjie Yan

Most existing re-identification methods focus on learning robust and discriminative features with deep convolution networks. However, many of them consider content similarity separately and fail to utilize the context information of the query and gallery sets, e. g. probe-gallery and gallery-gallery relations, thus hard samples may not be well solved due to the limited or even misleading information. In this paper, we present a novel Context-Aware Graph Convolution Network (CAGCN), where the probe-gallery relations are encoded into the graph nodes and the graph edge connections are well controlled by the gallery-gallery relations. In this way, hard samples can be addressed with the context information flows among other easy samples during the graph reasoning. Specifically, we adopt an effective hard gallery sampler to obtain high recall for positive samples while keeping a reasonable graph size, which can also weaken the imbalanced problem in training process with low computation complexity. Experiments show that the proposed method achieves state-of-the-art performance on both person and vehicle reidentification datasets in a plug and play fashion with limited overhead.

AAAI Conference 2021 Conference Paper

Correlation-Aware Heuristic Search for Intelligent Virtual Machine Provisioning in Cloud Systems

  • Chuan Luo
  • Bo Qiao
  • Wenqian Xing
  • Xin Chen
  • Pu Zhao
  • Chao Du
  • Randolph Yao
  • Hongyu Zhang

The optimization of resource is crucial for the operation of public cloud systems such as Microsoft Azure, as well as servers dedicated to the workloads of large customers such as Microsoft 365. Those optimization tasks often need to take unknown parameters into consideration and can be formulated as Prediction+Optimization problems. This paper proposes a new Prediction+Optimization method named Correlation-Aware Heuristic Search (CAHS) that is capable of accounting for the uncertainty in unknown parameters and delivering effective solutions to difficult optimization problems. We apply this method to solving the predictive virtual machine (VM) provisioning (PreVMP) problem, where the VM provisioning plans are optimized based on the predicted demands of different VM types, to ensure rapid provisions upon customers’ requests and to pursue high resource utilization. Unlike the current state-of-the-art PreVMP approaches that assume independence among the demands for different VM types, CAHS incorporates demand correlation when conducting prediction and optimization in a novel and effective way. Our experiments on two public benchmarks and one industrial benchmark demonstrate that CAHS can achieve better performance than its nine state-of-the-art competitors. CAHS has been successfully deployed in Microsoft Azure and significantly improved its performance. The main ideas of CAHS have also been leveraged to improve the efficiency and the reliability of the cloud services provided by Microsoft 365.

AAAI Conference 2021 Conference Paper

Empowering Conversational AI is a Trip to Mars: Progress and Future of Open Domain Human-Computer Dialogues

  • Rui Yan
  • Wei Wu

Dialogue systems powered by conversational artificial intelligence (AI) have never been so popular. Interacting with computer through languages reveals a more natural interface to give orders and acquire information---just like human communication. Due to promising potential as virtual assistants and/or social bots, major NLP, AI and even Search & Mining communities are explicitly calling-out for contributions of conversational studies. Learning towards real conversational intelligence is a trip to Mars; perhaps we are yet on Earth. We have achieved substantial progress from recent research outputs. Still we have major obstacles to overcome. In this paper, we present an overview of progress and look forward to future trends so as to shed light on possible directions towards success.

AAAI Conference 2021 Conference Paper

Explaining A Black-box By Using A Deep Variational Information Bottleneck Approach

  • Seojin Bang
  • Pengtao Xie
  • Heewook Lee
  • Wei Wu
  • Eric Xing

Interpretable machine learning has gained much attention recently. Briefness and comprehensiveness are necessary in order to provide a large amount of information concisely when explaining a black-box decision system. However, existing interpretable machine learning methods fail to consider briefness and comprehensiveness simultaneously, leading to redundant explanations. We propose the variational information bottleneck for interpretation, VIBI, a system-agnostic interpretable method that provides a brief but comprehensive explanation. VIBI adopts an information theoretic principle, information bottleneck principle, as a criterion for finding such explanations. For each instance, VIBI selects key features that are maximally compressed about an input (briefness), and informative about a decision made by a black-box system on that input (comprehensive). We evaluate VIBI on three datasets and compare with state-of-the-art interpretable machine learning methods in terms of both interpretability and fidelity evaluated by human and quantitative metrics.

AAAI Conference 2021 Conference Paper

Open Domain Dialogue Generation with Latent Images

  • Ze Yang
  • Wei Wu
  • Huang Hu
  • Can Xu
  • Wei Wang
  • Zhoujun Li

We consider grounding open domain dialogues with images. Existing work assumes that both an image and a textual context are available, but image-grounded dialogues by nature are more difficult to obtain than textual dialogues. Thus, we propose learning a response generation model with both image-grounded dialogues and textual dialogues by assuming that the visual scene information at the time of a conversation can be represented by an image, and trying to recover the latent images of the textual dialogues through text-to-image generation techniques. The likelihood of the two types of dialogues is then formulated by a response generator and an image reconstructor that are learned within a conditional variational auto-encoding framework. Empirical studies are conducted in both image-grounded conversation and text-based conversation. In the first scenario, image-grounded dialogues, especially under a low-resource setting, can be effectively augmented by textual dialogues with latent images; while in the second scenario, latent images can enrich the content of responses and at the same time keep them relevant to contexts.

AAAI Conference 2021 Conference Paper

PULNS: Positive-Unlabeled Learning with Effective Negative Sample Selector

  • Chuan Luo
  • Pu Zhao
  • Chen Chen
  • Bo Qiao
  • Chao Du
  • Hongyu Zhang
  • Wei Wu
  • Shaowei Cai

Positive-unlabeled learning (PU learning) is an important case of binary classification where the training data only contains positive and unlabeled samples. The current stateof-the-art approach for PU learning is the cost-sensitive approach, which casts PU learning as a cost-sensitive classification problem and relies on unbiased risk estimator for correcting the bias introduced by the unlabeled samples. However, this approach requires the knowledge of class prior and is subject to the potential label noise. In this paper, we propose a novel PU learning approach dubbed PULNS, equipped with an effective negative sample selector, which is optimized by reinforcement learning. Our PULNS approach employs an effective negative sample selector as the agent responsible for selecting negative samples from the unlabeled data. While the selected, likely negative samples can be used to improve the classifier, the performance of classifier is also used as the reward to improve the selector through the REINFORCE algorithm. By alternating the updates of the selector and the classifier, the performance of both is improved. Extensive experimental studies on 7 real-world application benchmarks demonstrate that PULNS consistently outperforms the current state-of-the-art methods in PU learning, and our experimental results also confirm the effectiveness of the negative sample selector underlying PULNS.

IJCAI Conference 2020 Conference Paper

Intelligent Virtual Machine Provisioning in Cloud Computing

  • Chuan Luo
  • Bo Qiao
  • Xin Chen
  • Pu Zhao
  • Randolph Yao
  • Hongyu Zhang
  • Wei Wu
  • Andrew Zhou

Virtual machine (VM) provisioning is a common and critical problem in cloud computing. In industrial cloud platforms, there are a huge number of VMs provisioned per day. Due to the complexity and resource constraints, it needs to be carefully optimized to make cloud platforms effectively utilize the resources. Moreover, in practice, provisioning a VM from scratch requires fairly long time, which would degrade the customer experience. Hence, it is advisable to provision VMs ahead for upcoming demands. In this work, we formulate the practical scenario as the predictive VM provisioning (PreVMP) problem, where upcoming demands are unknown and need to be predicted in advance, and then the VM provisioning plan is optimized based on the predicted demands. Further, we propose Uncertainty-Aware Heuristic Search (UAHS) for solving the PreVMP problem. UAHS first models the prediction uncertainty, and then utilizes the prediction uncertainty in optimization. Moreover, UAHS leverages Bayesian optimization to interact prediction and optimization to improve its practical performance. Extensive experiments show that UAHS performs much better than state-of-the-art competitors on two public datasets and an industrial dataset. UAHS has been successfully applied in Microsoft Azure and brought practical benefits in real-world applications.

IJCAI Conference 2020 Conference Paper

Retrieve, Program, Repeat: Complex Knowledge Base Question Answering via Alternate Meta-learning

  • Yuncheng Hua
  • Yuan-Fang Li
  • Gholamreza Haffari
  • Guilin Qi
  • Wei Wu

A compelling approach to complex question answering is to convert the question to a sequence of actions, which can then be executed on the knowledge base to yield the answer, aka the programmer-interpreter approach. Use similar training questions to the test question, meta-learning enables the programmer to adapt to unseen questions to tackle potential distributional biases quickly. However, this comes at the cost of manually labeling similar questions to learn a retrieval model, which is tedious and expensive. In this paper, we present a novel method that automatically learns a retrieval model alternately with the programmer from weak supervision, i. e. , the system’s performance with respect to the produced answers. To the best of our knowledge, this is the first attempt to train the retrieval model with the programmer jointly. Our system leads to state-of-the-art performance on a large-scale task for complex question answering over knowledge bases. We have released our code at https: //github. com/DevinJake/MARL.

NeurIPS Conference 2020 Conference Paper

Zero-Resource Knowledge-Grounded Dialogue Generation

  • Linxiao Li
  • Can Xu
  • Wei Wu
  • Yufan Zhao
  • Xueliang Zhao
  • Chongyang Tao

While neural conversation models have shown great potentials towards generating informative and engaging responses via introducing external knowledge, learning such a model often requires knowledge-grounded dialogues that are difficult to obtain. To overcome the data challenge and reduce the cost of building a knowledge-grounded dialogue system, we explore the problem under a zero-resource setting by assuming no context-knowledge-response triples are needed for training. To this end, we propose representing the knowledge that bridges a context and a response and the way that the knowledge is expressed as latent variables, and devise a variational approach that can effectively estimate a generation model from independent dialogue corpora and knowledge corpora. Evaluation results on three benchmarks of knowledge-grounded dialogue generation indicate that our model can achieve comparable performance with state-of-the-art methods that rely on knowledge-grounded dialogues for training, and exhibits a good generalization ability over different datasets.

IJCAI Conference 2019 Conference Paper

A Document-grounded Matching Network for Response Selection in Retrieval-based Chatbots

  • Xueliang Zhao
  • Chongyang Tao
  • Wei Wu
  • Can Xu
  • Dongyan Zhao
  • Rui Yan

We present a document-grounded matching network (DGMN) for response selection that can power a knowledge-aware retrieval-based chatbot system. The challenges of building such a model lie in how to ground conversation contexts with background documents and how to recognize important information in the documents for matching. To overcome the challenges, DGMN fuses information in a document and a context into representations of each other, and dynamically determines if grounding is necessary and importance of different parts of the document and the context through hierarchical interaction with a response at the matching step. Empirical studies on two public data sets indicate that DGMN can significantly improve upon state-of-the-art methods and at the same time enjoys good interpretability.

NeurIPS Conference 2019 Conference Paper

Glyce: Glyph-vectors for Chinese Character Representations

  • Yuxian Meng
  • Wei Wu
  • Fei Wang
  • Xiaoya Li
  • Ping Nie
  • Fan Yin
  • Muyu Li
  • Qinghong Han

It is intuitive that NLP tasks for logographic languages like Chinese should benefit from the use of the glyph information in those languages. However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found. In this paper, we address this gap by presenting Glyce, the glyph-vectors for Chinese character representations. We make three major innovations: (1) We use historical Chinese scripts (e. g. , bronzeware script, seal script, traditional Chinese, etc) to enrich the pictographic evidence in characters; (2) We design CNN structures (called tianzege-CNN) tailored to Chinese character image processing; and (3) We use image-classification as an auxiliary task in a multi-task learning setup to increase the model's ability to generalize. We show that glyph-based models are able to consistently outperform word/char ID-based models in a wide range of Chinese NLP tasks. When combing with BERT, we are able to set new state-of-the-art results for a variety of Chinese NLP tasks, including language modeling, tagging (NER, CWS, POS), sentence pair classification (BQ, LCQMC, XNLI, NLPCC-DBQA), single sentence classification tasks (ChnSentiCorp, the Fudan corpus, iFeng), dependency parsing, and semantic role labeling. For example, the proposed model achieves an F1 score of 81. 6 on the OntoNotes dataset of NER, +1. 5 over BERT; it achieves an almost perfect accuracy of 99. 8\% on the the Fudan corpus for text classification.

IJCAI Conference 2018 Conference Paper

Efficient Attributed Network Embedding via Recursive Randomized Hashing

  • Wei Wu
  • Bin Li
  • Ling Chen
  • Chengqi Zhang

Attributed network embedding aims to learn a low-dimensional representation for each node of a network, considering both attributes and structure information of the node. However, the learning based methods usually involve substantial cost in time, which makes them impractical without the help of a powerful workhorse. In this paper, we propose a simple yet effective algorithm, named NetHash, to solve this problem only with moderate computing capacity. NetHash employs the randomized hashing technique to encode shallow trees, each of which is rooted at a node of the network. The main idea is to efficiently encode both attributes and structure information of each node by recursively sketching the corresponding rooted tree from bottom (i. e. , the predefined highest-order neighboring nodes) to top (i. e. , the root node), and particularly, to preserve as much information closer to the root node as possible. Our extensive experimental results show that the proposed algorithm, which does not need learning, runs significantly faster than the state-of-the-art learning-based network embedding methods while achieving competitive or even better performance in accuracy.

IJCAI Conference 2018 Conference Paper

Get The Point of My Utterance! Learning Towards Effective Responses with Multi-Head Attention Mechanism

  • Chongyang Tao
  • Shen Gao
  • Mingyue Shang
  • Wei Wu
  • Dongyan Zhao
  • Rui Yan

Attention mechanism has become a popular and widely used component in sequence-to-sequence models. However, previous research on neural generative dialogue systems always generates universal responses, and the attention distribution learned by the model always attends to the same semantic aspect. To solve this problem, in this paper, we propose a novel Multi-Head Attention Mechanism (MHAM) for generative dialog systems, which aims at capturing multiple semantic aspects from the user utterance. Further, a regularizer is formulated to force different attention heads to concentrate on certain aspects. The proposed mechanism leads to more informative, diverse, and relevant response generated. Experimental results show that our proposed model outperforms several strong baselines.

AAAI Conference 2018 Conference Paper

Hierarchical Recurrent Attention Network for Response Generation

  • Chen Xing
  • Yu Wu
  • Wei Wu
  • Yalou Huang
  • Ming Zhou

We study multi-turn response generation in chatbots where a response is generated according to a conversation context. Existing work has modeled the hierarchy of the context, but does not pay enough attention to the fact that words and utterances in the context are differentially important. As a result, they may lose important information in context and generate irrelevant responses. We propose a hierarchical recurrent attention network (HRAN) to model both the hierarchy and the importance variance in a unified framework. In HRAN, a hierarchical attention mechanism attends to important parts within and among utterances with word level attention and utterance level attention respectively. Empirical studies on both automatic evaluation and human judgment show that HRAN can significantly outperform state-of-the-art models for context based response generation.

AAAI Conference 2018 Conference Paper

Knowledge Enhanced Hybrid Neural Network for Text Matching

  • Yu Wu
  • Wei Wu
  • Can Xu
  • Zhoujun Li

Long text brings a big challenge to neural network based text matching approaches due to their complicated structures. To tackle the challenge, we propose a knowledge enhanced hybrid neural network (KEHNN) that leverages prior knowledge to identify useful information and filter out noise in long text and performs matching from multiple perspectives. The model fuses prior knowledge into word representations by knowledge gates and establishes three matching channels with words, sequential structures of text given by Gated Recurrent Units (GRUs), and knowledge enhanced representations. The three channels are processed by a convolutional neural network to generate high level features for matching, and the features are synthesized as a matching score by a multilayer perceptron. In this paper, we focus on exploring the use of taxonomy knowledge for text matching. Evaluation results from extensive experiments on public data sets of question answering and conversation show that KEHNN can significantly outperform state-of-the-art matching models and particularly improve matching accuracy on pairs with long text.

AAAI Conference 2018 Conference Paper

Neural Response Generation With Dynamic Vocabularies

  • Yu Wu
  • Wei Wu
  • Dejian Yang
  • Can Xu
  • Zhoujun Li

We study response generation for open domain conversation in chatbots. Existing methods assume that words in responses are generated from an identical vocabulary regardless of their inputs, which not only makes them vulnerable to generic patterns and irrelevant noise, but also causes a high cost in decoding. We propose a dynamic vocabulary sequence-tosequence (DVS2S) model which allows each input to possess their own vocabulary in decoding. In training, vocabulary construction and response generation are jointly learned by maximizing a lower bound of the true objective with a Monte Carlo sampling method. In inference, the model dynamically allocates a small vocabulary for an input with the word prediction model, and conducts decoding only with the small vocabulary. Because of the dynamic vocabulary mechanism, DVS2S eludes many generic patterns and irrelevant words in generation, and enjoys efficient decoding at the same time. Experimental results on both automatic metrics and human annotations show that DVS2S can significantly outperform state-of-the-art methods in terms of response quality, but only requires 60% decoding time compared to the most efficient baseline.

NeurIPS Conference 2018 Conference Paper

PointCNN: Convolution On X-Transformed Points

  • Yangyan Li
  • Rui Bu
  • Mingchao Sun
  • Wei Wu
  • Xinhan Di
  • Baoquan Chen

We present a simple and general framework for feature learning from point cloud. The key to the success of CNNs is the convolution operator that is capable of leveraging spatially-local correlation in data represented densely in grids (e. g. images). However, point cloud are irregular and unordered, thus a direct convolving of kernels against the features associated with the points will result in deserting the shape information while being variant to the orders. To address these problems, we propose to learn a X-transformation from the input points, which is used for simultaneously weighting the input features associated with the points and permuting them into latent potentially canonical order. Then element-wise product and sum operations of typical convolution operator are applied on the X-transformed features. The proposed method is a generalization of typical CNNs into learning features from point cloud, thus we call it PointCNN. Experiments show that PointCNN achieves on par or better performance than state-of-the-art methods on multiple challenging benchmark datasets and tasks.

AAAI Conference 2017 Conference Paper

Topic Aware Neural Response Generation

  • Chen Xing
  • Wei Wu
  • Yu Wu
  • Jie Liu
  • Yalou Huang
  • Ming Zhou
  • Wei-Ying Ma

We consider incorporating topic information into a sequenceto-sequence framework to generate informative and interesting responses for chatbots. To this end, we propose a topic aware sequence-to-sequence (TA-Seq2Seq) model. The model utilizes topics to simulate prior human knowledge that guides them to form informative and interesting responses in conversation, and leverages topic information in generation by a joint attention mechanism and a biased generation probability. The joint attention mechanism summarizes the hidden vectors of an input message as context vectors by message attention and synthesizes topic vectors by topic attention from the topic words of the message obtained from a pre-trained LDA model, with these vectors jointly affecting the generation of words in decoding. To increase the possibility of topic words appearing in responses, the model modifies the generation probability of topic words by adding an extra probability item to bias the overall distribution. Empirical studies on both automatic evaluation metrics and human annotations show that TA-Seq2Seq can generate more informative and interesting responses, significantly outperforming state-of-theart response generation models.

AAAI Conference 2016 Conference Paper

Improving Recommendation of Tail Tags for Questions in Community Question Answering

  • Yu Wu
  • Wei Wu
  • Zhoujun Li
  • Ming Zhou

We study tag recommendation for questions in community question answering (CQA). Tags represent the semantic summarization of questions are useful for navigation and expert finding in CQA and can facilitate content consumption such as searching and mining in these web sites. The task is challenging, as both questions and tags are short and a large fraction of tags are tail tags which occur very infrequently. To solve these problems, we propose matching questions and tags not only by themselves, but also by similar questions and similar tags. The idea is then formalized as a model in which we calculate question-tag similarity using a linear combination of similarity with similar questions and tags weighted by tag importance. Question similarity, tag similarity, and tag importance are learned in a supervised random walk framework by fusing multiple features. Our model thus can not only accurately identify question-tag similarity for head tags, but also improve the accuracy of recommendation of tail tags. Experimental results show that the proposed method significantly outperforms state-of-the-art methods on tag recommendation for questions. Particularly, it improves tail tag recommendation accuracy by a large margin.

ECAI Conference 2016 Conference Paper

ShapeLearner: Towards Shape-Based Visual Knowledge Harvesting

  • Huayong Xu
  • Yafang Wang
  • Kang Feng
  • Gerard de Melo
  • Wei Wu
  • Andrei Sharf
  • Baoquan Chen

The deluge of images on the Web has led to a number of efforts to organize images semantically and mine visual knowledge. Despite enormous progress on categorizing entire images or bounding boxes, only few studies have targeted fine-grained image understanding at the level of specific shape contours. For instance, beyond recognizing that an image portrays a cat, we may wish to distinguish its legs, head, tail, and so on. To this end, we present ShapeLearner, a system that acquires such visual knowledge about object shapes and their parts in a semantic taxonomy, and then is able to exploit this hierarchy in order to analyze new kinds of objects that it has not observed before. ShapeLearner jointly learns this knowledge from sets of segmented images. The space of label and segmentation hypotheses is pruned and then evaluated using Integer Linear Programming. Experiments on a variety of shape classes show the accuracy and effectiveness of our method.

AAAI Conference 2015 Conference Paper

Mining Query Subtopics from Questions in Community Question Answering

  • Yu Wu
  • Wei Wu
  • Zhoujun Li
  • Ming Zhou

This paper proposes mining query subtopics from questions in community question answering (CQA). The subtopics are represented as a number of clusters of questions with keywords summarizing the clusters. The task is unique in that the subtopics from questions can not only facilitate user browsing in CQA search, but also describe aspects of queries from a question-answering perspective. The challenges of the task include how to group semantically similar questions and how to find keywords capable of summarizing the clusters. We formulate the subtopic mining task as a non-negative matrix factorization (NMF) problem and further extend the model of NMF to incorporate question similarity estimated from metadata of CQA into learning. Compared with existing methods, our method can jointly optimize question clustering and keyword extraction and encourage the former task to enhance the latter. Experimental results on large scale real world CQA datasets show that the proposed method significantly outperforms the existing methods in terms of keyword extraction, while achieving a comparable performance to the state-ofthe-art methods for question clustering.

AAAI Conference 2014 Conference Paper

Double Configuration Checking in Stochastic Local Search for Satisfiability

  • Chuan Luo
  • Shaowei Cai
  • Wei Wu
  • Kaile Su

Stochastic local search (SLS) algorithms have shown effectiveness on satisfiable instances of the Boolean satisfiability (SAT) problem. However, their performance is still unsatisfactory on random k-SAT at the phase transition, which is of significance and is one of the empirically hardest distributions of SAT instances. In this paper, we propose a new heuristic called DCCA, which combines two configuration checking (CC) strategies with different definitions of configuration in a novel way. We use the DCCA heuristic to design an efficient SLS solver for SAT dubbed DCCASat. The experiments show that the DCCASat solver significantly outperforms a number of state-of-the-art solvers on extensive random k-SAT benchmarks at the phase transition. Moreover, DCCASat shows good performance on structured benchmarks, and a combination of DCCASat with a complete solver achieves state-of-the-art performance on structured benchmarks.

IJCAI Conference 2013 Conference Paper

Automatic Name-Face Alignment to Enable Cross-Media News Retrieval

  • Yuejie Zhang
  • Wei Wu
  • Yang Li
  • Cheng Jin
  • Xiangyang Xue
  • Jianping Fan

A new algorithm is developed in this paper to support automatic name-face alignment for achieving more accurate cross-media news retrieval. We focus on extracting valuable information from large amounts of news images and their captions, where multi-level image-caption pairs are constructed for characterizing both significant names with higher salience and their cohesion with human faces extracted from news images. To remedy the issue of lacking enough related information for rare name, Web mining is introduced to acquire the extra multimodal information. We also emphasize on an optimization mechanism by our Improved Self-Adaptive Simulated Annealing Genetic Algorithm to verify the feasibility of alignment combinations. Our experiments have obtained very positive results.

JMLR Journal 2013 Journal Article

Learning Bilinear Model for Matching Queries and Documents

  • Wei Wu
  • Zhengdong Lu
  • Hang Li

The task of matching data from two heterogeneous domains naturally arises in various areas such as web search, collaborative filtering, and drug design. In web search, existing work has designed relevance models to match queries and documents by exploiting either user clicks or content of queries and documents. To the best of our knowledge, however, there has been little work on principled approaches to leveraging both clicks and content to learn a matching model for search. In this paper, we propose a framework for learning to match heterogeneous objects. The framework learns two linear mappings for two objects respectively, and matches them via the dot product of their images after mapping. Moreover, when different regularizations are enforced, the framework renders a rich family of matching models. With orthonormal constraints on mapping functions, the framework subsumes Partial Least Squares (PLS) as a special case. Alternatively, with a $\ell_1$+$\ell_2$ regularization, we obtain a new model called Regularized Mapping to Latent Structures (RMLS). RMLS enjoys many advantages over PLS, including lower time complexity and easy parallelization. To further understand the matching framework, we conduct generalization analysis and apply the result to both PLS and RMLS. We apply the framework to web search and implement both PLS and RMLS using a click-through bipartite with metadata representing features of queries and documents. We test the efficacy and scalability of RMLS and PLS on large scale web search problems. The results show that both PLS and RMLS can significantly outperform baseline methods, while RMLS substantially speeds up the learning process. [abs] [ pdf ][ bib ] &copy JMLR 2013. ( edit, beta )

IJCAI Conference 2011 Conference Paper

Fusion of Multiple Features and Supervised Learning for Chinese OOV Term Detection and POS Guessing

  • Yuejie Zhang
  • Lei Cen
  • Wei Wu
  • Cheng Jin
  • Xiangyang Xue

In this paper, to support more precise Chinese Out-of-Vocabulary (OOV) term detection and Part-of-Speech (POS) guessing, a unified mechanism is proposed and formulated based on the fusion of multiple features and supervised learning. Besides all the traditional features, the new features for statistical information and global contexts are introduced, as well as some constraints and heuristic rules, which reveal the relationships among OOV term candidates. Our experiments on the Chinese corpora from both People's Daily and SIGHAN 2005 have achieved the consistent results, which are better than those acquired by pure rule-based or statistics-based models. From the experimental results for combining our model with Chinese monolingual retrieval on the data sets of TREC-9, it is found that the obvious improvement for the retrieval performance can also be obtained.

JMLR Journal 2011 Journal Article

Learning a Robust Relevance Model for Search Using Kernel Methods

  • Wei Wu
  • Jun Xu
  • Hang Li
  • Satoshi Oyama

This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-definite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A specific implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 significantly outperforms baseline methods and can successfully tackle the term mismatch problem. [abs] [ pdf ][ bib ] &copy JMLR 2011. ( edit, beta )

AAAI Conference 2011 Conference Paper

Multi-Task Learning in Square Integrable Space

  • Wei Wu
  • Hang Li
  • Yunhua Hu
  • Rong Jin

Several kernel based methods for multi-task learning have been proposed, which leverage relations among tasks as regularization to enhance the overall learning accuracies. These methods assume that the tasks share the same kernel, which could limit their applications because in practice different tasks may need different kernels. The main challenge of introducing multiple kernels into multiple tasks is that models from different Reproducing Kernel Hilbert Spaces (RKHSs) are not comparable, making it difficult to exploit relations among tasks. This paper addresses the challenge by formalizing the problem in the Square Integrable Space (SIS). Specially, it proposes a kernel based method which makes use of a regularization term defined in the SIS to represent task relations. We prove a new representer theorem for the proposed approach in SIS. We further derive a practical method for solving the learning problem and conduct consistency analysis of the method. We discuss the relations between our method and an existing method. We also give an SVM based implementation of our method for multi-label classification. Experiments on two real-world data sets show that the proposed method performs better than the existing method.

NeurIPS Conference 2011 Conference Paper

Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

  • Sebastian Kurtek
  • Anuj Srivastava
  • Wei Wu

While signal estimation under random amplitudes, phase shifts, and additive noise is studied frequently, the problem of estimating a deterministic signal under random time-warpings has been relatively unexplored. We present a novel framework for estimating the unknown signal that utilizes the action of the warping group to form an equivalence relation between signals. First, we derive an estimator for the equivalence class of the unknown signal using the notion of Karcher mean on the quotient space of equivalence classes. This step requires the use of Fisher-Rao Riemannian metric and a square-root representation of signals to enable computations of distances and means under this metric. Then, we define a notion of the center of a class and show that the center of the estimated class is a consistent estimator of the underlying unknown signal. This estimation algorithm has many applications: (1)registration/alignment of functional data, (2) separation of phase/amplitude components of functional data, (3) joint demodulation and carrier estimation, and (4) sparse modeling of functional data. Here we demonstrate only (1) and (2): Given signals are temporally aligned using nonlinear warpings and, thus, separated into their phase and amplitude components. The proposed method for signal alignment is shown to have state of the art performance using Berkeley growth, handwritten signatures, and neuroscience spike train data.