Arrow Research search

Author name cluster

Hao Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

92 papers
2 author rows

Possible papers

92

AAAI Conference 2026 Conference Paper

Benchmarking LLMs’ Mathematical Reasoning with Unseen Random Variables Questions

  • Zijin Hong
  • Hao Wu
  • Su Dong
  • Junnan Dong
  • Yilin Xiao
  • Yujing Zhang
  • Zhu Wang
  • Feiran Huang

Recent studies have raised significant concerns regarding the reliability of current mathematical benchmarks, highlighting key limitations such as simplistic design and potential data contamination that undermine evaluation accuracy. Consequently, developing a reliable benchmark that effectively evaluates large language models' (LLMs) genuine capabilities in mathematical reasoning remains a critical challenge. To address these concerns, we propose RV-Bench, a novel evaluation methodology for Benchmarking LLMs with Random Variables in mathematical reasoning. Specifically, we develop question-generating functions to produce random variable questions (RVQs), whose background content mirrors the original benchmark problems, but with randomized variable combinations, rendering them "unseen" to LLMs. Models must completely understand the inherent question pattern to correctly answer RVQs with diverse variable combinations. Thus, an LLMs' genuine reasoning capability is reflected through its accuracy and robustness on RV-Bench. We conducted extensive experiments on over 30 representative LLMs across more than 1,000 RVQs. Our findings reveal that LLMs exhibit a proficiency imbalance between encountered and "unseen" data distributions. Furthermore, RV-Bench reveals that proficiency generalization across similar mathematical reasoning tasks is limited, but we verified that it can still be effectively elicited through test-time scaling.

AAAI Conference 2026 Conference Paper

DeepSenseMoE: Harnessing Power of Time Series Foundation Models for Few-Shot Human Activity Recognition

  • Zenan Fu
  • Dongzhou Cheng
  • Lei Zhang
  • Wenbo Huang
  • Zhenghao Chen
  • Hao Wu

Recent advances in Time Series Foundation Models (TSFMs) have fundamentally revolutionized general time series analysis across domains like finance, retail, weather, and power. However, how to unlock the hidden capacity of general-purpose TSFMs for wearable activity recognition still remains largely unexplored, given severe sensor annotation scarcity and highly heterogeneous sensor data. To address these challenges, we propose DeepSenseMoE—a novel multi-scale convolution-based Mixture of Experts (MoE) module for parameter-efficient fine-tuning of general-purpose TSFMs to sensor-based activity recognition. DeepSenseMoE integrates three key innovations: (1) Multi-scale convolutional experts with different filter sizes responsible for capturing varying sensor contexts; (2) Shared-expert isolation mechanism compressing common activity knowledge into a single shared expert while reducing redundancy among routed experts; and (3) Hierarchical supervised contrastive alignment guiding experts to further learn discriminative activity features. Extensive experiments on three challenging HAR benchmarks demonstrate DeepSenseMoE's superiority, achieving up to 9.5% accuracy gains over state-of-the-art under few-shot and full-supervised settings, with only <1% additional trainable parameters. We hope that this work may establish a solid foundation to accelerate development and deployment of powerful TSFMs in data-scarce wearable activity recognition tasks while reducing the reliance on labeled sensor data.

AAAI Conference 2026 Conference Paper

Diverse Human Driving Vehicle Simulation in Background Traffic for Autonomous Driving Tests

  • Wendi Li
  • Hao Wu
  • Han Gao
  • Bing Mao
  • Fengyuan Xu
  • Sheng Zhong

Realistic background traffic is critical to the simulation platforms for autonomous driving (AD) testing. Given that most vehicles in reality are driven by human beings, introducing human driving (HD) vehicles to the background traffic is necessary to be able to discover more problems of the tested AD vehicle in the simulation stage. However, existing methods rely on ad-hoc rules or data-driven training to mimic partial human driver behaviors, which are not comprehensive and lack transparency. In this work, we design a smart human driving vehicle simulator HDSim which is empowered by cognitively inspired modeling and AI models. HDSim enables diverse, realistic, and scalable HD traffic simulation on AD testing platforms like CARLA in a non-intrusive manner. There are two novel components in HDSim. First, we introduce a driver model to guide the generation of diverse human driving styles by using different combinations of latent cognitive factors in a hierarchy. Second, we design a Perception-Mediated Behavior Influence (PMBI) mechanism to use LLM-assisted perceptual transformations to indirectly fuse driving actions with driving styles. Experiments show that HDSim traffic can help simulation platforms like CARLA to reveal 68% more failures of tested AD vehicles, and the explainability of reported accidents is also improved.

JBHI Journal 2026 Journal Article

EAP-LSTM: A Bi-LSTM-Based Deep Learning Framework for Quantitatively Predicting Enhancer Activity in Drosophila and Human Cell Lines

  • Yao Zhang
  • Lichang Dai
  • Yu Dou
  • Xin Li
  • Chang Lu
  • Hao Wu

Enhancer activity plays a critical role in gene regulation, influencing various biological processes such as development and disease progression. Accurate prediction of enhancer activity is essential for understanding the mechanisms underlying gene regulation and enhancer function. This study introduces a novel deep learning framework, EAP-LSTM (Enhancer Activity Prediction based on Bi-LSTM), to quantitatively predict enhancer activity across different species and cell lines. The model integrates multiple feature modules, including Word2Vec-based representations of DNA sequences, reverse complement k-mer, mismatch k-mer features, and epigenomic data. Evaluated on six cell lines, including five human cell lines (A549, HCT116, HepG2, K562, and MCF-7) and one Drosophila cell line (S2), EAP-LSTM consistently outperforms state-of-the-art models, such as DeepSTARR and HEAP, in all datasets. For example, on the K562 dataset, EAP-LSTM achieves a Pearson correlation coefficient (PCC) of 0. 7944, outperforming DeepSTARR and HEAP by 13. 65% and 2. 73%, respectively. In addition, EAP-LSTM demonstrates strong performance in small-sample learning scenarios, showing clear improvements compared with baseline models. Furthermore, the study investigates the role of transcription factor binding sites (TFBSs) within enhancer regions, identifying critical motifs associated with enhancer activity. These findings not only improve enhancer prediction accuracy but also provide valuable insights into the molecular mechanisms underlying enhancer function.

JBHI Journal 2026 Journal Article

MomicPred: A Cell Cycle Prediction Framework Based on Dual-Branch Multi-Modal Feature Fusion for Single-Cell Multi-Omics Data

  • Zhenqi Shi
  • Linxing Cong
  • Hao Wu

The cell cycle plays a pivotal role in regulating cell fate and stem cell differentiation. As a rate-limiting step in differentiation, its precise regulation is essential for maintaining cellular diversity and tissue homeostasis. Recent advances in single-cell multi-omics technologies have enabled the integration of gene expression data and chromatin structural regulation, thereby enhancing the prediction of cell cycle using multi-omics approaches. However, current algorithms have yet to effectively integrate transcriptome and three-dimensional (3D) genomic data for cell cycle prediction. We propose MomicPred, an innovative dual-branch multi-modal fusion framework designed to predict cell cycle dynamics. This framework integrates transcriptome-derived gene expression data with global chromatin structural insights from 3D genome data. By leveraging the complementary nature of these multi-omics data, MomicPred extracts three core feature sets that uncover cross-layer associations and synergistic interactions between the two omics modalities, enabling high-precision cell cycle prediction. We further evaluate the framework’s performance through various benchmarking strategies, demonstrating its efficiency and robustness. Furthermore, feature importance analysis reveals chromatin structural changes and key biological processes across distinct cell cycle stages, offering new perspectives for future research.

AAAI Conference 2026 Conference Paper

NeuralOM: Neural Ocean Model for Subseasonal-to-Seasonal Simulation

  • Yuan Gao
  • Hao Wu
  • Fan Xu
  • Yanfei Xiang
  • Ruijian Gou
  • Ruiqi Shu
  • Qingsong Wen
  • Xian Wu

Long-term, high-fidelity simulation of slow-changing physical systems, such as the ocean and climate, presents a fundamental challenge in scientific computing. Traditional autoregressive machine learning models often fail in these tasks as minor errors accumulate and lead to rapid forecast degradation. To address this problem, we propose NeuralOM, a general neural operator framework designed for simulating complex, slow-changing dynamics. NeuralOM's core consists of two key innovations: (1) a Progressive Residual Correction Framework that decomposes the forecasting task into a series of fine-grained refinement steps, effectively suppressing long-term error accumulation; and (2) a Physics-Guided Graph Network whose built-in adaptive messaging mechanism explicitly models multi-scale physical interactions, such as gradient-driven flows and multiplicative couplings, thereby enhancing physical consistency while maintaining computational efficiency. We validate NeuralOM on the challenging task of global Subseasonal-to-Seasonal (S2S) ocean simulation. Extensive experiments demonstrate that NeuralOM not only surpasses state-of-the-art models in forecast accuracy and long-term stability, but also excels in simulating extreme events. For instance, at a 60-day lead time, NeuralOM achieves a 13.3% lower RMSE compared to the best-performing baseline, offering a stable, efficient, and physically-aware paradigm for data-driven scientific computing.

JBHI Journal 2026 Journal Article

Optimizing Accuracy-Efficiency Trade-Offs of On-Device Activity Inference With Star Operation

  • Guangjie Chen
  • Zenan Fu
  • Yetong Sha
  • Di Xiong
  • Lei Zhang
  • Hao Wu
  • Aiguo Song

Lightweight convolution-based neural networks (CNNs) are well suited for sensor-based human activity recognition (HAR) applications on resource-constrained edge devices with faster inference speed. However, the convolutional kernels are often limited to a small window range, which can only capture local details in time series sensor data, thus preventing further performance boost. Though Introducing self-attention into convolution can help to handle long-range dependence well, it might significantly slow down actual activity inference speed, due to high computational cost. In this paper, we introduce a new learning paradigm (star operation) and then present a lightweight Dual-Branch High-Order Interactions (DbHoi) block, which is computationally friendly for mobile HAR deployment. The proposed DbHoi block may implicitly transform raw sensor inputs into high-dimensional non-linear features, but actually operate in a low-dimensional feature space (analogs to the design principle of polynomial kernel tricks), without incurring extra computational overhead. Extensive experiments are conducted on three public HAR benchmarks including UCI-HAR, UniMiB-SHAR, and OPPORTUNITY, which demonstrate that our suggested DbHoi can consistently surpass various meticulously designed lightweight networks such as MobileNet, ShuffleNet, and GhostNet. Detailed ablation studies, visualizing representations, and on-device latency analyses further validate our insights with regards to the star operation, while underscoring its practical merit in real-world HAR deployment.

AAAI Conference 2026 Conference Paper

RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis

  • Linfeng Dong
  • Yuchen Yang
  • Hao Wu
  • Wei Wang
  • Yuenan Hou
  • Zhihang Zhong
  • Xiao Sun

We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a Cross-Attention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multi-modal analysis in sports.

AAAI Conference 2026 Conference Paper

State Proficiency-Based Adaptive Fine-Tuning for Offline-to-Online Reinforcement Learning

  • Songlin Li
  • Wei Xiao
  • Hao Wu
  • Xiaodan Zhang
  • Daolong An
  • Shuai Lü

In offline-to-online (O2O) reinforcement learning, achieving efficient performance improvement while maintaining training stability remains a critical challenge for effective fine-tuning. Existing O2O methods usually focus on the balance between policy improvement and policy constraint during online fine-tuning. However, they often overlook sample differences, leading to suboptimal performance. To address this challenge, we identify that the effectiveness of policy learning exhibits significant variation across states. Therefore, we propose the notion of state proficiency to capture the degree of effective learning in a given state. We propose State Proficiency-Based Adaptive Fine-Tuning (SPA), a straightforward yet effective method that establishes proficiency-based sample priorities in policy optimization to facilitate effective fine-tuning. Specifically, SPA focuses on low proficiency samples during policy improvement to enhance sample efficiency, while emphasizing high proficiency samples during policy constraint to ensure stable training. Extensive empirical results demonstrate that SPA achieves significant improvements over existing methods, attaining state-of-the-art performance on the D4RL benchmark.

AAAI Conference 2026 Conference Paper

S³-MSD: Large Vision-Language Model for Explainable and Generalizable Multi-modal Sarcasm Detection

  • Zhihong Zhu
  • Fan Zhang
  • Yunyan Zhang
  • Jinghan Sun
  • Guimin Hu
  • Hao Wu
  • Yuyan Chen
  • Bowen Xing

Multimodal sarcasm detection (MSD) aims to identify sarcasm polarity from diverse modalities (i.e., image–text pairs), a task that has received increasing attention. While significant progress has been made, existing approaches still face two major issues: lack of explainability and weak generalizability. In this paper, we introduce a new large vision–language model (LVLM) dubbed S³-MSD for explainable and generalizable MSD through three key components. For explainability, we develop (1) a self-training paradigm that automatically bootstraps answers with explanations, and (2) a self-calibrating mechanism that rectifies flawed explanations. For generalizability, we design (3) a self-focusing module that amplifies visual semantic entities through preference optimization, thereby mitigating textual over-reliance. Experimental results on both in-distribution and out-of-distribution (OOD) benchmarks demonstrate that S³-MSD consistently outperforms state-of-the-art methods in detection performance. Furthermore, the proposed S³-MSD provides persuasive explanations, as verified by both quantitative metrics and human evaluations.

AAAI Conference 2026 Conference Paper

Think How Your Teammates Think: Active Inference Can Benefit Decentralized Execution

  • Hao Wu
  • Shoucheng Song
  • Chang Yao
  • Sheng Han
  • Huaiyu Wan
  • Youfang Lin
  • Kai Lv

In multi-agent systems, explicit cognition of teammates' decision logic serves as a critical factor in facilitating coordination. Communication (i.e., "Tell") can assist in the cognitive development process by information dissemination, yet it is inevitably subject to real-world constraints such as noise, latency, and attacks. Therefore, building the understanding of teammates' decisions without communication remains challenging. To address this, we propose a novel non-communication MARL framework that realizes the construction of cognition through local observation-based modeling (i.e., "Think"). Our framework enables agents to model teammates' active inference process. At first, the proposed method produces three teammate portraits: perception-belief-action. Specifically, we model the teammate's decision process as follows: 1) Perception: observing environments; 2) Belief: forming beliefs; 3) Action: making decisions. Then, we selectively integrate the belief portrait into the decision process based on the accuracy and relevance of the perception portrait. This enables the selection of cooperative teammates and facilitates effective collaboration. Extensive experiments on the SMAC, SMACv2, MPE, and GRF benchmarks demonstrate the superior performance of our method.

ICRA Conference 2025 Conference Paper

A Bio-Inspired Sand-Rolling Robot: Effect of Body Shape on Sand Rolling Performance

  • Xingjue Liao
  • Wenhao Liu
  • Hao Wu
  • Feifei Qian

The capability of effectively moving on complex terrains such as sand and gravel can empower our robots to robustly operate in outdoor environments, and assist with critical tasks such as environment monitoring, search-and-rescue, and supply delivery. Inspired by the Mount Lyell salamander's ability to curl its body into a loop and effectively roll down hill slopes, in this study we develop a sand-rolling robot and investigate how its locomotion performance is governed by the shape of its body. We experimentally tested three different body shapes: Hexagon, Quadrilateral, and Triangle. We found that Hexagon and Triangle can achieve a faster rolling speed on sand, but exhibited more frequent failures of getting stuck. Analysis of the interaction between robot and sand revealed the failure mechanism: the deformation of the sand produced a local “sand incline” underneath robot contact segments, increasing the effective region of supporting polygon (ERSP) and preventing the robot from shifting its center of mass (CoM) outside the ERSP to produce sustainable rolling. Based on this mechanism, a highly-simplified model successfully captured the critical body pitch for each rolling shape to produce sustained rolling on sand, and informed design adaptations that mitigated the locomotion failures and improved robot speed by more than 200%. Our results provide insights into how locomotors can utilize different morphological features to achieve robust rolling motion across deformable substrates.

JBHI Journal 2025 Journal Article

A Rule-Guided Community Detection Method for Identifying Subpopulations in Medical Data

  • Hanyue Liu
  • Hong Yu
  • Hao Wu
  • Guoyin Wang

Precisely identifying and explaining subpopulations in heterogeneous populations is essential to understanding the disease subtype. Using community detection to identify subpopulations is a promising way. However, there remains an issue in the existing community detection: Current methods for identifying subpopulations in medical data rely solely on separate attribute values, ignoring the important association rules between attribute values. Association rules are crucial in medical diagnosis to determine disease subtypes. Thus, We propose a rule-guided community detection (RGCD) method for precisely identifying homogeneous subpopulations. Specifically, the RGCD incorporates association rules into the original network, thereby constructing an augmented network. It proves that decomposing the embedding vectors obtained from biased random walks on the augmented network is equivalent to decomposing the transition probability matrix. Based on this proof, we enhance the transition probability matrix through rule-guided biased random walks, resulting in the rule-augmented matrix. By performing matrix decomposition and clustering on this matrix, we achieve precise identification of subpopulations. To the best of our knowledge, this is the first work that introduces the incorporation of association rules into community detection. Extensive experiments on 10 real-world datasets from medical fields fully show that the RGCD is more competitive than six state-of-the-art community detection methods. The weighted F1 of RGCD increases by up to 22. 62%, compared to the best existing community detection methods. Furthermore, We provide a qualitative depiction of the subpopulations obtained through RGCD and acquire medically significant insights.

NeurIPS Conference 2025 Conference Paper

Breaking the Discretization Barrier of Continuous Physics Simulation Learning

  • Fan Xu
  • Hao Wu
  • Nan Wang
  • Lilan Peng
  • Kun Wang
  • Wei Gong
  • Xibin Zhao

The modeling of complicated time-evolving physical dynamics from partial observations is a long-standing challenge. Particularly, observations can be sparsely distributed in a seemingly random or unstructured manner, making it difficult to capture highly nonlinear features in a variety of scientific and engineering problems. However, existing data-driven approaches are often constrained by fixed spatial and temporal discretization. While some researchers attempt to achieve spatio-temporal continuity by designing novel strategies, they either overly rely on traditional numerical methods or fail to truly overcome the limitations imposed by discretization. To address these, we propose CoPS, a purely data-driven methods, to effectively model continuous physics simulation from partial observations. Specifically, we employ multiplicative filter network to fuse and encode spatial information with the corresponding observations. Then we customize geometric grids and use message-passing mechanism to map features from original spatial domain to the customized grids. Subsequently, CoPS models continuous-time dynamics by designing multi-scale graph ODEs, while introducing a Markov-based neural auto-correction module to assist and constrain the continuous extrapolations. Comprehensive experiments demonstrate that CoPS advances the state-of-the-art methods in space-time continuous modeling across various scenarios. The source code is available at~\url{https: //github. com/Sunxkissed/CoPS}.

NeurIPS Conference 2025 Conference Paper

CellVerse: Do Large Language Models Really Understand Cell Biology?

  • Fan Zhang
  • Tianyu Liu
  • Zhihong Zhu
  • Hao Wu
  • Haixin Wang
  • Donghao Zhou
  • Yefeng Zheng
  • Kun Wang

Recent studies have demonstrated the feasibility of modeling single-cell data as natural languages and the potential of leveraging powerful large language models (LLMs) for understanding cell biology. However, a comprehensive evaluation of LLMs' performance on language-driven single-cell analysis tasks still remains unexplored. Motivated by this challenge, we introduce CellVerse, a unified language-centric question-answering benchmark that integrates four types of single-cell multi-omics data and encompasses three hierarchical levels of single-cell analysis tasks: cell type annotation (cell-level), drug response prediction (drug-level), and perturbation analysis (gene-level). Going beyond this, we systematically evaluate the performance across 14 open-source and closed-source LLMs ranging 160M $\rightarrow$ 671B on CellVerse. Remarkably, the experimental results reveal: (1) Existing specialist models (C2S-Pythia) fail to make reasonable decisions across all sub-tasks within CellVerse, while generalist models such as Qwen, Llama, GPT, and DeepSeek family models exhibit preliminary understanding capabilities within the realm of cell biology. (2) The performance of current LLMs falls short of expectations and has substantial room for improvement. Notably, in the widely studied drug response prediction task, none of the evaluated LLMs demonstrate significant performance improvement over random guessing. CellVerse offers the first large-scale empirical demonstration that significant challenges still remain in applying LLMs to cell biology. By introducing CellVerse, we lay the foundation for advancing cell biology through natural languages and hope this paradigm could facilitate next-generation single-cell analysis. Project Page: https: //cellverse-cuhk. github. io

AAAI Conference 2025 Conference Paper

CoDe: Communication Delay-Tolerant Multi-Agent Collaboration via Dual Alignment of Intent and Timeliness

  • Shoucheng Song
  • Youfang Lin
  • Sheng Han
  • Chang Yao
  • Hao Wu
  • Shuo Wang
  • Kai Lv

Communication has been widely employed to enhance multi-agent collaboration. Previous research has typically assumed delay-free communication, a strong assumption that is challenging to meet in practice. However, real-world agents suffer from channel delays, receiving messages sent at different time points, termed Asynchronous Communication, leading to cognitive biases and breakdowns in collaboration. This paper first defines two communication delay settings in MARL and emphasizes their harm to collaboration. To handle the above delays, this paper proposes a novel framework, Communication Delay-Tolerant Multi-Agent Collaboration (CoDe). At first, CoDe learns an intent representation as messages through future action inference, reflecting the stable future behavioral trends of the agents. Then, CoDe devises a dual alignment mechanism of intent and timeliness to strengthen the fusion process of asynchronous messages. In this way, agents can extract the long-term intent of others, even from delayed messages, and selectively utilize the most recent messages that are relevant to their intent. Experimental results demonstrate that CoDe outperforms baseline algorithms in three MARL benchmarks without delay and exhibits robustness under fixed and time-varying delays.

NeurIPS Conference 2025 Conference Paper

Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models

  • Michael Plainer
  • Hao Wu
  • Leon Klein
  • Stephan Günnemann
  • Frank Noe

In recent years, diffusion models trained on equilibrium molecular distributions have proven effective for sampling biomolecules. Beyond direct sampling, the score of such a model can also be used to derive the forces that act on molecular systems. However, while classical diffusion sampling usually recovers the training distribution, the corresponding energy-based interpretation of the learned score is often inconsistent with this distribution, even for low-dimensional toy systems. We trace this inconsistency to inaccuracies of the learned score at very small diffusion timesteps, where the model must capture the correct evolution of the data distribution. In this regime, diffusion models fail to satisfy the Fokker-Planck equation, which governs the evolution of the score. We interpret this deviation as one source of the observed inconsistencies and propose an energy-based diffusion model with a Fokker-Planck-derived regularization term to enforce consistency. We demonstrate our approach by sampling and simulating multiple biomolecular systems, including fast-folding proteins, and by introducing a state-of-the-art transferable Boltzmann emulator for dipeptides that supports simulation and achieves improved consistency and efficient sampling. Our code, model weights, and self-contained JAX and PyTorch notebooks are available at https: //github. com/noegroup/ScoreMD.

IJCAI Conference 2025 Conference Paper

From General Relation Patterns to Task-Specific Decision-Making in Continual Multi-Agent Coordination

  • Chang Yao
  • Youfang Lin
  • Shoucheng Song
  • Hao Wu
  • Yuqing Ma
  • Sheng Han
  • Kai Lv

Continual Multi-Agent Reinforcement Learning (Co-MARL) requires agents to address catastrophic forgetting issues while learning new coordination policies with the dynamics team. In this paper, we delve into the core of Co-MARL, namely Relation Patterns, which refer to agents’ general understanding of interactions. In addition to generality, relation patterns exhibit task-specificity when mapped to different action spaces. To this end, we propose a novel method called General Relation Patterns-Guided Task-specific Decision-Maker (RPG). In RPG, agents extract relation patterns from dynamic observation spaces using a relation capturer. These task-agnostic relation patterns are then mapped to different action spaces via a task-specific decision-maker generated by a conditional hypernetwork. To combat forgetting, we further introduce regularization items on both the relation capturer and the conditional hypernetwork. Results on SMAC and LBF demonstrate that RPG effectively prevents catastrophic forgetting when learning new tasks and achieves zero-shot generalization to unseen tasks.

AAAI Conference 2025 Conference Paper

FunEditor: Achieving Complex Image Edits via Function Aggregation with Diffusion Models

  • Mohammadreza Samadi
  • Fred X. Han
  • Mohammad Salameh
  • Hao Wu
  • Fengyu Sun
  • Chunhua Zhou
  • Di Niu

Diffusion models have demonstrated outstanding performance in generative tasks, making them ideal candidates for image editing. Recent studies highlight their ability to apply desired edits effectively by following textual instructions, yet with two key challenges remaining. First, these models struggle to apply multiple edits simultaneously, resulting in computational inefficiencies due to their reliance on sequential processing. Second, relying on textual prompts to determine the editing region can lead to unintended alterations to the image. We introduce FunEditor, an efficient diffusion model designed to learn atomic editing functions and perform complex edits by aggregating simpler functions. This approach enables complex editing tasks, such as object movement, by aggregating multiple functions and applying them simultaneously to specific areas. Our experiments demonstrate that FunEditor significantly outperforms recent inference-time optimization methods and fine-tuned models, either quantitatively across various metrics or through visual comparisons or both, on complex tasks like object movement and object pasting. In the meantime, with only 4 steps of inference, FunEditor achieves 5--24 times inference speedups over existing popular methods.

ICLR Conference 2025 Conference Paper

Learning Graph Quantized Tokenizers

  • Limei Wang
  • Kaveh Hassani
  • Si Zhang
  • Dongqi Fu
  • Baichuan Yuan
  • Weilin Cong
  • Zhigang Hua
  • Hao Wu

Transformers serve as the backbone architectures of Foundational Models, where domain-specific tokenizers allow them to adapt to various domains. Graph Transformers (GTs) have recently emerged as leading models in geometric deep learning, outperforming Graph Neural Networks (GNNs) in various graph learning tasks. However, the development of tokenizers for graphs has lagged behind other modalities, with existing approaches relying on heuristics or GNNs co-trained with Transformers. To address this, we introduce GQT (\textbf{G}raph \textbf{Q}uantized \textbf{T}okenizer), which decouples tokenizer training from Transformer training by leveraging multi-task graph self-supervised learning, yielding robust and generalizable graph tokens. Furthermore, the GQT utilizes Residual Vector Quantization (RVQ) to learn hierarchical discrete tokens, resulting in significantly reduced memory requirements and improved generalization capabilities. By combining the GQT with token modulation, a Transformer encoder achieves state-of-the-art performance on 20 out of 22 benchmarks, including large-scale homophilic and heterophilic datasets. The implementation is publicly available at \href{https://github.com/limei0307/GQT}{https://github.com/limei0307/GQT}.

JBHI Journal 2025 Journal Article

Learning Sensor Sample-Reweighting for Dynamic Early-Exit Activity Recognition Via Meta Learning

  • Zenan Fu
  • Lei Zhang
  • Wenbo Huang
  • Dongzhou Cheng
  • Hao Wu
  • Aiguo Song

During recent years, dynamic early-exit has provided a promising paradigm to improve the computational efficiency of deep neural networks by constructing multiple classifiers to let easy samples exit at shallow layers while avoiding redundant computations at deep exits, which has been seldom explored in the context of latency-aware human activity recognition (HAR) deployed on wearable devices. Particularly, most existing early-exit strategies have always treated all activity samples equally at each exit during training, which ignore such dynamic early-exit behavior at test-time, causing a potential mismatch between training and test. Intuitively, easy activity samples that often exit earlier at test-time should place more emphasis on the training loss of shallow classifiers, while hard activity samples should contribute more to the training loss of deep classifiers. To bridge this gap, this paper introduces a sample-reweighting approach for efficient activity inference, which employs a weight-predicting network to reweight the training loss of different activity samples at every exit. From a perspective of meta learning, a new optimization objective function is designed to jointly optimize both weight-predicting network and backbone network. We perform extensive experiments on three popular HAR benchmarks including UCI-HAR, WISDM, and UniMiB-SHAR, which demonstrate that while incorporating such test-time early-exit behavior into conventional training pipeline, it can consistently improve the accuracy-efficiency trade-offs under budgeted batch classification and anytime prediction patterns. Moreover, our approach has a natural advantage in handing class-imbalance HAR problem. Detailed ablation studies, visualized illustrations, and real hardware deployment are provided to support our statement.

ECAI Conference 2025 Conference Paper

Modularity and Temporal Proximity Enhanced Nonnegative Tensor Latent Factorization for Accurate Dynamic Community Detection

  • Hao Fang
  • Hao Wu

Most real-world networks involved in big data applications are dynamic, making accurate identification of community structures crucial for optimizing and predicting the behavior of network individuals. When addressing the dynamic community detection, existing algorithms cannot simultaneously model the spatiotemporal patterns and node attentions appropriately, resulting in loss of detection accuracy. Motivated by the above issues, this paper innovatively presents a Modularity and Temporal proximity enhanced Nonnegative Tensor latent factorization (MTNT) method with three-fold ideas: a) Utilizing the nonnegative RESCAL framework for representing the dynamic evolution and potential community structure; b) Developing a modularity enhancement module to guarantee the spatial consistency between the detected communities and target network’s intrinsic properties; c) Inventively introducing the node temporal proximity calculated by temporal personalized PageRank into the contrastive loss for significantly boosting the features’ community semantics. Extensively experimental results obtained from six dynamic networks from real applications demonstrate that the MTNT is superior to state-of-the-art community detectors and the convergence of MTNT is verified.

ICLR Conference 2025 Conference Paper

Open-CK: A Large Multi-Physics Fields Coupling benchmarks in Combustion Kinetics

  • Zaige Fei
  • Fan Xu 0009
  • Junyuan Mao
  • Yuxuan Liang
  • Qingsong Wen
  • Kun Wang 0056
  • Hao Wu
  • Yang Wang 0015

In this paper, we use the Fire Dynamics Simulator (FDS) combined with the {\fontfamily{lmtt}\selectfont \textit{supercomputer}} support to create a \textbf{C}ombustion \textbf{K}inetics (CK) dataset for machine learning and scientific research. This dataset captures the development of fires in industrial parks with high-precision Computational Fluid Dynamics (CFD) simulations. It includes various physical fields such as temperature and pressure, and covers multiple environmental combinations for exploring \underline{multi-physics} field coupling phenomena. Additionally, we evaluate several advanced machine learning architectures across our {\fontfamily{lmtt}\selectfont {Open-CK}} benchmark using a substantial computational setup of 64 NVIDIA A100 GPUs: \ding{182} vision backbone; \ding{183} spatio-temporal predictive models; \ding{184} operator learning frameworks. These architectures uniquely excel at handling complex physical field data. We also introduce three benchmarks to demonstrate their potential in enhancing the exploration of downstream tasks: (a) capturing continuous changes in combustion kinetics; (b) a neural partial differential equation solver for learning temperature fields and turbulence; (c) reconstruction of sparse physical observations. The Open-CK dataset and benchmarks aim to advance research in combustion kinetics driven by machine learning, providing a reliable baseline for developing and comparing cutting-edge technologies and models. We hope to further promote the application of deep learning in earth sciences. Our project is available at \url{https://github.com/whscience/Open-CK}.

AAAI Conference 2025 Conference Paper

PriFold: Biological Priors Improve RNA Secondary Structure Predictions

  • Chenchen Yang
  • Hao Wu
  • Tao Shen
  • Kai Zou
  • Siqi Sun

Predicting RNA secondary structures is crucial for understanding RNA function, designing RNA-based therapeutics, and studying molecular interactions within cells. Existing deep-learning-based methods for RNA secondary structure prediction have mainly focused on local structural properties, often overlooking the global characteristics and evolutionary features of RNA sequences. Guided by biological priors, we propose PriFold, incorporating two key innovations: 1) improving attention mechanism with pairing probabilities to utilize global pairing characteristics, and 2) implementing data augmentation based on RNA covariation to leverage evolutionary information. Our structured enhanced pretraining and finetuning strategy significantly optimizes model performance. Extensive experiments demonstrate that PriFold achieves state-of-the-art (SOTA) results in RNA secondary structure prediction on benchmark datasets such as bpRNA, RNAStrAlign and ArchiveII. These results not only validate our prediction approach but also highlight the potential of integrating biological priors, such as global characteristics and evolutionary information, into RNA structure prediction tasks, opening new avenues for research in RNA biology and bioinformatics.

NeurIPS Conference 2025 Conference Paper

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

  • Yuhao Zhou
  • Yiheng Wang
  • Xuming He
  • Ruoyao Xiao
  • Zhiwei Li
  • Qiantai Feng
  • Zijie Guo
  • Yuejin Yang

Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34. 08% and 26. 52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.

AAAI Conference 2025 Conference Paper

VERO: Verification and Zero-Shot Feedback Acquisition for Few-Shot Multimodal Aspect-Level Sentiment Classification

  • Kai Sun
  • Hao Wu
  • Bin Shi
  • Samuel Mensah
  • Peng Liu
  • Bo Dong

Deep learning approaches for multimodal aspect-level sentiment classification (MALSC) often require extensive data, which is costly and time-consuming to obtain. To mitigate this, current methods typically fine-tune small-scale pretrained models like BERT and BART with few-shot examples. While these models have shown success, Large Vision-Language Models (LVLMs) offer significant advantages due to their greater capacity and ability to understand nuanced language in both zero-shot and few-shot settings. However, there is limited work on fine-tuning LVLMs for MALSC. A major challenge lies in selecting few-shot examples that effectively capture the underlying patterns in data for these LVLMs. To bridge this research gap, we propose an acquisition function designed to select challenging samples for the few-shot learning of LVLMs for MALSC. We compare our approach, Verification and ZERO-shot feedback acquisition (VERO), with diverse acquisition functions for few-shot learning in MALSC. Our experiments show that VERO outperforms prior methods, achieving an F1 score improvement of up to 6.07% on MALSC benchmark datasets.

NeurIPS Conference 2024 Conference Paper

Causal Deciphering and Inpainting in Spatio-Temporal Dynamics via Diffusion Model

  • Yifan Duan
  • Jian Zhao
  • Junyuan Mao
  • Hao Wu
  • Jingyu Xu
  • Shilong Wang
  • Caoyuan Ma
  • Kai Wang

Spatio-temporal (ST) prediction has garnered a De facto attention in earth sciences, such as meteorological prediction, human mobility perception. However, the scarcity of data coupled with the high expenses involved in sensor deployment results in notable data imbalances. Furthermore, models that are excessively customized and devoid of causal connections further undermine the generalizability and interpretability. To this end, we establish a causal framework for ST predictions, termed CaPaint, which targets to identify causal regions in data and endow model with causal reasoning ability in a two-stage process. Going beyond this process, we utilize the back-door adjustment to specifically address the sub-regions identified as non-causal in the upstream phase. Specifically, we employ a novel image inpainting technique. By using a fine-tuned unconditional Diffusion Probabilistic Model (DDPM) as the generative prior, we in-fill the masks defined as environmental parts, offering the possibility of reliable extrapolation for potential data distributions. CaPaint overcomes the high complexity dilemma of optimal ST causal discovery models by reducing the data generation complexity from exponential to quasi-linear levels. Extensive experiments conducted on five real-world ST benchmarks demonstrate that integrating the CaPaint concept allows models to achieve improvements ranging from 4. 3% to 77. 3%. Moreover, compared to traditional mainstream ST augmenters, CaPaint underscores the potential of diffusion models in ST enhancement, offering a novel paradigm for this field. Our project is available at https: //anonymous. 4open. science/r/12345-DFCC.

NeurIPS Conference 2024 Conference Paper

Divide-and-Conquer Predictive Coding: a structured Bayesian inference algorithm

  • Eli Sennesh
  • Hao Wu
  • Tommaso Salvatori

Unexpected stimuli induce "error" or "surprise" signals in the brain. The theory of predictive coding promises to explain these observations in terms of Bayesian inference by suggesting that the cortex implements variational inference in a probabilistic graphical model. However, when applied to machine learning tasks, this family of algorithms has yet to perform on par with other variational approaches in high-dimensional, structured inference problems. To address this, we introduce a novel predictive coding algorithm for structured generative models, that we call divide-and-conquer predictive coding (DCPC); it differs from other formulations of predictive coding, as it respects the correlation structure of the generative model and provably performs maximum-likelihood updates of model parameters, all without sacrificing biological plausibility. Empirically, DCPC achieves better numerical performance than competing algorithms and provides accurate inference in a number of problems not previously addressed with predictive coding. We provide an open implementation of DCPC in Pyro on Github.

AAAI Conference 2024 Conference Paper

Earthfarsser: Versatile Spatio-Temporal Dynamical Systems Modeling in One Model

  • Hao Wu
  • Yuxuan Liang
  • Wei Xiong
  • Zhengyang Zhou
  • Wei Huang
  • Shilong Wang
  • Kun Wang

Efficiently modeling spatio-temporal (ST) physical processes and observations presents a challenging problem for the deep learning community. Many recent studies have concentrated on meticulously reconciling various advantages, leading to designed models that are neither simple nor practical. To address this issue, this paper presents a systematic study on existing shortcomings faced by off-the-shelf models, including lack of local fidelity, poor prediction performance over long time-steps, low scalability, and inefficiency. To systematically address the aforementioned problems, we propose an EarthFarseer, a concise framework that combines parallel local convolutions and global Fourier-based transformer architectures, enabling dynamically capture the local-global spatial interactions and dependencies. EarthFarseer also incorporates a multi-scale fully convolutional and Fourier architectures to efficiently and effectively capture the temporal evolution. Our proposal demonstrates strong adaptability across various tasks and datasets, with fast convergence and better local fidelity in long time-steps predictions. Extensive experiments and visualizations over eight human society physical and natural physical datasets demonstrates the state-of-the-art performance of EarthFarseer. We release our code at https://github.com/easylearningscores/EarthFarseer.

NeurIPS Conference 2024 Conference Paper

Faster Differentially Private Top-$k$ Selection: A Joint Exponential Mechanism with Pruning

  • Hao Wu
  • Hanwen Zhang

We study the differentially private top-$k$ selection problem, aiming to identify a sequence of $k$ items with approximately the highest scores from $d$ items. Recent work by Gillenwater et al. (2022) employs a direct sampling approach from the vast collection of $O(d^k)$ possible length-$k$ sequences, showing superior empirical accuracy compared to previous pure or approximate differentially private methods. Their algorithm has a time and space complexity of $\tilde{O}(dk)$. In this paper, we present an improved algorithm that achieves time and space complexity of $\tilde{O}(d + k^2)$. Experimental results show that our algorithm runs orders of magnitude faster than their approach, while achieving similar empirical accuracy.

JBHI Journal 2024 Journal Article

lncLocator-imb: An Imbalance-Tolerant Ensemble Deep Learning Framework for Predicting Long Non-Coding RNA Subcellular Localization

  • Haibin Liu
  • Dianguo Li
  • Hao Wu

Recent studies have highlighted the critical roles of long non-coding RNAs (lncRNAs) in various biological processes, including but not limited to dosage compensation, epigenetic regulation, cell cycle regulation, and cell differentiation regulation. Consequently, lncRNAs have emerged as a central focus in genetic studies. The identification of the subcellular localization of lncRNAs is essential for gaining insights into crucial information about lncRNA interaction partners, post- or co-transcriptional regulatory modifications, and external stimuli that directly impact the function of lncRNA. Computational methods have emerged as a promising avenue for predicting the subcellular localization of lncRNAs. However, there is a need for additional enhancement in the performance of current methods when dealing with unbalanced data sets. To address this challenge, we propose a novel ensemble deep learning framework, termed lncLocator-imb, for predicting the subcellular localization of lncRNAs. To fully exploit lncRNA sequence information, lncLocator-imb integrates two base classifiers, including convolutional neural networks (CNN) and gated recurrent units (GRU). Additionally, it incorporates two distinct types of features, including the physicochemical pattern feature and the distributed representation of nucleic acids feature. To address the problem of poor performance exhibited by models when confronted with unbalanced data sets, we utilize the label-distribution-aware margin (LDAM) loss function during the training process. Compared with traditional machine learning models and currently available predictors, lncLocator-imb demonstrates more robust category imbalance tolerance. Our study proposes an ensemble deep learning framework for predicting the subcellular localization of lncRNAs. Additionally, a novel approach is presented for the management of different features and the resolution of unbalanced data sets. The proposed framework exhibits the potential to serve as a significant resource for various sequence-based prediction tasks, providing a versatile tool that can be utilized by professionals in the fields of bioinformatics and genetics.

JBHI Journal 2024 Journal Article

MaskCAE: Masked Convolutional AutoEncoder via Sensor Data Reconstruction for Self-Supervised Human Activity Recognition

  • Dongzhou Cheng
  • Lei Zhang
  • Lutong Qin
  • Shuoyuan Wang
  • Hao Wu
  • Aiguo Song

Self-supervised Human Activity Recognition (HAR) has been gradually gaining a lot of attention in ubiquitous computing community. Its current focus primarily lies in how to overcome the challenge of manually labeling complicated and intricate sensor data from wearable devices, which is often hard to interpret. However, current self-supervised algorithms encounter three main challenges: performance variability caused by data augmentations in contrastive learning paradigm, limitations imposed by traditional self-supervised models, and the computational load deployed on wearable devices by current mainstream transformer encoders. To comprehensively tackle these challenges, this paper proposes a powerful self-supervised approach for HAR from a novel perspective of denoising autoencoder, the first of its kind to explore how to reconstruct masked sensor data built on a commonly employed, well-designed, and computationally efficient fully convolutional network. Extensive experiments demonstrate that our proposed Masked Convolutional AutoEncoder (MaskCAE) outperforms current state-of-the-art algorithms in self-supervised, fully supervised, and semi-supervised situations without relying on any data augmentations, which fills the gap of masked sensor data modeling in HAR area. Visualization analyses show that our MaskCAE could effectively capture temporal semantics in time series sensor data, indicating its great potential in modeling abstracted sensor data. An actual implementation is evaluated on an embedded platform.

NeurIPS Conference 2024 Conference Paper

PURE: Prompt Evolution with Graph ODE for Out-of-distribution Fluid Dynamics Modeling

  • Hao Wu
  • Changhu Wang
  • Fan Xu
  • Jinbao Xue
  • Chong Chen
  • Xian-Sheng Hua
  • Xiao Luo

This work studies the problem of out-of-distribution fluid dynamics modeling. Previous works usually design effective neural operators to learn from mesh-based data structures. However, in real-world applications, they would suffer from distribution shifts from the variance of system parameters and temporal evolution of the dynamical system. In this paper, we propose a novel approach named \underline{P}rompt Evol\underline{u}tion with G\underline{r}aph OD\underline{E} (\method{}) for out-of-distribution fluid dynamics modeling. The core of our \method{} is to learn time-evolving prompts using a graph ODE to adapt spatio-temporal forecasting models to different scenarios. In particular, our \method{} first learns from historical observations and system parameters in the frequency domain to explore multi-view context information, which could effectively initialize prompt embeddings. More importantly, we incorporate the interpolation of observation sequences into a graph ODE, which can capture the temporal evolution of prompt embeddings for model adaptation. These time-evolving prompt embeddings are then incorporated into basic forecasting models to overcome temporal distribution shifts. We also minimize the mutual information between prompt embeddings and observation embeddings to enhance the robustness of our model to different distributions. Extensive experiments on various benchmark datasets validate the superiority of the proposed \method{} in comparison to various baselines.

AAAI Conference 2024 Conference Paper

Revisiting Graph-Based Fraud Detection in Sight of Heterophily and Spectrum

  • Fan Xu
  • Nan Wang
  • Hao Wu
  • Xuezhi Wen
  • Xibin Zhao
  • Hai Wan

Graph-based fraud detection (GFD) can be regarded as a challenging semi-supervised node binary classification task. In recent years, Graph Neural Networks (GNN) have been widely applied to GFD, characterizing the anomalous possibility of a node by aggregating neighbor information. However, fraud graphs are inherently heterophilic, thus most of GNNs perform poorly due to their assumption of homophily. In addition, due to the existence of heterophily and class imbalance problem, the existing models do not fully utilize the precious node label information. To address the above issues, this paper proposes a semi-supervised GNN-based fraud detector SEC-GFD. This detector includes a hybrid filtering module and a local environmental constraint module, the two modules are utilized to solve heterophily and label utilization problem respectively. The first module starts from the perspective of the spectral domain, and solves the heterophily problem to a certain extent. Specifically, it divides the spectrum into various mixed-frequency bands based on the correlation between spectrum energy distribution and heterophily. Then in order to make full use of the node label information, a local environmental constraint module is adaptively designed. The comprehensive experimental results on four real-world fraud detection datasets denote that SEC-GFD outperforms other competitive graph-based fraud detectors. We release our code at https://github.com/Sunxkissed/SEC-GFD.

NeurIPS Conference 2024 Conference Paper

Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

  • Shuaipeng Li
  • Penghao Zhao
  • Hailin Zhang
  • Xingwu Sun
  • Hao Wu
  • Dian Jiao
  • Weiyan Wang
  • Chengjun Liu

In current deep learning tasks, Adam-style optimizers—such as Adam, Adagrad, RMSprop, Adafactor, and Lion—have been widely used as alternatives to SGD-style optimizers. These optimizers typically update model parameters using the sign of gradients, resulting in more stable convergence curves. The learning rate and the batch size are the most critical hyperparameters for optimizers, which require careful tuning to enable effective convergence. Previous research has shown that the optimal learning rate increases linearly (or follows similar rules) with batch size for SGD-style optimizers. However, this conclusion is not applicable to Adam-style optimizers. In this paper, we elucidate the connection between optimal learning rates and batch sizes for Adam-style optimizers through both theoretical analysis and extensive experiments. First, we raise the scaling law between batch sizes and optimal learning rates in the “sign of gradient” case, in which we prove that the optimal learning rate first rises and then falls as the batch size increases. Moreover, the peak value of the surge will gradually move toward the larger batch size as training progresses. Second, we conduct experiments on various CV and NLP tasks and verify the correctness of the scaling law.

AAAI Conference 2024 Conference Paper

UCMCTrack: Multi-Object Tracking with Uniform Camera Motion Compensation

  • Kefu Yi
  • Kai Luo
  • Xiaolei Luo
  • Jiangui Huang
  • Hao Wu
  • Rongdong Hu
  • Wei Hao

Multi-object tracking (MOT) in video sequences remains a challenging task, especially in scenarios with significant camera movements. This is because targets can drift considerably on the image plane, leading to erroneous tracking outcomes. Addressing such challenges typically requires supplementary appearance cues or Camera Motion Compensation (CMC). While these strategies are effective, they also introduce a considerable computational burden, posing challenges for real-time MOT. In response to this, we introduce UCMCTrack, a novel motion model-based tracker robust to camera movements. Unlike conventional CMC that computes compensation parameters frame-by-frame, UCMCTrack consistently applies the same compensation parameters throughout a video sequence. It employs a Kalman filter on the ground plane and introduces the Mapped Mahalanobis Distance (MMD) as an alternative to the traditional Intersection over Union (IoU) distance measure. By leveraging projected probability distributions on the ground plane, our approach efficiently captures motion patterns and adeptly manages uncertainties introduced by homography projections. Remarkably, UCMCTrack, relying solely on motion cues, achieves state-of-the-art performance across a variety of challenging datasets, including MOT17, MOT20, DanceTrack and KITTI. More details and code are available at https://github.com/corfyi/UCMCTrack.

ICLR Conference 2024 Conference Paper

VCR-Graphormer: A Mini-batch Graph Transformer via Virtual Connections

  • Dongqi Fu
  • Zhigang Hua
  • Yan Xie
  • Jin Fang
  • Si Zhang
  • Kaan Sancak
  • Hao Wu
  • Andrey Malevich

Graph transformer has been proven as an effective graph learning method for its adoption of attention mechanism that is capable of capturing expressive representations from complex topological and feature information of graphs. Graph transformer conventionally performs dense attention (or global attention) for every pair of nodes to learn node representation vectors, resulting in quadratic computational costs that are unaffordable for large-scale graph data. Therefore, mini-batch training for graph transformers is a promising direction, but limited samples in each mini-batch can not support effective dense attention to encode informative representations. Facing this bottleneck, (1) we start by assigning each node a token list that is sampled by personalized PageRank (PPR) and then apply standard multi-head self-attention only on this list to compute its node representations. This PPR tokenization method decouples model training from complex graph topological information and makes heavy feature engineering offline and independent, such that mini-batch training of graph transformers is possible by loading each node's token list in batches. We further prove this PPR tokenization is viable as a graph convolution network with a fixed polynomial filter and jumping knowledge. However, only using personalized PageRank may limit information carried by a token list, which could not support different graph inductive biases for model training. To this end, (2) we rewire graphs by introducing multiple types of virtual connections through structure- and content-based super nodes that enable PPR tokenization to encode local and global contexts, long-range interaction, and heterophilous information into each node's token list, and then formalize our $\underline{\textbf{V}}$irtual $\underline{\textbf{C}}$onnection $\underline{\textbf{R}}$anking based $\underline{\textbf{Graph}}$ Trans$\underline{\textbf{former}}$ (VCR-Graphormer). Overall, VCR-Graphormer needs $O(m+klogk)$ complexity for graph tokenization as compared to $O(n^{3})$ of previous works. The [code](https://github.com/DongqiFu/VCR-Graphormer) is provided.

TMLR Journal 2024 Journal Article

Vision Learners Meet Web Image-Text Pairs

  • Bingchen Zhao
  • Quan Cui
  • Hao Wu
  • Osamu Yoshie
  • Cheng Yang
  • Oisin Mac Aodha

Most recent self-supervised learning methods are pre-trained on the well-curated ImageNet-1K dataset. In this work, given the excellent scalability of web data, we consider self-supervised pre-training on noisy web sourced image-text paired data. First, we conduct a benchmark study of representative self-supervised pre-training methods on large-scale web data in a like-for-like setting. We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training. We observe that existing multi-modal methods do not outperform their single-modal counterparts on vision transfer learning tasks. We derive an information-theoretical view to explain these benchmark results, which provides insight into how to design a novel vision learner. Inspired by this insight, we present a new visual representation pre-training method, MUlti-modal Generator~(MUG), that learns from scalable web sourced image-text data. MUG achieves state-of-the-art transfer performance on a variety of tasks and demonstrates promising scaling properties. Pre-trained models and code will be made public upon acceptance.

TIST Journal 2023 Journal Article

3D-Guided Frontal Face Generation for Pose-Invariant Recognition

  • Hao Wu
  • Jianyang Gu
  • Xiaojin Fan
  • He Li
  • Lidong Xie
  • Jian Zhao

Although deep learning techniques have achieved extraordinary accuracy in recognizing human faces, the pose variances of images captured in real-world scenarios still hinder reliable model appliance. To mitigate this gap, we propose to recognize faces via generation frontal face images with a 3D -Guided Deep P ose- I nvariant Face Recognition M odel (3D-PIM) consisted of a simulator and a refiner module. The simulator employs a 3D Morphable Model (3D MM) to fit the shape and appearance features and recover primary frontal images with less training data. The refiner further enhances the image realism on both global facial structure and local details with adversarial training, while keeping the discriminative identity information consistent with original images. An Adaptive Weighting (AW) metric is then adopted to leverage the complimentary information from recovered frontal faces and original profile faces and to obtain credible similarity scores for recognition. Extended experiments verify the superiority of the proposed “recognition via generation” framework over state-of-the-art.

ICLR Conference 2023 Conference Paper

Do We Really Need Complicated Model Architectures For Temporal Networks?

  • Weilin Cong
  • Si Zhang
  • Jian Kang 0008
  • Baichuan Yuan
  • Hao Wu
  • Xin Zhou
  • Hanghang Tong
  • Mehrdad Mahdavi

Recurrent neural network (RNN) and self-attention mechanism (SAM) are the de facto methods to extract spatial-temporal information for temporal graph learning. Interestingly, we found that although both RNN and SAM could lead to a good performance, in practice neither of them is always necessary. In this paper, we propose GraphMixer, a conceptually and technically simple architecture that consists of three components: (1) a link-encoder that is only based on multi-layer perceptrons (MLP) to summarize the information from temporal links, (2) a node-encoder that is only based on neighbor mean-pooling to summarize node information, and (3) an MLP-based link classifier that performs link prediction based on the outputs of the encoders. Despite its simplicity, GraphMixer attains an outstanding performance on temporal link prediction benchmarks with faster convergence and better generalization performance. These results motivate us to rethink the importance of simpler model architecture.

JBHI Journal 2023 Journal Article

FreqSense: Adaptive Sampling Rates for Sensor-Based Human Activity Recognition Under Tunable Computational Budgets

  • Guangyu Yang
  • Lei Zhang
  • Can Bu
  • Shuaishuai Wang
  • Hao Wu
  • Aiguo Song

Recent years have witnessed great success of deep convolutional networks in sensor-based human activity recognition (HAR), yet their practical deployment remains a challenge due to the varying computational budgets required to obtain a reliable prediction. This article focuses on adaptive inference from a novel perspective of signal frequency, which is motivated by an intuition that low-frequency features are enough for recognizing “easy” activity samples, while only “hard” activity samples need temporally detailed information. We propose an adaptive resolution network by combining a simple subsampling strategy with conditional early-exit. Specifically, it is comprised of multiple subnetworks with different resolutions, where “easy” activity samples are first classified by lightweight subnetwork using the lowest sampling rate, while the subsequent subnetworks in higher resolution would be sequentially applied once the former one fails to reach a confidence threshold. Such dynamical decision process could adaptively select a proper sampling rate for each activity sample conditioned on an input if the budget varies, which will be terminated until enough confidence is obtained, hence avoiding excessive computations. Comprehensive experiments on four diverse HAR benchmark datasets demonstrate the effectiveness of our method in terms of accuracy-cost tradeoff. We benchmark the average latency on a real hardware.

JBHI Journal 2023 Journal Article

IChrom-Deep: An Attention-Based Deep Learning Model for Identifying Chromatin Interactions

  • Pengyu Zhang
  • Hao Wu

Identification of chromatin interactions is crucial for advancing our knowledge of gene regulation. However, due to the limitations of high-throughput experimental techniques, there is an urgent need to develop computational methods for predicting chromatin interactions. In this study, we propose a novel attention-based deep learning model, termed IChrom-Deep, to identify chromatin interactions using sequence features and genomic features. The experimental results based on the datasets of three cell lines demonstrate that the IChrom-Deep achieves satisfactory performance and is superior to the previous methods. We also investigate the effect of DNA sequence and associated features and genomic features on chromatin interactions, and highlight the applicable scenarios of some features, such as sequence conservation and distance. Moreover, we identify a few genomic features that are extremely important across different cell lines, and IChrom-Deep achieves comparable performance with only these significant genomic features versus using all genomic features. It is believed that IChrom-Deep can serve as a useful tool for future studies that seek to identify chromatin interactions.

NeurIPS Conference 2023 Conference Paper

IDEA: An Invariant Perspective for Efficient Domain Adaptive Image Retrieval

  • Haixin Wang
  • Hao Wu
  • Jinan Sun
  • Shikun Zhang
  • Chong Chen
  • Xian-Sheng Hua
  • Xiao Luo

In this paper, we investigate the problem of unsupervised domain adaptive hashing, which leverage knowledge from a label-rich source domain to expedite learning to hash on a label-scarce target domain. Although numerous existing approaches attempt to incorporate transfer learning techniques into deep hashing frameworks, they often neglect the essential invariance for adequate alignment between these two domains. Worse yet, these methods fail to distinguish between causal and non-causal effects embedded in images, rendering cross-domain retrieval ineffective. To address these challenges, we propose an Invariance-acquired Domain AdaptivE HAshing (IDEA) model. Our IDEA first decomposes each image into a causal feature representing label information, and a non-causal feature indicating domain information. Subsequently, we generate discriminative hash codes using causal features with consistency learning on both source and target domains. More importantly, we employ a generative model for synthetic samples to simulate the intervention of various non-causal effects, ultimately minimizing their impact on hash codes for domain invariance. Comprehensive experiments conducted on benchmark datasets validate the superior performance of our IDEA compared to a variety of competitive baselines.

JBHI Journal 2023 Journal Article

ProtoHAR: Prototype Guided Personalized Federated Learning for Human Activity Recognition

  • Dongzhou Cheng
  • Lei Zhang
  • Can Bu
  • Xing Wang
  • Hao Wu
  • Aiguo Song

Federated Learning (FL) has recently attracted great interest in sensor-based human activity recognition (HAR) tasks. However, in real-world environment, sensor data on devices is non-independently and identically distributed (Non-IID), e. g. , activity data recorded by most devices is sparse, and sensor data distribution for each client may be inconsistent. As a result, the traditional FL methods in the heterogeneous environment may incur a drifted global model that causes slow convergence and a heavy communication burden. Although some FL methods are gradually being applied to HAR, they are designed for overly ideal scenarios and do not address such Non-IID problem in the real-world setting. It is still a question whether they can be applied to cross-device FL. To tackle this challenge, we propose ProtoHAR, a prototype-guided FL framework for HAR, which aims to decouple the representation and classifier in the heterogeneous FL setting efficiently. It leverages the global prototype to correct the activity feature representation to make the prototype knowledge flow among clients without leaking privacy while solving a better classifier to avoid excessive drift of the local model in personalized training. Extensive experiments are conducted on four publicly available datasets: USC-HAD, UNIMIB-SHAR, PAMAP2, and HARBOX, which are collected in both controlled environments and real-world scenarios. The results show that compared with the state-of-the-art FL algorithms, ProtoHAR achieves the best performance and faster convergence speed in HAR datasets.

ICML Conference 2023 Conference Paper

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

  • Guangxuan Xiao
  • Ji Lin 0002
  • Mickaël Seznec
  • Hao Wu
  • Julien Demouth
  • Song Han 0003

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT, BLOOM, GLM, MT-NLG, and LLaMA family. We demonstrate up to 1. 56$\times$ speedup and 2$\times$ memory reduction for LLMs with negligible loss in accuracy. SmoothQuant enables serving 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs.

ICML Conference 2022 Conference Paper

Cross-Space Active Learning on Graph Convolutional Networks

  • Yufei Tao 0001
  • Hao Wu
  • Shiyuan Deng

This paper formalizes cross-space active learning on a graph convolutional network (GCN). The objective is to attain the most accurate hypothesis available in any of the instance spaces generated by the GCN. Subject to the objective, the challenge is to minimize the label cost, measured in the number of vertices whose labels are requested. Our study covers both budget algorithms which terminate after a designated number of label requests, and verifiable algorithms which terminate only after having found an accurate hypothesis. A new separation in label complexity between the two algorithm types is established. The separation is unique to GCNs.

JBHI Journal 2022 Journal Article

Data Integration Using Tensor Decomposition for the Prediction of miRNA-Disease Associations

  • Jiawei Luo
  • Yi Liu
  • Pei Liu
  • Zihan Lai
  • Hao Wu

Dysfunction of miRNAs has an important relationship with diseases by impacting their target genes. Identifying disease-related miRNAs is of great significance to prevent and treat diseases. Integrating information of genes related miRNAs and/or diseases in calculational methods for miRNA-disease association studies is meaningful because of the complexity of biological mechanisms. Therefore, in this study, we propose a novel method based on tensor decomposition, termed TDMDA, to integrate multi-type data for identifying pathogenic miRNAs. First, we construct a three-order association tensor to express the associations of miRNA-disease pairs, the associations of miRNA-gene pairs, and the associations of gene-disease pairs simultaneously. Then, a tensor decomposition-based method with auxiliary information is applied to reconstruct the association tensor for predicting miRNA-disease associations, and the auxiliary information includes biological similarity information and adjacency information. The performance of TDMDA is compared with other advanced methods under 5-fold cross-validations. The experimental results indicate the TDMDA is a competitive method.

JBHI Journal 2022 Journal Article

Dual-Branch Interactive Networks on Multichannel Time Series for Human Activity Recognition

  • Yin Tang
  • Lei Zhang
  • Hao Wu
  • Jun He
  • Aiguo Song

The popularity of convolutional architecture has made sensor-based human activity recognition (HAR) become one primary beneficiary. By simply superimposing multiple convolution layers, the local features can be effectively captured from multi-channel time series sensor data, which could output high-performance activity prediction results. On the other hand, recent years have witnessed great success of Transformer model, which uses powerful self-attention mechanism to handle long-range sequence modeling tasks, hence avoiding the shortcoming of local feature representations caused by convolutional neural networks (CNNs). In this paper, we seek to combine the merits of CNN and Transformer to model multi-channel time series sensor data, which might provide compelling recognition performance with fewer parameters and FLOPs based on lightweight wearable devices. To this end, we propose a new Dual-branch Interactive Network (DIN) that inherits the advantages from both CNN and Transformer to handle multi-channel time series for HAR. Specifically, the proposed framework utilizes two-stream architecture to disentangle local and global features by performing conv-embedding and patch-embedding, where a co-attention mechanism is used to adaptively fuse global-to-local and local-to-global feature representations. We perform extensive experiments on three mainstream HAR benchmark datasets including PAMAP2, WISDM, and OPPORTUNITY, which verify that our method consistently outperforms several state-of-the-art baselines, reaching an F1-score of 92. 05%, 98. 17%, and 91. 55% respectively with fewer parameters and FLOPs. In addition, the practical execution time is validated on an embedded Raspberry Pi P3 system, which demonstrates that our approach is adequately efficient for real-time HAR implementations and deserves as a better alternative in ubiquitous HAR computing scenario. Our model code will be released soon.

ICRA Conference 2022 Conference Paper

Fixed and Sliding FBG Sensors-Based Triaxial Tip Force Sensing for Cable-Driven Continuum Robots

  • Zecai Lin
  • Huanghua Liu
  • Xiaojie Ai
  • Weidong Chen 0001
  • Anzhu Gao
  • Zhenglong Sun 0001
  • Yun Zou
  • Guang-Zhong Yang

Tip force sensing for cable-driven continuum robots are vital to provide the force information for safe and reliable human-robot interaction. However, traditional triaxial force sensors usually have a complicated structure occupying its inner lumen, without enough space for additional instrumental tools. To solve this, this paper proposes a fixed and sliding fiber Bragg grating (FBG) sensors-based triaxial force sensing method for cable-driven continuum robots. The fixed FBG sensors are attached to the circumferential surface of continuum robot at the tip and base, and the sliding optical fibers with FBG sensors are located in the actuation channels as the sensing integrated pulling cables. This configuration guarantees a compact structure and large inner lumen. Two five-degreed-of-freedom (5-DOF) electromagnetic (EM) and a 6-DOF EM sensors are assembled to the tip and the base of the robot respectively, which can obtain the pose of the tip with respect to the base. The tip force in three directions can be decoupled using the information of the Bragg wavelength changes and EM sensors. Results show that the mean errors of force sensing along x-direction, y-direction, and z-direction are 4. 1%, 4. 7%, and 9. 8%, respectively. The proposed sensing method does not rely on the elasticity of continuum robot, enabling its wide applicability for other cable-driven pseudo-continuum robots.

AAAI Conference 2022 Conference Paper

L-CoDe:Language-Based Colorization Using Color-Object Decoupled Conditions

  • Shuchen Weng
  • Hao Wu
  • Zheng Chang
  • Jiajun Tang
  • Si Li
  • Boxin Shi

Colorizing a grayscale image is inherently an ill-posed problem with multi-modal uncertainty. Language-based colorization offers a natural way of interaction to reduce such uncertainty via a user-provided caption. However, the colorobject coupling and mismatch issues make the mapping from word to color difficult. In this paper, we propose L-CoDe, a Language-based Colorization network using color-object Decoupled conditions. A predictor for object-color corresponding matrix (OCCM) and a novel attention transfer module (ATM) are introduced to solve the color-object coupling problem. To deal with color-object mismatch that results in incorrect color-object correspondence, we adopt a soft-gated injection module (SIM). We further present a new dataset containing annotated color-object pairs to provide supervisory signals for resolving the coupling problem. Experimental results show that our approach outperforms state-of-the-art methods conditioned on captions.

IJCAI Conference 2021 Conference Paper

Boosting Offline Reinforcement Learning with Residual Generative Modeling

  • Hua Wei
  • Deheng Ye
  • Zhao Liu
  • Hao Wu
  • Bo Yuan
  • Qiang Fu
  • Wei Yang
  • Zhenhui Li

Offline reinforcement learning (RL) tries to learn the near-optimal policy with recorded offline experience without online exploration. Current offline RL research includes: 1) generative modeling, i. e. , approximating a policy using fixed data; and 2) learning the state-action value function. While most research focuses on the state-action function part through reducing the bootstrapping error in value function approximation induced by the distribution shift of training data, the effects of error propagation in generative modeling have been neglected. In this paper, we analyze the error in generative modeling. We propose AQL (action-conditioned Q-learning), a residual generative model to reduce policy approximation error for offline RL. We show that our method can learn more accurate policy approximations in different benchmark datasets. In addition, we show that the proposed offline RL method can learn more competitive AI agents in complex control tasks under the multiplayer online battle arena (MOBA) game, Honor of Kings.

NeurIPS Conference 2021 Conference Paper

FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling

  • Bowen Zhang
  • Yidong Wang
  • Wenxin Hou
  • Hao Wu
  • Jindong Wang
  • Manabu Okumura
  • Takahiro Shinozaki

The recently proposed FixMatch achieved state-of-the-art results on most semi-supervised learning (SSL) benchmarks. However, like other modern SSL algorithms, FixMatch uses a pre-defined constant threshold for all classes to select unlabeled data that contribute to the training, thus failing to consider different learning status and learning difficulties of different classes. To address this issue, we propose Curriculum Pseudo Labeling (CPL), a curriculum learning approach to leverage unlabeled data according to the model's learning status. The core of CPL is to flexibly adjust thresholds for different classes at each time step to let pass informative unlabeled data and their pseudo labels. CPL does not introduce additional parameters or computations (forward or backward propagation). We apply CPL to FixMatch and call our improved algorithm FlexMatch. FlexMatch achieves state-of-the-art performance on a variety of SSL benchmarks, with especially strong performances when the labeled data are extremely limited or when the task is challenging. For example, FlexMatch achieves 13. 96% and 18. 96% error rate reduction over FixMatch on CIFAR-100 and STL-10 datasets respectively, when there are only 4 labels per class. CPL also significantly boosts the convergence speed, e. g. , FlexMatch can use only 1/5 training time of FixMatch to achieve even better performance. Furthermore, we show that CPL can be easily adapted to other SSL algorithms and remarkably improve their performances. We open-source our code at https: //github. com/TorchSSL/TorchSSL.

NeurIPS Conference 2021 Conference Paper

Nested Variational Inference

  • Heiko Zimmermann
  • Hao Wu
  • Babak Esmaeili
  • Jan-Willem van de Meent

We develop nested variational inference (NVI), a family of methods that learn proposals for nested importance samplers by minimizing an forward or reverse KL divergence at each level of nesting. NVI is applicable to many commonly-used importance sampling strategies and provides a mechanism for learning intermediate densities, which can serve as heuristics to guide the sampler. Our experiments apply NVI to (a) sample from a multimodal distribution using a learned annealing path (b) learn heuristics that approximate the likelihood of future observations in a hidden Markov model and (c) to perform amortized inference in hierarchical deep generative models. We observe that optimizing nested objectives leads to improved sample quality in terms of log average weight and effective sample size.

AAAI Conference 2021 Conference Paper

Training Spiking Neural Networks with Accumulated Spiking Flow

  • Hao Wu
  • Yueyi Zhang
  • Wenming Weng
  • Yongting Zhang
  • Zhiwei Xiong
  • Zheng-Jun Zha
  • Xiaoyan Sun
  • Feng Wu

The fast development of neuromorphic hardwares promotes Spiking Neural Networks (SNNs) to a thrilling research avenue. Current SNNs, though much efficient, are less effective compared with leading Artificial Neural Networks (ANNs) especially in supervised learning tasks. Recent efforts further demonstrate the potential of SNNs in supervised learning by introducing approximated backpropagation (BP) methods. To deal with the non-differentiable spike function in SNNs, these BP methods utilize information from the spatio-temporal domain to adjust the model parameters. With the increasing of time window and network size, the computational complexity of spatio-temporal backpropagation augments dramatically. In this paper, we propose a new backpropagation method for SNNs based on the accumulated spiking flow (ASF), i. e. ASF- BP. In the proposed ASF-BP method, updating parameters does not rely on the spike train of spiking neurons but leverage accumulated inputs and outputs of spiking neurons over the time window, which reduces the BP complexity significantly. We further present an adaptive linear estimation model to approach the dynamic characteristics of spiking neurons statistically. Experimental results demonstrate that with our proposed ASF-BP method, light-weight convolutional SNNs achieve superior performances compared with other spike-based BP methods on both non-neuromorphic (MNIST, CIFAR10) and neuromorphic (CIFAR10-DVS) datasets. The code is available at https: //github. com/neural-lab/ASF-BP.

TIST Journal 2021 Journal Article

VSumVis: Interactive Visual Understanding and Diagnosis of Video Summarization Model

  • Guodao Sun
  • Hao Wu
  • Lin Zhu
  • Chaoqing Xu
  • Haoran Liang
  • Binwei Xu
  • Ronghua Liang

With the rapid development of mobile Internet, the popularity of video capture devices has brought a surge in multimedia video resources. Utilizing machine learning methods combined with well-designed features, we could automatically obtain video summarization to relax video resource consumption and retrieval issues. However, there always exists a gap between the summarization obtained by the model and the ones annotated by users. How to help users understand the difference, provide insights in improving the model, and enhance the trust in the model remains challenging in the current study. To address these challenges, we propose VSumVis under a user-centered design methodology, a visual analysis system with multi-feature examination and multi-level exploration, which could help users explore and analyze video content, as well as the intrinsic relationship that existed in our video summarization model. The system contains multiple coordinated views, i.e., video view, projection view, detail view, and sequential frames view. A multi-level analysis process to integrate video events and frames are presented with clusters and nodes visualization in our system. Temporal patterns concerning the difference between the manual annotation score and the saliency score produced by our model are further investigated and distinguished with sequential frames view. Moreover, we propose a set of rich user interactions that enable an in-depth, multi-faceted analysis of the features in our video summarization model. We conduct case studies and interviews with domain experts to provide anecdotal evidence about the effectiveness of our approach. Quantitative feedback from a user study confirms the usefulness of our visual system for exploring the video summarization model.

NeurIPS Conference 2020 Conference Paper

A Variational Approach for Learning from Positive and Unlabeled Data

  • Hui Chen
  • Fangqing Liu
  • Yin Wang
  • Liyue Zhao
  • Hao Wu

Learning binary classifiers only from positive and unlabeled (PU) data is an important and challenging task in many real-world applications, including web text classification, disease gene identification and fraud detection, where negative samples are difficult to verify experimentally. Most recent PU learning methods are developed based on the misclassification risk of the supervised learning type, and they may suffer from inaccurate estimates of class prior probabilities. In this paper, we introduce a variational principle for PU learning that allows us to quantitatively evaluate the modeling error of the Bayesian classifier directly from given data. This leads to a loss function which can be efficiently calculated without involving class prior estimation or any other intermediate estimation problems, and the variational learning method can then be employed to optimize the classifier under general conditions. We illustrate the effectiveness of the proposed variational method on a number of benchmark examples.

AAAI Conference 2020 Conference Paper

DeepDualMapper: A Gated Fusion Network for Automatic Map Extraction Using Aerial Images and Trajectories

  • Hao Wu
  • Hanyuan Zhang
  • Xinyu Zhang
  • Weiwei Sun
  • Baihua Zheng
  • Yuning Jiang

Automatic map extraction is of great importance to urban computing and location-based services. Aerial image and GPS trajectory data refer to two different data sources that could be leveraged to generate the map, although they carry different types of information. Most previous works on data fusion between aerial images and data from auxiliary sensors do not fully utilize the information of both modalities and hence suffer from the issue of information loss. We propose a deep convolutional neural network called DeepDualMapper which fuses the aerial image and trajectory data in a more seamless manner to extract the digital map. We design a gated fusion module to explicitly control the information flows from both modalities in a complementary-aware manner. Moreover, we propose a novel densely supervised refinement decoder to generate the prediction in a coarse-to-fine way. Our comprehensive experiments demonstrate that DeepDualMapper can fuse the information of images and trajectories much more effectively than existing approaches, and is able to generate maps with higher accuracy.

AAAI Conference 2020 Conference Paper

Mastering Complex Control in MOBA Games with Deep Reinforcement Learning

  • Deheng Ye
  • Zhao Liu
  • Mingfei Sun
  • Bei Shi
  • Peilin Zhao
  • Hao Wu
  • Hongsheng Yu
  • Shaojie Yang

We study the reinforcement learning problem of complex action control in the Multi-player Online Battle Arena (MOBA) 1v1 games. This problem involves far more complicated state and action spaces than those of traditional 1v1 games, such as Go and Atari series, which makes it very difficult to search any policies with human-level performance. In this paper, we present a deep reinforcement learning framework to tackle this problem from the perspectives of both system and algorithm. Our system is of low coupling and high scalability, which enables efficient explorations at large scale. Our algorithm includes several novel strategies, including control dependency decoupling, action mask, target attention, and dualclip PPO, with which our proposed actor-critic network can be effectively trained in our system. Tested on the MOBA game Honor of Kings, the trained AI agents can defeat top professional human players in full 1v1 games.

JBHI Journal 2020 Journal Article

Multimodal Data Analysis of Alzheimer's Disease Based on Clustering Evolutionary Random Forest

  • Xia-an Bi
  • Xi Hu
  • Hao Wu
  • Yang Wang

Alzheimer's disease (AD) has become a severe medical challenge. Advances in technologies produced high-dimensional data of different modalities including functional magnetic resonance imaging (fMRI) and single nucleotide polymorphism (SNP). Understanding the complex association patterns among these heterogeneous and complementary data is of benefit to the diagnosis and prevention of AD. In this paper, we apply the appropriate correlation analysis method to detect the relationships between brain regions and genes, and propose “brain region-gene pairs” as the multimodal features of the sample. In addition, we put forward a novel data analysis method from technology aspect, cluster evolutionary random forest (CERF), which is suitable for “brain region-gene pairs”. The idea of clustering evolution is introduced to improve the generalization performance of random forest which is constructed by randomly selecting samples and sample features. Through hierarchical clustering of decision trees in random forest, the decision trees with higher similarity are clustered into one class, and the decision trees with the best performance are retained to enhance the diversity between decision trees. Furthermore, based on CERF, we integrate feature construction, feature selection and sample classification to find the optimal combination of different methods, and design a comprehensive diagnostic framework for AD. The framework is validated by the samples with both fMRI and SNP data from ADNI. The results show that we can effectively identify AD patients and discover some brain regions and genes associated with AD significantly based on this framework. These findings are conducive to the clinical treatment and prevention of AD.

NeurIPS Conference 2020 Conference Paper

Stochastic Normalizing Flows

  • Hao Wu
  • Jonas Köhler
  • Frank Noe

The sampling of probability distributions specified up to a normalization constant is an important problem in both machine learning and statistical mechanics. While classical stochastic sampling methods such as Markov Chain Monte Carlo (MCMC) or Langevin Dynamics (LD) can suffer from slow mixing times there is a growing interest in using normalizing flows in order to learn the transformation of a simple prior distribution to the given target distribution. Here we propose a generalized and combined approach to sample target densities: Stochastic Normalizing Flows (SNF) – an arbitrary sequence of deterministic invertible functions and stochastic sampling blocks. We show that stochasticity overcomes expressivity limitations of normalizing flows resulting from the invertibility constraint, whereas trainable transformations between sampling steps improve efficiency of pure MCMC/LD along the flow. By invoking ideas from non-equilibrium statistical mechanics we derive an efficient training procedure by which both the sampler's and the flow's parameters can be optimized end-to-end, and by which we can compute exact importance weights without having to marginalize out the randomness of the stochastic blocks. We illustrate the representational power, sampling efficiency and asymptotic correctness of SNFs on several benchmarks including applications to sampling molecular systems in equilibrium.

AAAI Conference 2019 Conference Paper

Safeguarded Dynamic Label Regression for Noisy Supervision

  • Jiangchao Yao
  • Hao Wu
  • Ya Zhang
  • Ivor W. Tsang
  • Jun Sun

Learning with noisy labels is imperative in the Big Data era since it reduces expensive labor on accurate annotations. Previous method, learning with noise transition, has enjoyed theoretical guarantees when it is applied to the scenario with the class-conditional noise. However, this approach critically depends on an accurate pre-estimated noise transition, which is usually impractical. Subsequent improvement adapts the preestimation in the form of a Softmax layer along with the training progress. However, the parameters in the Softmax layer are highly tweaked for the fragile performance and easily get stuck into undesired local minimums. To overcome this issue, we propose a Latent Class-Conditional Noise model (LCCN) that models the noise transition in a Bayesian form. By projecting the noise transition into a Dirichlet-distributed space, the learning is constrained on a simplex instead of some adhoc parametric space. Furthermore, we specially deduce a dynamic label regression method for LCCN to iteratively infer the latent true labels and jointly train the classifier and model the noise. Our approach theoretically safeguards the bounded update of the noise transition, which avoids arbitrarily tuning via a batch of samples. Extensive experiments have been conducted on controllable noise data with CIFAR- 10 and CIFAR-100 datasets, and the agnostic noise data with Clothing1M and WebVision17 datasets. Experimental results have demonstrated that the proposed model outperforms several state-of-the-art methods.

IJCAI Conference 2018 Conference Paper

CAGAN: Consistent Adversarial Training Enhanced GANs

  • Yao Ni
  • Dandan Song
  • Xi Zhang
  • Hao Wu
  • Lejian Liao

Generative adversarial networks (GANs) have shown impressive results, however, the generator and the discriminator are optimized in finite parameter space which means their performance still need to be improved. In this paper, we propose a novel approach of adversarial training between one generator and an exponential number of critics which are sampled from the original discriminative neural network via dropout. As discrepancy between outputs of different sub-networks of a same sample can measure the consistency of these critics, we encourage the critics to be consistent to real samples and inconsistent to generated samples during training, while the generator is trained to generate consistent samples for different critics. Experimental results demonstrate that our method can obtain state-of-the-art Inception scores of 9. 17 and 10. 02 on supervised CIFAR-10 and unsupervised STL-10 image generation tasks, respectively, as well as achieve competitive semi-supervised classification results on several benchmarks. Importantly, we demonstrate that our method can maintain stability in training and alleviate mode collapse.

NeurIPS Conference 2018 Conference Paper

Deep Generative Markov State Models

  • Hao Wu
  • Andreas Mardt
  • Luca Pasquali
  • Frank Noe

We propose a deep generative Markov State Model (DeepGenMSM) learning framework for inference of metastable dynamical systems and prediction of trajectories. After unsupervised training on time series data, the model contains (i) a probabilistic encoder that maps from high-dimensional configuration space to a small-sized vector indicating the membership to metastable (long-lived) states, (ii) a Markov chain that governs the transitions between metastable states and facilitates analysis of the long-time dynamics, and (iii) a generative part that samples the conditional distribution of configurations in the next time step. The model can be operated in a recursive fashion to generate trajectories to predict the system evolution from a defined starting state and propose new configurations. The DeepGenMSM is demonstrated to provide accurate estimates of the long-time kinetics and generate valid distributions for molecular dynamics (MD) benchmark systems. Remarkably, we show that DeepGenMSMs are able to make long time-steps in molecular configuration space and generate physically realistic structures in regions that were not seen in training data.

IJCAI Conference 2018 Conference Paper

DeepTravel: a Neural Network Based Travel Time Estimation Model with Auxiliary Supervision

  • Hanyuan Zhang
  • Hao Wu
  • Weiwei Sun
  • Baihua Zheng

Estimating the travel time of a path is of great importance to smart urban mobility. Existing approaches are either based on estimating the time cost of each road segment which are not able to capture many cross-segment complex factors, or designed heuristically in a non-learning-based way which fail to leverage the natural abundant temporal labels of the data, i. e. , the time stamp of each trajectory point. In this paper, we leverage on new development of deep neural networks and propose a novel auxiliary supervision model, namely DeepTravel, that can automatically and effectively extract different features, as well as make full use of the temporal labels of the trajectory data. We have conducted comprehensive experiments on real datasets to demonstrate the out-performance of DeepTravel over existing approaches.

IJCAI Conference 2017 Conference Paper

Deep Context: A Neural Language Model for Large-scale Networked Documents

  • Hao Wu
  • Kristina Lerman

We propose a scalable neural language model that leverages the links between documents to learn the deep context of documents. Our model, Deep Context Vector, takes advantage of distributed representations to exploit the word order in document sentences, as well as the semantic connections among linked documents in a document network. We evaluate our model on large-scale data collections that include Wikipedia pages, and scientific and legal citations networks. We demonstrate its effectiveness and efficiency on document classification and link prediction tasks.

IJCAI Conference 2017 Conference Paper

Modeling Trajectories with Recurrent Neural Networks

  • Hao Wu
  • Ziyang Chen
  • Weiwei Sun
  • Baihua Zheng
  • Wei Wang

Modeling trajectory data is a building block for many smart-mobility initiatives. Existing approaches apply shallow models such as Markov chain and inverse reinforcement learning to model trajectories, which cannot capture the long-term dependencies. On the other hand, deep models such as Recurrent Neural Network (RNN) have demonstrated their strength of modeling variable length sequences. However, directly adopting RNN to model trajectories is not appropriate because of the unique topological constraints faced by trajectories. Motivated by these findings, we design two RNN-based models which can make full advantage of the strength of RNN to capture variable length sequence and meanwhile to address the constraints of topological structure on trajectory modeling. Our experimental study based on real taxi trajectory datasets shows that both of our approaches largely outperform the existing approaches.

NeurIPS Conference 2016 Conference Paper

Spectral Learning of Dynamic Systems from Nonequilibrium Data

  • Hao Wu
  • Frank Noe

Observable operator models (OOMs) and related models are one of the most important and powerful tools for modeling and analyzing stochastic systems. They exactly describe dynamics of finite-rank systems and can be efficiently and consistently estimated through spectral learning under the assumption of identically distributed data. In this paper, we investigate the properties of spectral learning without this assumption due to the requirements of analyzing large-time scale systems, and show that the equilibrium dynamics of a system can be extracted from nonequilibrium observation data by imposing an equilibrium constraint. In addition, we propose a binless extension of spectral learning for continuous data. In comparison with the other continuous-valued spectral algorithms, the binless algorithm can achieve consistent estimation of equilibrium dynamics with only linear complexity.

IJCAI Conference 2015 Conference Paper

Saul: Towards Declarative Learning Based Programming

  • Parisa Kordjamshidi
  • Dan Roth
  • Hao Wu

We present Saul, a new probabilistic programming language designed to address some of the shortcomings of programming languages that aim at advancing and simplifying the development of AI systems. Such languages need to interact with messy, naturally occurring data, to allow a programmer to specify what needs to be done at an appropriate level of abstraction rather than at the data level, to be developed on a solid theory that supports moving to and reasoning at this level of abstraction and, finally, to support flexible integration of these learning and inference models within an application program. Saul is an object-functional programming language written in Scala that facilitates these by (1) allowing a programmer to learn, name and manipulate named abstractions over relational data; (2) supporting seamless incorporation of trainable (probabilistic or discriminative) components into the program, and (3) providing a level of inference over trainable models to support composition and make decisions that respect domain and application constraints. Saul is developed over a declaratively defined relational data model, can use piecewise learned factor graphs with declaratively specified learning and inference objectives, and it supports inference over probabilistic models augmented with declarative knowledge-based constraints. We describe the key constructs of Saul and exemplify its use in developing applications that require relational feature engineering and structured output prediction.

UAI Conference 2014 Conference Paper

A Bayesian Nonparametric Model for Spectral Estimation of Metastable Systems

  • Hao Wu

The identification of eigenvalues and eigenfunctions from simulation or experimental data is a fundamental and important problem for analysis of metastable systems, because the dominant spectral components usually contain a lot of essential information of the metastable dynamics on slow timescales. It has been shown that the dynamics of a strongly metastable system can be equivalently described as a hidden Markov model (HMM) under some technical assumptions and the spectral estimation can be performed through HMM learning. However, the spectral estimation with unknown number of dominant spectra is still a challenge in the framework of traditional HMMs, and the infinite HMMs developed based on stick-breaking processes cannot satisfactorily solved this problem either. In this paper, we analyze the difficulties of spectral estimation for infinite HMMs, and present a new nonparametric model called stick-breaking half-weighted model (SB-HWM) to address this problem. The SB-HWM defines a sparse prior of eigenvalues and can be applied to Bayesian inference of dominant eigenpairs of metastable systems in a nonparametric manner. We demonstrate by simulations the advantages of applying SB-HWM to spectral estimation.

AAAI Conference 2010 Conference Paper

Modeling Dynamic Multi-Topic Discussions in Online Forums

  • Hao Wu
  • Jiajun Bu
  • Chun Chen
  • Can Wang
  • Guang Qiu
  • Lijun Zhang
  • Jianfeng Shen

In the form of topic discussions, users interact with each other to share knowledge and exchange information in online forums. Modeling the evolution of topic discussion reveals how information propagates on Internet and can thus help understand sociological phenomena and improve the performance of applications such as recommendation systems. In this paper, we argue that a user’s participation in topic discussions is motivated by either her friends or her own preferences. Inspired by the theory of information flow, we propose dynamic topic discussion models by mining influential relationships between users and individual preferences. Reply relations of users are exploited to construct the fundamental influential social network. The property of discussed topics and time lapse factor are also considered in our modeling. Furthermore, we propose a novel measure called ParticipationRank to rank users according to how important they are in the social network and to what extent they prefer to participate in the discussion of a certain topic. The experiments show our model can simulate the evolution of topic discussions well and predict the tendency of user’s participation accurately.

ICRA Conference 2009 Conference Paper

Environment adapted active multi-focal vision system for object detection

  • Tingting Xu
  • Hao Wu
  • Tianguang Zhang
  • Kolja Kühnlenz
  • Martin Buss

A biologically inspired foveated attention system in an object detection scenario is proposed. Thereby, a high-performance active multi-focal camera system imitates visual behaviors such as scan, saccade and fixation. Bottom-up attention uses wide-angle stereo data to select a sequence of fixation points in the peripheral field of view. Successive saccade and fixation of high foveal resolution using a telephoto camera enables high accurate object recognition. Once an object is recognized as target object, the bottom-up attention model is adapted to the current environment, using the top-down information extracted from this target object. The bottom-up attention model and the object recognition algorithm based on SIFT are implemented using CUDA technology on Graphics Processing Units (GPUs), which highly accelerates image processing. In the experimental evaluation, all the target objects were detected in different backgrounds. Evident improvements in accuracy, flexibility and efficiency are achieved.