Arrow Research search

Author name cluster

Yu Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

147 papers
2 author rows

Possible papers

147

AAAI Conference 2026 Conference Paper

AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers

  • Boxun Xu
  • Yu Wang
  • Zihu Wang
  • Peng Li

Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales—severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.

JBHI Journal 2026 Journal Article

Automated Prediction of Subsequent Miscarriage Risk in Pregnant Women by Early First-trimester Ultrasound Characteristics Based on the Convolutional Neural Network Model

  • Yu Wang
  • Fangfang Han
  • Chuanjie Lei
  • Qixin Zhang
  • Chenghuan Yin
  • Yueyang Teng
  • Zhengwei Yuan

Predicting the risk of subsequent miscarriage during the first trimester is crucial for optimizing ultrasound surveillance and alleviating psychological distress for pregnant women. This study aims to develop a multi-input convolutional neural network (CNN) that integrates early gestational ultrasound images with clinical measurements to assess miscarriage risk, while providing interpretable visual explanations to support clinical decision-making. Utilizing a retrospective dataset of singleton pregnancies at 6–8 weeks of gestation, we extracted gestational sac regions through three distinct segmentation strategies and trained a CNN-based classifier to differentiate between normal and miscarriage outcomes. The predictive performance was evaluated using accuracy, precision, recall, f1-score, and the area under the receiver operating characteristic curve (AUC). The results indicated that the multi-input CNN outperformed single-input models. Additionally, an edge-expanded segmentation strategy demonstrated superior performance compared to both precise segmentation and rectangular cropping. By incorporating Gradient-weighted Class Activation Mapping (Grad-CAM) and SHapley Additive exPlanations (SHAP), we identified subchorionic regions as significant biomarkers and clarified the differences in attention to features among different models, thereby enhancing the interpretability and clinical applicability of the model for early pregnancy risk assessment.

AAAI Conference 2026 Conference Paper

Boosting Noisy Correspondence Discrimination via Dynamic Neighborhood Semantic Verification

  • Yu Wang
  • Fengxia Han
  • Jianyu Wang

Noisy correspondence, characterized by mismatches in cross-modal data pairs, presents a significant challenge for real-world applications. Current approaches primarily rely on direct cross-modal pairwise similarity metrics, which suffer from two critical limitations: noise sensitivity, where direct similarity calculations are easily corrupted by noisy or ambiguous instances, and contextual blindness, where isolated pairwise comparisons fail to exploit the rich semantic context embedded in neighboring instances. To address this issue, we propose to improve noise correspondence discrimination through a well-designed Dynamic Neighborhood Semantic association verification paradigm, namely DNS. Specifically, we hypothesize that the matching degree of current samples can be quantified through the interrelationships among their respective semantic neighbors. For this reason, we develop a novel semantic drift distance and local relation proximity based on dynamic neighborhood association. Furthermore, beyond implicit approaches to semantic gap modeling in cross-modal data, we introduce an explicit decomposition framework that disentangles the gap into the semantic orientation and scalar magnitude. Through the strategic integration of these proposed mechanisms, DNS achieves substantial enhancement in noisy correspondence discrimination, yielding remarkable performance gains. Extensive experiments on three widely-used benchmark datasets, including Flickr30K, MS-COCO, and Conceptual Captions, demonstrate the superiority of DNS over state-of-the-art methods.

AAAI Conference 2026 Conference Paper

Dual-branch Spatial-Temporal Self-supervised Representation for Enhanced Road Network Learning

  • Qinghong Guo
  • Yu Wang
  • Ji Cao
  • Tongya Zheng
  • Junshu Dai
  • Bingde Hu
  • Shunyu Liu
  • Canghong Jin

Road network representation learning (RNRL) has attracted increasing attention from both researchers and practitioners as various spatiotemporal tasks are emerging. Recent advanced methods leverage Graph Neural Networks (GNNs) and contrastive learning to characterize the spatial structure of road segments in a self-supervised paradigm. However, spatial heterogeneity and temporal dynamics of road networks raise severe challenges to the neighborhood smoothing mechanism of self-supervised GNNs. To address these issues, we propose a Dual-branch Spatial-Temporal self-supervised representation framework for enhanced road representations, termed as DST. On one hand, DST designs a mix-hop transition matrix for graph convolution to incorporate dynamic relations of roads from trajectories. Besides, DST contrasts road representations of the vanilla road network against that of hypergraphs in a spatial self-supervised way. The hypergraph is newly built based on three types of hyperedges to capture long-range relations. On the other hand, DST performs next token prediction as the temporal self-supervised task on the sequences of traffic dynamics based on a causal Transformer, which is further regularized by differentiating traffic modes of weekdays from those of weekends. Extensive experiments against state-of-the-art methods verify the superiority of our proposed framework. Moreover, the comprehensive spatiotemporal modeling facilitates DST to excel in zero-shot learning scenarios.

AAAI Conference 2026 Conference Paper

FracSegmentator: Fracture Instance Segmentation with Trauma-Prior-Guided Contrastive Learning

  • Yanzhen Liu
  • Sutuke Yibulayimu
  • Yang Zhou
  • Yudi Sang
  • Yu Wang

Fracture injuries often lead to complex bone fragmentations, posing significant challenges for accurate segmentation in surgical planning and trauma assessment. Manual annotation of each fragment is time-consuming and inconsistent, while existing automated methods often fail to separate individual fragments due to the wide variation in fracture types, irregular fracture surface, and close inter-fragment contact. To address these challenges, we introduce FracSegmentator, a deep learning approach for bone fragment instance segmentation. The model takes extracted bone regions in CT as input and isolates individual fragments by identifying fracture surfaces and separating closely contacting structures. Central to our approach is a Trauma-Prior-Guided Contrastive Learning module, which incorporates clinical knowledge through memory-based attention to better distinguish fractured surfaces from healthy regions. We evaluate FracSegmentator on four datasets that cover a range of anatomical sites and fracture patterns. The method achieves state-of-the-art results across all datasets and demonstrates strong generalization capabilities. By delivering accurate and efficient fragment-level segmentation, FracSegmentator supports critical downstream tasks such as automated fracture diagnosis, surgical planning, and preoperative reduction simulation.

AAAI Conference 2026 Conference Paper

GENMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

  • Kaiyi Huang
  • Yukun Huang
  • Xuefei Ning
  • Zinan Lin
  • Yu Wang
  • Xihui Liu

Text-to-video generation models have shown significant progress in recent years. However, they still struggle with compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with differ- ent objects, and interactions between objects. Inspired by ef- fective human creative workflow, we propose GENMAC, a multi-agent collaboration framework that enables composi- tional text-to-video generation. The framework incorporates a three-stage collaborative workflow: DESIGN, GENERATION, and REDESIGN, with an iterative loop between the latter two stages to progressively verify and refine the generated videos. In the DESIGN stage, a large language model (Design Agent) plans objects with layouts, and then a video gener- ation model synthesizes videos in the GENERATION stage. The REDESIGN stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and re- design the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid halluci- nation of single-agent and naive multi-agent frameworks, we apply a division-of-labor strategy in this stage by introducing a sequence of specialized agents, executed by MLLMs (mul- timodal large language models): Verification Agent, Sugges- tion Agent, Correction Agent, and Output Structuring Agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a suite of correction agents, each specialized for one scenario. Ex- tensive experiments demonstrate the effectiveness of GEN- MAC by generating videos based on long compositional text prompts and achieving state-of-the-art in the compositional text-to-video generation benchmark.

JBHI Journal 2026 Journal Article

M $^{3}$- DEGREES Net: Monocular-Guided Metric Marching Depth Estimation With Graph-Based Relevance Ensemble for Endoluminal Surgery

  • Bo Lu
  • Tiancheng Zhou
  • Qingbiao Li
  • Wenzheng Chi
  • Yue Wang
  • Yu Wang
  • Huicong Liu
  • Jia Gu

Robotic endoluminal surgery has gained tremendous attention for its enhanced treatments in gastrointestinal intervention, where navigating surgeons with monocular camera-based metric depth estimation is a vital sector. However, existing methods either rely on external sensors or perform poorly in terms of visual navigation. In this work, we present our M $^{3}$ - Degrees Net, a novel monocular vision-guided and graph learning-based network tailored for accurate metric marching depth (MD) estimation. We first leverage a generative model to output a scale-free depth map, providing a depth basis in a coarse granularity. To achieve an optimized and metric MD prediction, a relational graph convolutional network with multi-modal visual knowledge fusion is devised. It utilizes shared salient features between keyframes and encodes their pixel differences on the depth basis as the main node, while a projection length-based node that predicts the MD on a proportional relationship basis is introduced, aiming to enable the network with explicit depth awareness. Moreover, to compensate for rotation-induced MD estimation bias, we model the endoscope’s orientation changes as image-level feature shifts, formulating an ego-motion correction node for MD optimization. Lastly, a multi-layer regression network for the metric MD estimation with finer granularity is devised. We validate our network on both public and in-house datasets, and the quantitative results reveal that it can limit the overall MD error under 27. 3%, which vastly outperforms the existing methods. Besides, our M $^{3}$ - Degrees Net is qualitatively tested on the in-house clinical gastrointestinal endoscopy data, demonstrating its satisfactory performance even under cavity mucus with varying reflections, indicating promising clinical potentials.

AAAI Conference 2026 Conference Paper

MedS³: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision

  • Shuyang Jiang
  • Yusheng Liao
  • Zhe Chen
  • Ya Zhang
  • Yanfeng Wang
  • Yu Wang

Medical language models face critical barriers to real-world clinical reasoning applications. However, mainstream efforts, which fall short in task coverage, lack fine-grained supervision for intermediate reasoning steps, and rely on proprietary systems, are still far from a versatile, credible and efficient language model for clinical reasoning usage. To this end, we propose MedS3, a self-evolving framework that imparts robust reasoning capabilities to small, deployable models. Starting with 8,000 curated instances sampled via a curriculum strategy across five medical domains and 16 datasets, we use a small base policy model to conduct Monte Carlo Tree Search (MCTS) for constructing rule-verifiable reasoning trajectories. Self-explored reasoning trajectories ranked by node values are used to bootstrap the policy model via reinforcement fine-tuning and preference learning. Moreover, we introduce a soft dual process reward model that incorporates value dynamics: steps that degrade node value are penalized, enabling fine-grained identification of reasoning errors even when the final answer is correct. Experiments on eleven benchmarks show that MedS3 outperforms the previous state-of-the-art medical model by +6.45 accuracy points and surpasses 32B-scale general-purpose reasoning models by +8.57 points. Additional empirical analysis further demonstrates that MedS3 achieves robust and faithful reasoning behavior.

AAAI Conference 2026 Conference Paper

Neural Graph Navigation for Intelligent Subgraph Matching

  • Yuchen Ying
  • Yiyang Dai
  • Wenda Li
  • Wenjie Huang
  • Rui Wang
  • Tongya Zheng
  • Yu Wang
  • Hanyang Yuan

Subgraph matching, a cornerstone of relational pattern detection in domains ranging from biochemical systems to social network analysis, faces significant computational challenges due to the dramatically growing search space. Existing methods address this problem within a filtering-ordering-enumeration framework, in which the enumeration stage recursively matches the query graph against the candidate subgraphs of the data graph. However, the lack of awareness of subgraph structural patterns leads to a costly brute-force enumeration, thereby critically motivating the need for intelligent navigation in subgraph matching. To address this challenge, we propose Neural Graph Navigation (NeuGN), a neuro-heuristic framework that transforms brute-force enumeration into neural-guided search by integrating neural navigation mechanisms into the core enumeration process. By preserving heuristic-based completeness guarantees while incorporating neural intelligence, NeuGN significantly reduces the First Match Steps by up to 98.2% compared to state-of-the-art methods across six real-world datasets.

JBHI Journal 2026 Journal Article

Privacy Preserved Blood Glucose Level Cross-Prediction: An Asynchronous Decentralized Federated Learning Approach

  • Chengzhe Piao
  • Taiyu Zhu
  • Yu Wang
  • Stephanie E Baldeweg
  • Paul Taylor
  • Pantelis Georgiou
  • Jiahao Sun
  • Jun Wang

Newly diagnosed Type 1 Diabetes (T1D) patients often struggle to obtain effective Blood Glucose (BG) prediction models due to the lack of sufficient BG data from Continuous Glucose Monitoring (CGM), presenting a significant “cold start” problem in patient care. Utilizing population models to address this challenge is a potential solution, but collecting patient data for training population models in a privacy-conscious manner is challenging, especially given that such data is often stored on personal devices. Considering the privacy protection and addressing the “cold start” problem in diabetes care, we propose “GluADFL”, blood Glucose prediction by Asynchronous Decentralized Federated Learning. We compared GluADFL with eight baseline methods using four distinct T1D datasets, comprising 298 participants, which demonstrated its superior performance in accurately predicting BG levels for cross-patient analysis. Furthermore, patients’ data might be stored and shared across various communication networks in GluADFL, ranging from highly interconnected (e. g. , random, performs the best among others) to more structured topologies (e. g. , cluster and ring), suitable for various social networks. The asynchronous training framework supports flexible participation. By adjusting the ratios of inactive participants, we found it remains stable if less than 70% are inactive. Our results confirm that GluADFL offers a practical, privacy-preserved solution for BG prediction in T1D, significantly enhancing the quality of diabetes management.

AAAI Conference 2026 Conference Paper

RSPlace: Rotation Sensing Macro Placement via Bidirectional Tree Expansion

  • Tianyi Liu
  • Yaxin Xu
  • Lin Geng
  • Ningzhong Liu
  • Han Sun
  • Yu Wang

Macro placement is a crucial subproblem of chip design, focusing on determining the locations of numerous macros while minimizing multiple metrics. In recent years, reinforcement learning (RL) has gained traction as a favorable technique to improve placement performance. However, existing RL-based placers ignore the orientation of macros, resulting in the state space constrained to two-dimensional discrete coordinates and greatly restricting the exploration opportunities. To address this issue, we propose a novel macro placement method, RSPlace, which guides the bidirectional expansion of the global search tree to offer the RL agent more exploration opportunities, incorporating rotation into the RL-based macro placement solution for the first time. RSPlace intelligently determines the optimal rotation angle to maximize placement benefits by leveraging rotation sensing and placement perturbations. Extensive experiments demonstrate that taking the macro orientation into account substantially broadens the feasible locations and effectively reduces the half-perimeter wirelength (HPWL), thus ensuring that our approach significantly improves the optimization effect compared to the state-of-the-art method.

AAAI Conference 2026 Conference Paper

Topology-aware Knowledge Preservation for Class-Incremental Learning

  • Han Zang
  • Yongfeng Dong
  • Linhao Li
  • Liang Yang
  • Yu Wang

Class Incremental Learning (CIL) aims to enable models to continually learn new classes while retaining previously learned knowledge. The principal challenge in CIL is catastrophic forgetting, which prior approaches typically address by distilling knowledge from previous model. However, such way is often limited to pairwise alignment, failing to preserve the underlying global manifold structure of feature space—ultimately resulting in semantic drift over time. To capture multi-scale structural patterns in the feature space, we propose a topology-aware distillation framework that leverages persistent homology. Specifically, by enforcing topological alignment across incremental stages, our method ensures structure-consistent knowledge transfer and robust preservation of old classes. Furthermore, we still devise a dual-branch architecture with an inverse sampling and dynamic reweighting mechanism that addresses the inherent data imbalance in standard replay-based frameworks. These innovations coalesce into TaKP (Topology-aware Knowledge Preservation), a unified framework designed to enhance knowledge preservation in CIL. Extensive experiments demonstrate that TaKP achieves state-of-the-art performance on multiple benchmarks, significantly improving old-class preservation and average accuracy.

NeurIPS Conference 2025 Conference Paper

A Driving-Style-Adaptive Framework for Vehicle Trajectory Prediction

  • Di Wen
  • Yu Wang
  • Zhigang Wu
  • Zhaocheng He
  • Zhe Wu
  • Zheng Qingfang

Vehicle trajectory prediction serves as a critical enabler for autonomous navigation and intelligent transportation systems. While existing approaches predominantly focus on temporal pattern extraction and vehicle-environment interaction modeling, they exhibit a fundamental limitation in addressing trajectory heterogeneity originating from human driving styles. This oversight constrains prediction reliability in complex real-world scenarios. To bridge this gap, we propose the Driving-Style-Adaptive (\underline{\textbf{DSA}}) framework, which establishes the first systematic integration of heterogeneous driving behaviors into trajectory prediction models. Specifically, our framework employs a set of basis functions tailored to each driving style to approximate the trajectory patterns. By dynamically combining and adaptively adjusting the degree of these basis functions, DSA not only enhances prediction accuracy but also provides \textbf{explanations} insights into the prediction process. Extensive experiments on public real-world datasets demonstrate that the DSA framework outperforms state-of-the-art methods.

TIST Journal 2025 Journal Article

Adaptive Target-Oriented Tracking

  • Sixian Chan
  • Xianpeng Zeng
  • Zhoujian Wu
  • Yu Wang
  • Xiaolong Zhou
  • Tinglong Tang
  • Jie Hu

The current one-stream tracking pipelines are early relation modeling in feature extraction. However, insufficient discrimination may result in ambiguous relation modeling during early feature extraction. Moreover, the non-target information occupies most of the search image, rendering most relation modeling futile. To tackle the above issues, we propose tracking via learning adaptive target-oriented representation, named ATOTrack. We design an Untied positional encoding to mark the template token and the search region token separately, which reduces the confused relationship between the template and the search region. Besides, we introduce an Auto-Mask Learner to decouple the target and non-target information in the search region. Interestingly, the Auto-Mask Learner can self-learn and mask the ineffective information to interpret adaptive target-oriented representation. Extensive experiments demonstrate that ATOTrack is superior to existing methods, which achieves the state-of-the-art performance on six tracking benchmarks. In particular, ATOTrack establishes a new record on AViST with 57% AO. The code and models will be released as soon.

NeurIPS Conference 2025 Conference Paper

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

  • Yu Shang
  • Peijie Liu
  • Yuwei Yan
  • Zijing Wu
  • Leheng Sheng
  • Yuanqing Yu
  • Chumeng Jiang
  • An Zhang

The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs’ advanced reasoning and role-playing capabilities to enable autonomous, adaptive decision-making. Unlike traditional recommendation approaches, agentic recommender systems can dynamically gather and interpret user-item interactions from complex environments, generating robust recommendation strategies that generalize across diverse scenarios. However, the field currently lacks standardized evaluation protocols to systematically assess these methods. To address this critical gap, we propose: (1) an interactive textual recommendation simulator incorporating rich user and item metadata and three typical evaluation scenarios (classic, evolving-interest, and cold-start recommendation tasks); (2) a unified modular framework for developing agentic recommender systems; and (3) the first comprehensive benchmark comparing over 10 classical and agentic recommendation methods. Our findings demonstrate the superiority of agentic systems and establish actionable design guidelines for their core components. The benchmark environment has been rigorously validated through an open challenge and remains publicly available with a maintained leaderboard at https: //tsinghua-fib-lab. github. io/AgentSocietyChallenge/pages/overview. html. The benchmark is available at: https: //huggingface. co/datasets/SGJQovo/AgentRecBench.

AAAI Conference 2025 Conference Paper

AnyTalk: Multi-modal Driven Multi-domain Talking Head Generation

  • Yu Wang
  • Yunfei Liu
  • Fa-Ting Hong
  • Meng Cao
  • Lijian Lin
  • Yu Li

Cross-domain talking head generation, such as animating a static cartoon animal photo with real human video, is crucial for personalized content creation. However, prior works typically rely on domain-specific frameworks and paired videos, limiting its utility and complicating its architecture with additional motion alignment modules. Addressing these shortcomings, we propose Anytalk, a unified framework that eliminates the need for paired data and learns a shared motion representation across different domains. The motion is represented by canonical 3D keypoints extracted using an unsupervised 3D keypoint detector. Further, we propose an expression consistency loss to improve the accuracy of facial dynamics in video generation. Additionally, we present AniTalk, a comprehensive dataset designed for advanced multi-modal cross-domain generation. Our experiments demonstrate that Anytalk excels at generating high-quality, multi-modal talking head videos, showcasing remarkable generalization capabilities across diverse domains.

ICRA Conference 2025 Conference Paper

AVD2: Accident Video Diffusion for Accident Video Description

  • Cheng Li
  • Keyuan Zhou
  • Tong Liu
  • Yu Wang
  • Mingqiao Zhuang
  • Huan-ang Gao
  • Bu Jin
  • Hao Zhao 0002

Traffic accidents present complex challenges for autonomous driving, often featuring unpredictable scenarios that hinder accurate system interpretation and responses. Nonetheless, prevailing methodologies fall short in elucidating the causes of accidents and proposing preventive measures due to the paucity of training data specific to accident scenarios. In this work, we introduce AVD2 (Accident Video Diffusion for Accident Video Description), a novel framework that enhances accident scene understanding by generating accident videos that aligned with detailed natural language descriptions and reasoning, resulting in the contributed EMM-AU (Enhanced Multi-Modal Accident Video Understanding) dataset. Empirical results reveal that the integration of the EMM-AU dataset establishes state-of-the-art performance across both automated metrics and human evaluations, markedly advancing the domains of accident analysis and prevention. Project resources are available at https://an-answer-tree.github.io

JBHI Journal 2025 Journal Article

Cascade-based Pancreatic Tumor Segmentation via Interactive Enhancement and Fine Localization

  • Jianxing Ma
  • Yu Wang
  • Shakir Khan
  • Farman Ali

Accurate delineation of pancreatic tumors in computed tomography (CT) images is essential for timely diagnosis, guiding therapeutic strategies, and predicting clinical outcomes in pancreatic cancer. However, the small size, irregular morphology, and complex spatial context of pancreatic tumors pose significant challenges for conventional segmentation methods, often leading to misclassification or omission of tumor regions. To address these limitations, we propose a coarse-to-fine dual-stage segmentation framework. In the first stage, a coarse segmentation network, built on a multi-scale backbone, extracts preliminary tumor regions by leveraging rich contextual features. These coarse predictions are refined by an interaction enhancement module that transforms initial visual cues into tumor-aware priors and spatial weights to localize and crop tumor candidates. In the second stage, a fine segmentation network equipped with a class-aware boundary-refinement loss further enhances the delineation of small tumor structures and inter-class boundaries. Extensive experiments on a benchmark pancreatic tumor dataset demonstrate that our framework achieves superior performance, with an average DSC of 60. 24%, consistently outperforming existing baselines and highlighting the effectiveness of the proposed approach.

NeurIPS Conference 2025 Conference Paper

Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation

  • Enshu Liu
  • Qian Chen
  • Xuefei Ning
  • Shengen Yan
  • Guohao Dai
  • Zinan Lin
  • Yu Wang

Image Auto-regressive (AR) models have emerged as a powerful paradigm of visual generative models. Despite their promising performance, they suffer from slow generation speed due to the large number of sampling steps required. Although Distilled Decoding 1 (DD1) was recently proposed to enable few-step sampling for image AR models, it still incurs significant performance degradation in the one-step setting, and relies on a pre-defined mapping that limits its flexibility. In this work, we propose a new method, Distilled Decoding 2 (DD2), to further advances the feasibility of one-step sampling for image AR models. Unlike DD1, DD2 does not without rely on a pre-defined mapping. We view the original AR model as a teacher model which provides the ground truth conditional score in the latent embedding space at each token position. Based on this, we propose a novel \emph{conditional score distillation loss} to train a one-step generator. Specifically, we train a separate network to predict the conditional score of the generated distribution and apply score distillation at every token position conditioned on previous tokens. Experimental results show that DD2 enables one-step sampling for image AR models with an minimal FID increase from 3. 40 to 5. 43 on ImageNet-256. Compared to the strongest baseline DD1, DD2 reduces the gap between the one-step sampling and original AR model by 67\%, with up to 12. 3$\times$ training speed-up simultaneously. DD2 takes a significant step toward the goal of one-step AR generation, opening up new possibilities for fast and high-quality AR modeling. Code is available at https: //github. com/imagination-research/Distilled-Decoding-2.

AAAI Conference 2025 Conference Paper

Enhancing Contrastive Learning Inspired by the Philosophy of “The Blind Men and the Elephant”

  • Yudong Zhang
  • Ruobing Xie
  • Jiansheng Chen
  • Xingwu Sun
  • Zhanhui Kang
  • Yu Wang

Contrastive learning is a prevalent technique in self-supervised vision representation learning, typically generating positive pairs by applying two data augmentations to the same image. Designing effective data augmentation strategies is crucial for the success of contrastive learning. Inspired by the story of the blind men and the elephant, we introduce JointCrop and JointBlur. These methods generate more challenging positive pairs by leveraging the joint distribution of the two augmentation parameters, thereby enabling contrastive learning to acquire more effective feature representations. To the best of our knowledge, this is the first effort to explicitly incorporate the joint distribution of two data augmentation parameters into contrastive learning. As a plug-and-play framework without additional computational overhead, JointCrop and JointBlur enhance the performance of SimCLR, BYOL, MoCo v1, MoCo v2, MoCo v3, SimSiam, and Dino baselines with notable improvements.

TIST Journal 2025 Journal Article

Fairness and Diversity in Recommender Systems: A Survey

  • Yuying Zhao
  • Yu Wang
  • Yunchao Liu
  • Xueqi Cheng
  • Charu C. Aggarwal
  • Tyler Derr

Recommender systems (RS) are effective tools for mitigating information overload and have seen extensive applications across various domains. However, the single focus on utility goals proves to be inadequate in addressing real-world concerns, leading to increasing attention to fairness-aware and diversity-aware RS. While most existing studies explore fairness and diversity independently, we identify strong connections between these two domains. In this survey, we first discuss each of them individually and then dive into their connections. Additionally, motivated by the concepts of user-level and item-level fairness, we broaden the understanding of diversity to encompass not only the item level but also the user level. With this expanded perspective on user and item-level diversity, we re-interpret fairness studies from the viewpoint of diversity. This fresh perspective enhances our understanding of fairness-related work and paves the way for potential future research directions. Articles discussed in this survey along with public code links are available at: https://github.com/YuyingZhao/Awesome-Fairness-and-Diversity-Papers-in-Recommender-Systems

NeurIPS Conference 2025 Conference Paper

FedMGP: Personalized Federated Learning with Multi-Group Text-Visual Prompts

  • Weihao Bo
  • Yanpeng Sun
  • Yu Wang
  • Xinyu Zhang
  • Zechao Li

In this paper, we introduce FedMGP, a new paradigm for personalized federated prompt learning in vision-language models (VLMs). Existing federated prompt learning (FPL) methods often rely on a single, text-only prompt representation, which leads to client-specific overfitting and unstable aggregation under heterogeneous data distributions. Toward this end, FedMGP equips each client with multiple groups of paired textual and visual prompts, enabling the model to capture diverse, fine-grained semantic and instance-level cues. A diversity loss is introduced to drive each prompt group to specialize in distinct and complementary semantic aspects, ensuring that the groups collectively cover a broader range of local characteristics. During communication, FedMGP employs a dynamic prompt aggregation strategy based on similarity-guided probabilistic sampling: each client computes the cosine similarity between its prompt groups and the global prompts from the previous round, then samples s groups via a softmax-weighted distribution. This soft selection mechanism preferentially aggregates semantically aligned knowledge while still enabling exploration of underrepresented patterns—effectively balancing the preservation of common knowledge with client-specific features. Notably, FedMGP maintains parameter efficiency by redistributing a fixed prompt capacity across multiple groups, achieving state-of-the-art performance with the lowest communication parameters (5. 1k) among all federated prompt learning methods. Theoretical analysis shows that our dynamic aggregation strategy promotes robust global representation learning by reinforcing shared semantics while suppressing client-specific noise. Extensive experiments demonstrate that FedMGP consistently outperforms prior approaches in both personalization and domain generalization across diverse federated vision-language benchmarks. The code will be released on https: //github. com/weihao-bo/FedMGP. git.

NeurIPS Conference 2025 Conference Paper

Graphs Help Graphs: Multi-Agent Graph Socialized Learning

  • Jialu Li
  • Yu Wang
  • Pengfei Zhu
  • Wanyu Lin
  • Xinjie Yao
  • Qinghua Hu

Graphs in the real world are fragmented and dynamic, lacking collaboration akin to that observed in human societies. Existing paradigms present collaborative information collapse and forgetting, making collaborative relationships poorly autonomous and interactive information insufficient. Moreover, collaborative information is prone to loss when the graph grows. Effective collaboration in heterogeneous dynamic graph environments becomes challenging. Inspired by social learning, this paper presents a Graph Socialized Learning (GSL) paradigm. We provide insights into graph socialization in GSL and boost the performance of agents through effective collaboration. It is crucial to determine with whom, what, and when to share and accumulate information for effective GSL. Thus, we propose the ''Graphs Help Graphs'' (GHG) method to solve these issues. Specifically, it uses a graph-driven organizational structure to select interacting agents and manage interaction strength autonomously. We produce customized synthetic graphs as an interactive medium based on the demand of agents, then apply the synthetic graphs to build prototypes in the life cycle to help select optimal parameters. We demonstrate the effectiveness of GHG in heterogeneous dynamic graphs by an extensive empirical study. The code is available through https: //github. com/Jillian555/GHG.

IROS Conference 2025 Conference Paper

HEATS: A Hierarchical Framework for Efficient Autonomous Target Search with Mobile Manipulators

  • Hao Zhang
  • Yifei Wang
  • Weifan Zhang
  • Yu Wang
  • Haoyao Chen

Utilizing robots for autonomous target search in complex and unknown environments can greatly improve the efficiency of search and rescue missions. However, existing methods have shown inadequate performance due to hardware platform limitations, inefficient viewpoint selection strategies, and conservative motion planning. In this work, we propose HEATS, which enhances the search capability of mobile manipulators in complex and unknown environments. We design a target viewpoint planner tailored to the strengths of mobile manipulators, ensuring efficient and comprehensive viewpoint planning. Supported by this, a whole-body motion planner integrates global path search with local IPC optimization, enabling the mobile manipulator to safely and agilely visit target viewpoints, significantly improving search performance. We present extensive simulated and real-world tests, in which our method demonstrates reduced search time, higher target search completeness, and lower movement cost compared to classic and state-of-the-art approaches. Our method will be open-sourced for community benefit 3.

IROS Conference 2025 Conference Paper

High-Precision Transformer-Based Visual Servoing for Humanoid Robots in Aligning Tiny Objects

  • Jialong Xue
  • Wei Gao
  • Yu Wang
  • Chao Ji
  • Dongdong Zhao
  • Shi Yan
  • Shiwu Zhang

High-precision tiny object alignment remains a common and critical challenge for humanoid robots in real world. To address this problem, this paper proposes a vision-based framework for precisely estimating and controlling the relative position between a handheld tool and a target object for humanoid robots, e. g. , a screwdriver tip and a screw head slot. By fusing images from the head and torso cameras on a robot with its head joint angles, the proposed Transformer-based visual servoing method can correct the handheld tool’s positional errors effectively, especially at a close distance. Experiments on M4-M8 screws demonstrate an average convergence error of 0. 8-1. 3 mm and a success rate of 93%-100%. Through comparative analysis, the results validate that this capability of high-precision tiny object alignment is enabled by the Distance Estimation Transformer architecture and the Multi-Perception-Head mechanism proposed in this paper.

AAAI Conference 2025 Conference Paper

Holistic Semantic Representation for Navigational Trajectory Generation

  • Ji Cao
  • Tongya Zheng
  • Qinghong Guo
  • Yu Wang
  • Junshu Dai
  • Shunyu Liu
  • Jie Yang
  • Jie Song

Trajectory generation has garnered significant attention from researchers in the field of spatio-temporal analysis, as it can generate substantial synthesized human mobility trajectories that enhance user privacy and alleviate data scarcity. However, existing trajectory generation methods often focus on improving trajectory generation quality from a singular perspective, lacking a comprehensive semantic understanding across various scales. Consequently, we are inspired to develop a HOlistic SEmantic Representation (HOSER) framework for navigational trajectory generation. Given an origin-and-destination (OD) pair and the starting time point of a latent trajectory, we first propose a Road Network Encoder to expand the receptive field of road- and zone-level semantics. Second, we design a Multi-Granularity Trajectory Encoder to integrate the spatio-temporal semantics of the generated trajectory at both the point and trajectory levels. Finally, we employ a Destination-Oriented Navigator to seamlessly integrate destination-oriented guidance. Extensive experiments on three real-world datasets demonstrate that HOSER outperforms state-of-the-art baselines by a significant margin. Moreover, the model's performance in few-shot learning and zero-shot learning scenarios further verifies the effectiveness of our holistic semantic representation.

ICRA Conference 2025 Conference Paper

Human-Robot Cooperative Distribution Coupling for Hamiltonian-Constrained Social Navigation

  • Weizheng Wang 0004
  • Chao Yu
  • Yu Wang
  • Byung-Cheol Min

Navigating in human-filled public spaces is a critical challenge for deploying autonomous robots in real-world environments. This paper introduces NaviDIFF, a novel Hamiltonian-constrained socially-aware navigation framework designed to address the complexities of human-robot interaction and socially-aware path planning. NaviDIFF integrates a port-Hamiltonian framework to model dynamic physical interactions and a diffusion model to manage uncertainty in human-robot cooperation. The framework leverages a spatial-temporal transformer to capture social and temporal dependencies, enabling more accurate spatial-temporal environmental dynamics understanding and port-Hamiltonian physical interactive process construction. Additionally, reinforcement learning from human feedback is employed to fine-tune robot policies, ensuring adaptation to human preferences and social norms. Extensive experiments demonstrate that NaviDIFF outperforms state-of-the-art methods in social navigation tasks, offering improved stability, efficiency, and adaptability 1 1 The experimental videos and additional information about this work can be found at: https://sites.google.com/view/NaviDIFF.

JMLR Journal 2025 Journal Article

Learning Global Nash Equilibrium in Team Competitive Games with Generalized Fictitious Cross-Play

  • Zelai Xu
  • Chao Yu
  • Yancheng Liang
  • Yi Wu
  • Yu Wang

Self-play (SP) is a popular multi-agent reinforcement learning framework for competitive games. Despite the empirical success, the theoretical properties of SP are limited to two-player settings. For team competitive games where two teams of cooperative agents compete with each other, we show a counter-example where SP cannot converge to a global Nash equilibrium (NE) with high probability. Policy-Space Response Oracles (PSRO) is an alternative framework that finds NEs by iteratively learning the best response (BR) to previous policies. PSRO can be directly extended to team competitive games with unchanged convergence properties by learning team BRs, but its repeated training from scratch makes it hard to scale to complex games. In this work, we propose Generalized Fictitious Cross-Play (GFXP), a novel algorithm that inherits benefits from both frameworks. GFXP simultaneously trains an SP-based main policy and a counter population. The main policy is trained by fictitious self-play and cross-play against the counter population, while the counter policies are trained as the BRs to the main policy's checkpoints. We evaluate GFXP in matrix games and gridworld domains where GFXP achieves the lowest exploitabilities. We further conduct experiments in a challenging football game where GFXP defeats SOTA models with over 94% win rate. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

NeurIPS Conference 2025 Conference Paper

PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models

  • Tianchen Zhao
  • Ke Hong
  • Xinhao Yang
  • Xuefeng Xiao
  • Huixia Li
  • Feng Ling
  • Ruiqi Xie
  • Siqi Chen

In visual generation, the quadratic complexity of attention mechanisms results in high memory and computational costs, especially for longer token sequences required in high-resolution image or multi-frame video generation. To address this, prior research has explored techniques such as sparsification and quantization. However, these techniques face significant challenges under low density and reduced bitwidths. Through systematic analysis, we identify that the core difficulty stems from the dispersed and irregular characteristics of visual attention patterns. Therefore, instead of introducing specialized sparsification and quantization design to accommodate such patterns, we propose an alternative strategy: "reorganizing" the attention pattern to alleviate the challenges. Inspired by the local aggregatin nature of visual feature extraction, we design a novel P attern- A ware token R e O rdering ( PARO ) technique, which unifies the diverse attention patterns into a hardware-friendly block-wise pattern. This unification substantially simplifies and enhances both sparsification and quantization. We evaluate the performance-efficiency trade-offs of various design choices and finalize a methodology tailored for the unified pattern. Our approach, PAROAttention, achieves video and image generation with lossless metrics, and nearly identical results from full-precision (FP) baselines, while operating at notably lower density ( 20%-30% ) and bitwidth ( INT8/INT4 ), achieving a 1. 9 - 2. 7x end-to-end latency speedup.

ICML Conference 2025 Conference Paper

PEINR: A Physics-enhanced Implicit Neural Representation for High-Fidelity Flow Field Reconstruction

  • Liming Shen
  • Liang Deng
  • Chongke Bi
  • Yu Wang
  • Xinhai Chen
  • Yueqing Wang
  • Jie Liu

Implicit neural representation (INR) has now been thrust into the limelight with its flexibility in high-fidelity flow field reconstruction tasks. However, the lack of standard benchmarking datasets and the grid independence assumption for INR-based methods hinder progress and adoption in real-world simulation scenarios. Moreover, naive adoptions of existing INR frameworks suffer from limited accuracy in capturing fine-scale structures and spatiotemporal dynamics. Tacking these issues, we first introduce HFR-Beach, a 5. 4 TB public large-scale CFD dataset with 33, 600 unsteady 2D and 3D vector fields for reconstructing high-fidelity flow fields. We further present PEINR, a physics-enhanced INR framework, to enrich the flow fields by concurrently enhancing numerical-precision and grid-resolution. Specifically, PEINR is mainly composed of physical encoding and transformer-based spatiotemporal fuser (TransSTF). Physical encoding decouples temporal and spatial components, employing Gaussian coordinate encoding and localized encoding techniques to capture the nonlinear characteristics of spatiotemporal dynamics and the stencil discretization of spatial dimensions, respectively. TransSTF fuses both spatial and temporal information via transformer for capturing long-range temporal dependencies. Qualitative and quantitative experiments and demonstrate that PEINR outperforms state-of-the-art INR-based methods in reconstruction quality.

TMLR Journal 2025 Journal Article

Personalization of Large Language Models: A Survey

  • Zhehao Zhang
  • Ryan A. Rossi
  • Branislav Kveton
  • Yijia Shao
  • Diyi Yang
  • Hamed Zamani
  • Franck Dernoncourt
  • Joe Barrow

Personalization of Large Language Models (LLMs) has recently become increasingly important with a wide range of applications. Despite the importance and recent progress, most existing works on personalized LLMs have focused either entirely on (a) personalized text generation or (b) leveraging LLMs for personalization-related downstream applications, such as recommendation systems. In this work, we bridge the gap between these two separate main directions for the first time by introducing a taxonomy for personalized LLM usage and summarizing the key differences and challenges. We provide a formalization of the foundations of personalized LLMs that consolidates and expands notions of personalization of LLMs, defining and discussing novel facets of personalization, usage, and desiderata of personalized LLMs. We then unify the literature across these diverse fields and usage scenarios by proposing systematic taxonomies for the granularity of personalization, personalization techniques, datasets, evaluation methods, and applications of personalized LLMs. Finally, we highlight challenges and important open problems that remain to be addressed. By unifying and surveying recent research using the proposed taxonomies, we aim to provide a clear guide to the existing literature and different facets of personalization in LLMs, empowering both researchers and practitioners.

NeurIPS Conference 2025 Conference Paper

Point4Bit: Post Training 4-bit Quantization for Point Cloud 3D Detection

  • Jianyu Wang
  • Yu Wang
  • Shengjie Zhao
  • Sifan Zhou

Voxel-based 3D object detectors have achieved remarkable performance in point cloud perception, yet their high computational and memory demands pose significant challenges for deployment on resource-constrained edge devices. Post-training quantization (PTQ) provides a practical means to compress models and accelerate inference; however, existing PTQ methods for point cloud detection are typically limited to INT8 and lack support for lower-bit formats such as INT4, which restricts their deployment potential. In this paper, we present Point4bit, the first general 4-bit PTQ framework tailored for voxel-based 3D object detectors. To tackle challenges in low-bit quantization, we propose two key techniques: (1) Foreground-aware Piecewise Activation Quantization (FA-PAQ), which leverages foreground structural cues to improve the quantization of sparse activations; and (2) Gradient-guided Key Weight Quantization (G-KWQ), which preserves task-critical weights through gradient-based analysis to reduce quantization-induced degradation. Extensive experiments demonstrate that Point4bit achieves INT4 quantization with minimal accuracy loss with less than 1. 5\% accuracy drop. Moreover, we validate its generalization ability on point cloud classification and segmentation tasks, demonstrating broad applicability. Our method further advances the bit-width limitation of point cloud quantization to 4 bits, demonstrating strong potential for efficient deployment on resource-constrained edge devices.

ICML Conference 2025 Conference Paper

Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers

  • Weilun Feng
  • Chuanguang Yang
  • Haotong Qin
  • Xiangqi Li
  • Yu Wang
  • Zhulin An
  • Libo Huang
  • Boyu Diao

Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency score of 23. 40, setting a new benchmark and outperforming the current state-of-the-art quantization methods by 1. 9$\times$.

NeurIPS Conference 2025 Conference Paper

R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

  • Tianyu Fu
  • Yi Ge
  • Yichen You
  • Enshu Liu
  • Zhihang Yuan
  • Guohao Dai
  • Shengen Yan
  • Huazhong Yang

Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce Roads to Rome (R2R), a neural token router that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1. 5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5. 6B, R2R surpasses the average accuracy of R1-7B by 1. 6×, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2. 8× wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency.

ICRA Conference 2025 Conference Paper

Real-Time LiDAR Point Cloud Compression and Transmission for Resource-Constrained Robots

  • Yuhao Cao
  • Yu Wang
  • Haoyao Chen

LiDARs are widely used in autonomous robots due to their ability to provide accurate environment structural information. However, the large size of point clouds poses challenges in terms of data storage and transmission. In this paper, we propose a novel point cloud compression and transmission framework for resource-constrained robotic applications, called RCPCC. We iteratively fit the surface of point clouds with a similar range value and eliminate redundancy through their spatial relationships. Then, we use Shape-adaptive DCT (SA-DCT) to transform the unfit points and reduce the data volume by quantizing the transformed coefficients. We design an adaptive bitrate control strategy based on QoE as the optimization goal to control the quality of the transmitted point cloud. Experiments show that our framework achieves compression rates of 40×to 80× while maintaining high accuracy for downstream applications. our method significantly outperforms other baselines in terms of accuracy when the compression rate exceeds 70×. Furthermore, in situations of reduced communication bandwidth, our adaptive bitrate control strategy demonstrates significant QoE improvements. The code will be available at https://github.com/HITSZ-NRSL/RCPCC.git.

NeurIPS Conference 2025 Conference Paper

ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning

  • Tonghe Zhang
  • Chao Yu
  • Sichang Su
  • Yu Wang

We propose ReinFlow, a simple yet effective online reinforcement learning (RL) framework that fine-tunes a family of flow matching policies for continuous robotic control. Derived from rigorous RL theory, ReinFlow injects learnable noise into a flow policy’s deterministic path, converting the flow into a discrete-time Markov Process for exact and straightforward likelihood computation. This conversion facilitates exploration and ensures training stability, enabling ReinFlow to fine-tune diverse flow model variants stably, including Rectified Flow [34] and Shortcut Models [18], particularly at very few or even one denoising step. We benchmark ReinFlow in representative locomotion and manipulation tasks, including long- horizon planning with visual input and sparse reward. The episode reward of Rectified Flow policies obtained an average net growth of 135. 36% after fine-tuning in challenging legged locomotion tasks while saving denoising steps and 82. 63% of wall time compared to state-of-the-art diffusion RL fine-tuning method DPPO [42]. The success rate of the Shortcut Model policies in state and visual manipulation tasks achieved an average net increase of 40. 34% after fine-tuning with ReinFlow at four or even one denoising step, whose performance is comparable to fine-tuned DDIM policies while saving computation time for an average of 23. 20%. Code, model, and checkpoints available on the project website: https: //reinflow. github. io/

TMLR Journal 2025 Journal Article

Reliable and Responsible Foundation Models

  • Xinyu Yang
  • Junlin Han
  • Rishi Bommasani
  • Jinqi Luo
  • Wenjie Qu
  • Wangchunshu Zhou
  • Adel Bibi
  • Xiyao Wang

Foundation models, including Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), Image Generative Models (i.e, Text-to-Image Models and Image-Editing Models), and Video Generative Models, have become essential tools with broad applications across various domains such as law, medicine, education, finance, and beyond. As these models see increasing real-world deployment, ensuring their reliability and responsibility has become critical for academia, industry, and government. This survey addresses the reliable and responsible development of foundation models. We explore critical issues, including bias and fairness, security and privacy, uncertainty, explainability, and distribution shift. Our research also covers model limitations, such as hallucinations, as well as methods like alignment and Artificial Intelligence-Generated Content (AIGC) detection. For each area, we review the current state of the field and outline concrete future research directions. Additionally, we discuss the intersections between these areas, highlighting their connections and shared challenges. We hope our survey fosters the development of foundation models that are not only powerful but also ethical, trustworthy, reliable, and socially responsible.

IROS Conference 2025 Conference Paper

RGB-Thermal Visual Place Recognition via Vision Foundation Model

  • Minghao Ye
  • Xiao Liu
  • Yu Wang
  • Lu Liu 0002
  • Haoyao Chen

Visual place recognition is a critical component of robust simultaneous localization and mapping systems. Conventional approaches primarily rely on RGB imagery, but their performance degrades significantly in extreme environments, such as those with poor illumination and airborne particulate interference (e. g. , smoke or fog), which significantly degrade the performance of RGB-based methods. Furthermore, existing techniques often struggle with cross-scenario generalization. To overcome these limitations, we propose an RGB-thermal multimodal fusion framework for place recognition, specifically designed to enhance robustness in extreme environmental conditions. Our framework incorporates a dynamic RGB-thermal fusion module, coupled with dual fine-tuned vision foundation models as the feature extraction backbone. Experimental results on public datasets and our self-collected dataset demonstrate that our method significantly outperforms state-of-the-art RGB-based approaches, achieving generalizable and robust retrieval capabilities across day and night scenarios. The code is available at https://github.com/HITSZ-NRSL/RGB-Thermal-VPR.

IJCAI Conference 2025 Conference Paper

Run Like a Neural Network, Explain Like k-Nearest Neighbor

  • Xiaomeng Ye
  • David Leake
  • Yu Wang
  • David Crandall

Deep neural networks have achieved remarkable performance across a variety of applications. However, their decision-making processes are opaque. In contrast, k-nearest neighbor (k-NN) provides interpretable predictions by relying on similar cases, but it lacks important capabilities of neural networks. The neural network k-nearest neighbor (NN-kNN) model is designed to bridge this gap, combining the benefits of neural networks with the instance-based interpretability of k-NN. However, the initial formulation of NN-kNN had limitations including scalability issues, reliance on surface-level features, and an excessive number of parameters. This paper improves NN-kNN by enhancing its scalability, parameter efficiency, ease of integration with feature extractors, and training simplicity. An evaluation of the revised architecture for image and language classification tasks illustrates its promise as a flexible and interpretable method.

IJCAI Conference 2025 Conference Paper

Sanitizing Backdoored Graph Neural Networks: A Multidimensional Approach

  • Rong Zhao
  • Jilian Zhang
  • Yu Wang
  • Yinyan Zhang
  • Jian Weng

Graph Neural Networks (GNNs) are known to be prone to adversarial attacks, among which backdoor attack is a major security threat. By injecting backdoor triggers into a graph and assigning a target class label to nodes attached to the triggers, the attacker can mislead the GNN model trained on the poisoned graph to classify test nodes attached with a trigger to the target class. To defend against backdoor attacks, existing defense methods rely on anomaly detection in feature distribution or label transformation. However, these approaches are incapable of detecting in-distribution triggers or clean-label attacks that do not alter the class label of target nodes. To tackle these threats, we empirically analyze triggers from a multidimensional aspect, and our analysis shows that there are clear distinctions between trigger nodes and normal ones in terms of node feature values, node embeddings, and class prediction probabilities. Based on these findings, we propose a Multidimensional Anomaly Detection framework (MAD) that can effectively minimize the impact of triggers by pruning away anomalous nodes and edges. Extensive experiments show that at the cost of slight loss in clean classification accuracy, MAD achieves considerably lower attack success rate as compared to state-of-the-art backdoor defense methods.

NeurIPS Conference 2025 Conference Paper

Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation

  • Yao Teng
  • Fu-Yun Wang
  • Xian Liu
  • Zhekai Chen
  • Han Shi
  • Yu Wang
  • Zhenguo Li
  • Weiyang Liu

As a new paradigm of visual content generation, autoregressive text-to-image models suffer from slow inference due to their sequential token-by-token decoding process, often requiring thousands of model forward passes to generate a single image. To address this inefficiency, we propose Speculative Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising process into Jacobi iterations to enable parallel token generation in autoregressive models. Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings and predict the next clean tokens through low-cost fine-tuning. This denoising paradigm guides the model towards more stable Jacobi trajectories. During inference, our method initializes token sequences with Gaussian noise and performs iterative next-clean-token-prediction in the embedding space. We employ a probabilistic criterion to verify and accept multiple tokens in parallel, and refine the unaccepted tokens for the next iteration with the denoising trajectory. Experiments show that our method can accelerate generation by reducing model forward passes while maintaining the visual quality of generated images.

NeurIPS Conference 2025 Conference Paper

StruDiCO: Structured Denoising Diffusion with Gradient-free Inference-stage Boosting for Memory and Time Efficient Combinatorial Optimization

  • Yu Wang
  • Yang Li
  • Junchi Yan
  • Yi Chang

Diffusion models have recently emerged as powerful neural solvers for combinatorial optimization (CO). However, existing approaches fail to reveal how variables are progressively determined during inference, making the final solution opaque until the last step. To address this limitation, we propose a structured denoising diffusion model, StruDiCO, which incrementally constructs solutions through step-wise variable selection. This is achieved via a variable-absorption noising model, wherein the forward process simulates gradual variable deactivation, converging to an empty solution, while the reverse process incrementally selects variables to reconstruct the final solution. This design induces structural continuity across intermediate states, enabling interpretable and trajectory-consistent partial solutions throughout inference. To further improve the reliability of reverse inference, we introduce a constrained consistency sampling strategy, which suppresses low-confidence variable selection at each step to stabilize the reverse process. Leveraging the structure-preserving reverse process, we further propose a lightweight, gradient-free, objective-aware refinement framework, which iteratively improves solution quality by applying structure-aware perturbations to the current solution, performing reverse inference through the constraint consistency model, and decoding with an objective-guided scoring scheme. Extensive experiments on two canonical CO tasks, the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), show that StruDiCO outperforms state-of-the-art diffusion-based solvers, achieving up to $3. 5\times$ faster inference, 70\% lower GPU memory usage, and significantly improved solution quality, with up to 37. 7\% drop reduction on TSP and an average 38. 1\% improvement on MIS. The codes are publicly available at https: //github. com/yuuuuwang/StruDiCO.

TMLR Journal 2025 Journal Article

Towards LifeSpan Cognitive Systems

  • Yu Wang
  • Chi Han
  • Tongtong Wu
  • Xiaoxin He
  • Wangchunshu Zhou
  • Nafis Sadeq
  • Xiusi Chen
  • Zexue He

Building a human-like system that continuously interacts with complex environments—whether simulated digital worlds or human society—presents several key challenges. Central to this is enabling continuous, high-frequency interactions, where the interactions are termed experiences. We refer to this envisioned system as the LifeSpan Cognitive System (LSCS). A critical feature of LSCS is its ability to engage in incremental and rapid updates while retaining and accurately recalling past experiences. In this paper we focus on the domain of Large Language Models (LLMs), where we identify two major challenges: (1) Abstraction and Experience Merging, and (2) Long-term Retention with Accurate Recall. These properties are essential for storing new experiences, organizing past experiences, and responding to the environment in ways that leverage relevant historical data. Unlike language models with continual learning, which typically rely on large corpora for fine-tuning and focus on improving performance within specific domains or tasks, LSCS must rapidly and incrementally update with new information from its environment at a high frequency. Existing technologies with the potential of solving the above two major challenges can be classified into four classes based on a conceptual metric called Storage Complexity, which measures the relative space required to store past experiences. Each of these four classes of technologies has its own strengths and limitations while we argue none of them alone can achieve LSCS alone. To this end, we propose a potential instantiation for LSCS that can integrate all four classes of technologies. The new instantiation, serving as a conjecture, operates through two core processes: Absorbing Experiences and Generating Responses.

AAAI Conference 2025 Conference Paper

Towards Trustworthy Knowledge Graph Reasoning: An Uncertainty Aware Perspective

  • Bo Ni
  • Yu Wang
  • Lu Cheng
  • Erik Blasch
  • Tyler Derr

Recently, Knowledge Graphs (KGs) have been successfully coupled with Large Language Models (LLMs) to mitigate their hallucinations and enhance their reasoning capability, e.g., KG-based retrieval-augmented framework. However, current KG-LLM frameworks lack rigorous uncertainty estimation, limiting their reliable deployment in applications where the cost of errors is significant. Directly incorporating uncertainty quantification into KG-LLM frameworks presents a challenge due to their more complex architectures and the intricate interactions between the knowledge graph and language model components. To address this crucial gap, we propose a new trustworthy KG-LLM framework, UAG (Uncertainty Aware Knowledge-Graph Reasoning), which incorporates uncertainty quantification into the KG-LLM framework. We design an uncertainty-aware multi-step reasoning framework that leverages conformal prediction to provide a theoretical guarantee on the prediction set. To manage the error rate of the multi-step process, we additionally introduce an error rate control module to adjust the error rate within the individual components. Extensive experiments show that UAG can achieve any pre-defined coverage rate while reducing the prediction set/interval size by 40% on average over the baselines.

NeurIPS Conference 2025 Conference Paper

VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play

  • Zelai Xu
  • Ruize Zhang
  • Chao Yu
  • Huining Yuan
  • Xiangmin Yi
  • Shilong Ji
  • Chuqi Wang
  • Wenhao Tang

Robot sports, characterized by well-defined objectives, explicit rules, and dynamic interactions, present ideal scenarios for demonstrating embodied intelligence. In this paper, we present VolleyBots, a novel robot sports testbed where multiple drones cooperate and compete in the sport of volleyball under physical dynamics. VolleyBots integrates three features within a unified platform: competitive and cooperative gameplay, turn-based interaction structure, and agile 3D maneuvering. These intertwined features yield a complex problem combining motion control and strategic play, with no available expert demonstrations. We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi-agent reinforcement learning (MARL) and game-theoretic algorithms. Simulation results show that on-policy RL methods outperform off-policy methods in single-agent tasks, but both approaches struggle in complex tasks that combine motion control and strategic play. We additionally design a hierarchical policy which achieves 69. 5% win rate against the strongest baseline in the 3 vs 3 task, demonstrating its potential for tackling the complex interplay between low-level control and high-level strategy. To highlight VolleyBots’ sim-to-real potential, we further demonstrate the zero-shot deployment of a policy trained entirely in simulation on real-world drones.

NeurIPS Conference 2025 Conference Paper

What Can RL Bring to VLA Generalization? An Empirical Study

  • Jijia Liu
  • Feng Gao
  • Bingwen Wei
  • Xinlei Chen
  • Qingmin Liao
  • Yi Wu
  • Chao Yu
  • Yu Wang

Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https: //rlvla. github. io

AAAI Conference 2024 Conference Paper

Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning

  • Jiayu Chen
  • Zelai Xu
  • Yunfei Li
  • Chao Yu
  • Jiaming Song
  • Huazhong Yang
  • Fei Fang
  • Yu Wang

Learning Nash equilibrium (NE) in complex zero-sum games with multi-agent reinforcement learning (MARL) can be extremely computationally expensive. Curriculum learning is an effective way to accelerate learning, but an under-explored dimension for generating a curriculum is the difficulty-to-learn of the subgames –games induced by starting from a specific state. In this work, we present a novel subgame curriculum learning framework for zero-sum games. It adopts an adaptive initial state distribution by resetting agents to some previously visited states where they can quickly learn to improve performance. Building upon this framework, we derive a subgame selection metric that approximates the squared distance to NE values and further adopt a particle-based state sampler for subgame generation. Integrating these techniques leads to our new algorithm, Subgame Automatic Curriculum Learning (SACL), which is a realization of the subgame curriculum learning framework. SACL can be combined with any MARL algorithm such as MAPPO. Experiments in the particle-world environment and Google Research Football environment show SACL produces much stronger policies than baselines. In the challenging hide-and-seek quadrant environment, SACL produces all four emergent stages and uses only half the samples of MAPPO with self-play. The project website is at https://sites.google.com/view/sacl-neurips.

AAAI Conference 2024 Conference Paper

Binding-Adaptive Diffusion Models for Structure-Based Drug Design

  • Zhilin Huang
  • Ling Yang
  • Zaixi Zhang
  • Xiangxin Zhou
  • Yu Bao
  • Xiawu Zheng
  • Yuwei Yang
  • Yu Wang

Structure-based drug design (SBDD) aims to generate 3D ligand molecules that bind to specific protein targets. Existing 3D deep generative models including diffusion models have shown great promise for SBDD. However, it is complex to capture the essential protein-ligand interactions exactly in 3D space for molecular generation. To address this problem, we propose a novel framework, namely Binding-Adaptive Diffusion Models (BindDM). In BindDM, we adaptively extract subcomplex, the essential part of binding sites responsible for protein-ligand interactions. Then the selected protein-ligand subcomplex is processed with SE(3)-equivariant neural networks, and transmitted back to each atom of the complex for augmenting the target-aware 3D molecule diffusion generation with binding interaction information. We iterate this hierarchical complex-subcomplex process with cross-hierarchy interaction node for adequately fusing global binding context between the complex and its corresponding subcomplex. Empirical studies on the CrossDocked2020 dataset show BindDM can generate molecules with more realistic 3D structures and higher binding affinities towards the protein targets, with up to -5.92 Avg. Vina Score, while maintaining proper molecular properties. Our code is available at https://github.com/YangLing0818/BindDM

NeurIPS Conference 2024 Conference Paper

Can LLMs Learn by Teaching for Better Reasoning? A Preliminary Study

  • Xuefei Ning
  • Zifu Wang
  • Shiyao Li
  • Zinan Lin
  • Peiran Yao
  • Tianyu Fu
  • Matthew B. Blaschko
  • Guohao Dai

Teaching to improve student models (e. g. , knowledge distillation) is an extensively studied methodology in LLMs. However, in human education, teaching enhances not only the students but also the teachers by fostering more rigorous and clearer reasoning, as well as deeper knowledge building. We ask: Can LLMs also learn by teaching (LbT) for better reasoning? If the answer is yes, we can potentially unlock the possibility of continuously advancing the models without solely relying on human-produced data or stronger models. In this paper, we provide a preliminary exploration of this question. We show that LbT ideas can be incorporated into existing LLM training/prompting pipelines and bring improvements. Specifically, we design three methods, each mimicking one of the three levels of LbT: observing students' feedback, learning from the feedback, and learning iteratively, with the goal of improving answer accuracy without training or improving models' inherent capability with fine-tuning. We reveal some findings: (1) Teaching materials that make it easier for students to learn (via in-context learning) have clearer and more accurate logic; (2) Weak-to-strong generalization: LbT might help improve strong models by teaching weak models; (3) Diversity in students might help: teaching multiple students could be better than teaching a single student or the teacher alone. We hope that our exploration can inspire future research on LbT and, more broadly, the adoption of advanced education techniques to improve LLMs. The code and website are at https: //github. com/imagination-research/lbt and https: //sites. google. com/view/llm-learning-by-teaching.

ICRA Conference 2024 Conference Paper

Continuous Robotic Tracking of Dynamic Targets in Complex Environments Based on Detectability

  • Zhihao Wang 0003
  • Shixing Huang
  • Minghang Li
  • Junyuan Ouyang
  • Yu Wang
  • Haoyao Chen

Target tracking is a fundamental task in the domain of robotics. The effectiveness of target tracking hinges upon various factors, such as tracking distance, occlusions, collision avoidance, etc. However, few existing works can simultaneously tackle these considerations of tracking single and multiple targets in complex environments. In this study, the interaction mechanism of target tracking between the robot, the environment and the targets is analyzed, and a general measure named detectability is introduced to correlate the tracking performance for guiding robotic motion planning. Based on the detectability measure, the robotic motion planning framework based on Model Predictive Control (MPC) is proposed to achieve continuous and robust tracking of single, two and three targets in complex environments. Simulations and experiments are performed and verify the performances of our method better than the state-of-the-art methods.

TMLR Journal 2024 Journal Article

Contrastive Learning with Consistent Representations

  • Zihu Wang
  • Yu Wang
  • Zhuotong Chen
  • Hanbin Hu
  • Peng Li

Contrastive learning demonstrates great promise for representation learning. Data augmentations play a critical role in contrastive learning by providing informative views of the data without necessitating explicit labels. Nonetheless, the efficacy of current methodologies heavily hinges on the quality of employed data augmentation (DA) functions, often chosen manually from a limited set of options. While exploiting diverse data augmentations is appealing, the complexities inherent in both DAs and representation learning can lead to performance deterioration. Addressing this challenge and facilitating the systematic incorporation of diverse data augmentations, this paper proposes Contrastive Learning with Consistent Representations (CoCor). At the heart of CoCor is a novel consistency metric termed DA consistency. This metric governs the mapping of augmented input data to the representation space. Moreover, we propose to learn the optimal mapping locations as a function of DA. Experimental results demonstrate that CoCor notably enhances the generalizability and transferability of learned representations in comparison to baseline methods. The implementation of CoCor can be found at https://github.com/zihuwang97/CoCor.

AAAI Conference 2024 Conference Paper

Deciphering Compatibility Relationships with Textual Descriptions via Extraction and Explanation

  • Yu Wang
  • Zexue He
  • Zhankui He
  • Hao Xu
  • Julian McAuley

Understanding and accurately explaining compatibility relationships between fashion items is a challenging problem in the burgeoning domain of AI-driven outfit recommendations. Present models, while making strides in this area, still occasionally fall short, offering explanations that can be elementary and repetitive. This work aims to address these shortcomings by introducing the Pair Fashion Explanation (PFE) dataset, a unique resource that has been curated to illuminate these compatibility relationships. Furthermore, we propose an innovative two stage pipeline model that leverages this dataset. This fine-tuning allows the model to generate explanations that convey the compatibility relationships between items. Our experiments showcase the model's potential in crafting descriptions that are knowledgeable, aligned with ground-truth matching correlations, and that produce understandable and informative descriptions, as assessed by both automatic metrics and human evaluation. Our code and data are released at https://github.com/wangyu-ustc/PairFashionExplanation.

NeurIPS Conference 2024 Conference Paper

DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization

  • Haowei Zhu
  • Dehua Tang
  • Ji Liu
  • Mingjie Lu
  • Jintu Zheng
  • Jinzhang Peng
  • Dong Li
  • Yu Wang

Diffusion models have achieved remarkable progress in the field of image generation due to their outstanding capabilities. However, these models require substantial computing resources because of the multi-step denoising process during inference. While traditional pruning methods have been employed to optimize these models, the retraining process necessitates large-scale training datasets and extensive computational costs to maintain generalization ability, making it neither convenient nor efficient. Recent studies attempt to utilize the similarity of features across adjacent denoising stages to reduce computational costs through simple and static strategies. However, these strategies cannot fully harness the potential of the similar feature patterns across adjacent timesteps. In this work, we propose a novel pruning method that derives an efficient diffusion model via a more intelligent and differentiable pruner. At the core of our approach is casting the model pruning process into a SubNet search process. Specifically, we first introduce a SuperNet based on standard diffusion via adding some backup connections built upon the similar features. We then construct a plugin pruner network and design optimization losses to identify redundant computation. Finally, our method can identify an optimal SubNet through few-step gradient optimization and a simple post-processing procedure. We conduct extensive experiments on various diffusion models including Stable Diffusion series and DiTs. Our DiP-GO approach achieves 4. 4 x speedup for SD-1. 5 without any loss of accuracy, significantly outperforming the previous state-of-the-art methods.

NeurIPS Conference 2024 Conference Paper

DiTFastAttn: Attention Compression for Diffusion Transformer Models

  • Zhihang Yuan
  • Hanling Zhang
  • Pu Lu
  • Xuefei Ning
  • Linfeng Zhang
  • Tianchen Zhao
  • Shengen Yan
  • Guohao Dai

Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) $\textit{Window Attention with Residual Sharing}$ to reduce spatial redundancy; (2) $\textit{Attention Sharing across Timesteps}$ to exploit the similarity between steps; (3) $\textit{Attention Sharing across CFG}$ to skip redundant computations during conditional generation.

AAAI Conference 2024 Conference Paper

Dynamic Sub-graph Distillation for Robust Semi-supervised Continual Learning

  • Yan Fan
  • Yu Wang
  • Pengfei Zhu
  • Qinghua Hu

Continual learning (CL) has shown promising results and comparable performance to learning at once in a fully supervised manner. However, CL strategies typically require a large number of labeled samples, making their real-life deployment challenging. In this work, we focus on semi-supervised continual learning (SSCL), where the model progressively learns from partially labeled data with unknown categories. We provide a comprehensive analysis of SSCL and demonstrate that unreliable distributions of unlabeled data lead to unstable training and refinement of the progressing stages. This problem severely impacts the performance of SSCL. To address the limitations, we propose a novel approach called Dynamic Sub-Graph Distillation (DSGD) for semi-supervised continual learning, which leverages both semantic and structural information to achieve more stable knowledge distillation on unlabeled data and exhibit robustness against distribution bias. Firstly, we formalize a general model of structural distillation and design a dynamic graph construction for the continual learning progress. Next, we define a structure distillation vector and design a dynamic sub-graph distillation algorithm, which enables end-to-end training and adaptability to scale up tasks. The entire proposed method is adaptable to various CL methods and supervision settings. Finally, experiments conducted on three datasets CIFAR10, CIFAR100, and ImageNet-100, with varying supervision ratios, demonstrate the effectiveness of our proposed approach in mitigating the catastrophic forgetting problem in semi-supervised continual learning scenarios. Our code is available: https://github.com/fanyan0411/DSGD.

AAAI Conference 2024 Conference Paper

EncryIP: A Practical Encryption-Based Framework for Model Intellectual Property Protection

  • Xin Mu
  • Yu Wang
  • Zhengan Huang
  • Junzuo Lai
  • Yehong Zhang
  • Hui Wang
  • Yue Yu

In the rapidly growing digital economy, protecting intellectual property (IP) associated with digital products has become increasingly important. Within this context, machine learning (ML) models, being highly valuable digital assets, have gained significant attention for IP protection. This paper introduces a practical encryption-based framework called EncryIP, which seamlessly integrates a public-key encryption scheme into the model learning process. This approach enables the protected model to generate randomized and confused labels, ensuring that only individuals with accurate secret keys, signifying authorized users, can decrypt and reveal authentic labels. Importantly, the proposed framework not only facilitates the protected model to multiple authorized users without requiring repetitive training of the original ML model with IP protection methods but also maintains the model's performance without compromising its accuracy. Compared to existing methods like watermark-based, trigger-based, and passport-based approaches, EncryIP demonstrates superior effectiveness in both training protected models and efficiently detecting the unauthorized spread of ML models.

AAAI Conference 2024 Conference Paper

Every Node Is Different: Dynamically Fusing Self-Supervised Tasks for Attributed Graph Clustering

  • Pengfei Zhu
  • Qian Wang
  • Yu Wang
  • Jialu Li
  • Qinghua Hu

Attributed graph clustering is an unsupervised task that partitions nodes into different groups. Self-supervised learning (SSL) shows great potential in handling this task, and some recent studies simultaneously learn multiple SSL tasks to further boost performance. Currently, different SSL tasks are assigned the same set of weights for all graph nodes. However, we observe that some graph nodes whose neighbors are in different groups require significantly different emphases on SSL tasks. In this paper, we propose to dynamically learn the weights of SSL tasks for different nodes and fuse the embeddings learned from different SSL tasks to boost performance. We design an innovative graph clustering approach, namely Dynamically Fusing Self-Supervised Learning (DyFSS). Specifically, DyFSS fuses features extracted from diverse SSL tasks using distinct weights derived from a gating network. To effectively learn the gating network, we design a dual-level self-supervised strategy that incorporates pseudo labels and the graph structure. Extensive experiments on five datasets show that DyFSS outperforms the state-of-the-art multi-task SSL methods by up to 8.66% on the accuracy metric. The code of DyFSS is available at: https://github.com/q086/DyFSS.

AAAI Conference 2024 Conference Paper

Exploring Diverse Representations for Open Set Recognition

  • Yu Wang
  • Junxian Mu
  • Pengfei Zhu
  • Qinghua Hu

Open set recognition (OSR) requires the model to classify samples that belong to closed sets while rejecting unknown samples during test. Currently, generative models often perform better than discriminative models in OSR, but recent studies show that generative models may be computationally infeasible or unstable on complex tasks. In this paper, we provide insights into OSR and find that learning supplementary representations can theoretically reduce the open space risk. Based on the analysis, we propose a new model, namely Multi-Expert Diverse Attention Fusion (MEDAF), that learns diverse representations in a discriminative way. MEDAF consists of multiple experts that are learned with an attention diversity regularization term to ensure the attention maps are mutually different. The logits learned by each expert are adaptively fused and used to identify the unknowns through the score function. We show that the differences in attention maps can lead to diverse representations so that the fused representations can well handle the open space. Extensive experiments are conducted on standard and OSR large-scale benchmarks. Results show that the proposed discriminative method can outperform existing generative models by up to 9.5% on AUROC and achieve new state-of-the-art performance with little computational cost. Our method can also seamlessly integrate existing classification models. Code is available at https://github.com/Vanixxz/MEDAF.

TMLR Journal 2024 Journal Article

Extreme Risk Mitigation in Reinforcement Learning using Extreme Value Theory

  • Karthik Somayaji NS
  • Yu Wang
  • Malachi Schram
  • Jan Drgona
  • Mahantesh M Halappanavar
  • Frank Liu
  • Peng Li

Risk-sensitive reinforcement learning (RL) has garnered significant attention in recent years due to the growing interest in deploying RL agents in real-world scenarios. A critical aspect of risk awareness involves modelling highly rare risk events (rewards) that could potentially lead to catastrophic outcomes. These infrequent occurrences present a formidable challenge for data-driven methods aiming to capture such risky events accurately. While risk-aware RL techniques do exist, they suffer from high variance estimation due to the inherent data scarcity. Our work proposes to enhance the resilience of RL agents when faced with very rare and risky events by focusing on refining the predictions of the extreme values predicted by the state-action value distribution. To achieve this, we formulate the extreme values of the state-action value function distribution as parameterized distributions, drawing inspiration from the principles of extreme value theory (EVT). We propose an extreme value theory based actor-critic approach, namely, Extreme Valued Actor-Critic (EVAC) which effectively addresses the issue of infrequent occurrence by leveraging EVT-based parameterization. Importantly, we theoretically demonstrate the advantages of employing these parameterized distributions in contrast to other risk-averse algorithms. Our evaluations show that the proposed method outperforms other risk averse RL algorithms on a diverse range of benchmark tasks, each encompassing distinct risk scenarios.

NeurIPS Conference 2024 Conference Paper

Geometry Awakening: Cross-Geometry Learning Exhibits Superiority over Individual Structures

  • Yadong Sun
  • Xiaofeng Cao
  • Yu Wang
  • Wei Ye
  • Jingcai Guo
  • Qing Guo

Recent research has underscored the efficacy of Graph Neural Networks (GNNs) in modeling diverse geometric structures within graph data. However, real-world graphs typically exhibit geometrically heterogeneous characteristics, rendering the confinement to a single geometric paradigm insufficient for capturing their intricate structural complexities. To address this limitation, we examine the performance of GNNs across various geometries through the lens of knowledge distillation (KD) and introduce a novel cross-geometric framework. This framework encodes graphs by integrating both Euclidean and hyperbolic geometries in a space-mixing fashion. Our approach employs multiple teacher models, each generating hint embeddings that encapsulate distinct geometric properties. We then implement a structure-wise knowledge transfer module that optimally leverages these embeddings within their respective geometric contexts, thereby enhancing the training efficacy of the student model. Additionally, our framework incorporates a geometric optimization network designed to bridge the distributional disparities among these embeddings. Experimental results demonstrate that our model-agnostic framework more effectively captures topological graph knowledge, resulting in superior performance of the student models when compared to traditional KD methodologies.

AAAI Conference 2024 Conference Paper

H2GFormer: Horizontal-to-Global Voxel Transformer for 3D Semantic Scene Completion

  • Yu Wang
  • Chao Tong

3D Semantic Scene Completion (SSC) has emerged as a novel task in vision-based holistic 3D scene understanding. Its objective is to densely predict the occupancy and category of each voxel in a 3D scene based on input from either LiDAR or images. Currently, many transformer-based semantic scene completion frameworks employ simple yet popular Cross-Attention and Self-Attention mechanisms to integrate and infer dense geometric and semantic information of voxels. However, they overlook the distinctions among voxels in the scene, especially in outdoor scenarios where the horizontal direction contains more variations. And voxels located at object boundaries and within the interior of objects exhibit varying levels of positional significance. To address this issue, we propose a transformer-based SSC framework called H2GFormer that incorporates a horizontal-to-global approach. This framework takes into full consideration the variations of voxels in the horizontal direction and the characteristics of voxels on object boundaries. We introduce a horizontal window-to-global attention (W2G) module that effectively fuses semantic information by first diffusing it horizontally from reliably visible voxels and then propagating the semantic understanding to global voxels, ensuring a more reliable fusion of semantic-aware features. Moreover, an Internal-External Position Awareness Loss (IoE-PALoss) is utilized during network training to emphasize the critical positions within the transition regions between objects. The experiments conducted on the SemanticKITTI dataset demonstrate that H2GFormer exhibits superior performance in both geometric and semantic completion tasks. Our code is available on https://github.com/Ryanwy1/H2GFormer.

AAAI Conference 2024 Conference Paper

Knowledge Graph Prompting for Multi-Document Question Answering

  • Yu Wang
  • Nedim Lipka
  • Ryan A. Rossi
  • Alexa Siu
  • Ruiyi Zhang
  • Tyler Derr

The `pre-train, prompt, predict' paradigm of large language models (LLMs) has achieved remarkable success in open-domain question answering (OD-QA). However, few works explore this paradigm in multi-document question answering (MD-QA), a task demanding a thorough understanding of the logical associations among the contents and structures of documents. To fill this crucial gap, we propose a Knowledge Graph Prompting (KGP) method to formulate the right context in prompting LLMs for MD-QA, which consists of a graph construction module and a graph traversal module. For graph construction, we create a knowledge graph (KG) over multiple documents with nodes symbolizing passages or document structures (e.g., pages/tables), and edges denoting the semantic/lexical similarity between passages or document structural relations. For graph traversal, we design an LLM-based graph traversal agent that navigates across nodes and gathers supporting passages assisting LLMs in MD-QA. The constructed graph serves as the global ruler that regulates the transitional space among passages and reduces retrieval latency. Concurrently, the graph traversal agent acts as a local navigator that gathers pertinent context to progressively approach the question and guarantee retrieval quality. Extensive experiments underscore the efficacy of KGP for MD-QA, signifying the potential of leveraging graphs in enhancing the prompt design and retrieval augmented generation for LLMs. Our code: https://github.com/YuWVandy/KG-LLM-MDQA.

AAAI Conference 2024 Conference Paper

Leveraging Opposite Gender Interaction Ratio as a Path towards Fairness in Online Dating Recommendations Based on User Sexual Orientation

  • Yuying Zhao
  • Yu Wang
  • Yi Zhang
  • Pamela Wisniewski
  • Charu Aggarwal
  • Tyler Derr

Online dating platforms have gained widespread popularity as a means for individuals to seek potential romantic relationships. While recommender systems have been designed to improve the user experience in dating platforms by providing personalized recommendations, increasing concerns about fairness have encouraged the development of fairness-aware recommender systems from various perspectives (e.g., gender and race). However, sexual orientation, which plays a significant role in finding a satisfying relationship, is under-investigated. To fill this crucial gap, we propose a novel metric, Opposite Gender Interaction Ratio (OGIR), as a way to investigate potential unfairness for users with varying preferences towards the opposite gender. We empirically analyze a real online dating dataset and observe existing recommender algorithms could suffer from group unfairness according to OGIR. We further investigate the potential causes for such gaps in recommendation quality, which lead to the challenges of group quantity imbalance and group calibration imbalance. Ultimately, we propose a fair recommender system based on re-weighting and re-ranking strategies to respectively mitigate these associated imbalance challenges. Experimental results demonstrate both strategies improve fairness while their combination achieves the best performance towards maintaining model utility while improving fairness.

ICRA Conference 2024 Conference Paper

LiDARFormer: A Unified Transformer-based Multi-task Network for LiDAR Perception

  • Zixiang Zhou
  • Dongqiangzi Ye
  • Weijia Chen
  • Yufei Xie
  • Yu Wang
  • Panqu Wang
  • Hassan Foroosh

There is a recent need in the LiDAR perception field for unifying multiple tasks in a single strong network with improved performance, as opposed to using separate networks for each task. In this paper, we introduce a new LiDAR multi-task learning paradigm based on the transformer. The proposed LiDARFormer utilizes cross-space global contextual feature information and exploits cross-task synergy to boost the performance of LiDAR perception tasks across multiple large-scale datasets and benchmarks. Our novel transformer-based framework includes a cross-space transformer module that learns attentive features between the 2D dense Bird’s Eye View (BEV) and 3D sparse voxel feature maps. Additionally, we propose a transformer decoder for the segmentation task to dynamically adjust the learned features by leveraging the categorical feature representations. Furthermore, we combine the segmentation and detection features in a shared transformer decoder with cross-task attention layers to enhance and integrate the object-level and class-level features. LiDARFormer is evaluated on the large-scale nuScenes and the Waymo Open datasets for both 3D detection and semantic segmentation tasks, and it achieves state-of-the-art performance on both tasks.

AAMAS Conference 2024 Conference Paper

LLM-Powered Hierarchical Language Agent for Real-time Human-AI Coordination

  • Jijia Liu
  • Chao Yu
  • Jiaxuan Gao
  • Yuqing Xie
  • Qingmin Liao
  • Yi Wu
  • Yu Wang

AI agents powered by Large Language Models (LLMs) have made significant advances, enabling them to assist humans in diverse complex tasks and leading to a revolution in human-AI coordination. LLM-powered agents typically require invoking LLM APIs and employing artificially designed complex prompts, which results in high inference latency. While this paradigm works well in scenarios with minimal interactive demands, such as code generation, it is unsuitable for highly interactive and real-time applications, such as gaming. Traditional gaming AI often employs small models or reactive policies, enabling fast inference but offering limited task completion and interaction abilities. In this work, we consider Overcooked as our testbed where players could communicate with natural language and cooperate to serve orders. We propose a Hierarchical Language Agent (HLA) for human-AI coordination that provides both strong reasoning abilities while keeping real-time execution. In particular, HLA adopts a hierarchical framework and comprises three modules: a proficient LLM, referred to as Slow Mind, for intention reasoning and language interaction, a lightweight LLM, referred to as Fast Mind, for generating macro actions, and a reactive policy, referred to as Executor, for transforming macro actions into atomic actions. Human studies show that HLA outperforms other baseline agents, including slow-mind-only agents and fast-mind-only agents, with stronger cooperation abilities, faster responses, and more consistent language communications.

ECAI Conference 2024 Conference Paper

Model Provenance via Model DNA

  • Xin Mu
  • Yu Wang
  • Yehong Zhang
  • Jiaqi Zhang
  • Hui Wang
  • Yang Xiang
  • Yue Yu

Understanding the life cycle of the machine learning (ML) model is an intriguing area of research (e. g. , understanding where the model comes from, how it is trained, and how it is used). Our focus is on a novel problem within this domain, namely Model Provenance (MP). MP concerns the relationship between a target model and its pre-training model and aims to determine whether a source model serves as the provenance for a target model. In this paper, we formulate this new challenge as a learning problem, supplementing our exploration with empirical discussions on its connections to existing works. Following that, we introduce “Model DNA”, an interesting concept encoding the model’s training data and input-output information to create a compact machine-learning model representation. Capitalizing on this model DNA, we establish an efficient framework consisting of three key components: DNA generation, DNA similarity loss, and a provenance classifier, aimed at identifying model provenance. We conduct evaluations on both computer vision and natural language processing tasks using various models, datasets, and scenarios to demonstrate the effectiveness of our approach.

AAAI Conference 2024 Conference Paper

Open-Set Graph Domain Adaptation via Separate Domain Alignment

  • Yu Wang
  • Ronghang Zhu
  • Pengsheng Ji
  • Sheng Li

Domain adaptation has become an attractive learning paradigm, as it can leverage source domains with rich labels to deal with classification tasks in an unlabeled target domain. A few recent studies develop domain adaptation approaches for graph-structured data. In the case of node classification task, current domain adaptation methods only focus on the closed-set setting, where source and target domains share the same label space. A more practical assumption is that the target domain may contain new classes that are not included in the source domain. Therefore, in this paper, we introduce a novel and challenging problem for graphs, i.e., open-set domain adaptive node classification, and propose a new approach to solve it. Specifically, we develop an algorithm for efficient knowledge transfer from a labeled source graph to an unlabeled target graph under a separate domain alignment (SDA) strategy, in order to learn discriminative feature representations for the target graph. Our goal is to not only correctly classify target nodes into the known classes, but also classify unseen types of nodes into an unknown class. Experimental results on real-world datasets show that our method outperforms existing methods on graph domain adaptation.

NeurIPS Conference 2024 Conference Paper

Persistence Homology Distillation for Semi-supervised Continual Learning

  • Yan Fan
  • Yu Wang
  • Pengfei Zhu
  • Dongyue Chen
  • Qinghua Hu

Semi-supervised continual learning (SSCL) has attracted significant attention for addressing catastrophic forgetting in semi-supervised data. Knowledge distillation, which leverages data representation and pair-wise similarity, has shown significant potential in preserving information in SSCL. However, traditional distillation strategies often fail in unlabeled data with inaccurate or noisy information, limiting their efficiency in feature spaces undergoing substantial changes during continual learning. To address these limitations, we propose Persistence Homology Distillation (PsHD) to preserve intrinsic structural information that is insensitive to noise in semi-supervised continual learning. First, we capture the structural features using persistence homology by homological evolution across different scales in vision data, where the multi-scale characteristic established its stability under noise interference. Next, we propose a persistence homology distillation loss in SSCL and design an acceleration algorithm to reduce the computational cost of persistence homology in our module. Furthermore, we demonstrate the superior stability of PsHD compared to sample representation and pair-wise similarity distillation methods theoretically and experimentally. Finally, experimental results on three widely used datasets validate that the new PsHD outperforms state-of-the-art with 3. 9% improvements on average, and also achieves 1. 5% improvements while reducing 60% memory buffer size, highlighting the potential of utilizing unlabeled data in SSCL. Our code is available: https: //github. com/fanyan0411/PsHD.

NeurIPS Conference 2024 Conference Paper

Rad-NeRF: Ray-decoupled Training of Neural Radiance Field

  • Lidong Guo
  • Xuefei Ning
  • Yonggan Fu
  • Tianchen Zhao
  • Zhuoliang Kang
  • Jincheng Yu
  • Yingyan (Celine) Lin
  • Yu Wang

Although the neural radiance field (NeRF) exhibits high-fidelity visualization on the rendering task, it still suffers from rendering defects, especially in complex scenes. In this paper, we delve into the reason for the unsatisfactory performance and conjecture that it comes from interference in the training process. Due to occlusions in complex scenes, a 3D point may be invisible to some rays. On such a point, training with those rays that do not contain valid information about the point might interfere with the NeRF training. Based on the above intuition, we decouple the training process of NeRF in the ray dimension softly and propose a Ray-decoupled Training Framework for neural rendering (Rad-NeRF). Specifically, we construct an ensemble of sub-NeRFs and train a soft gate module to assign the gating scores to these sub-NeRFs based on specific rays. The gate module is jointly optimized with the sub-NeRF ensemble to learn the preference of sub-NeRFs for different rays automatically. Furthermore, we introduce depth-based mutual learning to enhance the rendering consistency among multiple sub-NeRFs and mitigate the depth ambiguity. Experiments on five datasets demonstrate that Rad-NeRF can enhance the rendering performance across a wide range of scene types compared with existing single-NeRF and multi-NeRF methods. With only 0. 2% extra parameters, Rad-NeRF improves rendering performance by up to 1. 5dB. Code is available at https: //github. com/thu-nics/Rad-NeRF.

AAAI Conference 2024 Conference Paper

Semi-supervised Learning of Dynamical Systems with Neural Ordinary Differential Equations: A Teacher-Student Model Approach

  • Yu Wang
  • Yuxuan Yin
  • Karthik Somayaji NS
  • Ján Drgoňa
  • Malachi Schram
  • Mahantesh Halappanavar
  • Frank Liu
  • Peng Li

Modeling dynamical systems is crucial for a wide range of tasks, but it remains challenging due to complex nonlinear dynamics, limited observations, or lack of prior knowledge. Recently, data-driven approaches such as Neural Ordinary Differential Equations (NODE) have shown promising results by leveraging the expressive power of neural networks to model unknown dynamics. However, these approaches often suffer from limited labeled training data, leading to poor generalization and suboptimal predictions. On the other hand, semi-supervised algorithms can utilize abundant unlabeled data and have demonstrated good performance in classification and regression tasks. We propose TS-NODE, the first semi-supervised approach to modeling dynamical systems with NODE. TS-NODE explores cheaply generated synthetic pseudo rollouts to broaden exploration in the state space and to tackle the challenges brought by lack of ground-truth system data under a teacher-student model. TS-NODE employs an unified optimization framework that corrects the teacher model based on the student's feedback while mitigating the potential false system dynamics present in pseudo rollouts. TS-NODE demonstrates significant performance improvements over a baseline Neural ODE model on multiple dynamical system modeling tasks.

AAAI Conference 2024 Conference Paper

Spatial-Logic-Aware Weakly Supervised Learning for Flood Mapping on Earth Imagery

  • Zelin Xu
  • Tingsong Xiao
  • Wenchong He
  • Yu Wang
  • Zhe Jiang
  • Shigang Chen
  • Yiqun Xie
  • Xiaowei Jia

Flood mapping on Earth imagery is crucial for disaster management, but its efficacy is hampered by the lack of high-quality training labels. Given high-resolution Earth imagery with coarse and noisy training labels, a base deep neural network model, and a spatial knowledge base with label constraints, our problem is to infer the true high-resolution labels while training neural network parameters. Traditional methods are largely based on specific physical properties and thus fall short of capturing the rich domain constraints expressed by symbolic logic. Neural-symbolic models can capture rich domain knowledge, but existing methods do not address the unique spatial challenges inherent in flood mapping on high-resolution imagery. To fill this gap, we propose a spatial-logic-aware weakly supervised learning framework. Our framework integrates symbolic spatial logic inference into probabilistic learning in a weakly supervised setting. To reduce the time costs of logic inference on vast high-resolution pixels, we propose a multi-resolution spatial reasoning algorithm to infer true labels while training neural network parameters. Evaluations of real-world flood datasets show that our model outperforms several baselines in prediction accuracy. The code is available at https://github.com/spatialdatasciencegroup/SLWSL.

NeurIPS Conference 2024 Conference Paper

SubgDiff: A Subgraph Diffusion Model to Improve Molecular Representation Learning

  • Jiying Zhang
  • Zijing Liu
  • Yu Wang
  • Bin Feng
  • Yu Li

Molecular representation learning has shown great success in advancing AI-based drug discovery. A key insight of many recent works is that the 3D geometric structure of molecules provides essential information about their physicochemical properties. Recently, denoising diffusion probabilistic models have achieved impressive performance in molecular 3D conformation generation. However, most existing molecular diffusion models treat each atom as an independent entity, overlooking the dependency among atoms within the substructures. This paper introduces a novel approach that enhances molecular representation learning by incorporating substructural information in the diffusion model framework. We propose a novel diffusion model termed SubgDiff for involving the molecular subgraph information in diffusion. Specifically, SubgDiff adopts three vital techniques: i) subgraph prediction, ii) expectation state, and iii) k-step same subgraph diffusion, to enhance the perception of molecular substructure in the denoising network. Experiments on extensive downstream tasks, especially the molecular force predictions, demonstrate the superior performance of our approach.

AAAI Conference 2024 Conference Paper

SuperJunction: Learning-Based Junction Detection for Retinal Image Registration

  • Yu Wang
  • Xiaoye Wang
  • Zaiwang Gu
  • Weide Liu
  • Wee Siong Ng
  • Weimin Huang
  • Jun Cheng

Keypoints-based approaches have shown to be promising for retinal image registration, which superimpose two or more images from different views based on keypoint detection and description. However, existing approaches suffer from ineffective keypoint detector and descriptor training. Meanwhile, the non-linear mapping from 3D retinal structure to 2D images is often neglected. In this paper, we propose a novel learning-based junction detection approach for retinal image registration, which enhances both the keypoint detector and descriptor training. To improve the keypoint detection, it uses a multi-task vessel detection to regularize the model training, which helps to learn more representative features and reduce the risk of over-fitting. To achieve effective training for keypoints description, a new constrained negative sampling approach is proposed to compute the descriptor loss. Moreover, we also consider the non-linearity between retinal images from different views during matching. Experimental results on FIRE dataset show that our method achieves mean area under curve of 0.850, which is 12.6% higher than 0.755 by the state-of-the-art method. All the codes are available at https://github.com/samjcheng/SuperJunction.

NeurIPS Conference 2024 Conference Paper

TAIA: Large Language Models are Out-of-Distribution Data Learners

  • Shuyang Jiang
  • Yusheng Liao
  • Ya Zhang
  • Yanfeng Wang
  • Yu Wang

Fine-tuning on task-specific question-answer pairs is a predominant method for enhancing the performance of instruction-tuned large language models (LLMs) on downstream tasks. However, in certain specialized domains, such as healthcare or harmless content generation, it is nearly impossible to obtain a large volume of high-quality data that matches the downstream distribution. To improve the performance of LLMs in data-scarce domains with domain-mismatched data, we re-evaluated the Transformer architecture and discovered that not all parameter updates during fine-tuning contribute positively to downstream performance. Our analysis reveals that within the self-attention and feed-forward networks, only the fine-tuned attention parameters are particularly beneficial when the training set's distribution does not fully align with the test set. Based on this insight, we propose an effective inference-time intervention method: \uline{T}raining \uline{A}ll parameters but \uline{I}nferring with only \uline{A}ttention (TAIA). We empirically validate TAIA using two general instruction-tuning datasets and evaluate it on seven downstream tasks involving math, reasoning, and knowledge understanding across LLMs of different parameter sizes and fine-tuning techniques. Our comprehensive experiments demonstrate that TAIA achieves superior improvements compared to both the fully fine-tuned model and the base model in most scenarios, with significant performance gains. The high tolerance of TAIA to data mismatches makes it resistant to jailbreaking tuning and enhances specialized tasks using general data. Code is available in \url{https: //github. com/pixas/TAIA_LLM}.

AAAI Conference 2024 Conference Paper

UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer

  • Ji Liu
  • Dehua Tang
  • Yuanxian Huang
  • Li Zhang
  • Xiaocheng Zeng
  • Dong Li
  • Mingjie Lu
  • Jinzhang Peng

Traditional channel-wise pruning methods by reducing network channels struggle to effectively prune efficient CNN models with depth-wise convolutional layers and certain efficient modules, such as popular inverted residual blocks. Prior depth pruning methods by reducing network depths are not suitable for pruning some efficient models due to the existence of some normalization layers. Moreover, finetuning subnet with directly removing activation layers would corrupt the original model weights, hindering the pruned model from achieving high performance. To address these issues, we propose a novel depth pruning method for efficient models. Our approach proposes a novel block pruning strategy and progressive training method for the subnet. Additionally, we extend our pruning method to vision transformer models. Experimental results demonstrate that our method consistently outperforms existing depth pruning methods across various pruning configurations. We obtained three pruned ConvNeXtV1 models with our method applying on ConvNeXtV1, which surpass most SOTA efficient models with comparable inference performance. Our method also achieves state-of-the-art pruning performance on the vision transformer model.

AAAI Conference 2024 Conference Paper

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

  • Kun Su
  • Judith Yue Li
  • Qingqing Huang
  • Dima Kuzmin
  • Joonseok Lee
  • Chris Donahue
  • Fei Sha
  • Aren Jansen

Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.

NeurIPS Conference 2024 Conference Paper

WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking

  • Yunchao Liu
  • Ha Dong
  • Xin Wang
  • Rocco Moretti
  • Yu Wang
  • Zhaoqian Su
  • Jiawei Gu
  • Bobby Bodenheimer

While deep learning has revolutionized computer-aided drug discovery, the AI community has predominantly focused on model innovation and placed less emphasis on establishing best benchmarking practices. We posit that without a sound model evaluation framework, the AI community's efforts cannot reach their full potential, thereby slowing the progress and transfer of innovation into real-world drug discovery. Thus, in this paper, we seek to establish a new gold standard for small molecule drug discovery benchmarking, WelQrate. Specifically, our contributions are threefold: WelQrate dataset collection - we introduce a meticulously curated collection of 9 datasets spanning 5 therapeutic target classes. Our hierarchical curation pipelines, designed by drug discovery experts, go beyond the primary high-throughput screen by leveraging additional confirmatory and counter screens along with rigorous domain-driven preprocessing, such as Pan-Assay Interference Compounds (PAINS) filtering, to ensure the high-quality data in the datasets; WelQrate Evaluation Framework - we propose a standardized model evaluation framework considering high-quality datasets, featurization, 3D conformation generation, evaluation metrics, and data splits, which provides a reliable benchmarking for drug discovery experts conducting real-world virtual screening; Benchmarking - we evaluate model performance through various research questions using the WelQrate dataset collection, exploring the effects of different models, dataset quality, featurization methods, and data splitting strategies on the results. In summary, we recommend adopting our proposed WelQrate as the gold standard in small molecule drug discovery benchmarking. The WelQrate dataset collection, along with the curation codes, and experimental scripts are all publicly available at www. WelQrate. org.

NeurIPS Conference 2024 Conference Paper

What Matters in Graph Class Incremental Learning? An Information Preservation Perspective

  • Jialu Li
  • Yu Wang
  • Pengfei Zhu
  • Wanyu Lin
  • Qinghua Hu

Graph class incremental learning (GCIL) requires the model to classify emerging nodes of new classes while remembering old classes. Existing methods are designed to preserve effective information of old models or graph data to alleviate forgetting, but there is no clear theoretical understanding of what matters in information preservation. In this paper, we consider that present practice suffers from high semantic and structural shifts assessed by two devised shift metrics. We provide insights into information preservation in GCIL and find that maintaining graph information can preserve information of old models in theory to calibrate node semantic and graph structure shifts. We correspond graph information into low-frequency local-global information and high-frequency information in spatial domain. Based on the analysis, we propose a framework, Graph Spatial Information Preservation (GSIP). Specifically, for low-frequency information preservation, the old node representations obtained by inputting replayed nodes into the old model are aligned with the outputs of the node and its neighbors in the new model, and then old and new outputs are globally matched after pooling. For high-frequency information preservation, the new node representations are encouraged to imitate the near-neighbor pair similarity of old node representations. GSIP achieves a 10\% increase in terms of the forgetting metric compared to prior methods on large-scale datasets. Our framework can also seamlessly integrate existing replay designs. The code is available through https: //github. com/Jillian555/GSIP.

ECAI Conference 2023 Conference Paper

A Convolutional Neural Network Approach to General Game Playing

  • Yu Wang
  • Heng Zhang 0006
  • Guifei Jiang

General Game Playing (GGP), a research field aimed at developing agents that master different games in a unified way, is regarded as a necessary step towards creating artificial general intelligence. With the success of deep reinforcement learning (DRL) in games like Go, chess, and shogi, it has been recently introduced to GGP and is regarded as a promising technique to achieve the goal of GGP. However, the current work uses fully connected neural networks and is thus unable to efficiently exploit the topological structure of game states. In this paper, we propose an approach to applying general-purposed convolutional neural networks to GGP and implement a DRL-based GGP player. Experiments indicate that the built player not only outperforms the previous algorithm and UCT benchmark in a variety of games but also requires less training time.

AAMAS Conference 2023 Conference Paper

Asynchronous Multi-Agent Reinforcement Learning for Efficient Real-Time Multi-Robot Cooperative Exploration

  • Chao Yu
  • Xinyi Yang
  • Jiaxuan Gao
  • Jiayu Chen
  • Yunfei Li
  • Jijia Liu
  • Yunfei Xiang
  • Ruixin Huang

We consider the problem of cooperative exploration where multiple robots need to cooperatively explore an unknown region as fast as possible. Multi-agent reinforcement learning (MARL) has recently become a trending paradigm for solving this challenge. However, existing MARL-based methods adopt action-making steps as the metric for exploration efficiency by assuming all the agents are acting in a fully synchronous manner: i. e. , every single agent produces an action simultaneously and every single action is executed instantaneously at each time step. Despite its mathematical simplicity, such a synchronous MARL formulation can be problematic for real-world robotic applications. It can be typical that different robots may take slightly different wall-clock times to accomplish an atomic action or even periodically get lost due to hardware issues. Simply waiting for every robot being ready for the next action can be particularly time-inefficient. Therefore, we propose an asynchronous MARL solution, Asynchronous Coordination Explorer (ACE), to tackle this real-world challenge. We first extend a classical MARL algorithm, multi-agent PPO (MAPPO), to the asynchronous setting and additionally apply action-delay randomization to enforce the learned policy to generalize better to varying action delays in the real world. Moreover, each navigation agent is represented as a team-size-invariant CNN-based policy, which greatly benefits real-robot deployment by handling possible robot lost and allows bandwidth-efficient intra-agent communication through low-dimensional CNN features. We first validate our approach in a grid-based scenario. Both simulation and real-robot results show that ACE reduces over 10% actual exploration time compared with classical approaches. We also apply our framework Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 – June 2, 2023, London, United Kingdom. © 2023 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). All rights reserved. to a high-fidelity visual-based environment, Habitat, achieving 28% improvement in exploration efficiency.

AAAI Conference 2023 Conference Paper

AutoNF: Automated Architecture Optimization of Normalizing Flows with Unconstrained Continuous Relaxation Admitting Optimal Discrete Solution

  • Yu Wang
  • Ján Drgoňa
  • Jiaxin Zhang
  • Karthik Somayaji Nanjangud Suryanarayana
  • Malachi Schram
  • Frank Liu
  • Peng Li

Normalizing flows (NF) build upon invertible neural networks and have wide applications in probabilistic modeling. Currently, building a powerful yet computationally efficient flow model relies on empirical fine-tuning over a large design space. While introducing neural architecture search (NAS) to NF is desirable, the invertibility constraint of NF brings new challenges to existing NAS methods whose application is limited to unstructured neural networks. Developing efficient NAS methods specifically for NF remains an open problem. We present AutoNF, the first automated NF architectural optimization framework. First, we present a new mixture distribution formulation that allows efficient differentiable architecture search of flow models without violating the invertibility constraint. Second, under the new formulation, we convert the original NP-hard combinatorial NF architectural optimization problem to an unconstrained continuous relaxation admitting the discrete optimal architectural solution, circumventing the loss of optimality due to binarization in architectural optimization. We evaluate AutoNF with various density estimation datasets and show its superior performance-cost trade-offs over a set of existing hand-crafted baselines.

ICML Conference 2023 Conference Paper

Boosting Graph Contrastive Learning via Graph Contrastive Saliency

  • Chunyu Wei
  • Yu Wang
  • Bing Bai
  • Kai Ni
  • David Brady
  • Lu Fang

Graph augmentation plays a crucial role in achieving good generalization for contrastive graph self-supervised learning. However, mainstream Graph Contrastive Learning (GCL) often favors random graph augmentations, by relying on random node dropout or edge perturbation on graphs. Random augmentations may inevitably lead to semantic information corruption during the training, and force the network to mistakenly focus on semantically irrelevant environmental background structures. To address these limitations and to improve generalization, we propose a novel self-supervised learning framework for GCL, which can adaptively screen the semantic-related substructure in graphs by capitalizing on the proposed gradient-based Graph Contrastive Saliency (GCS). The goal is to identify the most semantically discriminative structures of a graph via contrastive learning, such that we can generate semantically meaningful augmentations by leveraging on saliency. Empirical evidence on 16 benchmark datasets demonstrates the exclusive merits of the GCS-based framework. We also provide rigorous theoretical justification for GCS’s robustness properties. Code is available at https: //github. com/GCS2023/GCS.

NeurIPS Conference 2023 Conference Paper

Discover and Align Taxonomic Context Priors for Open-world Semi-Supervised Learning

  • Yu Wang
  • Zhun Zhong
  • Pengchong Qiao
  • Xuxin Cheng
  • Xiawu Zheng
  • Chang Liu
  • Nicu Sebe
  • Rongrong Ji

Open-world Semi-Supervised Learning (OSSL) is a realistic and challenging task, aiming to classify unlabeled samples from both seen and novel classes using partially labeled samples from the seen classes. Previous works typically explore the relationship of samples as priors on the pre-defined single-granularity labels to help novel class recognition. In fact, classes follow a taxonomy and samples can be classified at multiple levels of granularity, which contains more underlying relationships for supervision. We thus argue that learning with single-granularity labels results in sub-optimal representation learning and inaccurate pseudo labels, especially with unknown classes. In this paper, we take the initiative to explore and propose a uniformed framework, called Taxonomic context prIors Discovering and Aligning (TIDA), which exploits the relationship of samples under various granularity. It allows us to discover multi-granularity semantic concepts as taxonomic context priors (i. e. , sub-class, target-class, and super-class), and then collaboratively leverage them to enhance representation learning and improve the quality of pseudo labels. Specifically, TIDA comprises two components: i) A taxonomic context discovery module that constructs a set of hierarchical prototypes in the latent space to discover the underlying taxonomic context priors; ii) A taxonomic context-based prediction alignment module that enforces consistency across hierarchical predictions to build the reliable relationship between classes among various granularity and provide additions supervision. We demonstrate that these two components are mutually beneficial for an effective OSSL framework, which is theoretically explained from the perspective of the EM algorithm. Extensive experiments on seven commonly used datasets show that TIDA can significantly improve the performance and achieve a new state of the art. The source codes are publicly available at https: //github. com/rain305f/TIDA.

AAAI Conference 2023 Conference Paper

Dynamic Ensemble of Low-Fidelity Experts: Mitigating NAS “Cold-Start”

  • Junbo Zhao
  • Xuefei Ning
  • Enshu Liu
  • Binxin Ru
  • Zixuan Zhou
  • Tianchen Zhao
  • Chen Chen
  • Jiajin Zhang

Predictor-based Neural Architecture Search (NAS) employs an architecture performance predictor to improve the sample efficiency. However, predictor-based NAS suffers from the severe ``cold-start'' problem, since a large amount of architecture-performance data is required to get a working predictor. In this paper, we focus on exploiting information in cheaper-to-obtain performance estimations (i.e., low-fidelity information) to mitigate the large data requirements of predictor training. Despite the intuitiveness of this idea, we observe that using inappropriate low-fidelity information even damages the prediction ability and different search spaces have different preferences for low-fidelity information types. To solve the problem and better fuse beneficial information provided by different types of low-fidelity information, we propose a novel dynamic ensemble predictor framework that comprises two steps. In the first step, we train different sub-predictors on different types of available low-fidelity information to extract beneficial knowledge as low-fidelity experts. In the second step, we learn a gating network to dynamically output a set of weighting coefficients conditioned on each input neural architecture, which will be used to combine the predictions of different low-fidelity experts in a weighted sum. The overall predictor is optimized on a small set of actual architecture-performance data to fuse the knowledge from different low-fidelity experts to make the final prediction. We conduct extensive experiments across five search spaces with different architecture encoders under various experimental settings. For example, our methods can improve the Kendall's Tau correlation coefficient between actual performance and predicted scores from 0.2549 to 0.7064 with only 25 actual architecture-performance data on NDS-ResNet. Our method can easily be incorporated into existing predictor-based NAS frameworks to discover better architectures. Our method will be implemented in Mindspore (Huawei 2020), and the example code is published at https://github.com/A-LinCui/DELE.

AAAI Conference 2023 Conference Paper

Ensemble-in-One: Ensemble Learning within Random Gated Networks for Enhanced Adversarial Robustness

  • Yi Cai
  • Xuefei Ning
  • Huazhong Yang
  • Yu Wang

Adversarial attacks have threatened modern deep learning systems by crafting adversarial examples with small perturbations to fool the convolutional neural networks (CNNs). To alleviate that, ensemble training methods are proposed to facilitate better adversarial robustness by diversifying the vulnerabilities among the sub-models, simultaneously maintaining comparable natural accuracy as standard training. Previous practices also demonstrate that enlarging the ensemble can improve the robustness. However, conventional ensemble methods are with poor scalability, owing to the rapidly increasing complexity when containing more sub-models in the ensemble. Moreover, it is usually infeasible to train or deploy an ensemble with substantial sub-models, owing to the tight hardware resource budget and latency requirement. In this work, we propose Ensemble-in-One (EIO), a simple but effective method to efficiently enlarge the ensemble with a random gated network (RGN). EIO augments a candidate model by replacing the parametrized layers with multi-path random gated blocks (RGBs) to construct an RGN. The scalability is significantly boosted because the number of paths exponentially increases with the RGN depth. Then by learning from the vulnerabilities of numerous other paths within the RGN, every path obtains better adversarial robustness. Our experiments demonstrate that EIO consistently outperforms previous ensemble training methods with smaller computational overheads, simultaneously achieving better accuracy-robustness trade-offs than adversarial training methods under black-box transfer attacks. Code is available at https://github.com/cai-y13/Ensemble-in-One.git

AAAI Conference 2023 Conference Paper

Fairness and Explainability: Bridging the Gap towards Fair Model Explanations

  • Yuying Zhao
  • Yu Wang
  • Tyler Derr

While machine learning models have achieved unprecedented success in real-world applications, they might make biased/unfair decisions for specific demographic groups and hence result in discriminative outcomes. Although research efforts have been devoted to measuring and mitigating bias, they mainly study bias from the result-oriented perspective while neglecting the bias encoded in the decision-making procedure. This results in their inability to capture procedure-oriented bias, which therefore limits the ability to have a fully debiasing method. Fortunately, with the rapid development of explainable machine learning, explanations for predictions are now available to gain insights into the procedure. In this work, we bridge the gap between fairness and explainability by presenting a novel perspective of procedure-oriented fairness based on explanations. We identify the procedure-based bias by measuring the gap of explanation quality between different groups with Ratio-based and Value-based Explanation Fairness. The new metrics further motivate us to design an optimization objective to mitigate the procedure-based bias where we observe that it will also mitigate bias from the prediction. Based on our designed optimization objective, we propose a Comprehensive Fairness Algorithm (CFA), which simultaneously fulfills multiple objectives - improving traditional fairness, satisfying explanation fairness, and maintaining the utility performance. Extensive experiments on real-world datasets demonstrate the effectiveness of our proposed CFA and highlight the importance of considering fairness from the explainability perspective. Our code: https://github.com/YuyingZhao/FairExplanations-CFA.

AAMAS Conference 2023 Conference Paper

Fictitious Cross-Play: Learning Global Nash Equilibrium in Mixed Cooperative-Competitive Games

  • Zelai Xu
  • Yancheng Liang
  • Chao Yu
  • Yu Wang
  • Yi Wu

Self-play (SP) is a popular multi-agent reinforcement learning (MARL) framework for solving competitive games, where each agent optimizes policy by treating others as part of the environment. Despite the empirical successes, the theoretical properties of SP-based methods are limited to two-player zero-sum games. However, for mixed cooperative-competitive games where agents on the same team need to cooperate with each other, we can show a simple counterexample where SP-based methods cannot converge to a global Nash equilibrium (NE) with high probability. Alternatively, Policy-Space Response Oracles (PSRO) is an iterative framework for learning NE, where the best responses w. r. t. previous policies are learned in each iteration. PSRO can be directly extended to mixed cooperativecompetitive settings by jointly learning team best responses with all convergence properties unchanged. However, PSRO requires repeatedly training joint policies from scratch till convergence, which makes it hard to scale to complex games. In this work, we develop a novel algorithm, Fictitious Cross-Play (FXP), which inherits the benefits from both frameworks. FXP simultaneously trains an SP-based main policy and a counter population of best response policies. The main policy is trained by fictitious self-play and crossplay against the counter population, while the counter policies are trained as the best responses to the main policy’s past versions. We validate our method in matrix games and show that FXP converges to global NEs while SP methods fail. We also conduct experiments in a gridworld domain, where FXP achieves higher Elo ratings and lower exploitabilities than baselines, and a more challenging football game, where FXP defeats SOTA models with over 94% win rate.

AAAI Conference 2023 Conference Paper

Interpretable Chirality-Aware Graph Neural Network for Quantitative Structure Activity Relationship Modeling in Drug Discovery

  • Yunchao (Lance) Liu
  • Yu Wang
  • Oanh Vu
  • Rocco Moretti
  • Bobby Bodenheimer
  • Jens Meiler
  • Tyler Derr

In computer-aided drug discovery, quantitative structure activity relation models are trained to predict biological activity from chemical structure. Despite the recent success of applying graph neural network to this task, important chemical information such as molecular chirality is ignored. To fill this crucial gap, we propose Molecular-Kernel Graph NeuralNetwork (MolKGNN) for molecular representation learning, which features SE(3)-/conformation invariance, chirality-awareness, and interpretability. For our MolKGNN, we first design a molecular graph convolution to capture the chemical pattern by comparing the atom's similarity with the learnable molecular kernels. Furthermore, we propagate the similarity score to capture the higher-order chemical pattern. To assess the method, we conduct a comprehensive evaluation with nine well-curated datasets spanning numerous important drug targets that feature realistic high class imbalance and it demonstrates the superiority of MolKGNN over other graph neural networks in computer-aided drug discovery. Meanwhile, the learned kernels identify patterns that agree with domain knowledge, confirming the pragmatic interpretability of this approach. Our code and supplementary material are publicly available at https://github.com/meilerlab/MolKGNN.

AAMAS Conference 2023 Conference Paper

Learning Graph-Enhanced Commander-Executor for Multi-Agent Navigation

  • Xinyi Yang
  • Shiyu Huang
  • Yiwen Sun
  • Yuxiang Yang
  • Chao Yu
  • Wei-Wei Tu
  • Huazhong Yang
  • Yu Wang

This paper investigates the multi-agent navigation problem, which requires multiple agents to reach the target goals in a limited time. Multi-agent reinforcement learning (MARL) has shown promising results for solving this issue. However, it is inefficient for MARL to directly explore the (nearly) optimal policy in the large search space, which is exacerbated as the agent number increases (e. g. , 10+ agents) or the environment is more complex (e. g. , 3𝐷 simulator). Goal-conditioned hierarchical reinforcement learning (HRL) provides a promising direction to tackle this challenge by introducing a hierarchical structure to decompose the search space, where the low-level policy predicts primitive actions in the guidance of the goals derived from the high-level policy. In this paper, we propose Multi-Agent Graph-Enhanced Commander-EXecutor (MAGE-X), a graph-based goal-conditioned hierarchical method for multi-agent navigation tasks. MAGE-X comprises a high-level Goal Commander and a low-level Action Executor. The Goal Commander predicts the probability distribution of the goals and leverages them to assign the most appropriate final target to each agent. The Action Executor utilizes graph neural networks (GNN) to construct a subgraph for each agent that only contains its crucial partners to improve cooperation. Additionally, the Goal Encoder in the Action Executor captures the relationship between the agent and the designated goal to encourage the agent to reach the final target. The results show that MAGE-X outperforms the state-of-the-art MARL baselines with a 100% success rate with only 3 million training steps in multi-agent particle environments (MPE) with 50 agents, and at least a 12% higher success rate and 2× higher data efficiency in a more complicated quadrotor 3𝐷 navigation task.

AAAI Conference 2023 Conference Paper

LidarMultiNet: Towards a Unified Multi-Task Network for LiDAR Perception

  • Dongqiangzi Ye
  • Zixiang Zhou
  • Weijia Chen
  • Yufei Xie
  • Yu Wang
  • Panqu Wang
  • Hassan Foroosh

LiDAR-based 3D object detection, semantic segmentation, and panoptic segmentation are usually implemented in specialized networks with distinctive architectures that are difficult to adapt to each other. This paper presents LidarMultiNet, a LiDAR-based multi-task network that unifies these three major LiDAR perception tasks. Among its many benefits, a multi-task network can reduce the overall cost by sharing weights and computation among multiple tasks. However, it typically underperforms compared to independently combined single-task models. The proposed LidarMultiNet aims to bridge the performance gap between the multi-task network and multiple single-task networks. At the core of LidarMultiNet is a strong 3D voxel-based encoder-decoder architecture with a Global Context Pooling (GCP) module extracting global contextual features from a LiDAR frame. Task-specific heads are added on top of the network to perform the three LiDAR perception tasks. More tasks can be implemented simply by adding new task-specific heads while introducing little additional cost. A second stage is also proposed to refine the first-stage segmentation and generate accurate panoptic segmentation results. LidarMultiNet is extensively tested on both Waymo Open Dataset and nuScenes dataset, demonstrating for the first time that major LiDAR perception tasks can be unified in a single strong network that is trained end-to-end and achieves state-of-the-art performance. Notably, LidarMultiNet reaches the official 1 place in the Waymo Open Dataset 3D semantic segmentation challenge 2022 with the highest mIoU and the best accuracy for most of the 22 classes on the test set, using only LiDAR points as input. It also sets the new state-of-the-art for a single model on the Waymo 3D object detection benchmark and three nuScenes benchmarks.

AAAI Conference 2023 Conference Paper

Memory-Oriented Structural Pruning for Efficient Image Restoration

  • Xiangsheng Shi
  • Xuefei Ning
  • Lidong Guo
  • Tianchen Zhao
  • Enshu Liu
  • Yi Cai
  • Yuhan Dong
  • Huazhong Yang

Deep learning (DL) based methods have significantly pushed forward the state-of-the-art for image restoration (IR) task. Nevertheless, DL-based IR models are highly computation- and memory-intensive. The surging demands for processing higher-resolution images and multi-task paralleling in practical mobile usage further add to their computation and memory burdens. In this paper, we reveal the overlooked memory redundancy of the IR models and propose a Memory-Oriented Structural Pruning (MOSP) method. To properly compress the long-range skip connections (a major source of the memory burden), we introduce a compactor module onto each skip connection to decouple the pruning of the skip connections and the main branch. MOSP progressively prunes the original model layers and the compactors to cut down the peak memory while maintaining high IR quality. Experiments on real image denoising, image super-resolution and low-light image enhancement show that MOSP can yield models with higher memory efficiency while better preserving performance compared with baseline pruning methods.

AAAI Conference 2023 Conference Paper

Online Semi-supervised Learning with Mix-Typed Streaming Features

  • Di Wu
  • Shengda Zhuo
  • Yu Wang
  • Zhong Chen
  • Yi He

Online learning with feature spaces that are not fixed but can vary over time renders a seemingly flexible learning paradigm thus has drawn much attention. Unfortunately, two restrictions prohibit a ubiquitous application of this learning paradigm in practice. First, whereas prior studies mainly assume a homogenous feature type, data streams generated from real applications can be heterogeneous in which Boolean, ordinal, and continuous co-exist. Existing methods that prescribe parametric distributions such as Gaussians would not suffice to model the correlation among such mixtyped features. Second, while full supervision seems to be a default setup, providing labels to all arriving data instances over a long time span is tangibly onerous, laborious, and economically unsustainable. Alas, a semi-supervised online learner that can deal with mix-typed, varying feature spaces is still missing. To fill the gap, this paper explores a novel problem, named Online Semi-supervised Learning with Mixtyped streaming Features (OSLMF), which strives to relax the restrictions on the feature type and supervision information. Our key idea to solve the new problem is to leverage copula model to align the data instances with different feature spaces so as to make their distance measurable. A geometric structure underlying data instances is then established in an online fashion based on their distances, through which the limited labeling information is propagated, from the scarce labeled instances to their close neighbors. Experimental results are documented to evidence the viability and effectiveness of our proposed approach. Code is released in https://github.com/wudi1989/OSLMF.

JBHI Journal 2022 Journal Article

Mul-SNO: A Novel Prediction Tool for S-Nitrosylation Sites Based on Deep Learning Methods

  • Qian Zhao
  • Jiaqi Ma
  • Yu Wang
  • Fang Xie
  • Zhibin Lv
  • Yaoqun Xu
  • Hua Shi
  • Ke Han

Protein s-nitrosylation (SNO) is one of the most important post-translational modifications and is formed by the covalent modification of nitric oxide and cysteine residues. Extensive studies have shown that SNO plays a pivotal role in the plant immune response and treating various major human diseases. In recent years, SNO sites have become a hot research topic. Traditional biochemical methods for SNO site identification are time-consuming and costly. In this study, we developed an economical and efficient SNO site prediction tool named Mul-SNO. Mul-SNO ensembled current popular and powerful deep learning model bidirectional long short-term memory (BiLSTM) and bidirectional encoder representations from Transformers (BERT). Compared with existing state-of-the-art methods, Mul-SNO obtained better ACC of 0. 911 and 0. 796 based on 10-fold cross-validation and independent data sets, respectively.

NeurIPS Conference 2022 Conference Paper

Out-of-Distribution Detection via Conditional Kernel Independence Model

  • Yu Wang
  • Jingjing Zou
  • Jingyang Lin
  • Qing Ling
  • Yingwei Pan
  • Ting Yao
  • Tao Mei

Recently, various methods have been introduced to address the OOD detection problem with training outlier exposure. These methods usually count on discriminative softmax metric or energy method to screen OOD samples. In this paper, we probe an alternative hypothesis on OOD detection by constructing a novel latent variable model based on independent component analysis (ICA) techniques. This novel method named Conditional-i builds upon the probabilistic formulation, and applies the Hilbert-Schmidt Independence Criteria that offers a convenient solution for optimizing variable dependencies. Conditional-i exclusively encodes the useful class condition into the probabilistic model, which provides the desired convenience in delivering theoretical support for the OOD detection task. To facilitate the implementation of the Conditional-i model, we construct unique memory bank architectures that allow for convenient end-to-end training within a tractable budget. Empirical results demonstrate an evident performance boost on benchmarks against SOTA methods. We also provide valuable theoretical justifications that our training strategy is guaranteed to bound the error in the context of OOD detection. Code is available at: https: //github. com/OODHSIC/conditional-i.

NeurIPS Conference 2022 Conference Paper

TA-GATES: An Encoding Scheme for Neural Network Architectures

  • Xuefei Ning
  • Zixuan Zhou
  • Junbo Zhao
  • Tianchen Zhao
  • Yiping Deng
  • Changcheng Tang
  • Shuang Liang
  • Huazhong Yang

Neural architecture search tries to shift the manual design of neural network (NN) architectures to algorithmic design. In these cases, the NN architecture itself can be viewed as data and needs to be modeled. A better modeling could help explore novel architectures automatically and open the black box of automated architecture design. To this end, this work proposes a new encoding scheme for neural architectures, the Training-Analogous Graph-based ArchiTecture Encoding Scheme (TA-GATES). TA-GATES encodes an NN architecture in a way that is analogous to its training. Extensive experiments demonstrate that the flexibility and discriminative power of TA-GATES lead to better modeling of NN architectures. We expect our methodology of explicitly modeling the NN training process to benefit broader automated deep learning systems. The code is available at https: //github. com/walkerning/aw_nas.

NeurIPS Conference 2022 Conference Paper

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

  • Chao Yu
  • Akash Velu
  • Eugene Vinitsky
  • Jiaxuan Gao
  • Yu Wang
  • Alexandre Bayen
  • Yi Wu

Proximal Policy Optimization (PPO) is a ubiquitous on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due to the belief that PPO is significantly less sample efficient than off-policy methods in multi-agent systems. In this work, we carefully study the performance of PPO in cooperative multi-agent settings. We show that PPO-based multi-agent algorithms achieve surprisingly strong performance in four popular multi-agent testbeds: the particle-world environments, the StarCraft multi-agent challenge, the Hanabi challenge, and Google Research Football, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. Importantly, compared to competitive off-policy methods, PPO often achieves competitive or superior results in both final returns and sample efficiency. Finally, through ablation studies, we analyze implementation and hyperparameter factors that are critical to PPO's empirical performance, and give concrete practical suggestions regarding these factors. Our results show that when using these practices, simple PPO-based methods are a strong baseline in cooperative multi-agent reinforcement learning. Source code is released at https: //github. com/marlbenchmark/on-policy.

TIST Journal 2021 Journal Article

A Comprehensive Survey of Grammatical Error Correction

  • Yu Wang
  • Yuelin Wang
  • Kai Dang
  • Jie Liu
  • Zhuo Liu

Grammatical error correction (GEC) is an important application aspect of natural language processing techniques, and GEC system is a kind of very important intelligent system that has long been explored both in academic and industrial communities. The past decade has witnessed significant progress achieved in GEC for the sake of increasing popularity of machine learning and deep learning. However, there is not a survey that untangles the large amount of research works and progress in this field. We present the first survey in GEC for a comprehensive retrospective of the literature in this area. We first give the definition of GEC task and introduce the public datasets and data annotation schema. After that, we discuss six kinds of basic approaches, six commonly applied performance boosting techniques for GEC systems, and three data augmentation methods. Since GEC is typically viewed as a sister task of Machine Translation (MT), we put more emphasis on the statistical machine translation (SMT)-based approaches and neural machine translation (NMT)-based approaches for the sake of their importance. Similarly, some performance-boosting techniques are adapted from MT and are successfully combined with GEC systems for enhancement on the final performance. More importantly, after the introduction of the evaluation in GEC, we make an in-depth analysis based on empirical results in aspects of GEC approaches and GEC systems for a clearer pattern of progress in GEC, where error type analysis and system recapitulation are clearly presented. Finally, we discuss five prospective directions for future GEC researches.

AAAI Conference 2021 Conference Paper

Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extraction

  • Shaowei Chen
  • Yu Wang
  • Jie Liu
  • Yuelin Wang

Aspect sentiment triplet extraction (ASTE), which aims to identify aspects from review sentences along with their corresponding opinion expressions and sentiments, is an emerging task in fine-grained opinion mining. Since ASTE consists of multiple subtasks, including opinion entity extraction, relation detection, and sentiment classification, it is critical and challenging to appropriately capture and utilize the associations among them. In this paper, we transform ASTE task into a multi-turn machine reading comprehension (MTMRC) task and propose a bidirectional MRC (BMRC) framework to address this challenge. Specifically, we devise three types of queries, including non-restrictive extraction queries, restrictive extraction queries and sentiment classification queries, to build the associations among different subtasks. Furthermore, considering that an aspect sentiment triplet can derive from either an aspect or an opinion expression, we design a bidirectional MRC structure. One direction sequentially recognizes aspects, opinion expressions, and sentiments to obtain triplets, while the other direction identifies opinion expressions first, then aspects, and at last sentiments. By making the two directions complement each other, our framework can identify triplets more comprehensively. To verify the effectiveness of our approach, we conduct extensive experiments on four benchmark datasets. The experimental results demonstrate that BMRC achieves state-of-the-art performances.

JBHI Journal 2021 Journal Article

Deep Semisupervised Multitask Learning Model and Its Interpretability for Survival Analysis

  • Shengqiang Chi
  • Yu Tian
  • Feng Wang
  • Yu Wang
  • Ming Chen
  • Jingsong Li

Survival analysis is a commonly used method in the medical field to analyze and predict the time of events. In medicine, this approach plays a key role in determining the course of treatment, developing new drugs, and improving hospital procedures. Most of the existing work in this area has addressed the problem by making strong assumptions about the underlying stochastic process. However, these assumptions are usually violated in the real-world data. This paper proposed a semisupervised multitask learning (SSMTL) method based on deep learning for survival analysis with or without competing risks. SSMTL transforms the survival analysis problem into a multitask learning problem that includes semisupervised learning and multipoint survival probability prediction. The distribution of survival times and the relationship between covariates and outcomes were modeled directly without any assumptions. Semisupervised loss and ranking loss are used to deal with censored data and the prior knowledge of the nonincreasing trend of the survival probability. Additionally, the importance of prognostic factors is determined, and the time-dependent and nonlinear effects of these factors on survival outcomes are visualized. The prediction performance of SSMTL is better than that of previous models in settings with or without competing risks, and the effects of predictors are successfully described. This study is of great significance for the exploration and application of deep learning methods involving medical structured data and provides an effective deep-learning-based method for survival analysis with complex-structured clinical data.

NeurIPS Conference 2021 Conference Paper

Detecting Individual Decision-Making Style: Exploring Behavioral Stylometry in Chess

  • Reid McIlroy-Young
  • Yu Wang
  • Siddhartha Sen
  • Jon Kleinberg
  • Ashton Anderson

The advent of machine learning models that surpass human decision-making ability in complex domains has initiated a movement towards building AI systems that interact with humans. Many building blocks are essential for this activity, with a central one being the algorithmic characterization of human behavior. While much of the existing work focuses on aggregate human behavior, an important long-range goal is to develop behavioral models that specialize to individual people and can differentiate among them. To formalize this process, we study the problem of behavioral stylometry, in which the task is to identify a decision-maker from their decisions alone. We present a transformer-based approach to behavioral stylometry in the context of chess, where one attempts to identify the player who played a set of games. Our method operates in a few-shot classification framework, and can correctly identify a player from among thousands of candidate players with 98% accuracy given only 100 labeled games. Even when trained on amateur play, our method generalises to out-of-distribution samples of Grandmaster players, despite the dramatic differences between amateur and world-class players. Finally, we consider more broadly what our resulting embeddings reveal about human style in chess, as well as the potential ethical implications of powerful methods for identifying individuals from behavioral data.

NeurIPS Conference 2021 Conference Paper

Evaluating Efficient Performance Estimators of Neural Architectures

  • Xuefei Ning
  • Changcheng Tang
  • Wenshuo Li
  • Zixuan Zhou
  • Shuang Liang
  • Huazhong Yang
  • Yu Wang

Conducting efficient performance estimations of neural architectures is a major challenge in neural architecture search (NAS). To reduce the architecture training costs in NAS, one-shot estimators (OSEs) amortize the architecture training costs by sharing the parameters of one supernet between all architectures. Recently, zero-shot estimators (ZSEs) that involve no training are proposed to further reduce the architecture evaluation cost. Despite the high efficiency of these estimators, the quality of such estimations has not been thoroughly studied. In this paper, we conduct an extensive and organized assessment of OSEs and ZSEs on five NAS benchmarks: NAS-Bench-101/201/301, and NDS ResNet/ResNeXt-A. Specifically, we employ a set of NAS-oriented criteria to study the behavior of OSEs and ZSEs, and reveal their biases and variances. After analyzing how and why the OSE estimations are unsatisfying, we explore how to mitigate the correlation gap of OSEs from three perspectives. Through our analysis, we give out suggestions for future application and development of efficient architecture performance estimators. Furthermore, the analysis framework proposed in our work could be utilized in future research to give a more comprehensive understanding of newly designed architecture performance estimators. The code is available at https: //github. com/walkerning/aw_nas.

NeurIPS Conference 2021 Conference Paper

Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration

  • Yu Wang
  • Jingyang Lin
  • Jingjing Zou
  • Yingwei Pan
  • Ting Yao
  • Tao Mei

Our work reveals a structured shortcoming of the existing mainstream self-supervised learning methods. Whereas self-supervised learning frameworks usually take the prevailing perfect instance level invariance hypothesis for granted, we carefully investigate the pitfalls behind. Particularly, we argue that the existing augmentation pipeline for generating multiple positive views naturally introduces out-of-distribution (OOD) samples that undermine the learning of the downstream tasks. Generating diverse positive augmentations on the input does not always pay off in benefiting downstream tasks. To overcome this inherent deficiency, we introduce a lightweight latent variable model UOTA, targeting the view sampling issue for self-supervised learning. UOTA adaptively searches for the most important sampling region to produce views, and provides viable choice for outlier-robust self-supervised learning approaches. Our method directly generalizes to many mainstream self-supervised learning approaches, regardless of the loss's nature contrastive or not. We empirically show UOTA's advantage over the state-of-the-art self-supervised paradigms with evident margin, which well justifies the existence of the OOD sample issue embedded in the existing approaches. Especially, we theoretically prove that the merits of the proposal boil down to guaranteed estimator variance and bias reduction. Code is available: https: //github. com/ssl-codelab/uota.

NeurIPS Conference 2021 Conference Paper

Meta-learning with an Adaptive Task Scheduler

  • Huaxiu Yao
  • Yu Wang
  • Ying Wei
  • Peilin Zhao
  • Mehrdad Mahdavi
  • Defu Lian
  • Chelsea Finn

To benefit the learning of a new task, meta-learning has been proposed to transfer a well-generalized meta-model learned from various meta-training tasks. Existing meta-learning algorithms randomly sample meta-training tasks with a uniform probability, under the assumption that tasks are of equal importance. However, it is likely that tasks are detrimental with noise or imbalanced given a limited number of meta-training tasks. To prevent the meta-model from being corrupted by such detrimental tasks or dominated by tasks in the majority, in this paper, we propose an adaptive task scheduler (ATS) for the meta-training process. In ATS, for the first time, we design a neural scheduler to decide which meta-training tasks to use next by predicting the probability being sampled for each candidate task, and train the scheduler to optimize the generalization capacity of the meta-model to unseen tasks. We identify two meta-model-related factors as the input of the neural scheduler, which characterize the difficulty of a candidate task to the meta-model. Theoretically, we show that a scheduler taking the two factors into account improves the meta-training loss and also the optimization landscape. Under the setting of meta-learning with noise and limited budgets, ATS improves the performance on both miniImageNet and a real-world drug discovery benchmark by up to 13% and 18%, respectively, compared to state-of-the-art task schedulers.

AAAI Conference 2021 System Paper

Mobile-based Clock Drawing Test for Detecting Early Signs of Dementia

  • Hongchao Jiang
  • Yanci Zhang
  • Zhiwei Zeng
  • Jun Ji
  • Yu Wang
  • Ying Chi
  • Chunyan Miao

Dementia is one of the major causes of disability and dependency among older people. Early detection is the key for preserving the quality of life of the patients and reducing caring costs. The Clock Drawing Test (CDT) is commonly used by clinicians to screen for early signs of dementia. We build an automated CDT that runs on mobile platforms, enabling convenient and frequent self-monitoring and testing at minimal costs. Our system combines both a spatial-temporal approach and a purely image-based deep learning approach to analyze and evaluate the hand-drawn clocks based on established clinical criteria. Our system produces scores that are highly correlated with expert human raters.

NeurIPS Conference 2021 Conference Paper

Variational Automatic Curriculum Learning for Sparse-Reward Cooperative Multi-Agent Problems

  • Jiayu Chen
  • Yuanxin Zhang
  • Yuanfan Xu
  • Huimin Ma
  • Huazhong Yang
  • Jiaming Song
  • Yu Wang
  • Yi Wu

We introduce an automatic curriculum algorithm, Variational Automatic Curriculum Learning (VACL), for solving challenging goal-conditioned cooperative multi-agent reinforcement learning problems. We motivate our curriculum learning paradigm through a variational perspective, where the learning objective can be decomposed into two terms: task learning on the current curriculum, and curriculum update to a new task distribution. Local optimization over the second term suggests that the curriculum should gradually expand the training tasks from easy to hard. Our VACL algorithm implements this variational paradigm with two practical components, task expansion and entity curriculum, which produces a series of training tasks over both the task configurations as well as the number of entities in the task. Experiment results show that VACL solves a collection of sparse-reward problems with a large number of agents. Particularly, using a single desktop machine, VACL achieves 98% coverage rate with 100 agents in the simple-spread benchmark and reproduces the ramp-use behavior originally shown in OpenAI’s hide-and-seek project.

AAAI Conference 2020 Conference Paper

Actor Critic Deep Reinforcement Learning for Neural Malware Control

  • Yu Wang
  • Jack Stokes
  • Mady Marinescu

In addition to using signatures, antimalware products also detect malicious attacks by evaluating unknown files in an emulated environment, i. e. sandbox, prior to execution on a computer’s native operating system. During emulation, a file cannot be scanned indefinitely, and antimalware engines often set the number of instructions to be executed based on a set of heuristics. These heuristics only make the decision of when to halt emulation using partial information leading to the execution of the file for either too many or too few instructions. Also this method is vulnerable if the attackers learn this set of heuristics. Recent research uses a deep reinforcement learning (DRL) model employing a Deep Q-Network (DQN) to learn when to halt the emulation of a file. In this paper, we propose a new DRL-based system which instead employs a modified actor critic (AC) framework for the emulation halting task. This AC model dynamically predicts the best time to halt the file’s execution based on a sequence of system API calls. Compared to the earlier models, the new model is capable of handling adversarial attacks by simulating their behaviors using the critic model. The new AC model demonstrates much better performance than both the DQN model and antimalware engine’s heuristics. In terms of execution speed (evaluated by the halting decision), the new model halts the execution of unknown files by up to 2. 5% earlier than the DQN model and 93. 6% earlier than the heuristics. For the task of detecting malicious files, the proposed AC model increases the true positive rate by 9. 9% from 69. 5% to 76. 4% at a false positive rate of 1% compared to the DQN model, and by 83. 4% from 41. 2% to 76. 4% at a false positive rate of 1% compared to a recently proposed LSTM model.

IJCAI Conference 2020 Conference Paper

CooBa: Cross-project Bug Localization via Adversarial Transfer Learning

  • Ziye Zhu
  • Yun Li
  • Hanghang Tong
  • Yu Wang

Bug localization plays an important role in software quality control. Many supervised machine learning models have been developed based on historical bug-fix information. Despite being successful, these methods often require sufficient historical data (i. e. , labels), which is not always available especially for newly developed software projects. In response, cross-project bug localization techniques have recently emerged whose key idea is to transferring knowledge from label-rich source project to locate bugs in the target project. However, a major limitation of these existing techniques lies in that they fail to capture the specificity of each individual project, and are thus prone to negative transfer. To address this issue, we propose an adversarial transfer learning bug localization approach, focusing on only transferring the common characteristics (i. e. , public information) across projects. Specifically, our approach (CooBa) learns the indicative public information from cross-project bug reports through a shared encoder, and extracts the private information from code files by an individual feature extractor for each project. CooBa further incorporates adversarial learning mechanism to ensure that public information shared between multiple projects could be effectively extracted. Extensive experiments on four large-scale real-world data sets demonstrate that the proposed CooBa significantly outperforms the state of the art techniques.

KR Conference 2020 System Paper

Explainable and Argumentation-based Decision Making with Qualitative Preferences for Diagnostics and Prognostics of Alzheimer's Disease

  • Zhiwei Zeng
  • Zhiqi Shen
  • Benny Toh Hsiang Tan
  • Jing Jih Chin
  • Cyril Leung
  • Yu Wang
  • Ying Chi
  • Chunyan Miao

Argumentation has gained traction as a formalism to make more transparent decisions and provide formal explanations recently. In this paper, we present an argumentation-based approach to decision making that can support modelling and automated reasoning about complex qualitative preferences and offer dialogical explanations for the decisions made. We first propose Qualitative Preference Decision Frameworks (QPDFs). In a QPDF, we use contextual priority to represent the relative importance of combinations of goals in different contexts and define associated strategies for deriving decision preferences based on prioritized goal combinations. To automate the decision computation, we map QPDFs to Assumption-based Argumentation (ABA) frameworks so that we can utilize existing ABA argumentative engines for our implementation. We implemented our approach for two tasks, diagnostics and prognostics of Alzheimer's Disease (AD), and evaluated it with real-world datasets. For each task, one of our models achieves the highest accuracy and good precision and recall for all classes compared to common machine learning models. Moreover, we study how to formalize argumentation dialogues that give contrastive, focused and selected explanations for the most preferred decisions selected in given contexts.

AAAI Conference 2020 Conference Paper

Feature Variance Regularization: A Simple Way to Improve the Generalizability of Neural Networks

  • Ranran Huang
  • Hanbo Sun
  • Ji Liu
  • Lu Tian
  • Li Wang
  • Yi Shan
  • Yu Wang

To improve the generalization ability of neural networks, we propose a novel regularization method that regularizes the empirical risk using a penalty on the empirical variance of the features. Intuitively, our approach introduces confusion into feature extraction and prevents the models from learning features that may relate to specific training samples. According to our theoretical analysis, our method encourages models to generate closer feature distributions for the training set and unobservable true data and minimize the expected risk as well, which allows the model to adapt to new samples better. We provide a thorough empirical justification of our approach, and achieves a greater improvement than other regularization methods. The experimental results show the effectiveness of our method on multiple visual tasks, including classification (CIFAR100, ImageNet, fine-grained datasets) and semantic segmentation (Cityscapes).

NeurIPS Conference 2020 Conference Paper

Joint Contrastive Learning with Infinite Possibilities

  • Qi Cai
  • Yu Wang
  • Yingwei Pan
  • Ting Yao
  • Tao Mei

This paper explores useful modifications of the recent development in contrastive learning via novel probabilistic modeling. We derive a particular form of contrastive loss named Joint Contrastive Learning (JCL). JCL implicitly involves the simultaneous learning of an infinite number of query-key pairs, which poses tighter constraints when searching for invariant features. We derive an upper bound on this formulation that allows analytical solutions in an end-to-end training manner. While JCL is practically effective in numerous computer vision applications, we also theoretically unveil the certain mechanisms that govern the behavior of JCL. We demonstrate that the proposed formulation harbors an innate agency that strongly favors similarity within each instance-specific class, and therefore remains advantageous when searching for discriminative features among distinct instances. We evaluate these proposals on multiple benchmarks, demonstrating considerable improvements over existing algorithms. Code is publicly available at: https: //github. com/caiqi/Joint-Contrastive-Learning.

NeurIPS Conference 2020 Conference Paper

Learning to Adapt to Evolving Domains

  • Hong Liu
  • Mingsheng Long
  • Jianmin Wang
  • Yu Wang

Domain adaptation aims at knowledge transfer from a labeled source domain to an unlabeled target domain. Current domain adaptation methods have made substantial advances in adapting discrete domains. However, this can be unrealistic in real-world applications, where target data usually comes in an online and continually evolving manner as small batches, posing challenges to classic domain adaptation paradigm: (1) Mainstream domain adaptation methods are tailored to stationary target domains, and can fail in non-stationary environments. (2) Since the target data arrive online, the agent should also maintain competence on previous target domains, i. e. to adapt without forgetting. To tackle these challenges, we propose a meta-adaptation framework which enables the learner to adapt to continually evolving target domain without catastrophic forgetting. Our framework comprises of two components: a meta-objective of learning representations to adapt to evolving domains, enabling meta-learning for unsupervised domain adaptation; and a meta-adapter for learning to adapt without forgetting, reserving knowledge from previous target data. Experiments validate the effectiveness our method on evolving target domains.

JMLR Journal 2020 Journal Article

Unique Sharp Local Minimum in L1-minimization Complete Dictionary Learning

  • Yu Wang
  • Siqi Wu
  • Bin Yu

We study the problem of globally recovering a dictionary from a set of signals via $\ell_1$-minimization. We assume that the signals are generated as i.i.d. random linear combinations of the $K$ atoms from a complete reference dictionary $D^*\in \mathbb R^{K\times K}$, where the linear combination coefficients are from either a Bernoulli type model or exact sparse model. First, we obtain a necessary and sufficient norm condition for the reference dictionary $D^*$ to be a sharp local minimum of the expected $\ell_1$ objective function. Our result substantially extends that of Wu and Yu (2015) and allows the combination coefficient to be non-negative. Secondly, we obtain an explicit bound on the region within which the objective value of the reference dictionary is minimal. Thirdly, we show that the reference dictionary is the unique sharp local minimum, thus establishing the first known global property of $\ell_1$-minimization dictionary learning. Motivated by the theoretical results, we introduce a perturbation based test to determine whether a dictionary is a sharp local minimum of the objective function. In addition, we also propose a new dictionary learning algorithm based on Block Coordinate Descent, called DL-BCD, which is guaranteed to decrease the obective function monotonically. Simulation studies show that DL-BCD has competitive performance in terms of recovery rate compared to other state-of-the-art dictionary learning algorithms when the reference dictionary is generated from random Gaussian matrices. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

NeurIPS Conference 2019 Conference Paper

A Debiased MDI Feature Importance Measure for Random Forests

  • Xiao Li
  • Yu Wang
  • Sumanta Basu
  • Karl Kumbier
  • Bin Yu

Tree ensembles such as Random Forests have achieved impressive empirical success across a wide variety of applications. To understand how these models make predictions, people routinely turn to feature importance measures calculated from tree ensembles. It has long been known that Mean Decrease Impurity (MDI), one of the most widely used measures of feature importance, incorrectly assigns high importance to noisy features, leading to systematic bias in feature selection. In this paper, we address the feature selection bias of MDI from both theoretical and methodological perspectives. Based on the original definition of MDI by Breiman et al. \cite{Breiman1984} for a single tree, we derive a tight non-asymptotic bound on the expected bias of MDI importance of noisy features, showing that deep trees have higher (expected) feature selection bias than shallow ones. However, it is not clear how to reduce the bias of MDI using its existing analytical expression. We derive a new analytical expression for MDI, and based on this new expression, we are able to propose a debiased MDI feature importance measure using out-of-bag samples, called MDI-oob. For both the simulated data and a genomic ChIP dataset, MDI-oob achieves state-of-the-art performance in feature selection from Random Forests for both deep and shallow trees.

AAAI Conference 2019 Conference Paper

A Deep Reinforcement Learning Based Multi-Step Coarse to Fine Question Answering (MSCQA) System

  • Yu Wang
  • Hongxia Jin

In this paper, we present a multi-step coarse to fine question answering (MSCQA) system which can efficiently processes documents with different lengths by choosing appropriate actions. The system is designed using an actor-critic based deep reinforcement learning model to achieve multistep question answering. Compared to previous QA models targeting on datasets mainly containing either short or long documents, our multi-step coarse to fine model takes the merits from multiple system modules, which can handle both short and long documents. The system hence obtains a much better accuracy and faster trainings speed compared to the current state-of-the-art models. We test our model on four QA datasets, WIKEREADING, WIKIREADING LONG, CNN and SQuAD, and demonstrate 1. 3%-1. 7% accuracy improvements with 1. 5x-3. 4x training speed-ups in comparison to the baselines using state-of-the-art models.

AAMAS Conference 2019 Conference Paper

A New Concept of Convex based Multiple Neural Networks Structure

  • Yu Wang
  • Yue Deng
  • Yilin Shen
  • Hongxia Jin

In this paper, a new concept of convex based multiple neural networks structure is proposed. This new approach uses the collective information from multiple neural networks to train the model. From both theoretical and experimental analysis, it is going to demonstrate that the new approach gives a faster training speed of convergence with a similar or even better test accuracy, compared to a conventional neural network structure. Two experiments are conducted to demonstrate the performance of our new structure: the first one is a semantic frame parsing task for spoken language understanding (SLU) on ATIS dataset, and the other is a hand written digits recognition task on MNIST dataset. We test this new structure using both recurrent neural network and convolutional neural networks through these two tasks. The results of both experiments demonstrate a 4x-8x faster training speed with better or similar performance by using this new concept.

NeurIPS Conference 2019 Conference Paper

Unified Language Model Pre-training for Natural Language Understanding and Generation

  • Li Dong
  • Nan Yang
  • Wenhui Wang
  • Furu Wei
  • Xiaodong Liu
  • Yu Wang
  • Jianfeng Gao
  • Ming Zhou

This paper presents a new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UniLM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2. 0 and CoQA question answering tasks. Moreover, UniLM achieves new state-of-the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40. 51 (2. 04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35. 75 (0. 86 absolute improvement), the CoQA generative question answering F1 score to 82. 5 (37. 1 absolute improvement), the SQuAD question generation BLEU-4 to 22. 12 (3. 75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2. 67 (human performance is 2. 65). The code and pre-trained models are available at https: //github. com/microsoft/unilm.

JMLR Journal 2018 Journal Article

Connections with Robust PCA and the Role of Emergent Sparsity in Variational Autoencoder Models

  • Bin Dai
  • Yu Wang
  • John Aston
  • Gang Hua
  • David Wipf

Variational autoencoders (VAE) represent a popular, flexible form of deep generative model that can be stochastically fit to samples from a given random process using an information-theoretic variational bound on the true underlying distribution. Once so-obtained, the model can be putatively used to generate new samples from this distribution, or to provide a low-dimensional latent representation of existing samples. While quite effective in numerous application domains, certain important mechanisms which govern the behavior of the VAE are obfuscated by the intractable integrals and resulting stochastic approximations involved. Moreover, as a highly non-convex model, it remains unclear exactly how minima of the underlying energy relate to original design purposes. We attempt to better quantify these issues by analyzing a series of tractable special cases of increasing complexity. In doing so, we unveil interesting connections with more traditional dimensionality reduction models, as well as an intrinsic yet underappreciated propensity for robustly dismissing sparse outliers when estimating latent manifolds. With respect to the latter, we demonstrate that the VAE can be viewed as the natural evolution of recent robust PCA models, capable of learning nonlinear manifolds of unknown dimension obscured by gross corruptions. [abs] [ pdf ][ bib ] &copy JMLR 2018. ( edit, beta )

IJCAI Conference 2018 Conference Paper

Deep Propagation Based Image Matting

  • Yu Wang
  • Yi Niu
  • Peiyong Duan
  • Jianwei Lin
  • Yuanjie Zheng

In this paper, we propose a deep propagation based image matting framework by introducing deep learning into learning an alpha matte propagation principal. Our deep learning architecture is a concatenation of a deep feature extraction module, an affinity learning module and a matte propagation module. These three modules are all differentiable and can be optimized jointly via an end-to-end training process. Our framework results in a semantic-level pairwise similarity of pixels for propagation by learning deep image representations adapted to matte propagation. It combines the power of deep learning and matte propagation and can therefore surpass prior state-of-the-art matting techniques in terms of both accuracy and training complexity, as validated by our experimental results from 243K images created based on two benchmark matting databases.

AAAI Conference 2018 Conference Paper

Telepath: Understanding Users from a Human Vision Perspective in Large-Scale Recommender Systems

  • Yu Wang
  • Jixing Xu
  • Aohan Wu
  • Mantian Li
  • Yang He
  • Jinghe Hu
  • Weipeng Yan

Designing an e-commerce recommender system that serves hundreds of millions of active users is a daunting challenge. To our best knowledge, the complex brain activity mechanism behind human shopping activities is never considered in existing recommender systems. From a human vision perspective, we found two key factors that affect users’ behaviors: items’ attractiveness and their matching degrees with users’ interests. This paper proposes Telepath, a visionbased bionic recommender system model, which simulates human brain activities in decision making of shopping, thus understanding users from such perspective. The core of Telepath is a complex deep neural network with multiple subnetworks. In practice, the Telepath model has been launched to JD’s recommender system and advertising system and outperformed the former state-of-the-art method. For one of the major item recommendation blocks on the JD app, clickthrough rate (CTR), gross merchandise value (GMV) and orders have been increased 1. 59%, 8. 16% and 8. 71% respectively by Telepath. For several major ad publishers of JD demand-side platform, CTR, GMV and return on investment have been increased 6. 58%, 61. 72% and 65. 57% respectively by the first launch of Telepath, and further increased 2. 95%, 41. 75% and 41. 37% respectively by the second launch.

JBHI Journal 2017 Journal Article

A Shared Decision-Making System for Diabetes Medication Choice Utilizing Electronic Health Record Data

  • Yu Wang
  • Peng-Fei Li
  • Yu Tian
  • Jing-Jing Ren
  • Jing-Song Li

The use of a shared decision-making (SDM) process in antihyperglycemic medication strategy decisions is necessary due to the complexity of the conditions of diabetes patients. Knowledge of guidelines is used as decision aids in clinical situations, and during this process, no patient health conditions are considered. In this paper, we propose an SDM system framework for type-2 diabetes mellitus (T2DM) patients that not only contains knowledge abstracted from guidelines but also employs a multilabel classification model that uses class-imbalanced electronic health record (EHR) data and that aims to provide a recommended list of available antihyperglycemic medications to help physicians and patients have an SDM conversation. The use of EHR data to serve as a decision-support component in decision aids helps physicians and patients to reach a more intuitive understanding of current health conditions and allows the tailoring of the available knowledge to each patient, leading to a more effective SDM. Real-world data from 2542 T2DM inpatient EHRs were substituted by 77 features and eight output labels, i. e. , eight antihyperglycemic medications, and these data were utilized to build and validate the recommendation model. The multilabel recommendation model exhibited stable performance in every single-label classification and showed the ability to predict minority positive cases in which the average recall value of the eight classes was 0. 9898. As a whole multilabel classifier, the recommendation model demonstrated outstanding performance, with scores of 0. 0941 for Hamming Loss, 0. 7611 for Accuracy exam, 0. 9664 for Recall exam, and 0. 8269 for F exam.

IJCAI Conference 2017 Conference Paper

Fast Change Point Detection on Dynamic Social Networks

  • Yu Wang
  • Aniket Chakrabarti
  • David Sivakoff
  • Srinivasan Parthasarathy

A number of real world problems in many domains (e. g. sociology, biology, political science and communication networks) can be modeled as dynamic networks with nodes representing entities of interest and edges representing interactions among the entities at different points in time. A common representation for such models is the snapshot model - where a network is defined at logical time-stamps. An important problem under this model is change point detection. In this work we devise an effective and efficient three-step-approach for detecting change points in dynamic networks under the snapshot model. Our algorithm achieves up to 9X speedup over the state-of-the-art while improving quality on both synthetic and real world networks.

JBHI Journal 2013 Journal Article

A Suction Detection System for Rotary Blood Pumps Based on the Lagrangian Support Vector Machine Algorithm

  • Yu Wang
  • Marwan A. Simaan

The left ventricular assist device is a rotary mechanical pump that is implanted in patients with congestive heart failure to help the left ventricle in pumping blood in the circulatory system. However, using such a device may result in a very dangerous event, called ventricular suction, that can cause ventricular collapse due to overpumping of blood from the left ventricle when the rotational speed of the pump is high. Therefore, a reliable technique for detecting ventricular suction is crucial. This paper presents a new suction detection system that can precisely classify pump flow patterns, based on a Lagrangian support vector machine (LSVM) model that combines six suction indices extracted from the pump flow signal to make a decision about whether the pump is in suction, approaching suction, or not in suction. The proposed method has been tested using in vivo experimental data based on two different pumps. The simulation results show that the system can produce superior performance in terms of classification accuracy, stability, learning speed, and good robustness compared to three other existing suction detection methods and the original support vector machine (SVM) algorithm. The ability of the proposed algorithm to detect suction provides a reliable platform for the development of a feedback control system to control the speed of the pump while at the same time ensuring that suction is avoided.

IJCAI Conference 2013 Conference Paper

Towards Effective Prioritizing Water Pipe Replacement and Rehabilitation

  • Junchi Yan
  • Yu Wang
  • Ke Zhou
  • Jin Huang
  • Chunhua Tian
  • Hongyuan Zha
  • Weishan Dong

Water pipe failures can not only have a great impact on people’s daily life but also cause significant waste of water which is an essential and precious resource to human beings. As a result, preventative maintenance for water pipes, particularly in urbanscale networks, is of great importance for a sustainable society. To achieve effective replacement and rehabilitation, failure prediction aims to proactively find those ‘most-likely-to-fail’ pipes becomes vital and has been attracting more attention from both academia and industry, especially from the civil engineering field. This paper presents an alreadydeployed industrial computational system for pipe failure prediction. As an alternative to risk matrix methods often depending on ad-hoc domain heuristics, learning based methods are adopted using the attributes with respect to physical, environmental, operational conditions and etc. Further challenge arises in practice when lacking of profile attributes. A dive into the failure records shows that the failure event sequences typically exhibit temporal clustering patterns, which motivates us to use the stochastic process to tackle the failure prediction task. Specifically, the failure sequence is formulated as a self-exciting stochastic process which is, to our best knowledge, a novel formulation for pipe failure prediction. And we show that it outperforms a baseline assuming the failure risk grows linearly with aging. Broad new problems and research points for the machine learning community are also introduced for future work.

AAAI Conference 2011 Conference Paper

Simulated Annealing Based Influence Maximization in Social Networks

  • Qingye Jiang
  • Guojie Song
  • Cong Gao
  • Yu Wang
  • Wenjun Si
  • Kunqing Xie

The problem of influence maximization, i. e. , mining top-k influential nodes from a social network such that the spread of influence in the network is maximized, is NP-hard. Most of the existing algorithms for the problem are based on greedy algorithm. Although greedy algorithm can achieve a good approximation, it is computational expensive. In this paper, we propose a totally different approach based on Simulated Annealing(SA) for the influence maximization problem. This is the first SA based algorithm for the problem. Additionally, we propose two heuristic methods to accelerate the convergence process of SA, and a new method of computing influence to speed up the proposed algorithm. Experimental results on four real networks show that the proposed algorithms run faster than the state-of-the-art greedy algorithm by 2-3 orders of magnitude while being able to improve the accuracy of greedy algorithm.

TAAS Journal 2009 Journal Article

Self-organizing fault-tolerant topology control in large-scale three-dimensional wireless networks

  • Yu Wang
  • Lijuan Cao
  • Teresa A. Dahlberg
  • Fan Li
  • Xinghua Shi

Topology control protocol aims to efficiently adjust the network topology of wireless networks in a self-adaptive fashion to improve the performance and scalability of networks. This is especially essential to large-scale multihop wireless networks (e.g., wireless sensor networks). Fault-tolerant topology control has been studied recently. In order to achieve both sparseness (i.e., the number of links is linear with the number of nodes) and fault tolerance (i.e., can survive certain level of node/link failures), different geometric topologies were proposed and used as the underlying network topologies for wireless networks. However, most of the existing topology control algorithms can only be applied to two-dimensional (2D) networks where all nodes are distributed in a 2D plane. In practice, wireless networks may be deployed in three-dimensional (3D) space, such as under water wireless sensor networks in ocean or mobile ad hoc networks among space shuttles in space. This article seeks to investigate self-organizing fault-tolerant topology control protocols for large-scale 3D wireless networks. Our new protocols not only guarantee k -connectivity of the network, but also ensure the bounded node degree and constant power stretch factor even under k −1 node failures. All of our proposed protocols are localized algorithms, which only use one-hop neighbor information and constant messages with small time complexity. Thus, it is easy to update the topology efficiently and self-adaptively for large-scale dynamic networks. Our simulation confirms our theoretical proofs for all proposed 3D topologies.

IROS Conference 2006 Conference Paper

Research on the Walking Modes Shifting Based on the Variable ZMP and 3-D. O. F Inverted Pendulum Model for a Humanoid and Gorilla Robot

  • Weiguo Wu
  • Yu Wang
  • Yunzhong Pan
  • Feng Liang

The walking modes shifting of a gorilla robot is a kind of movements between biped standing state and quadruped landing state. In this paper, the robot mechanism is reduced to a 3-D. O. F inverted pendulum model with variable pendulum length, and the variable ZMP is defined reasonably as a function related to the inverted pendulum angle. Base on dynamic balance theory, the trajectory equation of robot's mass centre during the walking mode shifting is deduced. Furthermore, through inverse kinematics analysis for robot's mass center, the trajectories of joint are obtained. Thus method of trajectory generation about walking mode shifting for humanoid and gorilla robot is proposed. In order to verify the correctness of the method, a calculation example of trajectory generation is provided, and the continuing action simulation including biped walking and quadruped landing action, quadruped walking and standing up action is successfully realized. On the basis of above work, the continuing action experiment including biped walking and walking modes transitions for a humanoid and gorilla robot "GoRoBoT" developed by us is also finished

ICRA Conference 1988 Conference Paper

On the inconsistency of rigid-body frictional planar mechanics

  • Matthew T. Mason
  • Yu Wang

The problem of a thin rigid rod sliding on a horizontal surface in the plane is considered. This problem is commonly cited as an example of the inconsistency of planar rigid-body Newtonian mechanics. The existence of a consistent solution, using Routh's analysis of rigid-body impact is demonstrated. >

ICRA Conference 1987 Conference Paper

Modeling impact dynamics for robotic operations

  • Yu Wang
  • Matthew T. Mason

The motion of an object to be manipulated is determined by the forces applied to the object. During a collision, impulsive forces may dominate all other forces, and determine the ultimate success or failure of a task. More effective planning and control of manipulators should be possible if the impact process, including the effects of friction and elasticity, is better understood. This paper explores the planar impact of two objects, and develops simple graphical methods for predicting the mode of contact, the total impulse, and the resultant motions of the objects. In the special case of a perfectly plastic collision, the fundamental motion of the object-whether an angular acceleration will occur, and if so in what direction-is the same as predicted in earlier work on quasi-static pushing.