Arrow Research search

Author name cluster

Wei Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

64 papers
2 author rows

Possible papers

64

AAAI Conference 2026 Conference Paper

Learning to Deliberate: Meta-policy Collaboration for Agentic LLMs with Multi-agent Reinforcement Learning

  • Wei Yang
  • Jesse Thomason

Multi-agent systems of large language models (LLMs) show promise for complex reasoning, but their effectiveness is often limited by fixed collaboration protocols. These frameworks typically focus on macro-level orchestration while overlooking agents’ internal deliberative capabilities. This critical meta-cognitive blindspot treats agents as passive executors unable to adapt their strategy based on internal cognitive states like uncertainty or confidence. We introduce the Meta-Policy Deliberation Framework (MPDF), where agents learn a decentralized policy over a set of high-level meta-cognitive actions: Persist, Refine, and Concede. To overcome the instability of traditional policy gradients in this setting, we develop SoftRankPO, a novel reinforcement learning algorithm. SoftRankPO stabilizes training by shaping advantages based on the rank of rewards mapped through smooth normal quantiles, making the learning process robust to reward variance. Experiments show that MPDF with SoftRankPO achieves a 4-5% absolute gain in average accuracy across six mathematical and general reasoning benchmarks compared to state-of-the-art heuristic and learning-based multi-agent reasoning algorithms. Our work presents a paradigm for learning adaptive, meta-cognitive policies for multi-agent LLM systems, shifting the focus from designing fixed protocols to learning dynamic, deliberative strategies.

TMLR Journal 2026 Journal Article

TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

  • Wen Ye
  • Wei Yang
  • Defu Cao
  • Yizhou Zhang
  • Lumingyuan Tang
  • Jie Cai
  • Yan Liu

Time series analysis is crucial in real-world applications, yet traditional methods focus on isolated tasks only, and recent studies on time series reasoning remain limited to either single-step inference or are constrained to natural language answers. In this work, we introduce TS-Reasoner, a domain-specialized agent designed for multi-step time series inference. By integrating large language model (LLM) reasoning with domain- specific computational tools and error feedback loop, TS-Reasoner enables domain-informed, constraint-aware analytical workflows that combine symbolic reasoning with precise numerical analysis. We assess the system’s capabilities along two axes: 1) fundamental time series understanding assessed by TimeSeriesExam and 2) complex, multi-step inference, evaluated by a newly proposed dataset designed to test both compositional reasoning and computational precision in time series analysis. Experiments show that our approach outperforms standalone general-purpose LLMs in both basic time series concept understanding as well as the multi-step time series inference task, highlighting the promise of domain-specialized agents for automating real-world time series reasoning and analysis.

AAAI Conference 2025 Conference Paper

AAKR: Adversarial Attack-based Knowledge Retention for Continual Semantic Segmentation

  • Zhidong Yu
  • Xiaoman Liu
  • Jiajun Hu
  • Zhenbo Shi
  • Wei Yang

In the context of Continual Semantic Segmentation (CSS), replay-based methods tend to achieve better performance than knowledge distillation-based ones, as the former utilizes additional data to transfer old knowledge. However, this advantage is at the cost of necessitating additional space for storing the generative model and extra time for continual training. To address this predicament, we propose a novel CSS framework, namely Adversarial Attack-based Knowledge Retention (AAKR). The AKKR framework generates specific adversarial samples by adding images, and uses them to retain old knowledge. Specifically, we leverage adversarial attacks to generate adversarial images for incremental samples. By imposing additional constraints within these attacks, we enhance the transfer of old knowledge, thereby reinforcing the understanding of previously learned information. Furthermore, we design an attack probability module that adjusts adversarial attack directions based on training feedback. This module effectively encourages the new model to learn old knowledge from poorly protected classes, significantly improving knowledge transfer effectiveness. Our comprehensive experiments demonstrate the efficacy of AAKR, and showcase that AAKR surpasses state-of-the-art competitors on benchmark datasets.

IROS Conference 2025 Conference Paper

Energy-Efficient Omnidirectional Locomotion for Wheeled Quadrupeds via Predictive Energy-Aware Nominal Gait Selection

  • Xu Yang 0044
  • Wei Yang
  • Kaibo He
  • Bo Yang 0064
  • Yanan Sui
  • Yilin Mo

Wheeled-legged robots combine the efficiency of wheels with the versatility of legs, but face significant energy optimization challenges when navigating diverse environments. In this work, we present a hierarchical control framework that integrates predictive power modeling with residual reinforcement learning to optimize omnidirectional locomotion efficiency for wheeled quadrupedal robots. Our approach employs a novel power prediction network that forecasts energy consumption across different gait patterns over a 1-second horizon, enabling intelligent selection of the most energy-efficient nominal gait. A reinforcement learning policy then generates residual adjustments to this nominal gait, fine-tuning the robot’s actions to balance energy efficiency with performance objectives. Comparative analysis shows our method reduces energy consumption by up to 35% compared to fixed-gait approaches while maintaining comparable velocity tracking performance. We validate our framework through extensive simulations and real-world experiments on a modified Unitree Go1 platform, demonstrating robust performance even under external disturbances. Videos and implementation details are available at https://sites.google.com/view/switching-wpg.

JBHI Journal 2025 Journal Article

Frozen Large-Scale Pretrained Vision-Language Models are the Effective Foundational Backbone for Multimodal Breast Cancer Prediction

  • Hung Q. Vo
  • Lin Wang
  • Kelvin K. Wong
  • Chika F. Ezeana
  • Xiaohui Yu
  • Wei Yang
  • Jenny Chang
  • Hien V. Nguyen

Breast cancer is a pervasive global health concern among women. Leveraging multimodal data from enterprise patient databases—including Picture Archiving and Communication Systems (PACS) and Electronic Health Records (EHRs)—holds promise for improving prediction. This study introduces a multimodal deep-learning model leveraging mammogram datasets to evaluate breast cancer prediction. Our approach integrates frozen large-scale pretrained vision-language models, showcasing superior performance and stability compared to traditional image-tabular models across two public breast cancer datasets. The model consistently outperforms conventional full fine-tuning methods by using frozen pretrained vision-language models alongside a lightweight trainable classifier. The observed improvements are significant. In the CBIS-DDSM dataset, the Area Under the Curve (AUC) increases from 0. 867 to 0. 902 during validation and from 0. 803 to 0. 830 for the official test set. Within the EMBED dataset, AUC improves from 0. 780 to 0. 805 during validation. In scenarios with limited data, using Breast Imaging-Reporting and Data System category three (BI-RADS 3) cases, AUC improves from 0. 91 to 0. 96 on the official CBIS-DDSM test set and from 0. 79 to 0. 83 on a challenging validation set. This study underscores the benefits of vision-language models in jointly training diverse image-clinical datasets from multiple healthcare institutions, effectively addressing challenges related to non-aligned tabular features. Combining training data enhances breast cancer prediction on the EMBED dataset, outperforming all other experiments. In summary, our research emphasizes the efficacy of frozen large-scale pretrained vision-language models in multimodal breast cancer prediction, offering superior performance and stability over conventional methods, reinforcing their potential for breast cancer prediction.

NeurIPS Conference 2025 Conference Paper

Leaving No OOD Instance Behind: Instance-Level OOD Fine-Tuning for Anomaly Segmentation

  • Yuxuan Zhang
  • Zhenbo Shi
  • han ye
  • Shuchang Wang
  • Zhidong Yu
  • Shaowei Wang
  • Wei Yang

Out-of-distribution (OOD) fine-tuning has emerged as a promising approach for anomaly segmentation. Current OOD fine-tuning strategies typically employ global-level objectives, aiming to guide segmentation models to accurately predict a large number of anomaly pixels. However, these strategies often perform poorly on small anomalies. To address this issue, we propose an instance-level OOD fine-tuning framework, dubbed LNOIB (Leaving No OOD Instance Behind). We start by theoretically analyzing why global-level objectives fail to segment small anomalies. Building on this analysis, we introduce a simple yet effective instance-level objective. Moreover, we propose a feature separation objective to explicitly constrain the representations of anomalies, which are prone to be smoothed by their in-distribution (ID) surroundings. LNOIB integrates these objectives to enhance the segmentation of small anomalies and serves as a paradigm adaptable to existing OOD fine-tuning strategies, without introducing additional inference cost. Experimental results show that integrating LNOIB into various OOD fine-tuning strategies yields significant improvements, particularly in component-level results, highlighting its strength in comprehensive anomaly segmentation.

IJCAI Conference 2025 Conference Paper

Optimized View and Geometry Distillation from Multi-view Diffuser

  • Youjia Zhang
  • Zikai Song
  • Junqing Yu
  • Yawei Luo
  • Wei Yang

Generating multi-view images from a single input view using image-conditioned diffusion models is a recent advancement and has shown considerable potential. However, issues such as the lack of consistency in synthesized views and over-smoothing in extracted geometry persist. Previous methods integrate multi-view consistency modules or impose additional supervisory to enhance view consistency while compromising on the flexibility of camera positioning and limiting the versatility of view synthesis. In this study, we consider the radiance field optimized during geometry extraction as a more rigid consistency prior, compared to volume and ray aggregation used in previous works. We further identify and rectify a critical bias in the traditional radiance field optimization process through score distillation from a multi-view diffuser. We introduce an Unbiased Score Distillation (USD) that utilizes unconditioned noises from a 2D diffusion model, greatly refining the radiance field fidelity. We leverage the rendered views from the optimized radiance field as the basis and develop a two-step specialization process of a 2D diffusion model, which is adept at conducting object-specific denoising and generating high-quality multi-view images. Finally, we recover faithful geometry and texture directly from the refined multi-view images. Empirical evaluations demonstrate that our optimized geometry and view distillation technique generates comparable results to the state-of-the-art models trained on extensive datasets, all while maintaining freedom in camera positioning. Source code of our work is publicly available at: https: //youjiazhang. github. io/USD/.

AAAI Conference 2025 Conference Paper

RP-PGD: Boosting Segmentation Robustness with a Region-and-Prototype Based Adversarial Attack

  • Yuxuan Zhang
  • Zhenbo Shi
  • Shuchang Wang
  • Wei Yang
  • Shaowei Wang
  • Yinxing Xue

Adversarial attack and defense have been extensively explored in classification tasks, but their study in semantic segmentation remains limited. Moreover, current attacks fail to act as strong underlying attacks for adversarial training (AT), making it difficult to achieve segmentation robustness against strong attacks. In this paper, we present RP-PGD, a novel Region-and-Prototype based Projected Gradient Descent attack tailored to fool segmentation models. In particular, we propose a region-based attack, which leverages a spatial-temporal way to separate the pixels into three disjoint regions, and highlights the attack on the crucial True Region and Boundary Region. Moreover, we introduce a prototype-based attack to disrupt the feature space, further enhancing the attack capability. To boost the robustness of segmentation models, we inject adversaries generated by RP-PGD into the clean data and perform AT. Extensive experiments on multiple datasets showcase that RP-PGD generates adversaries with faster convergence and stronger attack effectiveness, surpassing state-of-the-art attacks by a large margin. Consequently, RP-PGD serves as a strong underlying attack for segmentation models to perform AT, assisting them in defending against a variety of strong attacks without incurring additional computational costs during inference.

AAAI Conference 2025 Conference Paper

Stop Diverse OOD Attacks: Knowledge Ensemble for Reliable Defense

  • Zhenbo Shi
  • Xiaoman Liu
  • Yuxuan Zhang
  • Shuchang Wang
  • Rui Shu
  • Zhidong Yu
  • Wei Yang
  • Liusheng Huang

Enhancing defense through model ensemble is an emerging trend, where the challenge lies in how to use ensemble knowledge to counter Out-of-Distribution (OOD) attacks. In this paper, we propose the Reliable Defense Ensemble (REE) to address this issue. REE optimizes the ensemble knowledge of models through aggregation and enhances multidimensional robust performance through collaboration. It employs the Dynamic Synergy Amplification for weight allocation and strategy adjustment. Furthermore, we design a new Kernel Anomaly Smoothing Detection Module, which detects anomalous attacks using a smoothing feature function based on Gaussian kernel mean embedding and a multi-layer feedback structure. Particularly, we build a framework that uses reinforcement learning to iteratively fine-tune the parameters of inter-model communication and consensus. Extensive experimental results show that REE outperforms current state-of-the-art methods by a large margin in defending against OOD attacks.

NeurIPS Conference 2025 Conference Paper

Structured Spectral Reasoning for Frequency-Adaptive Multimodal Recommendation

  • Wei Yang
  • Rui Zhong
  • Yiqun Chen
  • Chi Lu
  • Peng Jiang

Multimodal recommendation aims to integrate collaborative signals with heterogeneous content such as visual and textual information, but remains challenged by modality-specific noise, semantic inconsistency, and unstable propagation over user–item graphs. These issues are often exacerbated by naive fusion or shallow modeling strategies, leading to degraded generalization and poor robustness. While recent work has explored the frequency domain as a lens to separate stable from noisy signals, most methods rely on static filtering or reweighting, lacking the ability to reason over spectral structure or adapt to modality-specific reliability. To address these challenges, we propose a Structured Spectral Reasoning (SSR) framework for frequency-aware multimodal recommendation. Our method follows a four-stage pipeline: (i) Decompose graph-based multimodal signals into spectral bands via graph-guided transformations to isolate semantic granularity; (ii) Modulate band-level reliability with spectral band masking, a training-time masking with representation-consistency objective that suppresses brittle frequency components; (iii) Fuse complementary frequency cues using hyperspectral reasoning with low-rank cross-band interaction; and (iv) Align modality-specific spectral features via contrastive regularization to promote semantic and structural consistency. Experiments on three real-world benchmarks show consistent gains over strong baselines, particularly under sparse and cold-start settings. Additional analyses indicate that structured spectral modeling improves robustness and provides clearer diagnostics of how different bands contribute to performance.

AAAI Conference 2025 Conference Paper

Temporal Coherent Object Flow for Multi-Object Tracking

  • Zikai Song
  • Run Luo
  • Lintao Ma
  • Ying Tang
  • Yi-Ping Phoebe Chen
  • Junqing Yu
  • Wei Yang

Multi-object tracking is a challenging vision task that requires simultaneous reasoning about object detection and object association. Conventional solutions use frame as the basic unit and typically rely on a motion predictor that exploits the appearance features to associate detected candidates, leading to insufficient adaptability to long-term associations. In this study, we propose a section-based multi-object tracking approach that integrates a temporal coherent Object Flow Tracker (OFTrack), capable of achieving simultaneous multi-frame tracking by treating multiple consecutive frames as the basic processing unit, denoted as a “section”. Our OFTrack boosts the optical flow to the object flow by employing object perception and section-based motion estimation strategies. Object perception adopts object-aware sampling and scale-aware correlation to enable precise target discrimination. Motion estimation models the correlation of different objects in multi-frames via specialized temporal-spatial attention to achieve robust association in very long videos. Additionally, to address the oscillation of unpredictable trajectories in multi-frame estimation, we have designed temporal coherent enhancement including the trajectory masking pre-training and the smoothing constraint on trajectory curves. Comprehensive experiments on several widely used benchmarks demonstrate the superior performance of our approach.

AAAI Conference 2025 Conference Paper

Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model

  • Hang Zhou
  • Jiale Cai
  • Yuteng Ye
  • Yonghui Feng
  • Chenxing Gao
  • Junqing Yu
  • Zikai Song
  • Wei Yang

A recent endeavor in one class of video anomaly detection is to leverage diffusion models and posit the task as a generation problem, where the diffusion model is trained to recover normal patterns exclusively, thus reporting abnormal patterns as outliers. Yet, existing attempts neglect the various formations of anomaly and predict normal samples at the feature level regardless that abnormal objects in surveillance videos are often relatively small. To address this, a novel patch-based diffusion model is proposed, specifically engineered to capture fine-grained local information. We further observe that anomalies in videos manifest themselves as deviations in both appearance and motion. Therefore, we argue that a comprehensive solution must consider both of these aspects simultaneously to achieve accurate frame prediction. To address this, we introduce innovative motion and appearance conditions that are seamlessly integrated into our patch diffusion model. These conditions are designed to guide the model in generating coherent and contextually appropriate predictions for both semantic content and motion relations. Experimental results on four challenging video anomaly detection datasets empirically substantiate the efficacy of our proposed approach, demonstrating that it consistently outperforms most existing methods in detecting abnormal behaviors.

AAAI Conference 2024 Conference Paper

AMD: Anatomical Motion Diffusion with Interpretable Motion Decomposition and Fusion

  • Beibei Jing
  • Youjia Zhang
  • Zikai Song
  • Junqing Yu
  • Wei Yang

Generating realistic human motion sequences from text descriptions is a challenging task that requires capturing the rich expressiveness of both natural language and human motion. Recent advances in diffusion models have enabled significant progress in human motion synthesis. However, existing methods struggle to handle text inputs that describe complex or long motions. In this paper, we propose the Adaptable Motion Diffusion (AMD) model, which leverages a Large Language Model (LLM) to parse the input text into a sequence of concise and interpretable anatomical scripts that correspond to the target motion. This process exploits the LLM’s ability to provide anatomical guidance for complex motion synthesis. We then devise a two-branch fusion scheme that balances the influence of the input text and the anatomical scripts on the inverse diffusion process, which adaptively ensures the semantic fidelity and diversity of the synthesized motion. Our method can effectively handle texts with complex or long motion descriptions, where existing methods often fail. Experiments on datasets with relatively more complex motions, such as CLCD1 and CLCD2, demonstrate that our AMD significantly outperforms existing state-of-the-art models.

AAAI Conference 2024 Conference Paper

Attacking Transformers with Feature Diversity Adversarial Perturbation

  • Chenxing Gao
  • Hang Zhou
  • Junqing Yu
  • Yuteng Ye
  • Jiale Cai
  • Junle Wang
  • Wei Yang

Understanding the mechanisms behind Vision Transformer (ViT), particularly its vulnerability to adversarial perturbations, is crucial for addressing challenges in its real-world applications. Existing ViT adversarial attackers rely on labels to calculate the gradient for perturbation, and exhibit low transferability to other structures and tasks. In this paper, we present a label-free white-box attack approach for ViT-based models that exhibits strong transferability to various black-box models, including most ViT variants, CNNs, and MLPs, even for models developed for other modalities. Our inspiration comes from the feature collapse phenomenon in ViTs, where the critical attention mechanism overly depends on the low-frequency component of features, causing the features in middle-to-end layers to become increasingly similar and eventually collapse. We propose the feature diversity attacker to naturally accelerate this process and achieve remarkable performance and transferability.

AAAI Conference 2024 Conference Paper

Attacks on Continual Semantic Segmentation by Perturbing Incremental Samples

  • Zhidong Yu
  • Wei Yang
  • Xike Xie
  • Zhenbo Shi

As an essential computer vision task, Continual Semantic Segmentation (CSS) has received a lot of attention. However, security issues regarding this task have not been fully studied. To bridge this gap, we study the problem of attacks in CSS in this paper. We first propose a new task, namely, attacks on incremental samples in CSS, and reveal that the attacks on incremental samples corrupt the performance of CSS in both old and new classes. Moreover, we present an adversarial sample generation method based on class shift, namely Class Shift Attack (CS-Attack), which is an offline and easy-to-implement approach for CSS. CS-Attack is able to significantly degrade the performance of models on both old and new classes without knowledge of the incremental learning approach, which undermines the original purpose of the incremental learning, i.e., learning new classes while retaining old knowledge. Experiments show that on the popular datasets Pascal VOC, ADE20k, and Cityscapes, our approach easily degrades the performance of currently popular CSS methods, which reveals the importance of security in CSS.

NeurIPS Conference 2024 Conference Paper

Coupled Mamba: Enhanced Multimodal Fusion with Coupled State Space Model

  • Wenbing Li
  • Hang Zhou
  • Junqing Yu
  • Zikai Song
  • Wei Yang

The essence of multi-modal fusion lies in exploiting the complementary information inherent in diverse modalities. However, most prevalent fusion methods rely on traditional neural architectures and are inadequately equipped to capture the dynamics of interactions across modalities, particularly in presence of complex intra- and inter-modality correlations. Recent advancements in State Space Models (SSMs), notably exemplified by the Mamba model, have emerged as promising contenders. Particularly, its state evolving process implies stronger modality fusion paradigm, making multi-modal fusion on SSMs an appealing direction. However, fusing multiple modalities is challenging for SSMs due to its hardware-aware parallelism designs. To this end, this paper proposes the Coupled SSM model, for coupling state chains of multiple modalities while maintaining independence of intra-modality state processes. Specifically, in our coupled scheme, we devise an inter-modal hidden states transition scheme, in which the current state is dependent on the states of its own chain and that of the neighbouring chains at the previous time-step. To fully comply with the hardware-aware parallelism, we obtain the global convolution kernel by deriving the state equation while introducing the historical state. Extensive experiments on CMU-MOSEI, CH-SIMS, CH-SIMSV2 through multi-domain input verify the effectiveness of our model compared to current state-of-the-art methods, improved F1-Score by 0. 4%, 0. 9%, and 2. 3% on the three datasets respectively, 49% faster inference and 83. 7% GPU memory save. The results demonstrate that Coupled Mamba model is capable of enhanced multi-modal fusion.

AAAI Conference 2024 Conference Paper

DiffusionTrack: Diffusion Model for Multi-Object Tracking

  • Run Luo
  • Zikai Song
  • Lintao Ma
  • Jinlin Wei
  • Wei Yang
  • Min Yang

Multi-object tracking (MOT) is a challenging vision task that aims to detect individual objects within a single frame and associate them across multiple frames. Recent MOT approaches can be categorized into two-stage tracking-by-detection (TBD) methods and one-stage joint detection and tracking (JDT) methods. Despite the success of these approaches, they also suffer from common problems, such as harmful global or local inconsistency, poor trade-off between robustness and model complexity, and lack of flexibility in different scenes within the same video. In this paper we propose a simple but robust framework that formulates object detection and association jointly as a consistent denoising diffusion process from paired noise boxes to paired ground-truth boxes. This novel progressive denoising diffusion strategy substantially augments the tracker's effectiveness, enabling it to discriminate between various objects. During the training stage, paired object boxes diffuse from paired ground-truth boxes to random distribution, and the model learns detection and tracking simultaneously by reversing this noising process. In inference, the model refines a set of paired randomly generated boxes to the detection and tracking results in a flexible one-step or multi-step denoising diffusion process. Extensive experiments on three widely used MOT benchmarks, including MOT17, MOT20, and DanceTrack, demonstrate that our approach achieves competitive performance compared to the current state-of-the-art methods. Code is available at https://github.com/RainBowLuoCS/DiffusionTrack.

AAAI Conference 2024 Conference Paper

Dynamic Feature Pruning and Consolidation for Occluded Person Re-identification

  • Yuteng Ye
  • Hang Zhou
  • Jiale Cai
  • Chenxing Gao
  • Youjia Zhang
  • Junle Wang
  • Qiang Hu
  • Junqing Yu

Occluded person re-identification (ReID) is a challenging problem due to contamination from occluders. Existing approaches address the issue with prior knowledge cues, such as human body key points and semantic segmentations, which easily fail in the presence of heavy occlusion and other humans as occluders. In this paper, we propose a feature pruning and consolidation (FPC) framework to circumvent explicit human structure parsing. The framework mainly consists of a sparse encoder, a multi-view feature mathcing module, and a feature consolidation decoder. Specifically, the sparse encoder drops less important image tokens, mostly related to background noise and occluders, solely based on correlation within the class token attention. Subsequently, the matching stage relies on the preserved tokens produced by the sparse encoder to identify k-nearest neighbors in the gallery by measuring the image and patch-level combined similarity. Finally, we use the feature consolidation module to compensate pruned features using identified neighbors for recovering essential information while disregarding disturbance from noise and occlusion. Experimental results demonstrate the effectiveness of our proposed framework on occluded, partial, and holistic Re-ID datasets. In particular, our method outperforms state-of-the-art results by at least 8.6% mAP and 6.0% Rank-1 accuracy on the challenging Occluded-Duke dataset.

IJCAI Conference 2024 Conference Paper

GenSeg: On Generating Unified Adversary for Segmentation

  • Yuxuan Zhang
  • Zhenbo Shi
  • Wei Yang
  • Shuchang Wang
  • Shaowei Wang
  • Yinxing Xue

Great advancements in semantic, instance, and panoptic segmentation have been made in recent years, yet the top-performing models remain vulnerable to imperceptible adversarial perturbation. Current attacks on segmentation primarily focus on a single task, and these methods typically rely on iterative instance-specific strategies, resulting in limited attack transferability and low efficiency. In this paper, we propose GenSeg, a Generative paradigm that creates unified adversaries for Segmentation tasks. In particular, we propose an intermediate-level objective to enhance attack transferability, including a mutual agreement loss for feature deviation, and a prototype obfuscating loss to disrupt intra-class and inter-class relationships. Moreover, GenSeg crafts an adversary in a single forward pass, significantly boosting the attack efficiency. Besides, we unify multiple segmentation tasks to GenSeg in a novel category-and-mask view, which makes it possible to attack these segmentation tasks within this unified framework, and conduct cross-domain and cross-task attacks as well. Extensive experiments demonstrate the superiority of GenSeg in black-box attacks compared with state-of-the-art attacks. To our best knowledge, GenSeg is the first approach capable of conducting cross-domain and cross-task attacks on segmentation tasks, which are closer to real-world scenarios.

AAAI Conference 2024 Conference Paper

Progressive Text-to-Image Diffusion with Soft Latent Direction

  • Yuteng Ye
  • Jiale Cai
  • Hang Zhou
  • Guanwen Li
  • Youjia Zhang
  • Zikai Song
  • Chenxing Gao
  • Junqing Yu

In spite of the rapidly evolving landscape of text-to-image generation, the synthesis and manipulation of multiple entities while adhering to specific relational constraints pose enduring challenges. This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image, ensuring their adherence to spatial and relational constraints at each sequential step. Our key insight stems from the observation that while a pre-trained text-to-image diffusion model adeptly handles one or two entities, it often falters when dealing with a greater number. To address this limitation, we propose harnessing the capabilities of a Large Language Model (LLM) to decompose intricate and protracted text descriptions into coherent directives adhering to stringent formats. To facilitate the execution of directives involving distinct semantic operations—namely insertion, editing, and erasing—we formulate the Stimulus, Response, and Fusion (SRF) framework. Within this framework, latent regions are gently stimulated in alignment with each operation, followed by the fusion of the responsive latent components to achieve cohesive entity manipulation. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs. Consequently, it establishes a new benchmark for text-to-image generation tasks, further elevating the field's performance standards.

IJCAI Conference 2024 Conference Paper

PTDE: Personalized Training with Distilled Execution for Multi-Agent Reinforcement Learning

  • Yiqun Chen
  • Hangyu Mao
  • Jiaxin Mao
  • Shiguang Wu
  • Tianle Zhang
  • Bin Zhang
  • Wei Yang
  • Hongxing Chang

Centralized Training with Decentralized Execution (CTDE) has emerged as a widely adopted paradigm in multi-agent reinforcement learning, emphasizing the utilization of global information for learning an enhanced joint Q-function or centralized critic. In contrast, our investigation delves into harnessing global information to directly enhance individual Q-functions or individual actors. Notably, we discover that applying identical global information universally across all agents proves insufficient for optimal performance. Consequently, we advocate for the customization of global information tailored to each agent, creating agent-personalized global information to bolster overall performance. Furthermore, we introduce a novel paradigm named Personalized Training with Distilled Execution (PTDE), wherein agent-personalized global information is distilled into the agent's local information. This distilled information is then utilized during decentralized execution, resulting in minimal performance degradation. PTDE can be seamless integrated with state-of-the-art algorithms, leading to notable performance enhancements across diverse benchmarks, including the SMAC benchmark, Google Research Football (GRF) benchmark, and Learning to Rank (LTR) task.

AAAI Conference 2024 Conference Paper

TIKP: Text-to-Image Knowledge Preservation for Continual Semantic Segmentation

  • Zhidong Yu
  • Wei Yang
  • Xike Xie
  • Zhenbo Shi

Continual Semantic Segmentation (CSS) is an emerging trend, where catastrophic forgetting has been a perplexing problem. In this paper, we propose a Text-to-Image Knowledge Preservation (TIKP) framework to address this issue. TIKP applies Text-to-Image techniques to CSS by automatically generating prompts and content adaptation. It extracts associations between the labels of seen data and constructs text-level prompts based on these associations, which are preserved and maintained at each incremental step. During training, these prompts generate correlated images to mitigate the catastrophic forgetting. Particularly, as the generated images may have different distributions from the original data, TIKP transfers the knowledge by a content adaption loss, which determines the role played by the generated images in incremental training based on the similarity. In addition, for the classifier, we use the previous model from a different perspective: misclassifying new classes into old objects instead of the background. We propose a knowledge distillation loss based on wrong labels, enabling us to attribute varying weights to individual objects during the distillation process. Extensive experiments conducted in the same setting show that TIKP outperforms state-of-the-art methods by a large margin on benchmark datasets.

AAAI Conference 2024 Conference Paper

Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model

  • Zhenyu Xie
  • Yang Wu
  • Xuehao Gao
  • Zhongqian Sun
  • Wei Yang
  • Xiaodan Liang

Text-guided motion synthesis aims to generate 3D human motion that not only precisely reflects the textual description but reveals the motion details as much as possible. Pioneering methods explore the diffusion model for text-to-motion synthesis and obtain significant superiority. However, these methods conduct diffusion processes either on the raw data distribution or the low-dimensional latent space, which typically suffer from the problem of modality inconsistency or detail-scarce. To tackle this problem, we propose a novel Basic-to-Advanced Hierarchical Diffusion Model, named B2A-HDM, to collaboratively exploit low-dimensional and high-dimensional diffusion models for high quality detailed motion synthesis. Specifically, the basic diffusion model in low-dimensional latent space provides the intermediate denoising result that to be consistent with the textual description, while the advanced diffusion model in high-dimensional latent space focuses on the following detail-enhancing denoising process. Besides, we introduce a multi-denoiser framework for the advanced diffusion model to ease the learning of high-dimensional model and fully explore the generative potential of the diffusion model. Quantitative and qualitative experiment results on two text-to-motion benchmarks (HumanML3D and KIT-ML) demonstrate that B2A-HDM can outperform existing state-of-the-art methods in terms of fidelity, modality consistency, and diversity.

NeurIPS Conference 2023 Conference Paper

A Robust and Opponent-Aware League Training Method for StarCraft II

  • Ruozi Huang
  • Xipeng Wu
  • Hongsheng Yu
  • Zhong Fan
  • Haobo Fu
  • Qiang Fu
  • Wei Yang

It is extremely difficult to train a superhuman Artificial Intelligence (AI) for games of similar size to StarCraft II. AlphaStar is the first AI that beat human professionals in the full game of StarCraft II, using a league training framework that is inspired by a game-theoretic approach. In this paper, we improve AlphaStar's league training in two significant aspects. We train goal-conditioned exploiters, whose abilities of spotting weaknesses in the main agent and the entire league are greatly improved compared to the unconditioned exploiters in AlphaStar. In addition, we endow the agents in the league with the new ability of opponent modeling, which makes the agent more responsive to the opponent's real-time strategy. Based on these improvements, we train a better and superhuman AI with orders of magnitude less resources than AlphaStar (see Table 1 for a full comparison). Considering the iconic role of StarCraft II in game AI research, we believe our method and results on StarCraft II provide valuable design principles on how one would utilize the general league training framework for obtaining a least-exploitable strategy in various, large-scale, real-world games.

NeurIPS Conference 2023 Conference Paper

Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs

  • Peng Jin
  • Yang Wu
  • Yanbo Fan
  • Zhongqian Sun
  • Wei Yang
  • Li Yuan

Most text-driven human motion generation methods employ sequential modeling approaches, e. g. , transformer, to extract sentence-level text representations automatically and implicitly for human motion synthesis. However, these compact text representations may overemphasize the action names at the expense of other important properties and lack fine-grained details to guide the synthesis of subtly distinct motion. In this paper, we propose hierarchical semantic graphs for fine-grained control over motion generation. Specifically, we disentangle motion descriptions into hierarchical semantic graphs including three levels of motions, actions, and specifics. Such global-to-local structures facilitate a comprehensive understanding of motion description and fine-grained control of motion generation. Correspondingly, to leverage the coarse-to-fine topology of hierarchical semantic graphs, we decompose the text-to-motion diffusion process into three semantic levels, which correspond to capturing the overall motion, local actions, and action specifics. Extensive experiments on two benchmark human motion datasets, including HumanML3D and KIT, with superior performances, justify the efficacy of our method. More encouragingly, by modifying the edge weights of hierarchical semantic graphs, our method can continuously refine the generated motion, which may have a far-reaching impact on the community. Code and pre-trained weights are available at https: //github. com/jpthu17/GraphMotion.

AAAI Conference 2023 Conference Paper

Compact Transformer Tracker with Correlative Masked Modeling

  • Zikai Song
  • Run Luo
  • Junqing Yu
  • Yi-Ping Phoebe Chen
  • Wei Yang

Transformer framework has been showing superior performances in visual object tracking for its great strength in information aggregation across the template and search image with the well-known attention mechanism. Most recent advances focus on exploring attention mechanism variants for better information aggregation. We find these schemes are equivalent to or even just a subset of the basic self-attention mechanism. In this paper, we prove that the vanilla self-attention structure is sufficient for information aggregation, and structural adaption is unnecessary. The key is not the attention structure, but how to extract the discriminative feature for tracking and enhance the communication between the target and search image. Based on this finding, we adopt the basic vision transformer (ViT) architecture as our main tracker and concatenate the template and search image for feature embedding. To guide the encoder to capture the invariant feature for tracking, we attach a lightweight correlative masked decoder which reconstructs the original template and search image from the corresponding masked tokens. The correlative masked decoder serves as a plugin for the compact transformer tracker and is skipped in inference. Our compact tracker uses the most simple structure which only consists of a ViT backbone and a box head, and can run at 40 fps. Extensive experiments show the proposed compact transform tracker outperforms existing approaches, including advanced attention variants, and demonstrates the sufficiency of self-attention in tracking tasks. Our method achieves state-of-the-art performance on five challenging datasets, along with the VOT2020, UAV123, LaSOT, TrackingNet, and GOT-10k benchmarks. Our project is available at https://github.com/HUSTDML/CTTrack.

JBHI Journal 2023 Journal Article

Coupled Contour Regression for Efficient Delineation of Lumen and External Elastic Lamina in Intravascular Ultrasound Images

  • Yuan Yang
  • Wei Yu
  • Haiyan Du
  • Li Ling
  • Qianjin Feng
  • Shengxian Tu
  • Wei Yang

Automatic delineation of the lumen and vessel contours in intravascular ultrasound (IVUS) images is crucial for the subsequent IVUS-based analysis. Existing methods usually address this task through mask-based segmentation, which cannot effectively handle the anatomical plausibility of the lumen and external elastic lamina (EEL) contours and thus limits their performance. In this article, we propose a contour encoding based method called coupled contour regression network (CCRNet) to directly predict the lumen and EEL contour pairs. The lumen and EEL contours are resampled, coupled, and embedded into a low-dimensional space to learn a compact contour representation. Then, we employ a convolutional network backbone to predict the coupled contour signatures and reconstruct the signatures to the object contours by a linear decoder. Assisted by the implicit anatomical prior of the paired lumen and EEL contours in the signature space and contour decoder, CCRNet has the potential to avoid producing unreasonable results. We evaluated our proposed method on a large IVUS dataset consisting of 7204 cross-sectional frames from 185 pullbacks. The CCRNet can rapidly extract the contours at 100 fps. Without any post-processing, all produced contours are anatomically reasonable in the test 19 pullbacks. The mean Dice similarity coefficients of our CCRNet for the lumen and EEL are 0. 940 and 0. 958, which are comparable to the mask-based models. In terms of the contour metric Hausdorff distance, our CCRNet achieves 0. 258 mm for lumen and 0. 268 mm for EEL, which outperforms the mask-based models.

AAAI Conference 2023 Conference Paper

Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection

  • Hang Zhou
  • Junqing Yu
  • Wei Yang

Learning discriminative features for effectively separating abnormal events from normality is crucial for weakly supervised video anomaly detection (WS-VAD) tasks. Existing approaches, both video and segment level label oriented, mainly focus on extracting representations for anomaly data while neglecting the implication of normal data. We observe that such a scheme is sub-optimal, i.e., for better distinguishing anomaly one needs to understand what is a normal state, and may yield a higher false alarm rate. To address this issue, we propose an Uncertainty Regulated Dual Memory Units (UR-DMU) model to learn both the representations of normal data and discriminative features of abnormal data. To be specific, inspired by the traditional global and local structure on graph convolutional networks, we introduce a Global and Local Multi-Head Self Attention (GL-MHSA) module for the Transformer network to obtain more expressive embeddings for capturing associations in videos. Then, we use two memory banks, one additional abnormal memory for tackling hard samples, to store and separate abnormal and normal prototypes and maximize the margins between the two representations. Finally, we propose an uncertainty learning scheme to learn the normal data latent space, that is robust to noise from camera switching, object changing, scene transforming, etc. Extensive experiments on XD-Violence and UCF-Crime datasets demonstrate that our method outperforms the state-of-the-art methods by a sizable margin.

IJCAI Conference 2023 Conference Paper

FGNet: Towards Filling the Intra-class and Inter-class Gaps for Few-shot Segmentation

  • Yuxuan Zhang
  • Wei Yang
  • Shaowei Wang

Current few-shot segmentation (FSS) approaches have made tremendous achievements based on prototypical learning techniques. However, due to the scarcity of the support data provided, FSS methods still suffer from the intra-class and inter-class gaps. In this paper, we propose a uniform network to fill both the gaps, termed FGNet. It consists of the novel design of a Self-Adaptive Module (SAM) to emphasize the query feature to generate an enhanced prototype for self-alignment. Such a prototype caters to each query sample itself since it contains the underlying intra-instance information, which gets around the intra-class appearance gap. Moreover, we design an Inter-class Feature Separation Module (IFSM) to separate the feature space of the target class from other classes, which contributes to bridging the inter-class gap. In addition, we present several new losses and a method termed B-SLIC, which help to further enhance the separation performance of FGNet. Experimental results show that FGNet reduces both the gaps for FSS by SAM and IFSM respectively, and achieves state-of-the-art performances on both PASCAL-5i and COCO-20i datasets compared with previous top-performing approaches.

NeurIPS Conference 2023 Conference Paper

Hokoff: Real Game Dataset from Honor of Kings and its Offline Reinforcement Learning Benchmarks

  • Yun Qu
  • Boyuan Wang
  • Jianzhun Shao
  • Yuhang Jiang
  • Chen Chen
  • Zhenbin Ye
  • Liu Linc
  • Yang Feng

The advancement of Offline Reinforcement Learning (RL) and Offline Multi-Agent Reinforcement Learning (MARL) critically depends on the availability of high-quality, pre-collected offline datasets that represent real-world complexities and practical applications. However, existing datasets often fall short in their simplicity and lack of realism. To address this gap, we propose Hokoff, a comprehensive set of pre-collected datasets that covers both offline RL and offline MARL, accompanied by a robust framework, to facilitate further research. This data is derived from Honor of Kings, a recognized Multiplayer Online Battle Arena (MOBA) game known for its intricate nature, closely resembling real-life situations. Utilizing this framework, we benchmark a variety of offline RL and offline MARL algorithms. We also introduce a novel baseline algorithm tailored for the inherent hierarchical action space of the game. We reveal the incompetency of current offline RL approaches in handling task complexity, generalization and multi-task learning.

AAAI Conference 2023 Conference Paper

Maximum Entropy Population-Based Training for Zero-Shot Human-AI Coordination

  • Rui Zhao
  • Jinming Song
  • Yufeng Yuan
  • Haifeng Hu
  • Yang Gao
  • Yi Wu
  • Zhongqian Sun
  • Wei Yang

We study the problem of training a Reinforcement Learning (RL) agent that is collaborative with humans without using human data. Although such agents can be obtained through self-play training, they can suffer significantly from the distributional shift when paired with unencountered partners, such as humans. In this paper, we propose Maximum Entropy Population-based training (MEP) to mitigate such distributional shift. In MEP, agents in the population are trained with our derived Population Entropy bonus to promote the pairwise diversity between agents and the individual diversity of agents themselves. After obtaining this diversified population, a common best agent is trained by paring with agents in this population via prioritized sampling, where the prioritization is dynamically adjusted based on the training progress. We demonstrate the effectiveness of our method MEP, with comparison to Self-Play PPO (SP), Population-Based Training (PBT), Trajectory Diversity (TrajeDi), and Fictitious Co-Play (FCP) in both matrix game and Overcooked game environments, with partners being human proxy models and real humans. A supplementary video showing experimental results is available at https://youtu.be/Xh-FKD0AAKE.

NeurIPS Conference 2023 Conference Paper

Policy Space Diversity for Non-Transitive Games

  • Jian Yao
  • Weiming Liu
  • Haobo Fu
  • Yaodong Yang
  • Stephen McAleer
  • Qiang Fu
  • Wei Yang

Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness with existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving of PSRO, we obtain a new PSRO variant, \textit{Policy Space Diversity} PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on single-state games, Leduc, and Goofspiel demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants.

AAAI Conference 2023 Conference Paper

RLogist: Fast Observation Strategy on Whole-Slide Images with Deep Reinforcement Learning

  • Boxuan Zhao
  • Jun Zhang
  • Deheng Ye
  • Jian Cao
  • Xiao Han
  • Qiang Fu
  • Wei Yang

Whole-slide images (WSI) in computational pathology have high resolution with gigapixel size, but are generally with sparse regions of interest, which leads to weak diagnostic relevance and data inefficiency for each area in the slide. Most of the existing methods rely on a multiple instance learning framework that requires densely sampling local patches at high magnification. The limitation is evident in the application stage as the heavy computation for extracting patch-level features is inevitable. In this paper, we develop RLogist, a benchmarking deep reinforcement learning (DRL) method for fast observation strategy on WSIs. Imitating the diagnostic logic of human pathologists, our RL agent learns how to find regions of observation value and obtain representative features across multiple resolution levels, without having to analyze each part of the WSI at the high magnification. We benchmark our method on two whole-slide level classification tasks, including detection of metastases in WSIs of lymph node sections, and subtyping of lung cancer. Experimental results demonstrate that RLogist achieves competitive classification performance compared to typical multiple instance learning algorithms, while having a significantly short observation path. In addition, the observation path given by RLogist provides good decision-making interpretability, and its ability of reading path navigation can potentially be used by pathologists for educational/assistive purposes. Our code is available at: https://github.com/tencent-ailab/RLogist.

ICLR Conference 2023 Conference Paper

SYNC: Safety-Aware Neural Control for Stabilizing Stochastic Delay-Differential Equations

  • Jingdong Zhang 0001
  • Qunxi Zhu
  • Wei Yang
  • Wei Lin 0003

Stabilization of the systems described by \textit{stochastic delay}-differential equations (SDDEs) under preset conditions is a challenging task in the control community. Here, to achieve this task, we leverage neural networks to learn control policies using the information of the controlled systems in some prescribed regions. Specifically, two learned control policies, i.e., the neural deterministic controller (NDC) and the neural stochastic controller (NSC), work effectively in the learning procedures that rely on, respectively, the well-known LaSalle-type theorem and the newly-established theorem for guaranteeing the stochastic stability in SDDEs. We theoretically investigate the performance of the proposed controllers in terms of convergence time and energy cost. More practically and significantly, we improve our learned control policies through considering the situation where the controlled trajectories only evolve in some specific safety set. {\color{black} The practical validity of such control policies restricted in safety set is attributed to the theory that we further develop for safety and stability guarantees in SDDEs using the stochastic control barrier function and the spatial discretization}. We call this control as SYNC (\textbf{S}afet\textbf{Y}-aware \textbf{N}eural \textbf{C}ontrol). The efficacy of all the articulated control policies, including the SYNC, is demonstrated systematically by using representative control problems.

NeurIPS Conference 2022 Conference Paper

Honor of Kings Arena: an Environment for Generalization in Competitive Reinforcement Learning

  • Hua Wei
  • Jingxiao Chen
  • Xiyang Ji
  • Hongyang Qin
  • Minwen Deng
  • Siqin Li
  • Liang Wang
  • Weinan Zhang

This paper introduces Honor of Kings Arena, a reinforcement learning (RL) environment based on the Honor of Kings, one of the world’s most popular games at present. Compared to other environments studied in most previous work, ours presents new generalization challenges for competitive reinforcement learning. It is a multi-agent problem with one agent competing against its opponent; and it requires the generalization ability as it has diverse targets to control and diverse opponents to compete with. We describe the observation, action, and reward specifications for the Honor of Kings domain and provide an open-source Python-based interface for communicating with the game engine. We provide twenty target heroes with a variety of tasks in Honor of Kings Arena and present initial baseline results for RL-based methods with feasible computing resources. Finally, we showcase the generalization challenges imposed by Honor of Kings Arena and possible remedies to the challenges. All of the software, including the environment-class, are publicly available.

IJCAI Conference 2022 Conference Paper

JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning

  • Zichuan Lin
  • Junyou Li
  • Jianing Shi
  • Deheng Ye
  • Qiang Fu
  • Wei Yang

Learning rational behaviors in open-world games like Minecraft remains to be challenging for Reinforcement Learning (RL) research due to the compound challenge of partial observability, high-dimensional visual perception and delayed reward. To address this, we propose JueWu-MC, a sample-efficient hierarchical RL approach equipped with representation learning and imitation learning to deal with perception and exploration. Specifically, our approach includes two levels of hierarchy, where the high-level controller learns a policy to control over options and the low-level workers learn to solve each sub-task. To boost the learning of sub-tasks, we propose a combination of techniques including 1) action-aware representation learning which captures underlying relations between action and representation, 2) discriminator-based self-imitation learning for efficient exploration, and 3) ensemble behavior cloning with consistency filtering for policy robustness. Extensive experiments show that JueWu-MC significantly improves sample efficiency and outperforms a set of baselines by a large margin. Notably, we won the championship of the NeurIPS MineRL 2021 research competition and achieved the highest performance score ever.

IJCAI Conference 2022 Conference Paper

Learn to Reverse DNNs from AI Programs Automatically

  • Simin Chen
  • Hamed Khanpour
  • Cong Liu
  • Wei Yang

With the privatization deployment of DNNs on edge devices, the security of on-device DNNs has raised significant concern. To quantify the model leakage risk of on-device DNNs automatically, we propose NNReverse, the first learning-based method which can reverse DNNs from AI programs without domain knowledge. NNReverse trains a representation model to represent the semantics of binary code for DNN layers. By searching the most similar function in our database, NNReverse infers the layer type of a given function’s binary code. To represent assembly instructions semantics precisely, NNReverse proposes a more fine-grained embedding model to represent the textual and structural-semantic of assembly functions.

NeurIPS Conference 2022 Conference Paper

SCL-WC: Cross-Slide Contrastive Learning for Weakly-Supervised Whole-Slide Image Classification

  • Xiyue Wang
  • Jinxi Xiang
  • Jun Zhang
  • Sen Yang
  • Zhongyi Yang
  • Ming-Hui Wang
  • Jing Zhang
  • Wei Yang

Weakly-supervised whole-slide image (WSI) classification (WSWC) is a challenging task where a large number of unlabeled patches (instances) exist within each WSI (bag) while only a slide label is given. Despite recent progress for the multiple instance learning (MIL)-based WSI analysis, the major limitation is that it usually focuses on the easy-to-distinguish diagnosis-positive regions while ignoring positives that occupy a small ratio in the entire WSI. To obtain more discriminative features, we propose a novel weakly-supervised classification method based on cross-slide contrastive learning (called SCL-WC), which depends on task-agnostic self-supervised feature pre-extraction and task-specific weakly-supervised feature refinement and aggregation for WSI-level prediction. To enable both intra-WSI and inter-WSI information interaction, we propose a positive-negative-aware module (PNM) and a weakly-supervised cross-slide contrastive learning (WSCL) module, respectively. The WSCL aims to pull WSIs with the same disease types closer and push different WSIs away. The PNM aims to facilitate the separation of tumor-like patches and normal ones within each WSI. Extensive experiments demonstrate state-of-the-art performance of our method in three different classification tasks (e. g. , over 2% of AUC in Camelyon16, 5% of F1 score in BRACS, and 3% of AUC in DiagSet). Our method also shows superior flexibility and scalability in weakly-supervised localization and semi-supervised classification experiments (e. g. , first place in the BRIGHT challenge). Our code will be available at https: //github. com/Xiyue-Wang/SCL-WC.

AAAI Conference 2022 Conference Paper

Shape Prior Guided Attack: Sparser Perturbations on 3D Point Clouds

  • Zhenbo Shi
  • Zhi Chen
  • Zhenbo Xu
  • Wei Yang
  • Zhidong Yu
  • Liusheng Huang

Deep neural networks are extremely vulnerable to malicious input data. As 3D data is increasingly used in vision tasks such as robots, autonomous driving and drones, the internal robustness of the classification models for 3D point cloud has received widespread attention. In this paper, we propose a novel method named SPGA (Shape Prior Guided Attack) to generate adversarial point cloud examples. We use shape prior information to make perturbations sparser and thus achieve imperceptible attacks. In particular, we propose a Spatially Logical Block (SLB) to apply adversarial points through sliding in the oriented bounding box. Moreover, we design an algorithm called FOFA for this type of task, which further refines the adversarial attack in the process of breaking down complicated problems into sub-problems. Compared with the methods of global perturbation, our attack method consumes significantly fewer computations, making it more efficient. Most importantly of all, SPGA can generate examples with a higher attack success rate (even in a defensive situation), less perturbation budget and stronger transferability.

IJCAI Conference 2021 Conference Paper

Boosting Offline Reinforcement Learning with Residual Generative Modeling

  • Hua Wei
  • Deheng Ye
  • Zhao Liu
  • Hao Wu
  • Bo Yuan
  • Qiang Fu
  • Wei Yang
  • Zhenhui Li

Offline reinforcement learning (RL) tries to learn the near-optimal policy with recorded offline experience without online exploration. Current offline RL research includes: 1) generative modeling, i. e. , approximating a policy using fixed data; and 2) learning the state-action value function. While most research focuses on the state-action function part through reducing the bootstrapping error in value function approximation induced by the distribution shift of training data, the effects of error propagation in generative modeling have been neglected. In this paper, we analyze the error in generative modeling. We propose AQL (action-conditioned Q-learning), a residual generative model to reduce policy approximation error for offline RL. We show that our method can learn more accurate policy approximations in different benchmark datasets. In addition, we show that the proposed offline RL method can learn more competitive AI agents in complex control tasks under the multiplayer online battle arena (MOBA) game, Honor of Kings.

IJCAI Conference 2021 Conference Paper

Hiding Numerical Vectors in Local Private and Shuffled Messages

  • Shaowei Wang
  • Jin Li
  • Yuqiu Qian
  • Jiachun Du
  • Wenqing Lin
  • Wei Yang

Numerical vector aggregation has numerous applications in privacy-sensitive scenarios, such as distributed gradient estimation in federated learning, and statistical analysis on key-value data. Within the framework of local differential privacy, this work gives tight minimax error bounds of O(d s/(n epsilon^2)), where d is the dimension of the numerical vector and s is the number of non-zero entries. An attainable mechanism is then designed to improve from existing approaches suffering error rate of O(d^2/(n epsilon^2)) or O(d s^2/(n epsilon^2)). To break the error barrier in the local privacy, this work further consider privacy amplification in the shuffle model with anonymous channels, and shows the mechanism satisfies centralized (14 ln(2/delta) (s e^epsilon+2s-1)/(n-1))^0. 5, delta)-differential privacy, which is domain independent and thus scales to federated learning of large models. We experimentally validate and compare it with existing approaches, and demonstrate its significant error reduction.

NeurIPS Conference 2021 Conference Paper

Learning Diverse Policies in MOBA Games via Macro-Goals

  • Yiming Gao
  • Bei Shi
  • Xueying Du
  • Liang Wang
  • Guangwei Chen
  • Zhenjie Lian
  • Fuhao Qiu
  • GUOAN HAN

Recently, many researchers have made successful progress in building the AI systems for MOBA-game-playing with deep reinforcement learning, such as on Dota 2 and Honor of Kings. Even though these AI systems have achieved or even exceeded human-level performance, they still suffer from the lack of policy diversity. In this paper, we propose a novel Macro-Goals Guided framework, called MGG, to learn diverse policies in MOBA games. MGG abstracts strategies as macro-goals from human demonstrations and trains a Meta-Controller to predict these macro-goals. To enhance policy diversity, MGG samples macro-goals from the Meta-Controller prediction and guides the training process towards these goals. Experimental results on the typical MOBA game Honor of Kings demonstrate that MGG can execute diverse policies in different matches and lineups, and also outperform the state-of-the-art methods over 102 heroes.

IJCAI Conference 2021 Conference Paper

MapGo: Model-Assisted Policy Optimization for Goal-Oriented Tasks

  • Menghui Zhu
  • Minghuan Liu
  • Jian Shen
  • Zhicheng Zhang
  • Sheng Chen
  • Weinan Zhang
  • Deheng Ye
  • Yong Yu

In Goal-oriented Reinforcement learning, relabeling the raw goals in past experience to provide agents with hindsight ability is a major solution to the reward sparsity problem. In this paper, to enhance the diversity of relabeled goals, we develop FGI (Foresight Goal Inference), a new relabeling strategy that relabels the goals by looking into the future with a learned dynamics model. Besides, to improve sample efficiency, we propose to use the dynamics model to generate simulated trajectories for policy training. By integrating these two improvements, we introduce the MapGo framework (Model-Assisted Policy optimization for Goal-oriented tasks). In our experiments, we first show the effectiveness of the FGI strategy compared with the hindsight one, and then show that the MapGo framework achieves higher sample efficiency when compared to model-free baselines on a set of complicated tasks.

JBHI Journal 2021 Journal Article

Quantifying Axial Spine Images Using Object-Specific Bi-Path Network

  • Liyan Lin
  • Xi Tao
  • Wei Yang
  • Shumao Pang
  • Zhihai Su
  • Hai Lu
  • Shuo Li
  • Qianjin Feng

Automatic estimation of indices from medical images is the main goal of computer-aided quantification (CADq), which speeds up diagnosis and lightens the workload of radiologists. Deep learning technique is a good choice for implementing CADq. Usually, to acquire high-accuracy quantification, specific network architecture needs to be designed for a given CADq task. In this study, considering that the target organs are the intervertebral disc and the dural sac, we propose an object-specific bi-path network (OSBP-Net) for axial spine image quantification. Each path of the OSBP-Net comprises a shallow feature extraction layer (SFE) and a deep feature extraction sub-network (DFE). The SFEs use different convolution strides because the two target organs have different anatomical sizes. The DFEs use average pooling for downsampling based on the observation that the target organs have lower intensity than the background. In addition, an inter-path dissimilarity constraint is proposed and applied to the output of the SFEs, taking into account that the activated regions in the feature maps of two paths should be different theoretically. An inter-index correlation regularization is introduced and applied to the output of the DFEs based on the observation that the diameter and area of the same object express an approximately linear relation. The prediction results of OSBP-Net are compared to several state-of-the-art machine learning-based CADq methods. The comparison reveals that the proposed methods precede other competing methods extensively, indicating its great potential for spine CADq.

JBHI Journal 2020 Journal Article

Flexible Prediction of CT Images From MRI Data Through Improved Neighborhood Anchored Regression for PET Attenuation Correction

  • Liming Zhong
  • Yanlin Chen
  • Xiao Zhang
  • Shupeng Liu
  • Yuankui Wu
  • Yunbi Liu
  • Liyan Lin
  • Qianjin Feng

Given the complicated relationship between the magnetic resonance imaging (MRI) signals and the attenuation values, the attenuation correction in hybrid positron emission tomography (PET)/MRI systems remains a challenging task. Currently, existing methods are either time-consuming or require sufficient samples to train the models. In this paper, an efficient approach for predicting pseudo computed tomography (CT) images from T1- and T2-weighted MRI data with limited data is proposed. The proposed approach uses improved neighborhood anchored regression (INAR) as a baseline method to pre-calculate projected matrices to flexibly predict the pseudo CT patches. Techniques, including the augmentation of the MR/CT dataset, learning of the nonlinear descriptors of MR images, hierarchical search for nearest neighbors, data-driven optimization, and multi-regressor ensemble, are adopted to improve the effectiveness of the proposed approach. In total, 22 healthy subjects were enrolled in the study. The pseudo CT images obtained using INAR with multi-regressor ensemble yielded mean absolute error (MAE) of 92. 73 $\pm$ 14. 86 HU, peak signal-to-noise ratio of 29. 77 $\pm$ 1. 63 dB, Pearson linear correlation coefficient of 0. 82 $\pm$ 0. 05, dice similarity coefficient of 0. 81 $\pm$ 0. 03, and the relative mean absolute error (rMAE) in PET attenuation correction of 1. 30 $\pm$ 0. 20% compared with true CT images. Moreover, our proposed INAR method, without any refinement strategies, can achieve considerable results with only seven subjects (MAE 106. 89 $\pm$ 14. 43 HU, rMAE 1. 51 $\pm$ 0. 21%). The experiments prove the superior performance of the proposed method over the six innovative methods. Moreover, the proposed method can rapidly generate the pseudo CT images that are suitable for PET attenuation correction.

AAAI Conference 2020 Conference Paper

Mastering Complex Control in MOBA Games with Deep Reinforcement Learning

  • Deheng Ye
  • Zhao Liu
  • Mingfei Sun
  • Bei Shi
  • Peilin Zhao
  • Hao Wu
  • Hongsheng Yu
  • Shaojie Yang

We study the reinforcement learning problem of complex action control in the Multi-player Online Battle Arena (MOBA) 1v1 games. This problem involves far more complicated state and action spaces than those of traditional 1v1 games, such as Go and Atari series, which makes it very difficult to search any policies with human-level performance. In this paper, we present a deep reinforcement learning framework to tackle this problem from the perspectives of both system and algorithm. Our system is of low coupling and high scalability, which enables efficient explorations at large scale. Our algorithm includes several novel strategies, including control dependency decoupling, action mask, target attention, and dualclip PPO, with which our proposed actor-critic network can be effectively trained in our system. Tested on the MOBA game Honor of Kings, the trained AI agents can defeat top professional human players in full 1v1 games.

NeurIPS Conference 2020 Conference Paper

Towards Playing Full MOBA Games with Deep Reinforcement Learning

  • Deheng Ye
  • Guibin Chen
  • Wen Zhang
  • Sheng Chen
  • Bo Yuan
  • Bo Liu
  • Jia Chen
  • Zhao Liu

MOBA games, e. g. , Honor of Kings, League of Legends, and Dota 2, pose grand challenges to AI systems such as multi-agent, enormous state-action space, complex action control, etc. Developing AI for playing MOBA games has raised much attention accordingly. However, existing work falls short in handling the raw game complexity caused by the explosion of agent combinations, i. e. , lineups, when expanding the hero pool in case that OpenAI's Dota AI limits the play to a pool of only 17 heroes. As a result, full MOBA games without restrictions are far from being mastered by any existing AI system. In this paper, we propose a MOBA AI learning paradigm that methodologically enables playing full MOBA games with deep reinforcement learning. Specifically, we develop a combination of novel and existing learning techniques, including off-policy adaption, multi-head value estimation, curriculum self-play learning, policy distillation, and Monte-Carlo tree-search, in training and playing a large pool of heroes, meanwhile addressing the scalability issue skillfully. Tested on Honor of Kings, a popular MOBA game, we show how to build superhuman AI agents that can defeat top esports players. The superiority of our AI is demonstrated by the first large-scale performance test of MOBA AI agent in the literature.

AAAI Conference 2020 Conference Paper

ZoomNet: Part-Aware Adaptive Zooming Neural Network for 3D Object Detection

  • Zhenbo Xu
  • Wei Zhang
  • Xiaoqing Ye
  • Xiao Tan
  • Wei Yang
  • Shilei Wen
  • Errui Ding
  • Ajin Meng

3D object detection is an essential task in autonomous driving and robotics. Though great progress has been made, challenges remain in estimating 3D pose for distant and occluded objects. In this paper, we present a novel framework named ZoomNet for stereo imagery-based 3D detection. The pipeline of ZoomNet begins with an ordinary 2D object detection model which is used to obtain pairs of leftright bounding boxes. To further exploit the abundant texture cues in rgb images for more accurate disparity estimation, we introduce a conceptually straight-forward module – adaptive zooming, which simultaneously resizes 2D instance bounding boxes to a unified resolution and adjusts the camera intrinsic parameters accordingly. In this way, we are able to estimate higher-quality disparity maps from the resized box images then construct dense point clouds for both nearby and distant objects. Moreover, we introduce to learn part locations as complementary features to improve the resistance against occlusion and put forward the 3D fitting score to better estimate the 3D detection quality. Extensive experiments on the popular KITTI 3D detection dataset indicate ZoomNet surpasses all previous state-of-the-art methods by large margins (improved by 9. 4% on APbv (IoU=0. 7) over pseudo-LiDAR). Ablation study also demonstrates that our adaptive zooming strategy brings an improvement of over 10% on AP3d (IoU=0. 7). In addition, since the official KITTI benchmark lacks fine-grained annotations like pixel-wise part locations, we also present our KFG dataset by augmenting KITTI with detailed instance-wise annotations including pixel-wise part location, pixel-wise disparity, etc. . Both the KFG dataset and our codes will be publicly available at https: //github. com/detectRecog/ZoomNet.

AAAI Conference 2019 Conference Paper

Multi-Perspective Relevance Matching with Hierarchical ConvNets for Social Media Search

  • Jinfeng Rao
  • Wei Yang
  • Yuhao Zhang
  • Ferhan Ture
  • Jimmy Lin

Despite substantial interest in applications of neural networks to information retrieval, neural ranking models have mostly been applied to “standard” ad hoc retrieval tasks over web pages and newswire articles. This paper proposes MP-HCNN (Multi-Perspective Hierarchical Convolutional Neural Network), a novel neural ranking model specifically designed for ranking short social media posts. We identify document length, informal language, and heterogeneous relevance signals as features that distinguish documents in our domain, and present a model specifically designed with these characteristics in mind. Our model uses hierarchical convolutional layers to learn latent semantic soft-match relevance signals at the character, word, and phrase levels. A poolingbased similarity measurement layer integrates evidence from multiple types of matches between the query, the social media post, as well as URLs contained in the post. Extensive experiments using Twitter data from the TREC Microblog Tracks 2011–2014 show that our model significantly outperforms prior feature-based as well as existing neural ranking models. To our best knowledge, this paper presents the first substantial work tackling search over social media posts using neural ranking models. Our code and data are publicly available. 1

JBHI Journal 2018 Journal Article

Lung Field Segmentation in Chest Radiographs From Boundary Maps by a Structured Edge Detector

  • Wei Yang
  • Yunbi Liu
  • Liyan Lin
  • Zhaoqiang Yun
  • Zhentai Lu
  • Qianjin Feng
  • Wufan Chen

Lung field segmentation in chest radiographs (CXRs) is an essential preprocessing step in automatically analyzing such images. We present a method for lung field segmentation that is built on a high-quality boundary map detected by an efficient modern boundary detector, namely a structured edge detector (SED). A SED is trained beforehand to detect lung boundaries in CXRs with manually outlined lung fields. Then, an ultrametric contour map (UCM) is transformed from the masked and marked boundary map. Finally, the contours with the highest confidence level in the UCM are extracted as lung contours. Our method is evaluated using the public Japanese Society of Radiological Technology database of scanned films. The average Jaccard index of our method is 95. 2%, which is comparable with those of other state-of-the-art methods (95. 4%). The computation time of our method is less than 0. 1 s for a 256 × 256 CXR when executed on an ordinary laptop. Our method is also validated on CXRs acquired with different digital radiography units. The results demonstrate the generalization of the trained SED model and the usefulness of our method.

JBHI Journal 2017 Journal Article

Classification of Multiple Finger Motions During Dynamic Upper Limb Movements

  • Dapeng Yang
  • Wei Yang
  • Qi Huang
  • Hong Liu

To better restore human hand function, advanced hand prostheses should be able to deal with a variety of daily living conditions. In this paper, we addressed myoelectric signal variations introduced by different muscle contractions, dynamic arm movements, and outer interfering forces in the practice of pattern recognition-based myoelectric control schemes. We examined four different training paradigms (data-collection protocols) and quantified their effectiveness for obtaining a robust classification. We further depicted the classification accuracy according to different arm/wrist motion primitives. Our results indicate the training paradigm that collects myoelectric signals on dynamic arm postures and varying muscular contractions (DPDE) can largely mitigate the motions' misclassification rate. The misclassification rate of finger motions seems to highly correlate to wrist pronation and supination, rather than different arm positions. Combining proprioceptive information, such as the hand's orientation, with myoelectric signals for classification only slightly alleviates the misclassification rate.