Arrow Research search

Author name cluster

Chao Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

43 papers
2 author rows

Possible papers

43

AAAI Conference 2026 Conference Paper

AR-Nav Benchmark: Augmented Reality Navigation with Vision and Language

  • Liqi Yan
  • Yihao Wu
  • Chenyi Xu
  • Chao Yang
  • Jianhui Zhang
  • Pan Li

Augmented Reality (AR) navigation has emerged as a transformative tool for spatial intelligence, enabling users to interactively explore complex environments through wearable and mobile AR devices. However, current AR navigation systems struggle with low indoor localization accuracy, weak semantic understanding, and limited long-term memory, which severely limits their adaptability in dynamic, multi-floor, and large-scale real-world settings. To address these challenges, we present AR-Nav benchmark, a novel dataset with corresponding suite that leverages vision and language for AR navigation. First, to construct this benchmark, we proposed an Augmented Reality Visual-Language Memory Model (AR‑VLM²), which generates structured, semantically rich, and temporally indexed representations for long-term AR navigation. Second, we design a lightweight navigation intent recommending module with hierarchical topological reasoning and language-grounded path planning, called ARN‑Pilot, enabling low-latency and personalized route selection. Third, we introduce a closed-loop AR interaction module that supports real-time multi-modal feedback, dynamic memory updates, and human-in-the-loop query refinement. Extensive experiments in indoor multi-floor and outdoor parking scenarios show that AR-Nav suite significantly outperforms state-of-the-art AR navigation methods.

AAAI Conference 2026 Conference Paper

SHADOW: Dynamic-Aware Credit Assignment Against Long-Horizon Tasks

  • Yuze Liu
  • Chaochao Lu
  • Chao Yang

Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM) agents to solve complex, multi-step tasks through environmental interaction. A fundamental challenge in such long-horizon scenarios is credit assignment, as delayed rewards provide inadequate signals for evaluating individual action contributions. Existing methods typically neglect trajectory transition dynamics, which leads to coarse-grained or biased credit assignment. To address these limitations, we introduce SHADOW, a novel framework that systematically incorporates transition dynamics for improved credit assignment. Our framework makes two primary contributions: (i) a dynamics-aware state grouping mechanism that mitigates misleading action comparisons between dynamically inconsistent states, and (ii) a local dynamic advantage estimator that leverages Generalized Advantage Estimation (GAE) to precisely quantify individual action contributions through a fine-grained analysis of transition patterns. Comprehensive experiments conducted with the Qwen2.5-1.5/7B-Instruct agent model demonstrate that our method achieves success rate improvements of 9.4%/7.6% on the ALFworld benchmark and a performance gain of over 5% on WebShop.

ICML Conference 2025 Conference Paper

C-3PO: Compact Plug-and-Play Proxy Optimization to Achieve Human-like Retrieval-Augmented Generation

  • Guoxin Chen
  • Minpeng Liao
  • Peiying Yu
  • Dingmin Wang
  • Zile Qiao
  • Chao Yang
  • Xin Zhao 0018
  • Kai Fan 0002

Retrieval-augmented generation (RAG) systems face a fundamental challenge in aligning independently developed retrievers and large language models (LLMs). Existing approaches typically involve modifying either component or introducing simple intermediate modules, resulting in practical limitations and sub-optimal performance. Inspired by human search behavior—typically involving a back-and-forth process of proposing search queries and reviewing documents, we propose C-3PO, a proxy-centric framework that facilitates communication between retrievers and LLMs through a lightweight multi-agent system. Our framework implements three specialized agents that collaboratively optimize the entire RAG pipeline without altering the retriever and LLMs. These agents work together to assess the need for retrieval, generate effective queries, and select information suitable for the LLMs. To enable effective multi-agent coordination, we develop a tree-structured rollout approach for reward credit assignment in reinforcement learning. Extensive experiments in both in-domain and out-of-distribution scenarios demonstrate that C-3PO significantly enhances RAG performance while maintaining plug-and-play flexibility and superior generalization capabilities.

IJCAI Conference 2025 Conference Paper

CMFS: CLIP-Guided Modality Interaction for Mitigating Noise in Multi-Modal Image Fusion and Segmentation

  • Guilin Su
  • Yuqing Huang
  • Chao Yang
  • Zhenyu He

Infrared-visible image fusion and semantic segmentation are pivotal tasks for robust scene understanding under challenging conditions such as low light. However, existing methods often struggle with high noise, modality inconsistencies, and inefficient cross-modal interactions, limiting fusion quality and segmentation accuracy. To this end, we propose CMFS, a unified framework that leverages CLIP-guided modality interaction to mitigate noise in multi-modal image fusion and segmentation. Our approach features a region-aware Modal Interaction Alignment module that combines a VMamba-based encoder with an additional shuffle layer to obtain more robust features and a CLIP-guided, regionally constrained multi-modal feature interaction block to emphasize foreground targets while suppressing low-light noise. Additionally, a Frequency-Spatial Collaboration module uses selective scanning and integrates wavelet-, spatial-, and Fourier-domain features to achieve adaptive denoising and balanced feature allocation. Furthermore, we employ a low-rank mixture-of-experts with dynamic routing to improve region-specific fusion and enhance pixel-level accuracy. Extensive experiments on several benchmarks show that, compared with state-of-the-art methods, the proposed approach demonstrates effectiveness in both image fusion quality and semantic segmentation accuracy, especially in complex environments. The source code will be released at IJCAI2025-CMFS.

AAAI Conference 2025 Conference Paper

Dynamic Spectral Graph Anomaly Detection

  • Jianbo Zheng
  • Chao Yang
  • Tairui Zhang
  • Longbing Cao
  • Bin Jiang
  • Xuhui Fan
  • Xiao-ming Wu
  • Xianxun Zhu

Graph anomaly detection is crucial for identifying anomalous nodes within graphs and addressing applications like financial fraud detection and social spam detection. Recent spectral graph neural network methods advance graph anomaly detection by focusing on anomalies that notably affect the distribution of graph spectral energy. Such spectrum-based methods rely on two steps: graph wavelet extraction and feature fusion. However, both steps are hand-designed, capturing incomprehensive anomaly information of wavelet-specific features and resulting in their inconsistent feature fusion. To address these problems, we propose a dynamic spectral graph anomaly detection framework DSGAD to adaptively capture comprehensive anomaly information and perform consistent feature fusion. DSGAD introduces dynamic wavelets, consisting of trainable wavelets to adaptively learn anomalous patterns and capture wavelet-specific features with comprehensive anomaly information. Furthermore, the consistent fusion of wavelet-specific features achieves dynamic fusion by combining wavelet-specific feature extraction with energy difference and channel convolution fusion using location correlation. Experimental results on four datasets substantiate the efficacy of our DSGAD method, surpassing state-of-the-art methods in both homogeneous and heterogeneous graphs.

ICML Conference 2025 Conference Paper

Evolving Minds: Logic-Informed Inference from Temporal Action Patterns

  • Chao Yang
  • Shuting Cui
  • Yang Yang
  • Shuang Li 0002

Understanding human mental states—such as intentions and desires—is crucial for natural AI-human collaboration. However, this is challenging because human actions occur irregularly over time, and the underlying mental states that drive these actions are unobserved. To tackle this, we propose a novel framework that combines a logic-informed temporal point process (TPP) with amortized variational Expectation-Maximization (EM). Our key innovation is integrating logic rules as priors to guide the TPP’s intensity function, allowing the model to capture the interplay between actions and mental events while reducing dependence on large datasets. To handle the intractability of mental state inference, we introduce a discrete-time renewal process to approximate the posterior. By jointly optimizing model parameters, logic rules, and inference networks, our approach infers entire mental event sequences and adaptively predicts future actions. Experiments on both synthetic and real-world datasets show that our method outperforms existing approaches in accurately inferring mental states and predicting actions, demonstrating its effectiveness in modeling human cognitive processes.

IS Journal 2025 Journal Article

Fine-Tuning Large Language Models With Behavioral Alignment for Depression Detection

  • Xifeng Ning
  • Hailu Sun
  • Dejun Yu
  • Chao Yang
  • Ruonan Fang
  • Lin Fan
  • Qika Lin
  • Yifan Zhu

Depression is a prevalent mental health issue, and early detection is crucial for effective intervention. In this article, we propose the depression large language model (DLLM), a novel two-stage fine-tuning framework designed to enhance the accuracy and robustness of depression detection using multimodal data from social media. In the first stage, we design specific prompts to incorporate various types of multimodal data and collect diverse instruction data. The DLLM is then fine-tuned for depression detection based on these data. In the second stage, we enhance the model’s robustness and generalization by performing behavioral alignment. This involves a deep understanding of user actions to improve behavior perception, enabling the policy model to distinguish between positive and negative behaviors for individual users. Experiments on the WU3D dataset show that DLLM outperforms state-of-the-art baselines (e. g. , +6. 2% accuracy over ALBERT, +2. 4% F1 over EKG-MDDM) and demonstrates strong generalization in ablation studies.

UAI Conference 2025 Conference Paper

Flow-Based Delayed Hawkes Process

  • Chao Yang
  • Wendi Ren
  • Shuang Li 0002

Multivariate Hawkes processes are classic temporal point process models for event data. These models are simple and parametric in nature, offering interpretability by capturing the triggering effects between event types. However, these parametric models often struggle with low model capacity, limiting their expressive power to capture heterogeneous data patterns influenced by latent variables. In this paper, we propose a simple yet powerful extension: the Flow-based Delayed Hawkes Process, which integrates Normalizing Flows as a generative model to parameterize the Hawkes process. By generating all model parameters through the flow-based network, our approach significantly improves flexibility and expressiveness while preserving interpretability. We provide theoretical guarantees by proving the identifiability of the model parameters and the consistency of the maximum likelihood estimator under mild assumptions. Extensive experiments on both synthetic and real-world datasets show that our model outperforms existing baselines in capturing intricate and heterogeneous event dynamics.

NeurIPS Conference 2025 Conference Paper

Fuz-RL: A Fuzzy-Guided Robust Framework for Safe Reinforcement Learning under Uncertainty

  • Xu Wan
  • Chao Yang
  • Cheng Yang
  • Jie Song
  • Mingyang Sun

Safe Reinforcement Learning (RL) is crucial for achieving high performance while ensuring safety in real-world applications. However, the complex interplay of multiple uncertainty sources in real environments poses significant challenges for interpretable risk assessment and robust decision-making. To address these challenges, we propose Fuz-RL, a fuzzy measure-guided robust framework for safe RL. Specifically, our framework develops a novel fuzzy Bellman operator for estimating robust value functions using Choquet integrals. Theoretically, we prove that solving the Fuz-RL problem (in Constrained Markov Decision Process (CMDP) form) is equivalent to solving distributionally robust safe RL problems (in robust CMDP form), effectively reformulating the min-max optimization problem into a tractable CMDP with Choquet-integrated value functions. Empirical analyses on safe-control-gym and safety-gymnasium scenarios demonstrate that Fuz-RL effectively integrates with existing safe RL baselines in a model-free manner, significantly improving both safety and control performance under various types of uncertainties in observation, action, and dynamics.

JBHI Journal 2025 Journal Article

GCNLA: Inferring Cell-Cell Interactions From Spatial Transcriptomics With Long Short-Term Memory and Graph Convolutional Networks

  • Chao Yang
  • Xiuhao Fu
  • Zhenjie Luo
  • Leyi Wei
  • Jingbing Li
  • Feifei Cui
  • Quan Zou
  • Qingchen Zhang

Spatial transcriptomics analysis methods offer an opportunity to investigate highly diverse biological tissues. Cell-cell communication is fundamental for maintaining physiological homeostasis in organisms and coordinating complex biological processes. Identifying cell-cell interactions is critical for understanding cellular activities. The interaction of a cell with other cells depends on several factors, and most of the existing methods that consider only gene expression information of neighbouring cells and spatial location information are somewhat limited. In this paper, we propose a network architecture based on graph convolution network and long short-term memory attention module-GCNLA, which contains graph convolution layer, long short-term memory network, attention module, and residual connections. GCNLA not only learns the spatial structure of cells but also captures interaction information between distal cells, the attention module further extracting and enhancing features related to cell-cell interactions. Finally, the inner product decoding calculates the cosine similarity, which is used to infer cell-cell interactions. In addition, GCNLA is capable of reconstructing the complete cell-cell interaction network. The experimental results on seqFISH and MERFISH demonstrate that the GCNLA network structure has better robustness and noise immunity. The potential features learned by GCNLA enable other downstream analyses, including single-cell resolution cell clustering based on spatial information resolving cell heterogeneity.

NeurIPS Conference 2025 Conference Paper

SolidGeo: Measuring Multimodal Spatial Math Reasoning in Solid Geometry

  • Peijie Wang
  • Chao Yang
  • Zhong-Zhi Li
  • Fei Yin
  • Dekang Ran
  • Mi Tian
  • Zhilong Ji
  • Jinfeng Bai

Geometry is a fundamental branch of mathematics and plays a crucial role in evaluating the reasoning capabilities of multimodal large language models (MLLMs). However, existing multimodal mathematics benchmarks mainly focus on plane geometry and largely ignore solid geometry, which requires spatial reasoning and is more challenging than plane geometry. To address this critical gap, we introduce SolidGeo, the first large-scale benchmark specifically designed to evaluate the performance of MLLMs on mathematical reasoning tasks in solid geometry. SolidGeo consists of 3, 113 real-world K–12 and competition-level problems, each paired with visual context and annotated with difficulty levels and fine-grained solid geometry categories. Our benchmark covers a wide range of 3D reasoning subjects such as projection, unfolding, spatial measurement, and spatial vector, offering a rigorous testbed for assessing solid geometry. Through extensive experiments, we observe that MLLMs encounter substantial challenges in solid geometry math tasks, with a considerable performance gap relative to human capabilities on SolidGeo. Moreover, we analyze the performance, inference effiency and error patterns of various models, offering insights into the solid geometric mathematical reasoning capabilities of MLLMs. We hope SolidGeo serves as a catalyst for advancing MLLMs toward deeper geometric reasoning and spatial intelligence. The dataset is released at https: //huggingface. co/datasets/HarryYancy/SolidGeo/

AAAI Conference 2025 Conference Paper

SrSv: Integrating Sequential Rollouts with Sequential Value Estimation for Multi-agent Reinforcement Learning

  • Xu Wan
  • Chao Yang
  • Cheng Yang
  • Jie Song
  • Mingyang Sun

Although multi-agent reinforcement learning (MARL) has shown its success across diverse domains, extending its application to large-scale real-world systems still faces significant challenges. Primarily, the high complexity of real-world environments exacerbates the credit assignment problem, substantially reducing training efficiency. Moreover, the variability of agent populations in large-scale scenarios necessitates scalable decision-making mechanisms. To address these challenges, we propose a novel framework: Sequential rollout with Sequential value estimation (SrSv). This framework aims to capture agent interdependence and provide a scalable solution for cooperative MARL. Specifically, SrSv leverages the autoregressive property of the Transformer model to handle varying populations through sequential action rollout. Furthermore, to capture the interdependence of policy distributions and value functions among multiple agents, we introduce an innovative sequential value estimation methodology and integrates the value approximation into an attention-based sequential model. We evaluate SrSv on three benchmarks: Multi-Agent MuJoCo, StarCraft Multi-Agent Challenge, and DubinsCars. Experimental results demonstrate that SrSv significantly outperforms baseline methods in terms of training efficiency without compromising convergence performance. Moreover, when implemented in a large-scale DubinsCar system with 1,024 agents, our framework surpasses existing benchmarks, highlighting the excellent scalability of SrSv.

ICML Conference 2025 Conference Paper

Think Twice, Act Once: A Co-Evolution Framework of LLM and RL for Large-Scale Decision Making

  • Xu Wan 0001
  • Wenyue Xu
  • Chao Yang
  • Mingyang Sun

Recent advancements in Large Language Models (LLMs) and Reinforcement Learning (RL) have shown significant promise in decision-making tasks. Nevertheless, for large-scale industrial decision problems, both approaches face distinct challenges: LLMs lack real-time long-sequence decision-making capabilities, while RL struggles with sample efficiency in vast action spaces. To bridge this gap, we propose A gents C o- E volution (ACE), a synergistic framework between LLMs and RL agent for large-scale decision-making scenarios. ACE introduces a dual-role trajectory refinement mechanism where LLMs act as both Policy Actor and Value Critic during RL’s training: the Actor refines suboptimal actions via multi-step reasoning and environment validation, while the Critic performs temporal credit assignment through trajectory-level reward shaping. Concurrently, RL agent enhance LLMs’ task-specific decision-making via prioritized experience replay. Through extensive experiments across multiple power grid operation challenges with action spaces exceeding 60K discrete actions, ACE demonstrates superior performance over existing RL methods and LLM-based methods.

AAAI Conference 2025 Conference Paper

TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning

  • Xiang Li
  • Yunshi Lan
  • Chao Yang

Recently, numerous new benchmarks have been established to evaluate the performance of large language models (LLMs) via either computing a holistic score or employing another LLM as a judge. However, these approaches suffer from data leakage due to the open access of the benchmark and inflexible evaluation process. To address this issue, we introduce TreeEval, a benchmark-free evaluation method for LLMs that let a high-performance LLM host an irreproducible evaluation session and essentially avoids the data leakage. Moreover, this LLM performs as an examiner to raise up a series of questions under a topic with a tree planing strategy, which considers the current evaluation status to decide the next question generation and ensures the completeness and efficiency of the evaluation process. We evaluate 6 models of different parameter sizes, including 7B, 13B, and 34B, and ultimately achieved the highest correlation coefficient with AlpacaEval2.0 using only around 45 questions. We also conduct more analysis to show the robustness and reliability of TreeEval.

NeurIPS Conference 2025 Conference Paper

VLMs can Aggregate Scattered Training Patches

  • Zhanhui Zhou
  • Lingjie Chen
  • Chao Yang
  • Chaochao Lu

One way to mitigate risks in vision-language models (VLMs) is to censor dangerous samples from their training data. However, data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe, " VLMs may later describe, the full image or a text reference to the scene, as "safe. " We define the core ability of VLMs enabling this attack as $\textit{visual stitching}$—the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID. We split each $(\texttt{image}, \texttt{ID})$ pair into $\{(\texttt{patch}, \texttt{ID})\}$ pairs at different granularities for finetuning, and we find that models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like "safe" or "unsafe", demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks.

AAAI Conference 2024 Conference Paper

Critic-Guided Decision Transformer for Offline Reinforcement Learning

  • Yuanfu Wang
  • Chao Yang
  • Ying Wen
  • Yu Liu
  • Yu Qiao

Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm that learns the action distribution based on target returns for each state in a supervised manner. However, prevailing RCSL methods largely focus on deterministic trajectory modeling, disregarding stochastic state transitions and the diversity of future trajectory distributions. A fundamental challenge arises from the inconsistency between the sampled returns within individual trajectories and the expected returns across multiple trajectories. Fortunately, value-based methods offer a solution by leveraging a value function to approximate the expected returns, thereby addressing the inconsistency effectively. Building upon these insights, we propose a novel approach, termed the Critic-Guided Decision Transformer (CGDT), which combines the predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer. By incorporating a learned value function, known as the critic, CGDT ensures a direct alignment between the specified target returns and the expected returns of actions. This integration bridges the gap between the deterministic nature of RCSL and the probabilistic characteristics of value-based methods. Empirical evaluations on stochastic environments and D4RL benchmark datasets demonstrate the superiority of CGDT over traditional RCSL methods. These results highlight the potential of CGDT to advance the state of the art in offline RL and extend the applicability of RCSL to a wide range of RL tasks.

NeurIPS Conference 2024 Conference Paper

GLinSAT: The General Linear Satisfiability Neural Network Layer By Accelerated Gradient Descent

  • Hongtai Zeng
  • Chao Yang
  • Yanzhen Zhou
  • Cheng Yang
  • Qinglai Guo

Ensuring that the outputs of neural networks satisfy specific constraints is crucial for applying neural networks to real-life decision-making problems. In this paper, we consider making a batch of neural network outputs satisfy bounded and general linear constraints. We first reformulate the neural network output projection problem as an entropy-regularized linear programming problem. We show that such a problem can be equivalently transformed into an unconstrained convex optimization problem with Lipschitz continuous gradient according to the duality theorem. Then, based on an accelerated gradient descent algorithm with numerical performance enhancement, we present our architecture, GLinSAT, to solve the problem. To the best of our knowledge, this is the first general linear satisfiability layer in which all the operations are differentiable and matrix-factorization-free. Despite the fact that we can explicitly perform backpropagation based on automatic differentiation mechanism, we also provide an alternative approach in GLinSAT to calculate the derivatives based on implicit differentiation of the optimality condition. Experimental results on constrained traveling salesman problems, partial graph matching with outliers, predictive portfolio allocation and power system unit commitment demonstrate the advantages of GLinSAT over existing satisfiability layers. Our implementation is available at https: //github. com/HunterTracer/GLinSAT.

ICML Conference 2024 Conference Paper

Latent Logic Tree Extraction for Event Sequence Explanation from LLMs

  • Zitao Song
  • Chao Yang
  • Chaojie Wang 0001
  • Bo An 0001
  • Shuang Li 0002

Modern high-stakes systems, such as healthcare or robotics, often generate vast streaming event sequences. Our goal is to design an efficient, plug-and-play tool to elicit logic tree-based explanations from Large Language Models (LLMs) to provide customized insights into each observed event sequence. Built on the temporal point process model for events, our method employs the likelihood function as a score to evaluate generated logic trees. We propose an amortized Expectation-Maximization (EM) learning framework and treat the logic tree as latent variables. In the E-step, we evaluate the posterior distribution over the latent logic trees using an LLM prior and the likelihood of the observed event sequences. LLM provides a high-quality prior for the latent logic trees, however, since the posterior is built over a discrete combinatorial space, we cannot get the closed-form solution. We propose to generate logic tree samples from the posterior using a learnable GFlowNet, which is a diversity-seeking generator for structured discrete variables. The M-step employs the generated logic rules to approximate marginalization over the posterior, facilitating the learning of model parameters and refining the tunable LLM prior parameters. In the online setting, our locally built, lightweight model will iteratively extract the most relevant rules from LLMs for each sequence using only a few iterations. Empirical demonstrations showcase the promising performance and adaptability of our framework.

TMLR Journal 2024 Journal Article

MaskMA: Towards Zero-Shot Multi-Agent Decision Making with Mask-Based Collaborative Learning

  • Jie Liu
  • Yinmin Zhang
  • Chuming Li
  • Zhiyuan You
  • Zhanhui Zhou
  • Chao Yang
  • Yaodong Yang
  • Yu Liu

Building a single generalist agent with strong zero-shot capability has recently sparked significant advancements. However, extending this capability to multi-agent decision making scenarios presents challenges. Most current works struggle with zero-shot transfer, due to two challenges particular to the multi-agent settings: (a) a mismatch between centralized training and decentralized execution; and (b) difficulties in creating generalizable representations across diverse tasks due to varying agent numbers and action spaces. To overcome these challenges, we propose a Mask-Based collaborative learning framework for Multi-Agent decision making (MaskMA). Firstly, we randomly mask part of the units and collaboratively learn the policies of unmasked units to handle the mismatch. In addition, MaskMA integrates a generalizable action representation by dividing the action space into intrinsic actions solely related to the unit itself and interactive actions involving interactions with other units. This flexibility allows MaskMA to tackle tasks with varying agent numbers and thus different action spaces. Extensive experiments in SMAC reveal MaskMA, with a single model trained on 11 training maps, can achieve an impressive 77.8% average zero-shot win rate on 60 unseen test maps by decentralized execution, while also performing effectively on other types of downstream tasks (e.g., varied policies collaboration, ally malfunction, and ad hoc team play).

ICML Conference 2024 Conference Paper

Neuro-Symbolic Temporal Point Processes

  • Yang Yang
  • Chao Yang
  • Boyang Li
  • Yinghao Fu
  • Shuang Li 0002

Our goal is to $\textit{efficiently}$ discover a compact set of temporal logic rules to explain irregular events of interest. We introduce a neural-symbolic rule induction framework within the temporal point process model. The negative log-likelihood is the loss that guides the learning, where the explanatory logic rules and their weights are learned end-to-end in a $\textit{differentiable}$ way. Specifically, predicates and logic rules are represented as $\textit{vector embeddings}$, where the predicate embeddings are fixed and the rule embeddings are trained via gradient descent to obtain the most appropriate compositional representations of the predicate embeddings. To make the rule learning process more efficient and flexible, we adopt a $\textit{sequential covering algorithm}$, which progressively adds rules to the model and removes the event sequences that have been explained until all event sequences have been covered. All the found rules will be fed back to the models for a final rule embedding and weight refinement. Our approach showcases notable efficiency and accuracy across synthetic and real datasets, surpassing state-of-the-art baselines by a wide margin in terms of efficiency.

IJCAI Conference 2024 Conference Paper

Safety of Multimodal Large Language Models on Images and Text

  • Xin Liu
  • Yichen Zhu
  • Yunshi Lan
  • Chao Yang
  • Yu Qiao

Attracted by the impressive power of Multimodal Large Language Models (MLLMs), the public is increasingly utilizing them to improve the efficiency of daily work. Nonetheless, the vulnerabilities of MLLMs to unsafe instructions bring huge safety risks when these models are deployed in real-world scenarios. In this paper, we systematically survey current efforts on the evaluation, attack, and defense of MLLMs' safety on images and text. We begin with introducing the overview of MLLMs on images and text and understanding of safety, which helps researchers know the detailed scope of our survey. Then, we review the evaluation datasets and metrics for measuring the safety of MLLMs. Next, we comprehensively present attack and defense techniques related to MLLMs' safety. Finally, we analyze several unsolved issues and discuss promising research directions. The relevant papers are collected at "https: //github. com/isXinLiu/Awesome-MLLM-Safety".

NeurIPS Conference 2024 Conference Paper

Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models

  • Zhanhui Zhou
  • Zhixuan Liu
  • Jie Liu
  • Zhichen Dong
  • Chao Yang
  • Yu Qiao

Large language models are usually fine-tuned to align with human preferences. However, fine-tuning a large language model can be challenging. In this work, we introduce $\textit{weak-to-strong search}$, framing the alignment of a large language model as a test-time greedy search to maximize the log-probability difference between small tuned and untuned models while sampling from the frozen large model. This method serves both as (1) a compute-efficient model up-scaling strategy that avoids directly tuning the large model and as (2) an instance of weak-to-strong generalization that enhances a strong model with weak test-time guidance. Empirically, we demonstrate the flexibility of weak-to-strong search across different tasks. In controlled-sentiment generation and summarization, we use tuned and untuned $\texttt{gpt2}$s to improve the alignment of large models without additional training. Crucially, in a more difficult instruction-following benchmark, AlpacaEval 2. 0, we show that reusing off-the-shelf small models (e. g. , $\texttt{zephyr-7b-beta}$ and its untuned version) can improve the length-controlled win rates of both white-box and black-box large models against $\texttt{gpt-4-turbo}$ (e. g. , $34. 4\% \rightarrow 37. 9\%$ for $\texttt{Llama-3-70B-Instruct}$ and $16. 0\% \rightarrow 20. 1\%$ for $\texttt{gpt-3. 5-turbo-instruct}$), despite the small models' low win rates $\approx 10. 0\%$.

NeurIPS Conference 2023 Conference Paper

Discovering Intrinsic Spatial-Temporal Logic Rules to Explain Human Actions

  • Chengzhi Cao
  • Chao Yang
  • Ruimao Zhang
  • Shuang Li

We propose an interpretable model to uncover the behavioral patterns of human movements by analyzing their trajectories. Our approach is based on the belief that human actions are driven by intentions and are influenced by environmental factors such as spatial relationships with surrounding objects. To model this, we use a set of spatial-temporal logic rules that include intention variables as principles. These rules are automatically discovered and used to capture the dynamics of human actions. To learn the model parameters and rule content, we design an EM learning algorithm that treats the unknown rule content as a latent variable. In the E-step, we evaluate the posterior over the latent rule content, and in the M-step, we optimize the rule generator and model parameters by maximizing the expected log-likelihood. Our model has wide-ranging applications in areas such as sports analytics, robotics, and autonomous cars. We demonstrate the model's superior interpretability and prediction performance on both pedestrian and NBA basketball player datasets, achieving promising results.

JBHI Journal 2023 Journal Article

TARF: Technology-Agnostic RF Sensing for Human Activity Recognition

  • Chao Yang
  • Xuyu Wang
  • Shiwen Mao

With the rapid development towards smart Internet of Things (IoT), detection of human activity has become essential in a variety of applications. Various radio-frequency (RF) sensing technologies, such as WiFi, Radio-Frequency Identification (RFID), and Frequency-Modulated Continuous Wave (FMCW) radar, have been utilized for non-invasive human activity recognition (HAR). It will be highly desirable to develop a HAR solution that can work with different types of RF technologies, such that the cost and the barrier of wide deployment can both be greatly reduced, and more robust performance can be achieved by utilizing the complementary RF sensory data. In this paper, we propose a technology-agnostic approach for RF-based HAR, termed TARF, which works with several different RF sensing technologies. A novel data generalization technique is proposed to mitigate the disparity in measured data from different RF devices. A domain adversarial neural network is proposed to combat the interference from various RF sensing technologies. The performance of the proposed system is evaluated with experiments using four different RF sensing technologies. TARF is shown to outperform the state-of-the-art Convolutional Neural Network (CNN)-based solution with considerable gains.

AAAI Conference 2022 Conference Paper

SAS: Self-Augmentation Strategy for Language Model Pre-training

  • Yifei Xu
  • Jingqiao Zhang
  • Ru He
  • Liangzhu Ge
  • Chao Yang
  • Cheng Yang
  • Ying Nian Wu

The core of self-supervised learning for pre-training language models includes pre-training task design as well as appropriate data augmentation. Most data augmentations in language model pre-training are context-independent. A seminal contextualized augmentation was recently proposed in ELECTRA and achieved state-of-the-art performance by introducing an auxiliary generation network (generator) to produce contextualized data augmentation for the training of a main discrimination network (discriminator). This design, however, introduces extra computation cost of the generator and a need to adjust the relative capability between the generator and the discriminator. In this paper, we propose a self-augmentation strategy (SAS) where a single network is utilized for both regular pre-training and contextualized data augmentation for the training in later epochs. Essentially, this strategy eliminates a separate generator and uses the single network to jointly conduct two pre-training tasks with MLM (Masked Language Modeling) and RTD (Replaced Token Detection) heads. It avoids the challenge to search for an appropriate size of the generator, which is critical to the performance as evidenced in ELECTRA and its subsequent variant models. In addition, SAS is a general strategy that can be seamlessly combined with many new techniques emerging recently or in the future, such as the disentangled attention mechanism from DeBERTa. Our experiments show that SAS outperforms ELECTRA and other state-of-the-art models in the GLUE tasks with similar or less computation cost.

AAAI Conference 2022 Conference Paper

Sim2Real Object-Centric Keypoint Detection and Description

  • Chengliang Zhong
  • Chao Yang
  • Fuchun Sun
  • Jinshan Qi
  • Xiaodong Mu
  • Huaping Liu
  • Wenbing Huang

Keypoint detection and description play a central role in computer vision. Most existing methods are in the form of scene-level prediction, without returning the object classes of different keypoints. In this paper, we propose the objectcentric formulation, which, beyond the conventional setting, requires further identifying which object each interest point belongs to. With such fine-grained information, our framework enables more downstream potentials, such as objectlevel matching and pose estimation in a clustered environment. To get around the difficulty of label collection in the real world, we develop a sim2real contrastive learning mechanism that can generalize the model trained in simulation to real-world applications. The novelties of our training method are three-fold: (i) we integrate the uncertainty into the learning framework to improve feature description of hard cases, e. g. , less-textured or symmetric patches; (ii) we decouple the object descriptor into two output branches, intra-object salience and inter-object distinctness, resulting in a better pixel-wise description; (iii) we enforce cross-view semantic consistency for enhanced robustness in representation learning. Comprehensive experiments on image matching and 6D pose estimation verify the encouraging generalization ability of our method from simulation to reality. Particularly for 6D pose estimation, our method significantly outperforms typical unsupervised/sim2real methods, achieving a closer gap with the fully supervised counterpart.

NeurIPS Conference 2022 Conference Paper

TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training

  • Chang Chen
  • Min Li
  • Zhihua Wu
  • Dianhai Yu
  • Chao Yang

Sparsely gated Mixture-of-Expert (MoE) has demonstrated its effectiveness in scaling up deep neural networks to an extreme scale. Despite that numerous efforts have been made to improve the performance of MoE from the model design or system optimization perspective, existing MoE dispatch patterns are still not able to fully exploit the underlying heterogeneous network environments. In this paper, we propose TA-MoE, a topology-aware routing strategy for large-scale MoE trainging, from a model-system co-design perspective, which can dynamically adjust the MoE dispatch pattern according to the network topology. Based on communication modeling, we abstract the dispatch problem into an optimization objective and obtain the approximate dispatch pattern under different topologies. On top of that, we design a topology-aware auxiliary loss, which can adaptively route the data to fit in the underlying topology without sacrificing the model accuracy. Experiments show that TA-MoE can substantially outperform its counterparts on various hardware and model configurations, with roughly 1. 01x-1. 61x, 1. 01x-4. 77x, 1. 25x-1. 54x improvements over the popular DeepSpeed-MoE, FastMoE and FasterMoE systems.

JBHI Journal 2022 Journal Article

VoxelHop: Successive Subspace Learning for ALS Disease Classification Using Structural MRI

  • Xiaofeng Liu
  • Fangxu Xing
  • Chao Yang
  • Chung-Chieh Jay Kuo
  • Suma Babu
  • Georges El Fakhri
  • Thomas Jenkins
  • Jonghye Woo

Deep learning has great potential for accurate detection and classification of diseases with medical imaging data, but the performance is often limited by the number of training datasets and memory requirements. In addition, many deep learning models are considered a “black-box, ” thereby often limiting their adoption in clinical applications. To address this, we present a successive subspace learning model, termed VoxelHop, for accurate classification of Amyotrophic Lateral Sclerosis (ALS) using T2-weighted structural MRI data. Compared with popular convolutional neural network (CNN) architectures, VoxelHop has modular and transparent structures with fewer parameters without any backpropagation, so it is well-suited to small dataset size and 3D imaging data. Our VoxelHop has four key components, including (1) sequential expansion of near-to-far neighborhood for multi-channel 3D data; (2) subspace approximation for unsupervised dimension reduction; (3) label-assisted regression for supervised dimension reduction; and (4) concatenation of features and classification between controls and patients. Our experimental results demonstrate that our framework using a total of 20 controls and 26 patients achieves an accuracy of 93. 48 $\%$ and an AUC score of 0. 9394 in differentiating patients from controls, even with a relatively small number of datasets, showing its robustness and effectiveness. Our thorough evaluations also show its validity and superiority to the state-of-the-art 3D CNN classification approaches. Our framework can easily be generalized to other classification tasks using different imaging modalities.

AAAI Conference 2020 Conference Paper

Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance

  • Mingxuan Jing
  • Xiaojian Ma
  • Wenbing Huang
  • Fuchun Sun
  • Chao Yang
  • Bin Fang
  • Huaping Liu

In this paper, we study Reinforcement Learning from Demonstrations (RLfD) that improves the exploration efficiency of Reinforcement Learning (RL) by providing expert demonstrations. Most of existing RLfD methods require demonstrations to be perfect and sufficient, which yet is unrealistic to meet in practice. To work on imperfect demonstrations, we first define an imperfect expert setting for RLfD in a formal way, and then point out that previous methods suffer from two issues in terms of optimality and convergence, respectively. Upon the theoretical findings we have derived, we tackle these two issues by regarding the expert guidance as a soft constraint on regulating the policy exploration of the agent, which eventually leads to a constrained optimization problem. We further demonstrate that such problem is able to be addressed efficiently by performing a local linear search on its dual form. Considerable empirical evaluations on a comprehensive collection of benchmarks indicate our method attains consistent improvement over other RLfD counterparts.

AAAI Conference 2020 Short Paper

VECA: A Method for Detecting Overfitting in Neural Networks (Student Abstract)

  • Liangzhu Ge
  • Yuexian Hou
  • Yaju Jiang
  • Shuai Yao
  • Chao Yang

Despite their widespread applications, deep neural networks often tend to overfit the training data. Here, we propose a measure called VECA (Variance of Eigenvalues of Covariance matrix of Activation matrix) and demonstrate that VECA is a good predictor of networks’ generalization performance during the training process. Experiments performed on fully-connected networks and convolutional neural networks trained on benchmark image datasets show a strong correlation between test loss and VECA, which suggest that we can calculate the VECA to estimate generalization performance without sacrificing training data to be used as a validation set.

NeurIPS Conference 2019 Conference Paper

Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement

  • Chao Yang
  • Xiaojian Ma
  • Wenbing Huang
  • Fuchun Sun
  • Huaping Liu
  • Junzhou Huang
  • Chuang Gan

This paper studies Learning from Observations (LfO) for imitation learning with access to state-only demonstrations. In contrast to Learning from Demonstration (LfD) that involves both action and state supervisions, LfO is more practical in leveraging previously inapplicable resources (e. g. , videos), yet more challenging due to the incomplete expert guidance. In this paper, we investigate LfO and its difference with LfD in both theoretical and practical perspectives. We first prove that the gap between LfD and LfO actually lies in the disagreement of inverse dynamics models between the imitator and expert, if following the modeling approach of GAIL. More importantly, the upper bound of this gap is revealed by a negative causal entropy which can be minimized in a model-free way. We term our method as Inverse-Dynamics-Disagreement-Minimization (IDDM) which enhances the conventional LfO method through further bridging the gap to LfD. Considerable empirical results on challenging benchmarks indicate that our method attains consistent improvements over other LfO counterparts.

TIST Journal 2017 Journal Article

TensorBeat

  • Xuyu Wang
  • Chao Yang
  • Shiwen Mao

Breathing signal monitoring can provide important clues for health problems. Compared to existing techniques that require wearable devices and special equipment, a more desirable approach is to provide contact-free and long-term breathing rate monitoring by exploiting wireless signals. In this article, we propose TensorBeat, a system to employ channel state information (CSI) phase difference data to intelligently estimate breathing rates for multiple persons with commodity WiFi devices. The main idea is to leverage the tensor decomposition technique to handle the CSI phase difference data. The proposed TensorBeat scheme first obtains CSI phase difference data between pairs of antennas at the WiFi receiver to create CSI tensors. Then canonical polyadic (CP) decomposition is applied to obtain the desired breathing signals. A stable signal matching algorithm is developed to identify the decomposed signal pairs, and a peak detection method is applied to estimate the breathing rates for multiple persons. Our experimental study shows that TensorBeat can achieve high accuracy under different environments for multiperson breathing rate monitoring.