Arrow Research search

Author name cluster

Xiao Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

67 papers
2 author rows

Possible papers

67

JBHI Journal 2026 Journal Article

A Causal Learning-Based sEMG Disentanglement Framework for Multi-Posture Domain Generalization

  • Tanying Su
  • Xin Tan
  • Xiao Liu
  • Chenyun Dai

Surface electromyography (sEMG) -based human-computer interaction (HCI) systems achieve high accuracy in controlled environments, but their robustness under daily life remains challenging. In real-world scenarios, variations in user posture introduce personalized biases that can significantly degrade model performance. A viable solution is to train a highly generalized network using existing data from various postures, enabling the model to become less sensitive to posture variations. In this work, we treat the original sEMG signals as a coupling of pattern and posture components, where each component can be considered as a causal signal specific to corresponding labels. We use the causal encoders to understand the generative relationships between data and labels, facilitating the disentanglement of components into different latent spaces and promoting clustering within each space. This enables the model to extract posture-invariant pattern components and train a robust pattern recognition model with strong generalization capabilities. We developed a high-density sEMG (HD-sEMG) dataset with 16 subjects performing in four common HCI postures, addressing the lack of posture variation samples in existing sEMG datasets. Our model achieved an average accuracy of 90. 3% across four generalization tasks, outperforming other domain generalization models and demonstrating its superiority.

JBHI Journal 2026 Journal Article

A Multi-Scale Hybrid Efficient Deep Learning Model for COPD Detection Using Respiratory Sounds

  • Xingchen Dong
  • Xiaoyu Chen
  • Qiangqiang Chen
  • Xiao Liu
  • Hongyu Chen
  • Ronald M. Aarts
  • Bin Yin

Chronic obstructive pulmonary disease (COPD) is a prevalent respiratory disease, and early diagnosis is crucial for timely intervention and improved prognosis. Respiratory sound analysis, with its non-invasive nature and ability to reflect airway pathology, shows great potential as an auxiliary diagnostic tool. However, existing methods often focus on detecting specific abnormal sounds, such as wheezing and crackling, rather than diagnosing diseases directly, Additionally, most approaches rely on single features or architectures, which limits diagnostic accuracy. To address these issues, this paper proposes a multi-scale hybrid deep learning model that combines Convolutional Neural Network (CNN), Bidirectional Long Short-term Memory networks (BiLSTM), and Vision Transformer (ViT) to capture temporal, spatial, and global contextual features from both raw signals and multi-scale Mel spectrograms. A Multi-Scale Dynamic Fusion (MSDF) module further integrates these features to enhance representation, while achieving a balance between model complexity and performance. The model achieves accuracies of 99. 23% on the ICBHI database and 98. 48% on the KAUH/RespiratoryDatabase@TR hybrid database, demonstrating strong potential for effective clinical COPD diagnosis.

AAAI Conference 2026 Conference Paper

From Single to Societal: Analyzing Persona-Induced Bias in Multi-Agent Interactions

  • Jiayi Li
  • Xiao Liu
  • Yansong Feng

Large Language Model (LLM)-based multi-agent systems are increasingly used to simulate human interactions and solve collaborative tasks. A common practice is to assign agents with personas to encourage behavioral diversity. However, this raises a critical yet underexplored question: do personas introduce biases into multi-agent interactions? This paper presents a systematic investigation into persona-induced biases in multi-agent interactions, with a focus on social traits like trustworthiness (how an agent's opinion is received by others) and insistence (how strongly an agent advocates for its opinion). Through a series of controlled experiments in collaborative problem-solving and persuasion tasks, we reveal that (1) LLM-based agents exhibit biases in both trustworthiness and insistence, with personas from historically advantaged groups (e.g., men and White individuals) perceived as less trustworthy and demonstrating less insistence; and (2) agents exhibit significant in-group favoritism, showing a higher tendency to conform to others who share the same persona. These biases persist across various LLMs, group sizes, and numbers of interaction rounds, highlighting an urgent need for awareness and mitigation to ensure the fairness and reliability of multi-agent systems.

AAAI Conference 2026 Conference Paper

How Does Alignment Enhance LLMs’ Multilingual Capabilities? A Language Neurons Perspective

  • Shimao Zhang
  • Zhejian Lai
  • Xiang Liu
  • Shuaijie She
  • Xiao Liu
  • Yeyun Gong
  • Shujian Huang
  • Jiajun Chen

Multilingual Alignment is an effective and representative paradigm to enhance LLMs' multilingual capabilities, which transfers the capabilities from the high-resource languages to the low-resource languages. Meanwhile, some research on language-specific neurons provides a new perspective to analyze and understand LLMs' mechanisms. However, we find that there are many neurons that are shared by multiple but not all languages and cannot be correctly classified. In this work, we propose a ternary classification methodology that categorizes neurons into three types, including language-specific neurons, language-related neurons, and general neurons. And we propose a corresponding identification algorithm to distinguish these different types of neurons. Furthermore, based on the distributional characteristics of different types of neurons, we divide the LLMs' internal process for multilingual inference into four parts: (1) multilingual understanding, (2) shared semantic space reasoning, (3) multilingual output space transformation, and (4) vocabulary space outputting. Additionally, we systematically analyze the models before and after alignment with a focus on different types of neurons. We also analyze the phenomenon of ''Spontaneous Multilingual Alignment''. Overall, our work conducts a comprehensive investigation based on different types of neurons, providing empirical results and valuable insights to better understand multilingual alignment and multilingual capabilities of LLMs.

AAAI Conference 2026 Conference Paper

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

  • Jiazheng Xu
  • Yu Huang
  • Jiale Cheng
  • Yuanming Yang
  • Jiajun Xu
  • Yuan Wang
  • Wenbo Duan
  • Shen Yang

Visual generative models have achieved remarkable progress in synthesizing photorealistic images and videos, yet aligning their outputs with human preferences across critical dimensions remains a persistent challenge. Though reinforcement learning from human feedback offers promise for preference alignment, existing reward models for visual generation face limitations, including black-box scoring without interpretability and potentially resultant unexpected biases. We present VisionReward, a general framework for learning human visual preferences in both image and video generation. Specifically, we employ a hierarchical visual assessment framework to capture fine-grained human preferences, and leverages linear weighting to enable interpretable preference learning. Furthermore, we propose a multi-dimensional consistent strategy when using VisionReward as a reward model during preference optimization for visual generation. Experiments show that VisionReward can significantly outperform existing image and video reward models on both machine metrics and human evaluation. Notably, VisionReward surpasses VideoScore by 17.2% in preference prediction accuracy, and text-to-video models with VisionReward achieve a 31.6% higher pairwise win rate compared to the same models using VideoScore.

NeurIPS Conference 2025 Conference Paper

Better NTK Conditioning: A Free Lunch from (ReLU) Nonlinear Activation in Wide Neural Networks

  • Chaoyue Liu
  • Han Bi
  • Like Hui
  • Xiao Liu

Nonlinear activation functions are widely recognized for enhancing the expressivity of neural networks, which is the primary reason for their widespread implementation. In this work, we focus on ReLU activation and reveal a novel and intriguing property of nonlinear activations. By comparing enabling and disabling the nonlinear activations in the neural network, we demonstrate their specific effects on wide neural networks: (a) better feature separation, i. e. , a larger angle separation for similar data in the feature space of model gradient, and (b) better NTK conditioning, i. e. , a smaller condition number of neural tangent kernel (NTK). Furthermore, we show that the network depth (i. e. , with more nonlinear activation operations) further amplifies these effects; in addition, in the infinite-width-then-depth limit, all data are equally separated with a fixed angle in the model gradient feature space, regardless of how similar they are originally in the input space. Note that, without the nonlinear activation, i. e. , in a linear neural network, the data separation remains the same as for the original inputs and NTK condition number is equivalent to the Gram matrix, regardless of the network depth. Due to the close connection between NTK condition number and convergence theories, our results imply that nonlinear activation helps to improve the worst-case convergence rates of gradient based methods.

AAAI Conference 2025 Conference Paper

Enhancing Large Language Model Performance with Gradient-Based Parameter Selection

  • Haoling Li
  • Xin Zhang
  • Xiao Liu
  • Yeyun Gong
  • Yifan Wang
  • Qi Chen
  • Peng Cheng

Large language models (LLMs) have revolutionized numerous fields of research, driving significant advancements in natural language processing, machine translation, and beyond. Although the extensive number of parameters contributes a lot to the great success, existing studies indicate that not all model parameters hold equal importance, which further leads to redundancy during the parameter update process. Recent works for reducing redundant parameter updates for LLMs either lack task-specific data information, may leading to suboptimal model performance, or discard transformer components or insignificant parameters, limiting the model's scalability across different tasks and potentially compromising the LLM structure. To address these issues and further enhance the performance of LLMs, we propose Gradient-Mask Tuning (GMT), a method that selectively updates parameters based on gradient information, which is specific to the target tasks. Specifically, after calculating gradients during back propagation, we measure their absolute values and mask those with small absolute values. Our empirical results in various training paradigms like SFT and DPO for various domains of tasks demonstrate that GMT not only preserves the original network structure but also enhances the potential performance of LLMs. Further analysis indicates that GMT exhibits insensitivity to mask ratio and possesses computational efficiency comparable to vanilla training approach.

IROS Conference 2025 Conference Paper

eXplainable Intention Estimation in Teleoperated Manipulation Using Deep Dynamic Graph Neural Networks

  • Prakash Baskaran
  • Xiao Liu
  • Songpo Li
  • Soshi Iba

Shared autonomy can improve teleoperating robotic systems in complex manufacturing and assembly tasks by combining human decision-making and robotic capabilities. A key aspect of seamless collaboration and trust in shared autonomy is the robot’s ability to interpret human intentions in a consistent and explainable manner. To achieve this, a graph neural network-based intention estimation framework is introduced, which generates dynamic graphs that capture spatial relationships evolving over time. The framework predicts human intentions at two hierarchical levels: low-level actions and high-level tasks. Furthermore, we empirically and anecdo-tally verify the correctness and consistency of the predictions using explainability metrics. The algorithm is demonstrated by teleoperating a bi-manual robot to assemble various block structures in a virtual reality simulation environment.

IROS Conference 2025 Conference Paper

High-Precision and High-Efficiency Trajectory Tracking for Excavators Based on Closed-Loop Dynamics

  • Ziqing Zou
  • Cong Wang
  • Yue Hu
  • Xiao Liu
  • Bowen Xu
  • Rong Xiong
  • Changjie Fan
  • Yingfeng Chen

The complex nonlinear dynamics of hydraulic excavators, such as time delays and control coupling, pose significant challenges to achieving high-precision trajectory tracking. Traditional control methods often fall short in such applications due to their inability to effectively handle these nonlinearities, while commonly used learning-based methods require extensive interactions with the environment, leading to inefficiency. To address these issues, we introduce EfficientTrack, a trajectory tracking method that integrates model-based learning to manage nonlinear dynamics and leverages closed-loop dynamics to improve learning efficiency, ultimately minimizing tracking errors. We validate our method through comprehensive experiments both in simulation and on a real-world excavator. Comparative experiments in simulation demonstrate that our method outperforms existing learning-based approaches, achieving the highest tracking precision and smoothness with the fewest interactions. Real-world experiments further show that our method remains effective under load conditions and possesses the ability for continual learning, highlighting its practical applicability. For implementation details and source code, please refer to https://github.com/ZiqingZou/EfficientTrack.

AAAI Conference 2025 Conference Paper

Key-Point-Driven Data Synthesis with Its Enhancement on Mathematical Reasoning

  • Yiming Huang
  • Xiao Liu
  • Yeyun Gong
  • Zhibin Gou
  • Yelong Shen
  • Nan Duan
  • Weizhu Chen

Large language models have shown great potential in complex reasoning tasks, yet their performance is often hampered by the scarcity of high-quality and reasoning-focused training datasets. Addressing this challenge, we propose Key-PointDriven Data Synthesis (KPDDS), a novel data synthesis framework that synthesizes question-answer pairs by leveraging key points and exemplar practices from authentic data sources. KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability. As a result, we present KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K questionanswer pairs. Utilizing KPMath and augmenting it with additional reasoning-intensive corpora, we create the comprehensive KPMath-Plus dataset. Our experiments demonstrate that this dataset can enhance the mathematical reasoning performance of models across various architectures and sizes. The Qwen1.5-72B model, fine-tuned on KPMath-Plus, achieves 87.0% accuracy on GSM8K and 58.3% on MATH, surpassing competitors in the 7B to 72B range and best commercial models like GPT-4 across multiple math reasoning datasets.

ICML Conference 2025 Conference Paper

Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

  • Haebin Shin
  • Lei Ji 0001
  • Xiao Liu
  • Yeyun Gong

Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2. 5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.

IROS Conference 2025 Conference Paper

RGB-Thermal Visual Place Recognition via Vision Foundation Model

  • Minghao Ye
  • Xiao Liu
  • Yu Wang
  • Lu Liu 0002
  • Haoyao Chen

Visual place recognition is a critical component of robust simultaneous localization and mapping systems. Conventional approaches primarily rely on RGB imagery, but their performance degrades significantly in extreme environments, such as those with poor illumination and airborne particulate interference (e. g. , smoke or fog), which significantly degrade the performance of RGB-based methods. Furthermore, existing techniques often struggle with cross-scenario generalization. To overcome these limitations, we propose an RGB-thermal multimodal fusion framework for place recognition, specifically designed to enhance robustness in extreme environmental conditions. Our framework incorporates a dynamic RGB-thermal fusion module, coupled with dual fine-tuned vision foundation models as the feature extraction backbone. Experimental results on public datasets and our self-collected dataset demonstrate that our method significantly outperforms state-of-the-art RGB-based approaches, achieving generalizable and robust retrieval capabilities across day and night scenarios. The code is available at https://github.com/HITSZ-NRSL/RGB-Thermal-VPR.

IJCAI Conference 2025 Conference Paper

RobustHAR: Multi-scale Spatial-temporal Masked Self-supervised Pre-training for Robust Human Activity Recognition

  • Xiao Liu
  • Guan Yuan
  • Yanmei Zhang
  • Shang Liu
  • Qiuyan Yan

Human activity recognition (HAR) is prone to performance degradation in real-world applications due to data missing between intra-sensor and inter-sensor channels. Masked modeling, as one mainstream paradigm of self-supervised pre-training, can learn robust representations across sensors in the data missing scenario by reconstructing the masked content based on the unmasked part. However, the existing methods predominantly emphasize the temporal dynamics of human activities, which limits their ability to effectively capture the spatial interdependencies among multiple sensors. Besides, different human activities often span across various spatial-temporal scales, which results in activity recognizer failing to capture intricate spatial-temporal semantic information. To address these issues, we propose RobustHAR, a new HAR model with multi-scale spatial-temporal masked self-supervised pre-training designed to improve model performance on the data missing context. RobustHAR involves three main steps: (1) RobustHAR constructs location-inspired spatial-temporal 3D-variation modeling to capture spatial-temporal correlated information in human activity data. (2) RobustHAR then designs multi-scale spatial-temporal masked self-supervised pre-training with semantic-consistent multi-scale feature co-learning for learning robust features at different scales. (3) Finally, RobustHAR fine-tunes the pretraining model with adaptive multi-scale feature fusion for human activity recognition. Extensive experiments on three public multi-sensor datasets demonstrate that RobustHAR outperforms existing state-of-the-art methods.

IS Journal 2024 Journal Article

A Text-Enhanced Transformer Fusion Network for Multimodal Knowledge Graph Completion

  • Jingchao Wang
  • Xiao Liu
  • Weimin Li
  • Fangfang Liu
  • Xing Wu
  • Qun Jin

Multimodal knowledge graphs (MKGs) organize multimodal facts in the form of entities and relations, and have been successfully applied to several downstream tasks. As most MKGs are incomplete, the MKG completion task has been proposed to address this problem, which aims to complete missing entities in MKGs. Most of the previous works obtain reasoning ability by capturing the correlation between target triplets and related images, but they ignore contextual semantic information and the reasoning process is not easily explainable. To address these issues, we propose a novel text-enhanced transformer fusion network, which converts the context path between head and tail entities into natural language text and fuses multimodal features from both coarse and fine granularities through a multigranularity fuser. It not only effectively enhances text semantic information but also improves the interpretability of the model by introducing paths. Experimental results on benchmark datasets demonstrate the effectiveness of our model.

NeurIPS Conference 2024 Conference Paper

Adaptive Visual Scene Understanding: Incremental Scene Graph Generation

  • Naitik Khandelwal
  • Xiao Liu
  • Mengmi Zhang

Scene graph generation (SGG) analyzes images to extract meaningful information about objects and their relationships. In the dynamic visual world, it is crucial for AI systems to continuously detect new objects and establish their relationships with existing ones. Recently, numerous studies have focused on continual learning within the domains of object detection and image recognition. However, a limited amount of research focuses on a more challenging continual learning problem in SGG. This increased difficulty arises from the intricate interactions and dynamic relationships among objects, and their associated contexts. Thus, in continual learning, SGG models are often required to expand, modify, retain, and reason scene graphs within the process of adaptive visual scene understanding. To systematically explore Continual Scene Graph Generation (CSEGG), we present a comprehensive benchmark comprising three learning regimes: relationship incremental, scene incremental, and relationship generalization. Moreover, we introduce a ``Replays via Analysis by Synthesis" method named RAS. This approach leverages the scene graphs, decomposes and re-composes them to represent different scenes, and replays the synthesized scenes based on these compositional scene graphs. The replayed synthesized scenes act as a means to practice and refine proficiency in SGG in known and unknown environments. Our experimental results not only highlight the challenges of directly combining existing continual learning methods with SGG backbones but also demonstrate the effectiveness of our proposed approach, enhancing CSEGG efficiency while simultaneously preserving privacy and memory usage. All data and source code will be made public.

NeurIPS Conference 2024 Conference Paper

Attack-Resilient Image Watermarking Using Stable Diffusion

  • Lijun Zhang
  • Xiao Liu
  • Antoni V. Martin
  • Cindy X. Bearfield
  • Yuriy Brun
  • Hui Guan

Watermarking images is critical for tracking image provenance and proving ownership. With the advent of generative models, such as stable diffusion, that can create fake but realistic images, watermarking has become particularly important to make human-created images reliably identifiable. Unfortunately, the very same stable diffusion technology can remove watermarks injected using existing methods. To address this problem, we present ZoDiac, which uses a pre-trained stable diffusion model to inject a watermark into the trainable latent space, resulting in watermarks that can be reliably detected in the latent vector even when attacked. We evaluate ZoDiac on three benchmarks, MS-COCO, DiffusionDB, and WikiArt, and find that ZoDiac is robust against state-of-the-art watermark attacks, with a watermark detection rate above 98% and a false positive rate below 6. 4%, outperforming state-of-the-art watermarking methods. We hypothesize that the reciprocating denoising process in diffusion models may inherently enhance the robustness of the watermark when faced with strong attacks and validate the hypothesis. Our research demonstrates that stable diffusion is a promising approach to robust watermarking, able to withstand even stable-diffusion-based attack methods. ZoDiac is open-sourced and available at https: //github. com/zhanglijun95/ZoDiac.

NeurIPS Conference 2024 Conference Paper

Bias and Volatility: A Statistical Framework for Evaluating Large Language Model's Stereotypes and the Associated Generation Inconsistency

  • Yiran Liu
  • Ke Yang
  • Zehan Qi
  • Xiao Liu
  • Yang Yu
  • ChengXiang Zhai

We present a novel statistical framework for analyzing stereotypes in large language models (LLMs) by systematically estimating the bias and variation in their generation. Current evaluation metrics in the alignment literature often overlook the randomness of stereotypes caused by the inconsistent generative behavior of LLMs. For example, this inconsistency can result in LLMs displaying contradictory stereotypes, including those related to gender or race, for identical professions across varied contexts. Neglecting such inconsistency could lead to misleading conclusions in alignment evaluations and hinder the accurate assessment of the risk of LLM applications perpetuating or amplifying social stereotypes and unfairness. This work proposes a Bias-Volatility Framework (BVF) that estimates the probability distribution function of LLM stereotypes. Specifically, since the stereotype distribution fully captures an LLM's generation variation, BVF enables the assessment of both the likelihood and extent to which its outputs are against vulnerable groups, thereby allowing for the quantification of the LLM's aggregated discrimination risk. Furthermore, we introduce a mathematical framework to decompose an LLM’s aggregated discrimination risk into two components: bias risk and volatility risk, originating from the mean and variation of LLM’s stereotype distribution, respectively. We apply BVF to assess 12 commonly adopted LLMs and compare their risk levels. Our findings reveal that: i) Bias risk is the primary cause of discrimination risk in LLMs; ii) Most LLMs exhibit significant pro-male stereotypes for nearly all careers; iii) Alignment with reinforcement learning from human feedback lowers discrimination by reducing bias, but increases volatility; iv) Discrimination risk in LLMs correlates with key sociol-economic factors like professional salaries. Finally, we emphasize that BVF can also be used to assess other dimensions of generation inconsistency's impact on LLM behavior beyond stereotypes, such as knowledge mastery.

ICML Conference 2024 Conference Paper

CaRiNG: Learning Temporal Causal Representation under Non-Invertible Generation Process

  • Guangyi Chen 0002
  • Yifan Shen 0004
  • Zhenhao Chen
  • Xiangchen Song
  • Yuewen Sun
  • Weiran Yao
  • Xiao Liu
  • Kun Zhang 0001

Identifying the underlying time-delayed latent causal processes in sequential data is vital for grasping temporal dynamics and making downstream reasoning. While some recent methods can robustly identify these latent causal variables, they rely on strict assumptions about the invertible generation process from latent variables to observed data. However, these assumptions are often hard to satisfy in real-world applications containing information loss. For instance, the visual perception process translates a 3D space into 2D images, or the phenomenon of persistence of vision incorporates historical data into current perceptions. To address this challenge, we establish an identifiability theory that allows for the recovery of independent latent components even when they come from a nonlinear and non-invertible mix. Using this theory as a foundation, we propose a principled approach, CaRiNG, to learn the Causal Representation of Non-invertible Generative temporal data with identifiability guarantees. Specifically, we utilize temporal context to recover lost latent information and apply the conditions in our theory to guide the training process. Through experiments conducted on synthetic datasets, we validate that our CaRiNG method reliably identifies the causal process, even when the generation process is non-invertible. Moreover, we demonstrate that our approach considerably improves temporal understanding and reasoning in practical applications.

ICRA Conference 2024 Conference Paper

Contrastive Learning-Based Attribute Extraction Method for Enhanced Terrain Classification

  • Xiao Liu
  • Hongjin Chen
  • Haoyao Chen

The outdoor environment has many uneven surfaces that put the robot at risk of sinking or tipping over. Recognizing the type of terrain can help robot avoid risks and choose an appropriate gait. One of the critical problems is how to extract the terrain-related knowledge from sensor data collected as the robot traversed the ground. Many existing vision-based approaches are limited in directly perceiving the intrinsic properties of various terrains. The intuitive approach entails directly analyzing data recorded by the robot’s proprioceptive sensors. However, it faces challenges in being specific to certain robot leg configurations or in the lack of interpretability of the extracted features. In this paper, a terrain attribute extraction algorithm is proposed based on contrastive learning. It leverages the haptic data generated from the interaction between the robot’s legs and terrain to automatically extract terrain attributes. The results demonstrate that the attributes extracted using this method strongly correlate with the actual softness of the terrain. Furthermore, these attributes played an important role in achieving high accuracy in terrain classification tasks.

IROS Conference 2024 Conference Paper

Diff-Control: A Stateful Diffusion-based Policy for Imitation Learning

  • Xiao Liu
  • Yifan Zhou
  • Fabian Clemens Weigend
  • Shubham D. Sonawani
  • Shuhei Ikemoto
  • Heni Ben Amor

While imitation learning provides a simple and effective framework for policy learning, acquiring consistent action during robot execution remains a challenging task. Existing approaches primarily focus on either modifying the action representation at data curation stage or altering the model itself, both of which do not fully address the scalability of consistent action generation. To overcome this limitation, we introduce the Diff-Control policy, which utilizes a diffusion-based model to learn action representation from a state-space modeling viewpoint. We demonstrate that diffusion-based policies can acquire statefulness through a Bayesian formulation facilitated by ControlNet, leading to improved robustness and success rates. Our experimental results demonstrate the significance of incorporating action statefulness in policy learning, where Diff-Control shows improved performance across various tasks. Specifically, Diff-Control achieves an average success rate of 72% and 84% on stateful and dynamic tasks, respectively. Notably, Diff-Control also shows consistent performance in the presence of perturbations, outperforming other state-of-the-art methods that falter under similar conditions. Project page: https://diff-control.github.io/

IJCAI Conference 2024 Conference Paper

FBLG: A Local Graph Based Approach for Handling Dual Skewed Non-IID Data in Federated Learning

  • Yi Xu
  • Ying Li
  • Haoyu Luo
  • Xiaoliang Fan
  • Xiao Liu

In real-world situations, federated learning often needs to process non-IID (non-independent and identically distributed) data with multiple skews, causing inadequate model performance. Existing federated learning methods mainly focus on addressing the problem with a single skew of non-IID, and hence the performance of global models can be degraded when faced with dual skewed non-IID data caused by heterogeneous label distributions and sample sizes among clients. To address the problem with dual skewed non-IID data, in this paper, we propose a federated learning algorithm based on local graph, named FBLG. Specifically, to address the label distribution skew, we firstly construct a local graph based on clients' local losses and Jensen-Shannon (JS) divergence, so that similar clients can be selected for aggregation to ensure a highly consistent global model. Afterwards, to address the sample size skew, we design the objective function to favor clients with more samples as models trained with more samples tend to carry more useful information. Experiments on four datasets with dual skewed non-IID data demonstrate FBLG outperforms nine baseline methods and achieves up to 9% improvement in accuracy. Simultaneously, both theoretical analysis and experiments show FBLG can converge quickly.

ICRA Conference 2024 Conference Paper

iRoCo: Intuitive Robot Control From Anywhere Using a Smartwatch

  • Fabian Clemens Weigend
  • Xiao Liu
  • Shubham D. Sonawani
  • Neelesh Kumar
  • Venugopal Vasudevan
  • Heni Ben Amor

This paper introduces iRoCo (intuitive Robot Control) – a framework for ubiquitous human-robot collaboration using a single smartwatch and smartphone. By integrating probabilistic differentiable filters, iRoCo optimizes a combination of precise robot control and unrestricted user movement from ubiquitous devices. We demonstrate and evaluate the effectiveness of iRoCo in practical teleoperation and drone piloting applications. Comparative analysis shows no significant difference between task performance with iRoCo and gold-standard control systems in teleoperation tasks. Additionally, iRoCo users complete drone piloting tasks 32% faster than with a traditional remote control and report less frustration in a subjective load index questionnaire. Our findings strongly suggest that iRoCo is a promising new approach for intuitive robot control through smartwatches and smart-phones from anywhere, at any time. The code is available at www. github.com/wearable-motion-capture

ICLR Conference 2024 Conference Paper

LLCP: Learning Latent Causal Processes for Reasoning-based Video Question Answer

  • Guangyi Chen 0002
  • Yuke Li
  • Xiao Liu
  • Zijian Li 0001
  • Eman Al Suradi
  • Donglai Wei 0001
  • Kun Zhang 0001

Current approaches to Video Question Answering (VideoQA) primarily focus on cross-modality matching, which is limited by the requirement for extensive data annotations and the insufficient capacity for causal reasoning (e.g. attributing accidents). To address these challenges, we introduce a causal framework for video reasoning, termed Learning Latent Causal Processes (LLCP). At the heart of LLCP lies a multivariate generative model designed to analyze the spatial-temporal dynamics of objects within events. Leveraging the inherent modularity of causal mechanisms, we train the model through self-supervised local auto-regression eliminating the need for annotated question-answer pairs. During inference, the model is applied to answer two types of reasoning questions: accident attribution, which infers the cause from observed effects, and counterfactual prediction, which predicts the effects of counterfactual conditions given the factual evidence. In the first scenario, we identify variables that deviate from the established distribution by the learned model, signifying the root cause of accidents. In the second scenario, we replace embeddings of previous variables with counterfactual ones, enabling us to forecast potential developments. Once we have identified these cause/effect variables, natural language answers are derived through a combination of grammatical parsing and a pre-trained vision-language model. We assess the efficacy of LLCP on both synthetic and real-world data, demonstrating comparable performance to supervised methods despite our framework using no paired textual annotations.

TIST Journal 2024 Journal Article

MGRR-Net: Multi-level Graph Relational Reasoning Network for Facial Action Unit Detection

  • Xuri Ge
  • Joemon M. Jose
  • Songpei Xu
  • Xiao Liu
  • Hu Han

The Facial Action Coding System (FACS) encodes the action units (AUs) in facial images, which has attracted extensive research attention due to its wide use in facial expression analysis. Many methods that perform well on automatic facial action unit (AU) detection primarily focus on modeling various AU relations between corresponding local muscle areas or mining global attention–aware facial features; however, they neglect the dynamic interactions among local-global features. We argue that encoding AU features just from one perspective may not capture the rich contextual information between regional and global face features, as well as the detailed variability across AUs, because of the diversity in expression and individual characteristics. In this article, we propose a novel Multi-level Graph Relational Reasoning Network (termed MGRR-Net ) for facial AU detection. Each layer of MGRR-Net performs a multi-level (i.e., region-level, pixel-wise, and channel-wise level) feature learning. On the one hand, the region-level feature learning from the local face patch features via graph neural network can encode the correlation across different AUs. On the other hand, pixel-wise and channel-wise feature learning via graph attention networks (GAT) enhance the discrimination ability of AU features by adaptively recalibrating feature responses of pixels and channels from global face features. The hierarchical fusion strategy combines features from the three levels with gated fusion cells to improve AU discriminative ability. Extensive experiments on DISFA and BP4D AU datasets show that the proposed approach achieves superior performance than the state-of-the-art methods.

NeurIPS Conference 2024 Conference Paper

Not All Tokens Are What You Need for Pretraining

  • Zhenghao Lin
  • Zhibin Gou
  • Yeyun Gong
  • Xiao Liu
  • Yelong Shen
  • Ruochen Xu
  • Chen Lin
  • Yujiu Yang

Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that ''Not all tokens in a corpus are equally important for language model training''. Our initial analysis examines token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring training tokens using a reference model, and then training the language model with a focused loss on tokens with higher scores. When continual continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40. 6% and 51. 8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when continual pretraining on 80B general tokens, Rho-1 achieves 6. 8% average enhancement across 15 diverse tasks, increasing both data efficiency and performance of the language model pre-training.

NeurIPS Conference 2024 Conference Paper

To Learn or Not to Learn, That is the Question — A Feature-Task Dual Learning Model of Perceptual Learning

  • Xiao Liu
  • Muyang Lyu
  • Cong Yu
  • Si Wu

Perceptual learning refers to the practices through which participants learn to improve their performance in perceiving sensory stimuli. Two seemingly conflicting phenomena of specificity and transfer have been widely observed in perceptual learning. Here, we propose a dual-learning model to reconcile these two phenomena. The model consists of two learning processes. One is task-based learning, which is fast and enables the brain to adapt to a task rapidly by using existing feature representations. The other is feature-based learning, which is slow and enables the brain to improve feature representations to match the statistical change of the environment. Associated with different training paradigms, the interactions between these two learning processes induce the rich phenomena of perceptual learning. Specifically, in the training paradigm where the same stimulus condition is presented excessively, feature-based learning is triggered, which incurs specificity, while in the paradigm where the stimulus condition varies during the training, task-based learning dominates to induce the transfer effect. As the number of training sessions under the same stimulus condition increases, a transition from transfer to specificity occurs. We demonstrate that the dual-learning model can account for both the specificity and transfer phenomena observed in classical psychophysical experiments. We hope that this study gives us insight into understanding how the brain balances the accomplishment of a new task and the consumption of learning effort.

JBHI Journal 2023 Journal Article

3D Intracranial Aneurysm Classification and Segmentation via Unsupervised Dual-Branch Learning

  • Di Shao
  • Xuequan Lu
  • Xiao Liu

Intracranial aneurysms are common nowadays and how to detect them intelligently is of great significance in digital health. Whereas most existing deep learning research focused on medical images in a supervised way, we introduce an unsupervised method for the detection of intracranial aneurysms based on 3D point cloud data. In particular, our method consists of two stages: unsupervised pre-training and downstream tasks. As for the former, the main idea is to pair each point cloud with its jittering counterpart and maximise their correspondence. Then we design a dual-branch contrastive network with an encoder for each branch and a subsequent common projection head. As for the latter, we design simple networks for supervised classification and segmentation training. Experiments on the public dataset (IntrA) show that our unsupervised method achieves comparable or even better performance than some state-of-the-art supervised techniques, and it is most prominent in the detection of aneurysmal vessels. Experiments on the ModelNet-40 also show that our method achieves the accuracy of 90. 79% which outperforms existing state-of-the-art unsupervised models.

NeurIPS Conference 2023 Conference Paper

A Recurrent Neural Circuit Mechanism of Temporal-scaling Equivariant Representation

  • Junfeng Zuo
  • Xiao Liu
  • Ying Nian Wu
  • Si Wu
  • Wenhao Zhang

Time perception is critical in our daily life. An important feature of time perception is temporal scaling (TS): the ability to generate temporal sequences (e. g. , motor actions) at different speeds. However, it is largely unknown about the math principle underlying temporal scaling in recurrent circuits in the brain. To shed insight, the present study investigates the temporal scaling from the Lie group point of view. We propose a canonical nonlinear recurrent circuit dynamics, modeled as a continuous attractor network, whose neuronal population responses embed a temporal sequence that is TS equivariant. Furthermore, we found the TS group operators can be explicitly represented by a control input fed into the recurrent circuit, where the input gain determines the temporal scaling factor (group parameter), and the spatial offset between the control input and network state emerges the generator. The neuronal responses in the recurrent circuit are also consistent with experimental findings. We illustrated that the recurrent circuit can drive a feedforward circuit to generate complex temporal sequences with different time scales, even in the case of negative time scaling (''time reversal''). Our work for the first time analytically links the abstract temporal scaling group and concrete neural circuit dynamics.

NeurIPS Conference 2023 Conference Paper

AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation

  • Tong Wu
  • Zhihao Fan
  • Xiao Liu
  • Hai-Tao Zheng
  • Yeyun Gong
  • Yelong Shen
  • Jian Jiao
  • Juntao Li

Diffusion models have gained significant attention in the realm of image generation due to their exceptional performance. Their success has been recently expanded to text generation via generating all tokens within a sequence concurrently. However, natural language exhibits a far more pronounced sequential dependency in comparison to images, and the majority of existing language models are trained with a left-to-right auto-regressive approach. To account for the inherent sequential characteristic of natural language, we introduce Auto-Regressive Diffusion (AR-Diffusion). AR-Diffusion ensures that the generation of tokens on the right depends on the generated ones on the left, a mechanism achieved through employing a dynamic number of denoising steps that vary based on token position. This results in tokens on the left undergoing fewer denoising steps than those on the right, thereby enabling them to generate earlier and subsequently influence the generation of tokens on the right. In a series of experiments on various text generation tasks, including text summarization, machine translation, and common sense generation, AR-Diffusion clearly demonstrated its superiority over existing diffusion language models and that it can be $100\times\sim600\times$ faster when achieving comparable results. Our code is available at https: //github. com/microsoft/ProphetNet/tree/master/AR-diffusion.

JBHI Journal 2023 Journal Article

Contrastive Learning for Prediction of Alzheimer's Disease Using Brain 18F-FDG PET

  • Yonglin Chen
  • Huabin Wang
  • Gong Zhang
  • Xiao Liu
  • Wei Huang
  • Xianjun Han
  • Xuejun Li
  • Melanie Martin

Brain 18F-FDG PET images are commonly-known materials for effectively predicting Alzheimer's disease (AD). However, the data volume of PET is usually insufficient, which is unfavorable to train an accurate AD prediction networks. Furthermore, the PET image is noisy with low signal-to-noise ratio, and simultaneously the feature (metabolic abnormality) used for predicting AD in PET image is not always obvious. Therefore, a contrastive-based learning method is proposed to address the challenges of PET image inherently possessed. Firstly, the slices of 3D PET image are amplified by cropping the image of anchors (i. e. , an augmented version of the same image) to generate extended training data. Meanwhile, contrastive loss is adopted to enlarge inter-class feature distances and reduce intra-class feature differences using subject fuzzy labels as supervised information. Secondly, we construct a double convolutional hybrid attention module to enhance the network to learn different perceptual domains where two convolutional layers with different convolutional kernels ( $7\times 7$ and $5\times 5$ ) are constructed. Moreover, we recommend a diagnosis mechanism by analyzing the consistency of predicted result for PET slices alone with clinical neuropsychological assessment to achieve a better AD diagnosis. The experimental results show that the proposed method outperforms the state-of-the-arts for brain 18F-FDG PET images, and hence demonstrate the advantage of the method in effectively predicting AD.

IS Journal 2023 Journal Article

Effective Interpretable Policy Distillation via Critical Experience Point Identification

  • Xiao Liu
  • Shuyang Liu
  • Bo An
  • Yang Gao
  • Shangdong Yang
  • Wenbin Li

Interpretable policy distillation aims to imitate a deep reinforcement learning (DRL) policy into a self-explainable model. However, the distilled policy usually does not generalize well to complex tasks. To investigate this phenomenon, we examine the experience pools of DRL tasks and find that these interactive experience distributions are heavy tailed. However, this critical issue is largely ignored by existing approaches, and, thus, they do not fully unitize the less frequent but very critical experience points. To address this issue, we propose characterizing decision boundaries via the minimum experience retention to deal with the heavy-tailed experience distributions. Our method identifies critical experience points that are close to the model’s decision boundaries, and such experience points are more critical because they portray the prerequisite of a model to take an action. As a result, our method distills the DRL policy to a self-explainable structure without a neural structure and ambiguous intermediate parameters. Through experiments on six games, we show that our method outperforms the state-of-the-art baselines in cumulative rewards, stability, and faithfulness.

ECAI Conference 2023 Conference Paper

Efficient Information Modulation Network for Image Super-Resolution

  • Xiao Liu
  • Xiangyu Liao
  • Xiuya Shi
  • Linbo Qing
  • Chao Ren 0002

Recent researches have shown that the success of Transformers comes from their macro-level framework and advanced components, not just their self-attention (SA) mechanism. Comparable results can be obtained by replacing SA with spatial pooling, shifting, MLP, fourier transform and constant matrix, all of which have spatial information encoding capability like SA. In light of these findings, this work focuses on combining efficient spatial information encoding technology with superior macro architectures in Transformers. We rethink spatial convolution to achieve more efficient encoding of spatial features and dynamic modulation value representations by convolutional modulation techniques. The large-kernel convolution and Hadamard product are utilizated in the proposed Multi-orders Long-range convolutional modulation (MOLRCM) layer to imitate the implementation of SA. Moreover, MOLRCM layer also achieve long-range correlations and self-adaptation behavior, similar to SA, with linear complexity. On the other hand, we also address the sub-optimality of vanilla feed-forward networks (FFN) by introducing spatial awareness and locality, improving feature diversity, and regulating information flow between layers in the proposed Spatial Awareness Dynamic Feature Flow Modulation (SADFFM) layer. Experiment results show that our proposed efficient information modulation network (EIMN) performs better both quantitatively and qualitatively. Codes and supplementary materials link: https: //github. com/liux520/EIMN.

IROS Conference 2023 Conference Paper

Enhancing State Estimation in Robots: A Data-Driven Approach with Differentiable Ensemble Kalman Filters

  • Xiao Liu
  • Geoffrey Clark
  • Joseph Campbell
  • Yifan Zhou
  • Heni Ben Amor

This paper introduces a novel state estimation framework for robots using differentiable ensemble Kalman filters (DEnKF). DEnKF is a reformulation of the traditional ensemble Kalman filter that employs stochastic neural networks to model the process noise implicitly. Our work is an extension of previous research on differentiable filters, which has provided a strong foundation for our modular and end-to-end differentiable framework. This framework enables each component of the system to function independently, leading to improved flexibility and versatility in implementation. Through a series of experiments, we demonstrate the flexibility of this model across a diverse set of real-world tracking tasks, including visual odometry and robot manipulation. Moreover, we show that our model effectively handles noisy observations, is robust in the absence of observations, and outperforms state-of-the-art differentiable filters in terms of error metrics. Specifically, we observe a significant improvement of at least 59% in translational error when using DEnKF with noisy observations. Our results underscore the potential of DEnKF in advancing state estimation for robotics. Code for DEnKF is available at https://github.com/ir-lab/DEnKF

NeurIPS Conference 2023 Conference Paper

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

  • Jiazheng Xu
  • Xiao Liu
  • Yuchen Wu
  • Yuxuan Tong
  • Qinkai Li
  • Ming Ding
  • Jie Tang
  • Yuxiao Dong

We present a comprehensive solution to learn and improve text-to-image models from human preference feedback. To begin with, we build ImageReward---the first general-purpose text-to-image human preference reward model---to effectively encode human preferences. Its training is based on our systematic annotation pipeline including rating and ranking, which collects 137k expert comparisons to date. In human evaluation, ImageReward outperforms existing scoring models and metrics, making it a promising automatic metric for evaluating text-to-image synthesis. On top of it, we propose Reward Feedback Learning (ReFL), a direct tuning algorithm to optimize diffusion models against a scorer. Both automatic and human evaluation support ReFL's advantages over compared methods. All code and datasets are provided at \url{https: //github. com/THUDM/ImageReward}.

AAAI Conference 2023 Conference Paper

Imperceptible Adversarial Attack via Invertible Neural Networks

  • Zihan Chen
  • Ziyue Wang
  • Jun-Jie Huang
  • Wentao Zhao
  • Xiao Liu
  • Dejian Guan

Adding perturbations via utilizing auxiliary gradient information or discarding existing details of the benign images are two common approaches for generating adversarial examples. Though visual imperceptibility is the desired property of adversarial examples, conventional adversarial attacks still generate traceable adversarial perturbations. In this paper, we introduce a novel Adversarial Attack via Invertible Neural Networks (AdvINN) method to produce robust and imperceptible adversarial examples. Specifically, AdvINN fully takes advantage of the information preservation property of Invertible Neural Networks and thereby generates adversarial examples by simultaneously adding class-specific semantic information of the target class and dropping discriminant information of the original class. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet-1K demonstrate that the proposed AdvINN method can produce less imperceptible adversarial images than the state-of-the-art methods and AdvINN yields more robust adversarial examples with high confidence compared to other adversarial attacks. Code is available at https://github.com/jjhuangcs/AdvINN.

AAAI Conference 2023 Conference Paper

Learning Explicit Credit Assignment for Cooperative Multi-Agent Reinforcement Learning via Polarization Policy Gradient

  • Wubing Chen
  • Wenbin Li
  • Xiao Liu
  • Shangdong Yang
  • Yang Gao

Cooperative multi-agent policy gradient (MAPG) algorithms have recently attracted wide attention and are regarded as a general scheme for the multi-agent system. Credit assignment plays an important role in MAPG and can induce cooperation among multiple agents. However, most MAPG algorithms cannot achieve good credit assignment because of the game-theoretic pathology known as centralized-decentralized mismatch. To address this issue, this paper presents a novel method, Multi-Agent Polarization Policy Gradient (MAPPG). MAPPG takes a simple but efficient polarization function to transform the optimal consistency of joint and individual actions into easily realized constraints, thus enabling efficient credit assignment in MAPPG. Theoretically, we prove that individual policies of MAPPG can converge to the global optimum. Empirically, we evaluate MAPPG on the well-known matrix game and differential game, and verify that MAPPG can converge to the global optimum for both discrete and continuous action spaces. We also evaluate MAPPG on a set of StarCraft II micromanagement tasks and demonstrate that MAPPG outperforms the state-of-the-art MAPG algorithms.

IROS Conference 2023 Conference Paper

Learning Soft Robot Dynamics Using Differentiable Kalman Filters and Spatio-Temporal Embeddings

  • Xiao Liu
  • Shuhei Ikemoto
  • Yuhei Yoshimitsu
  • Heni Ben Amor

This paper introduces a novel approach for modeling the dynamics of soft robots, utilizing a differentiable filter architecture. The proposed approach enables end-to-end training to learn system dynamics, noise characteristics, and temporal behavior of the robot. A novel spatio-temporal embedding process is discussed to handle observations with varying sensor placements and sampling frequencies. The efficacy of this approach is demonstrated on a tensegrity robot arm by learning end-effector dynamics from demonstrations with complex bending motions. The model is proven to be robust against missing modalities, diverse sensor placement, and varying sampling rates. Additionally, the proposed framework is shown to identify physical interactions with humans during motion. The utilization of a differentiable filter presents a novel solution to the difficulties of modeling soft robot dynamics. Our approach shows substantial improvement in accuracy compared to state-of-the-art filtering methods, with at least a 24% reduction in mean absolute error (MAE) observed. Furthermore, the predicted end-effector positions show an average MAE of 25. 77mm from the ground truth, highlighting the advantage of our approach. The code is available at https://github.com/ir-lab/soft_robot_DEnKF.

NeurIPS Conference 2022 Conference Paper

AutoMTL: A Programming Framework for Automating Efficient Multi-Task Learning

  • Lijun Zhang
  • Xiao Liu
  • Hui Guan

Multi-task learning (MTL) jointly learns a set of tasks by sharing parameters among tasks. It is a promising approach for reducing storage costs while improving task accuracy for many computer vision tasks. The effective adoption of MTL faces two main challenges. The first challenge is to determine what parameters to share across tasks to optimize for both memory efficiency and task accuracy. The second challenge is to automatically apply MTL algorithms to an arbitrary CNN backbone without requiring time-consuming manual re-implementation and significant domain expertise. This paper addresses the challenges by developing the first programming framework AutoMTL that automates efficient MTL model development for vision tasks. AutoMTL takes as inputs an arbitrary backbone convolutional neural network (CNN) and a set of tasks to learn, and automatically produces a multi-task model that achieves high accuracy and small memory footprint simultaneously. Experiments on three popular MTL benchmarks (CityScapes, NYUv2, Tiny-Taskonomy) demonstrate the effectiveness of AutoMTL over state-of-the-art approaches as well as the generalizability of AutoMTL across CNNs. AutoMTL is open-sourced and available at https: //github. com/zhanglijun95/AutoMTL.

NeurIPS Conference 2021 Conference Paper

Learning with Noisy Correspondence for Cross-modal Matching

  • Zhenyu Huang
  • Guocheng Niu
  • Xiao Liu
  • Wenbiao Ding
  • Xinyan Xiao
  • Hua Wu
  • Xi Peng

Cross-modal matching, which aims to establish the correspondence between two different modalities, is fundamental to a variety of tasks such as cross-modal retrieval and vision-and-language understanding. Although a huge number of cross-modal matching methods have been proposed and achieved remarkable progress in recent years, almost all of these methods implicitly assume that the multimodal training data are correctly aligned. In practice, however, such an assumption is extremely expensive even impossible to satisfy. Based on this observation, we reveal and study a latent and challenging direction in cross-modal matching, named noisy correspondence, which could be regarded as a new paradigm of noisy labels. Different from the traditional noisy labels which mainly refer to the errors in category labels, our noisy correspondence refers to the mismatch paired samples. To solve this new problem, we propose a novel method for learning with noisy correspondence, named Noisy Correspondence Rectifier (NCR). In brief, NCR divides the data into clean and noisy partitions based on the memorization effect of neural networks and then rectifies the correspondence via an adaptive prediction model in a co-teaching manner. To verify the effectiveness of our method, we conduct experiments by using the image-text matching as a showcase. Extensive experiments on Flickr30K, MS-COCO, and Conceptual Captions verify the effectiveness of our method. The code could be accessed from www. pengxi. me.

JBHI Journal 2021 Journal Article

SCNET: A Novel UGI Cancer Screening Framework Based on Semantic-Level Multimodal Data Fusion

  • Shuai Ding
  • Hui Huang
  • Zhenmin Li
  • Xiao Liu
  • Shanlin Yang

Upper gastrointestinal (UGI) cancer has been identified as one of the ten most common causes of cancer deaths globally. UGI cancer screening is critical to improving the survival rate of UGI cancer patients. While many approaches to UGI cancer screening rely on single-modality data such as gastroscope imaging, limited studies have been dedicated to UGI cancer screening exploiting multisource and multimodal medical data, which could potentially lead to improved screening results. In this paper, we propose semantic-level cancer-screening network (SCNET), a framework for UGI cancer screening based on semantic-level multimodal upper gastrointestinal data fusion. Specifically, the proposed SCNET consists of a gastrointestinal image recognition flow and a textual medical record processing flow. High-level features of upper gastrointestinal data are extracted by identifying effective feature channels according to the correlation between the textual features and the spatial structure of the image features. The final screening results are obtained after the data fusion step. The experimental results show that the improvement of our approach over the state-of-the-art ones reached 4. 01% in average. The source code of SCNET is available at https://github.com/netflymachine/SCNET.

AAAI Conference 2020 Short Paper

Bayesian Adversarial Attack on Graph Neural Networks (Student Abstract)

  • Xiao Liu
  • Jing Zhao
  • Shiliang Sun

Adversarial attack on graph neural network (GNN) is distinctive as it often jointly trains the available nodes to generate a graph as an adversarial example. Existing attacking approaches usually consider the case that all the training set is available which may be impractical. In this paper, we propose a novel Bayesian adversarial attack approach based on projected gradient descent optimization, called Bayesian PGD attack, which gets more general attack examples than deterministic attack approaches. The generated adversarial examples by our approach using the same partial dataset as deterministic attack approaches would make the GNN have higher misclassification rate on graph node classification. Specifically, in our approach, the edge perturbation Z is used for generating adversarial examples, which is viewed as a random variable with scale constraint, and the optimization target of the edge perturbation is to maximize the KL divergence between its true posterior distribution p(Z|D) and its approximate variational distribution qθ(Z). We experimentally find that the attack performance will decrease with the reduction of available nodes, and the effect of attack using different nodes varies greatly especially when the number of nodes is small. Through experimental comparison with the state-of-the-art attack approaches on GNNs, our approach is demonstrated to have better and robust attack performance.

IJCAI Conference 2020 Conference Paper

Dialogue State Induction Using Neural Latent Variable Models

  • Qingkai Min
  • Libo Qin
  • Zhiyang Teng
  • Xiao Liu
  • Yue Zhang

Dialogue state modules are a useful component in a task-oriented dialogue system. Traditional methods find dialogue states by manually labeling training corpora, upon which neural models are trained. However, the labeling process can be costly, slow, error-prone, and more importantly, cannot cover the vast range of domains in real-world dialogues for customer service. We propose the task of dialogue state induction, building two neural latent variable models that mine dialogue states automatically from unlabeled customer service dialogue records. Results show that the models can effectively find meaningful dialogue states. In addition, equipped with induced dialogue states, a state-of-the-art dialogue system gives better performance compared with not using a dialogue state module.

AAAI Conference 2020 Conference Paper

Dynamic Instance Normalization for Arbitrary Style Transfer

  • Yongcheng Jing
  • Xiao Liu
  • Yukang Ding
  • Xinchao Wang
  • Errui Ding
  • Mingli Song
  • Shilei Wen

Prior normalization methods rely on affine transformations to produce arbitrary image style transfers, of which the parameters are computed in a pre-defined way. Such manuallydefined nature eventually results in the high-cost and shared encoders for both style and content encoding, making style transfer systems cumbersome to be deployed in resourceconstrained environments like on the mobile-terminal side. In this paper, we propose a new and generalized normalization module, termed as Dynamic Instance Normalization (DIN), that allows for flexible and more efficient arbitrary style transfers. Comprising an instance normalization and a dynamic convolution, DIN encodes a style image into learnable convolution parameters, upon which the content image is stylized. Unlike conventional methods that use shared complex encoders to encode content and style, the proposed DIN introduces a sophisticated style encoder, yet comes with a compact and lightweight content encoder for fast inference. Experimental results demonstrate that the proposed approach yields very encouraging results on challenging style patterns and, to our best knowledge, for the first time enables an arbitrary style transfer using MobileNet-based lightweight architecture, leading to a reduction factor of more than twenty in computational cost as compared to existing approaches. Furthermore, the proposed DIN provides flexible support for stateof-the-art convolutional operations, and thus triggers novel functionalities, such as uniform-stroke placement for nonnatural images and automatic spatial-stroke control.

AAAI Conference 2020 Conference Paper

Layerwise Sparse Coding for Pruned Deep Neural Networks with Extreme Compression Ratio

  • Xiao Liu
  • Wenbin Li
  • Jing Huo
  • Lili Yao
  • Yang Gao

Deep neural network compression is important and increasingly developed especially in resource-constrained environments, such as autonomous drones and wearable devices. Basically, we can easily and largely reduce the number of weights of a trained deep model by adopting a widely used model compression technique, e. g. , pruning. In this way, two kinds of data are usually preserved for this compressed model, i. e. , non-zero weights and meta-data, where metadata is employed to help encode and decode these non-zero weights. Although we can obtain an ideally small number of non-zero weights through pruning, existing sparse matrix coding methods still need a much larger amount of meta-data (may several times larger than non-zero weights), which will be a severe bottleneck of the deploying of very deep models. To tackle this issue, we propose a layerwise sparse coding (LSC) method to maximize the compression ratio by extremely reducing the amount of meta-data. We first divide a sparse matrix into multiple small blocks and remove zero blocks, and then propose a novel signed relative index (SRI) algorithm to encode the remaining non-zero blocks (with much less meta-data). In addition, the proposed LSC performs parallel matrix multiplication without full decoding, while traditional methods cannot. Through extensive experiments, we demonstrate that LSC achieves substantial gains in pruned DNN compression (e. g. , 51. 03x compression ratio on ADMM-Lenet) and inference computation (i. e. , time reduction and extremely less memory bandwidth), over stateof-the-art baselines.

TIST Journal 2019 Journal Article

A Trust Computing-based Security Routing Scheme for Cyber Physical Systems

  • Yuxin Liu
  • Xiao Liu
  • Anfeng Liu
  • Neal N. Xiong
  • Fang Liu

Security is a pivotal issue for the development of Cyber Physical Systems (CPS). The trusted computing of CPS includes the complete protection mechanisms, such as hardware, firmware, and software, the combination of which is responsible for enforcing a system security policy. A Trust Detection-based Secured Routing (TDSR) scheme is proposed to establish security routes from source nodes to the data center under malicious environment to ensure network security. In the TDSR scheme, sensor nodes in the routing path send detection routing to identify relay nodes’ trust. And then, data packets are routed through trustworthy nodes to sink securely. In the TDSR scheme, the detection routing is executed in those nodes that have abundant energy; thus, the network lifetime cannot be affected. Performance evaluation through simulation is carried out for success of routing ratio, compromised node detection ratio, and detection routing overhead. The experiment results show that the performance can be improved in the TDSR scheme compared to previous schemes.

AAAI Conference 2019 Conference Paper

DeepFuzz: Automatic Generation of Syntax Valid C Programs for Fuzz Testing

  • Xiao Liu
  • Xiaoting Li
  • Rupesh Prajapati
  • Dinghao Wu

Compilers are among the most fundamental programming tools for building software. However, production compilers remain buggy. Fuzz testing is often leveraged with newlygenerated, or mutated inputs in order to find new bugs or security vulnerabilities. In this paper, we propose a grammarbased fuzzing tool called DEEPFUZZ. Based on a generative Sequence-to-Sequence model, DEEPFUZZ automatically and continuously generates well-formed C programs. We use this set of new C programs to fuzz off-the-shelf C compilers, e. g. , GCC and Clang/LLVM. We present a detailed case study to analyze the success rate and coverage improvement of the generated C programs for fuzz testing. We analyze the performance of DEEPFUZZ with three types of sampling methods as well as three types of generation strategies. Consequently, DEEPFUZZ improved the testing efficacy in regards to the line, function, and branch coverage. In our preliminary study, we found and reported 8 bugs of GCC, all of which are actively being addressed by developers.

AAAI Conference 2019 Conference Paper

Distant Supervision for Relation Extraction with Linear Attenuation Simulation and Non-IID Relevance Embedding

  • Changsen Yuan
  • Heyan Huang
  • Chong Feng
  • Xiao Liu
  • Xiaochi Wei

Distant supervision for relation extraction is an efficient method to reduce labor costs and has been widely used to seek novel relational facts in large corpora, which can be identified as a multi-instance multi-label problem. However, existing distant supervision methods suffer from selecting important words in the sentence and extracting valid sentences in the bag. Towards this end, we propose a novel approach to address these problems in this paper. Firstly, we propose a linear attenuation simulation to reflect the importance of words in the sentence with respect to the distances between entities and words. Secondly, we propose a non-independent and identically distributed (non-IID) relevance embedding to capture the relevance of sentences in the bag. Our method can not only capture complex information of words about hidden relations, but also express the mutual information of instances in the bag. Extensive experiments on a benchmark dataset have well-validated the effectiveness of the proposed method.

NeurIPS Conference 2019 Conference Paper

Push-pull Feedback Implements Hierarchical Information Retrieval Efficiently

  • Xiao Liu
  • Xiaolong Zou
  • Zilong Ji
  • Gengshuo Tian
  • Yuanyuan Mi
  • Tiejun Huang
  • K. Y. Michael Wong
  • Si Wu

Experimental data has revealed that in addition to feedforward connections, there exist abundant feedback connections in a neural pathway. Although the importance of feedback in neural information processing has been widely recognized in the field, the detailed mechanism of how it works remains largely unknown. Here, we investigate the role of feedback in hierarchical information retrieval. Specifically, we consider a hierarchical network storing the hierarchical categorical information of objects, and information retrieval goes from rough to fine, aided by dynamical push-pull feedback from higher to lower layers. We elucidate that the push (positive) and pull (negative) feedbacks suppress the interferences due to neural correlations between different and the same categories, respectively, and their joint effect improves retrieval performance significantly. Our model agrees with the push-pull phenomenon observed in neural data and sheds light on our understanding of the role of feedback in neural information processing.

AAAI Conference 2019 Conference Paper

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

  • Dongliang He
  • Xiang Zhao
  • Jizhou Huang
  • Fu Li
  • Xiao Liu
  • Shilei Wen

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a presegmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-ofthe-art performance on ActivityNet’18 DenseCaption dataset (Krishna et al. 2017) and Charades-STA dataset (Sigurdsson et al. 2016; Gao et al. 2017) while observing only 10 or less clips per video.

IJCAI Conference 2019 Conference Paper

Relation-Aware Entity Alignment for Heterogeneous Knowledge Graphs

  • Yuting Wu
  • Xiao Liu
  • Yansong Feng
  • Zheng Wang
  • Rui Yan
  • Dongyan Zhao

Entity alignment is the task of linking entities with the same real-world identity from different knowledge graphs (KGs), which has been recently dominated by embedding-based methods. Such approaches work by learning KG representations so that entity alignment can be performed by measuring the similarities between entity embeddings. While promising, prior works in the field often fail to properly capture complex relation information that commonly exists in multi-relational KGs, leaving much room for improvement. In this paper, we propose a novel Relation-aware Dual-Graph Convolutional Network (RDGCN) to incorporate relation information via attentive interactions between the knowledge graph and its dual relation counterpart, and further capture neighboring structures to learn better entity representations. Experiments on three real-world cross-lingual datasets show that our approach delivers better and more robust results over the state-of-the-art alignment methods by learning better KG representations.

IJCAI Conference 2019 Conference Paper

Spatio-Temporal Attentive RNN for Node Classification in Temporal Attributed Graphs

  • Dongkuan Xu
  • Wei Cheng
  • Dongsheng Luo
  • Xiao Liu
  • Xiang Zhang

Node classification in graph-structured data aims to classify the nodes where labels are only available for a subset of nodes. This problem has attracted considerable research efforts in recent years. In real-world applications, both graph topology and node attributes evolve over time. Existing techniques, however, mainly focus on static graphs and lack the capability to simultaneously learn both temporal and spatial/structural features. Node classification in temporal attributed graphs is challenging for two major aspects. First, effectively modeling the spatio-temporal contextual information is hard. Second, as temporal and spatial dimensions are entangled, to learn the feature representation of one target node, it’s desirable and challenging to differentiate the relative importance of different factors, such as different neighbors and time periods. In this paper, we propose STAR, a spatio-temporal attentive recurrent network model, to deal with the above challenges. STAR extracts the vector representation of neighborhood by sampling and aggregating local neighbor nodes. It further feeds both the neighborhood representation and node attributes into a gated recurrent unit network to jointly learn the spatio-temporal contextual information. On top of that, we take advantage of the dual attention mechanism to perform a thorough analysis on the model interpretability. Extensive experiments on real datasets demonstrate the effectiveness of the STAR model.

AAAI Conference 2019 Conference Paper

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

  • Dongliang He
  • Zhichao Zhou
  • Chuang Gan
  • Fu Li
  • Xiao Liu
  • Yandong Li
  • Limin Wang
  • Shilei Wen

Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatialtemporal network (StNet) architecture for both local and global modeling in videos. Particularly, StNet stacks N successive video frames into a super-image which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatialtemporal structure, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet, which employs a separate channel-wise and temporal-wise convolution over the feature sequence of a video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.

AAAI Conference 2018 Conference Paper

Multimodal Keyless Attention Fusion for Video Classification

  • Xiang Long
  • Chuang Gan
  • Gerard Melo
  • Xiao Liu
  • Yandong Li
  • Fu Li
  • Shilei Wen

The problem of video classification is inherently sequential and multimodal, and deep neural models hence need to capture and aggregate the most pertinent signals for a given input video. We propose Keyless Attention as an elegant and efficient means to more effectively account for the sequential nature of the data. Moreover, comparing a variety of multimodal fusion methods, we find that Multimodal Keyless Attention Fusion is the most successful at discerning interactions between modalities. We experiment on four highly heterogeneous datasets, UCF101, ActivityNet, Kinetics, and YouTube-8M to validate our conclusion, and show that our approach achieves highly competitive results. Especially on large-scale data, our method has great advantages in efficiency and performance. Most remarkably, our best single model can achieve 77. 0% in terms of the top-1 accuracy and 93. 2% in terms of the top-5 accuracy on the Kinetics validation set, and achieve 82. 2% in terms of GAP@20 on the official YouTube-8M test set.

AAAI Conference 2017 Conference Paper

Localizing by Describing: Attribute-Guided Attention Localization for Fine-Grained Recognition

  • Xiao Liu
  • Jiang Wang
  • Shilei Wen
  • Errui Ding
  • Yuanqing Lin

A key challenge in fine-grained recognition is how to find and represent discriminative local regions. Recent attention models are capable of learning discriminative region localizers only from category labels with reinforcement learning. However, not utilizing any explicit part information, they are not able to accurately find multiple distinctive regions. In this work, we introduce an attribute-guided attention localization scheme where the local region localizers are learned under the guidance of part attribute descriptions. By designing a novel reward strategy, we are able to learn to locate regions that are spatially and semantically distinctive with reinforcement learning algorithm. The attribute labeling requirement of the scheme is more amenable than the accurate part location annotation required by traditional part-based fine-grained recognition methods. Experimental results on the CUB-200- 2011 dataset (Wah et al. 2011) demonstrate the superiority of the proposed scheme on both fine-grained recognition and attribute recognition.

IS Journal 2015 Journal Article

Identifying adverse drug events from patient social media: A case study for diabetes

  • Xiao Liu
  • Hsinchun Chen

Patient social media sites have emerged as major platforms for discussion of treatments and drug side effects, making them a promising source for listening to patients' voices in adverse drug event reporting. However, extracting patient reports from social media continues to be a challenge in health informatics research. In light of the need for more robust extraction methods, the authors developed a novel information extraction framework for identifying adverse drug events from patient social media. They also conducted a case study on a major diabetes patient social media platform to evaluate their framework's performance. Their approach achieves an f-measure of 86 percent in recognizing discussion of medical events and treatments, an f-measure of 69 percent for identifying adverse drug events, and an f-measure of 84 percent in patient report extraction. Their proposed methods significantly outperformed prior work in extracting patient reports of adverse drug events in health social media.