Arrow Research search

Author name cluster

Xiaoyu Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers
2 author rows

Possible papers

20

AAAI Conference 2026 Conference Paper

Efficient Plug-and-Play Weight Refinement for Sparse Large Models

  • Jingcheng Xie
  • Yinda Chen
  • Xiaoyu Liu
  • Yinglong Li
  • Haoyuan Shi
  • Zhiwei Xiong

One-shot pruning efficiently compresses Large Language Models but produces coarse sparse weights, causing significant performance degradation. Traditional fine-tuning approaches to refine these weights are prohibitively expensive for large models. This highlights the need for a training-free weight refinement method that works seamlessly with one-shot pruning and can efficiently recover the lost performance. To tackle this problem, we propose Efficient Iterative Weight Refinement (EIWR), a lightweight, plug-and-play, and training-free method that refines pruned weights through layer-wise iterative optimization. EIWR achieves efficient weight refinement via three key components: a Global Soft Constraint that eliminates costly row-wise Hessian inversions and expands the solution space; a Historical Momentum Strategy that leverages one-shot pruning priors to accelerate convergence and enhance final performance; and Neumann Series Extrapolation that significantly speeds up per-iteration computation. As a result, EIWR enables effective weight refinement with minimal time and memory overhead. Extensive experiments on LLaMA2/3 and Qwen under different pruning strategies and sparsity levels demonstrate that our method can efficiently refine sparse weights and mitigate performance degradation. For example, on LLaMA2-7B under 70 percent sparsity, EIWR reduces perplexity by 15 percent compared with SparseGPT on the WikiText2 benchmark, with only 1.81 additional minutes of computation and 1GB of additional memory.

AAAI Conference 2026 Conference Paper

Multi-Aspect Cross-modal Quantization for Generative Recommendation

  • Fuwei Zhang
  • Xiaoyu Liu
  • Dongbo Xi
  • Jishen Yin
  • Huan Chen
  • Peng Yan
  • Fuzhen Zhuang
  • Zhao Zhang

Generative Recommendation (GR) has emerged as a new paradigm in recommender systems. This approach relies on quantized representations to discretize item features, modeling users’ historical interactions as sequences of discrete tokens. Based on these tokenized sequences, GR predicts the next item by employing next-token prediction methods. The challenges of GR lie in constructing high-quality semantic identifiers (IDs) that are hierarchically organized, minimally conflicting, and conducive to effective generative model training. However, current approaches remain limited in their ability to harness multimodal information and to capture the deep and intricate interactions among diverse modalities, both of which are essential for learning high-quality semantic IDs and for effectively training GR models. To address this, we propose Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec), which introduces multimodal information and incorporates it into both semantic ID learning and generative model training from different aspects. Specifically, we first introduce cross-modal quantization during the ID learning process, which effectively reduces conflict rates and thus improves codebook usability through the complementary integration of multimodal information. In addition, to further enhance the generative ability of our GR model, we incorporate multi-aspect cross-modal alignments, including the implicit and explicit alignments. Finally, we conduct extensive experiments on three well-known recommendation datasets to demonstrate the effectiveness of our proposed method.

JBHI Journal 2025 Journal Article

An Optimization Strategy Allowing a Tactile Glove With Minimal Tactile Sensors for Soft Object Identification

  • Min Tang
  • Xiaoyu Liu
  • Xiaofeng Qiao
  • Yuanjie Zhu
  • Linyuan Fan
  • Songjun Du
  • Duo Chen
  • Jinghui Wang

Humans can easily perceive the shapes and textures of grasped objects due to high-density mechanoreceptor networks in the hand. However, replicating this capability in wearable devices with limited sensors remains challenging. Here, we designed a tactile glove equipped with easily accessible sensors, enabling accurate identification of soft objects during grasping. We propose an optimization strategy to eliminate redundant sensors and determine the minimal sensor configuration, which was then integrated into the tactile glove. The results indicate that the minimal sensor configuration (n = 7) attached to the hand achieved accurate identification comparable to that obtained using a larger number of sensors (n = 22) distributed across the hand before elimination. Furthermore, we found that various machine learning classifiers achieved recognition accuracies of up to 90% for soft objects when using the tactile glove. Correlation analyses were conducted to characterize individual contribution and mutual cooperativity of regional tactile forces on the hand during grasping, aiding in the interpretation of sensor selection or elimination in the optimization strategy. Adequate validation and analysis demonstrate that our strategy allows an easy–to–apply solution for identifying soft objects via a tactile glove with a minimal number of sensors, offering valuable insights for guiding the design of tactile sensor layouts in artificial limbs and robotic teleoperation systems.

JBHI Journal 2025 Journal Article

BioSAM: Generating SAM Prompts From Superpixel Graph for Biological Instance Segmentation

  • Miaomiao Cai
  • Xiaoyu Liu
  • Zhiwei Xiong
  • Xuejin Chen

Proposal-free instance segmentation methods have significantly advanced the field of biological image analysis. Recently, the Segment Anything Model (SAM) has shown an extraordinary ability to handle challenging instance boundaries. However, directly applying SAM to biological images that contain instances with complex morphologies and dense distributions fails to yield satisfactory results. In this work, we propose BioSAM, a new biological instance segmentation framework generating SAM prompts from a superpixel graph. Specifically, to avoid over-merging, we first generate sufficient superpixels as graph nodes and construct an initialized graph. We then generate initial prompts from each superpixel and aggregate them through a graph neural network (GNN) by predicting the relationship of superpixels to avoid over-segmentation. We employ the SAM encoder embeddings and the SAM-assisted superpixel similarity as new features for the graph to enhance its discrimination capability. With the graph-based prompt aggregation, we utilize the aggregated prompts in SAM to refine the segmentation and generate more accurate instance boundaries. Comprehensive experiments on four representative biological datasets demonstrate that our proposed method outperforms state-of-the-art methods.

NeurIPS Conference 2025 Conference Paper

ChatVLA-2: Vision-Language-Action Model with Open-World Reasoning

  • Zhongyi Zhou
  • Yichen Zhu
  • Xiaoyu Liu
  • Zhibin Tang
  • Junjie Wen
  • Yaxin Peng
  • Chaomin Shen
  • Yi Xu

Vision-language-action (VLA) models have emerged as the next generation of models in robotics. However, despite leveraging powerful pre-trained Vision-Language Models (VLMs), existing end-to-end VLA systems often lose key capabilities during fine-tuning as the model adapts to specific robotic tasks. We argue that a generalizable VLA model should retain and expand upon the VLM's core competencies: 1) **Open-world reasoning** - the VLA should inherit the knowledge from VLM, i. e. , recognize anything that the VLM can recognize, capable of solving math problems, possessing visual-spatial intelligence, 2) **Reasoning following** – effectively translating the open-world reasoning into actionable steps for the robot. In this work, we introduce **ChatVLA-2**, a novel mixture-of-expert VLA model coupled with a specialized three-stage training pipeline designed to preserve the VLM’s original strengths while enabling actionable reasoning. To validate our approach, we design a math-matching task wherein a robot interprets math problems written on a whiteboard and picks corresponding number cards from a table to solve equations. Remarkably, our method exhibits exceptional mathematical reasoning and OCR capabilities, despite these abilities not being explicitly trained within the VLA. Furthermore, we demonstrate that the VLA possesses strong spatial reasoning skills, enabling it to interpret novel directional instructions involving previously unseen objects. Overall, our method showcases reasoning and comprehension abilities that significantly surpass state-of-the-art imitation learning methods such as OpenVLA, DexVLA, and $\pi_0$. This work represents a substantial advancement toward developing truly generalizable robotic foundation models endowed with robust reasoning capacities.

ICML Conference 2025 Conference Paper

DiffusionVLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression

  • Junjie Wen
  • Yichen Zhu 0001
  • Minjie Zhu
  • Zhibin Tang
  • Jinming Li
  • Zhongyi Zhou
  • Xiaoyu Liu
  • Chaomin Shen 0001

In this paper, we present DiffusionVLA, a novel framework that integrates autoregressive reasoning with diffusion policies to address the limitations of existing methods: while autoregressive Vision-Language-Action (VLA) models lack precise and robust action generation, diffusion-based policies inherently lack reasoning capabilities. Central to our approach is autoregressive reasoning — a task decomposition and explanation process enabled by a pre-trained VLM — to guide diffusion-based action policies. To tightly couple reasoning with action generation, we introduce a reasoning injection module that directly embeds self-generated reasoning phrases into the policy learning process. The framework is simple, flexible, and efficient, enabling seamless deployment across diverse robotic platforms. We conduct extensive experiments using multiple real robots to validate the effectiveness of DiVLA. Our tests include a challenging factory sorting task, where DiVLA successfully categorizes objects, including those not seen during training. The reasoning injection module enhances interpretability, enabling explicit failure diagnosis by visualizing the model’s decision process. Additionally, we test DiVLA on a zero-shot bin-picking task, achieving 63. 7% accuracy on 102 previously unseen objects. Our method demonstrates robustness to visual changes, such as distractors and new backgrounds, and easily adapts to new embodiments. Furthermore, DiVLA can follow novel instructions and retain conversational ability. Notably, DiVLA is data-efficient and fast at inference; our smallest DiVLA-2B runs 82Hz on a single A6000 GPU. Finally, we scale the model from 2B to 72B parameters, showcasing improved generalization capabilities with increased model size.

ICLR Conference 2025 Conference Paper

Prompting Fairness: Integrating Causality to Debias Large Language Models

  • Jingling Li
  • Zeyu Tang 0002
  • Xiaoyu Liu
  • Peter Spirtes
  • Kun Zhang 0001
  • Liu Leqi
  • Yang Liu 0018

Large language models (LLMs), despite their remarkable capabilities, are susceptible to generating biased and discriminatory responses. As LLMs increasingly influence high-stakes decision-making (e.g., hiring and healthcare), mitigating these biases becomes critical. In this work, we propose a causality-guided debiasing framework to tackle social biases, aiming to reduce the objectionable dependence between LLMs' decisions and the social information in the input. Our framework introduces a novel perspective to identify how social information can affect an LLM's decision through different causal pathways. Leveraging these causal insights, we outline principled prompting strategies that regulate these pathways through selection mechanisms. This framework not only unifies existing prompting-based debiasing techniques, but also opens up new directions for reducing bias by encouraging the model to prioritize fact-based reasoning over reliance on biased social cues. We validate our framework through extensive experiments on real-world datasets across multiple domains, demonstrating its effectiveness in debiasing LLM decisions, even with only black-box access to the model.

NeurIPS Conference 2025 Conference Paper

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

  • Xiyao Wang
  • Zhengyuan Yang
  • Chao Feng
  • Yuhang Zhou
  • Xiaoyu Liu
  • Yongyuan Liang
  • Ming Li
  • Ziyi Zang

Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision–language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce \textbf{ViCrit} (\textit{Visual Caption Hallucination Critic}), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error—altering a few words on objects, attributes, counts, or spatial relations—and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the \textbf{ViCrit Task} exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce \textbf{ViCrit-Bench}, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.

NeurIPS Conference 2025 Conference Paper

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

  • Chaoyou Fu
  • Haojia Lin
  • Xiong Wang
  • Yifan Zhang
  • Yunhang Shen
  • Xiaoyu Liu
  • Haoyu Cao
  • Zuwei Long

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing against state-of-the-art counterparts across benchmarks for image, video, and speech, we demonstrate that our omni model is equipped with both strong visual and speech capabilities, making omni understanding and interaction.

NeurIPS Conference 2023 Conference Paper

C-Disentanglement: Discovering Causally-Independent Generative Factors under an Inductive Bias of Confounder

  • Xiaoyu Liu
  • Jiaxin Yuan
  • Bang An
  • Yuancheng Xu
  • Yifan Yang
  • Furong Huang

Representation learning assumes that real-world data is generated by a few semantically meaningful generative factors (i. e. , sources of variation) and aims to discover them in the latent space. These factors are expected to be causally disentangled, meaning that distinct factors are encoded into separate latent variables, and changes in one factor will not affect the values of the others. Compared to statistical independence, causal disentanglement allows more controllable data generation, improved robustness, and better generalization. However, most existing work assumes unconfoundedness in the discovery process, that there are no common causes to the generative factors and thus obtain only statistical independence. In this paper, we recognize the importance of modeling confounders in discovering causal generative factors. Unfortunately, such factors are not identifiable without proper inductive bias. We fill the gap by introducing a framework entitled Confounded-Disentanglement (C-Disentanglement), the first framework that explicitly introduces the inductive bias of confounder via labels from domain expertise. In addition, we accordingly propose an approach to sufficiently identify the causally-disentangled factors under any inductive bias of the confounder. We conduct extensive experiments on both synthetic and real-world datasets. Our method demonstrates competitive results compared to various SOTA baselines in obtaining causally disentangled features and downstream tasks under domain shifts.

IJCAI Conference 2022 Conference Paper

Biological Instance Segmentation with a Superpixel-Guided Graph

  • Xiaoyu Liu
  • Wei Huang
  • Yueyi Zhang
  • Zhiwei Xiong

Recent advanced proposal-free instance segmentation methods have made significant progress in biological images. However, existing methods are vulnerable to local imaging artifacts and similar object appearances, resulting in over-merge and over-segmentation. To reduce these two kinds of errors, we propose a new biological instance segmentation framework based on a superpixel-guided graph, which consists of two stages, i. e. , superpixel-guided graph construction and superpixel agglomeration. Specifically, the first stage generates enough superpixels as graph nodes to avoid over-merge, and extracts node and edge features to construct an initialized graph. The second stage agglomerates superpixels into instances based on the relationship of graph nodes predicted by a graph neural network (GNN). To solve over-segmentation and prevent introducing additional over-merge, we specially design two loss functions to supervise the GNN, i. e. , a repulsion-attraction (RA) loss to better distinguish the relationship of nodes in the feature space, and a maximin agglomeration score (MAS) loss to pay more attention to crucial edge classification. Extensive experiments on three representative biological datasets demonstrate the superiority of our method over existing state-of-the-art methods. Code is available at https: //github. com/liuxy1103/BISSG.

JMLR Journal 2021 Journal Article

Histogram Transform Ensembles for Large-scale Regression

  • Hanyuan Hang
  • Zhouchen Lin
  • Xiaoyu Liu
  • Hongwei Wen

In this paper, we propose a novel algorithm for large-scale regression problems named Histogram Transform Ensembles (HTE), composed of random rotations, stretchings, and translations. Our HTE method first implements a histogram transformed partition to the random affine mapped data, then adaptively leverages constant functions or SVMs to obtain the individual regression estimates, and eventually builds the ensemble predictor through an average strategy. First of all, in this paper, we investigate the theoretical properties of HTE when the regression function lies in the H\"{o}lder space $C^{k,\alpha}$, $k \in \mathbb{N}_0$, $\alpha \in (0,1]$. In the case that $k=0, 1$, we adopt the constant regressors and develop the na\"{i}ve histogram transforms (NHT). Within the space $C^{0,\alpha}$, although almost optimal convergence rates can be derived for both single and ensemble NHT, we fail to show the benefits of ensembles over single estimators theoretically. In contrast, in the subspace $C^{1,\alpha}$, we prove that if $d \geq 2(1+\alpha)/\alpha$, the lower bound of the convergence rates for single NHT turns out to be worse than the upper bound of the convergence rates for ensemble NHT. In the other case when $k \geq 2$, the NHT may no longer be appropriate in predicting smoother regression functions. Instead, we circumvent this issue by applying kernel histogram transforms (KHT) equipped with smoother regressors, such as support vector machines (SVMs). Accordingly, it turns out that both single and ensemble KHT enjoy almost optimal convergence rates. Then, we validate the above theoretical results with extensive numerical experiments. On the one hand, simulations are conducted to elucidate that ensemble NHT outperforms single NHT. On the other hand, the effects of bin sizes on the accuracy of both NHT and KHT are also in accord with the theoretical analysis. Last but not least, in the real-data experiments, comparisons between the ensemble KHT, equipped with adaptive histogram transforms, and other state-of-the-art large-scale regression estimators verify the effectiveness and precision of the proposed algorithm. [abs] [ pdf ][ bib ] &copy JMLR 2021. ( edit, beta )

NeurIPS Conference 2021 Conference Paper

Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge

  • Jiyang Qi
  • Yan Gao
  • Yao Hu
  • Xinggang Wang
  • Xiaoyu Liu
  • Xiang Bai
  • Serge Belongie
  • Alan Yuille

Although deep learning methods have achieved advanced video object recognition performance in recent years, perceiving heavily occluded objects in a video is still a very challenging task. To promote the development of occlusion understanding, we collect a large-scale dataset called OVIS for video instance segmentation in the occluded scenario. OVIS consists of 296k high-quality instance masks and 901 occluded scenes. While our human vision systems can perceive those occluded objects by contextual reasoning and association, our experiments suggest that current video understanding systems cannot. On the OVIS dataset, all baseline methods encounter a significant performance degradation of about 80\% in the heavily occluded object group, which demonstrates that there is still a long way to go in understanding obscured objects and videos in a complex real-world scenario. To facilitate the research on new paradigms for video understanding systems, we launched a challenge basing on the OVIS dataset. The submitted top-performing algorithms have achieved much higher performance than our baselines. In this paper, we will introduce the OVIS dataset and further dissect it by analyzing the results of baselines and submitted methods. The OVIS dataset and challenge information can be found at \url{http: //songbai. site/ovis}.

ICRA Conference 2020 Conference Paper

Self-Supervised Learning for Alignment of Objects and Sound

  • Xinzhu Liu
  • Xiaoyu Liu
  • Di Guo 0002
  • Huaping Liu 0001
  • Fuchun Sun 0001
  • Haibo Min

The sound source separation problem has many useful applications in the field of robotics, such as human-robot interaction, scene understanding, etc. However, it remains a very challenging problem. In this paper, we utilize both visual and audio information of videos to perform the sound source separation task. A self-supervised learning framework is proposed to implement the object detection and sound separation modules simultaneously. Such an approach is designed to better find the alignment between the detected objects and separated sound components. Our experiments, conducted on both the synthetic and real datasets, validate this approach and demonstrate the effectiveness of the proposed model in the task of object and sound alignment.

AAAI Conference 2018 Conference Paper

Adaptive Co-attention Network for Named Entity Recognition in Tweets

  • Qi Zhang
  • Jinlan Fu
  • Xiaoyu Liu
  • Xuanjing Huang

In this study, we investigate the problem of named entity recognition for tweets. Named entity recognition is an important task in natural language processing and has been carefully studied in recent decades. Previous named entity recognition methods usually only used the textual content when processing tweets. However, many tweets contain not only textual content, but also images. Such visual information is also valuable in the name entity recognition task. To make full use of textual and visual information, this paper proposes a novel method to process tweets that contain multimodal information. We extend a bi-directional long short term memory network with conditional random fields and an adaptive co-attention network to achieve this task. To evaluate the proposed methods, we constructed a large scale labeled dataset that contained multimodal tweets. Experimental results demonstrated that the proposed method could achieve a better performance than the previous methods in most cases.

AAAI Conference 2018 Conference Paper

Neural Networks Incorporating Dictionaries for Chinese Word Segmentation

  • Qi Zhang
  • Xiaoyu Liu
  • Jinlan Fu

In recent years, deep neural networks have achieved significant success in Chinese word segmentation and many other natural language processing tasks. Most of these algorithms are end-to-end trainable systems and can effectively process and learn from large scale labeled datasets. However, these methods typically lack the capability of processing rare words and data whose domains are different from training data. Previous statistical methods have demonstrated that human knowledge can provide valuable information for handling rare cases and domain shifting problems. In this paper, we seek to address the problem of incorporating dictionaries into neural networks for the Chinese word segmentation task. Two different methods that extend the bi-directional long short-term memory neural network are proposed to perform the task. To evaluate the performance of the proposed methods, state-of-the-art supervised models based methods and domain adaptation approaches are compared with our methods on nine datasets from different domains. The experimental results demonstrate that the proposed methods can achieve better performance than other state-of-the-art neural network methods and domain adaptation approaches in most cases.

IJCAI Conference 2018 Conference Paper

Neural Networks Incorporating Unlabeled and Partially-labeled Data for Cross-domain Chinese Word Segmentation

  • Lujun Zhao
  • Qi Zhang
  • Peng Wang
  • Xiaoyu Liu

Most existing Chinese word segmentation (CWS) methods are usually supervised. Hence, large-scale annotated domain-specific datasets are needed for training. In this paper, we seek to address the problem of CWS for the resource-poor domains that lack annotated data. A novel neural network model is proposed to incorporate unlabeled and partially-labeled data. To make use of unlabeled data, we combine a bidirectional LSTM segmentation model with two character-level language models using a gate mechanism. These language models can capture co-occurrence information. To make use of partially-labeled data, we modify the original cross entropy loss function of RNN. Experimental results demonstrate that the method performs well on CWS tasks in a series of domains.