Arrow Research search

Author name cluster

Rui Xu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

28 papers
2 author rows

Possible papers

28

AAAI Conference 2026 Conference Paper

MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment

  • Huangbiao Xu
  • Huanqi Wu
  • Xiao Ke
  • Junyi Wu
  • Rui Xu
  • Jinglin Xu

Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks.

AAAI Conference 2026 Conference Paper

Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment

  • Rui Xu
  • Yunke Wang
  • Yong Luo
  • Bo Du

Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs). However, most in-LLM reduction approaches rely on text-conditioned interactions, implicitly assuming that textual tokens can reliably capture the importance of visual tokens. In this work, we revisit this assumption and reveal causal, semantic, and spatial forms of cross-modal misalignment. These misalignments undermine the effectiveness of text-guided visual token reduction. To address this, we introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention, without relying on textual signals. To further suppress redundancy throughout the model hierarchy, we treat the visual encoder and the LLM as a unified system and design a progressive pruning pipeline. Our method performs dominant token selection and lightweight contextual merging at multiple stages, enabling fine-grained visual information to be retained even under aggressive token budgets. Extensive experiments across diverse benchmarks show that VisionDrop achieves consistent improvements over existing approaches, despite requiring no additional training or complex modifications. Notably, when integrated with LLaVA-NeXT-7B, VisionDrop achieves a 2.7x reduction in inference latency and 6x in FLOPs, while retaining 95.71% of the original performance.

AAAI Conference 2025 Conference Paper

DanceFix: An Exploration in Group Dance Neatness Assessment Through Fixing Abnormal Challenges of Human Pose

  • Huangbiao Xu
  • Xiao Ke
  • Huanqi Wu
  • Rui Xu
  • Yuezhou Li
  • Peirong Xu
  • Wenzhong Guo

The fair and objective assessment of performances and competitions is a common pursuit and challenge in human society. The application of computer vision technology offers hope for this purpose, but it still faces obstacles such as occlusion and motion blur. To address these hindrances, our DanceFix proposes a bidirectional spatial-temporal context optical flow correction (BOFC) method. This approach leverages the consistency and complementarity of motion information between two modalities: optical flow, which excels at pixel capture, and lightweight skeleton data. It enables the extraction of pixel-level motion changes and the correction of abnormal skeleton data. Furthermore, we propose a part-level dance dataset (Dancer Parts) and part-level motion feature extraction based on task decoupling (PETD). This aims to decouple complex whole-body parts tracking into fine-grained limb-level motion extraction, enhancing the confidence of temporal information and the accuracy of correction for abnormal data. Finally, we present the DNV dataset, which simulates fully neat group dance scenes and provides reliable labels and validation methods for the newly introduced group dance neatness assessment (GDNA). To the best of our knowledge, this is the first work to develop quantitative criteria for assessing limb and joint neatness in group dance. We conduct experiments on DNV and video-based public JHMDB datasets. Our method effectively corrects abnormal skeleton points, flexibly embeds, and improves the accuracy of existing pose estimation algorithms.

JBHI Journal 2025 Journal Article

Feature Separation in Diffuse Lung Disease Image Classification by Using Evolutionary Algorithm-Based NAS

  • Qing Zhang
  • Dan Shao
  • Lin Lin
  • Guoliang Gong
  • Rui Xu
  • Shoji Kido
  • HongWei Cui

In the field of diagnosing lung diseases, the application of neural networks (NNs) in image classification exhibits significant potential. However, NNs are considered “black boxes, ” making it difficult to discern their decision-making processes, thereby leading to skepticism and concern regarding NNs. This compromises model reliability and hampers intelligent medicine's development. To tackle this issue, we introduce the Evolutionary Neural Architecture Search (EvoNAS). In image classification tasks, EvoNAS initially utilizes an Evolutionary Algorithm to explore various Convolutional Neural Networks, ultimately yielding an optimized network that excels at separating between redundant texture features and the most discriminative ones. Retaining the most discriminative features improves classification accuracy, particularly in distinguishing similar features. This approach illuminates the intrinsic mechanics of classification, thereby enhancing the accuracy of the results. Subsequently, we incorporate a Differential Evolution algorithm based on distribution estimation, significantly enhancing search efficiency. Employing visualization techniques, we demonstrate the effectiveness of EvoNAS, endowing the model with interpretability. Finally, we conduct experiments on the diffuse lung disease texture dataset using EvoNAS. Compared to the original network, the classification accuracy increases by 0. 56%. Moreover, our EvoNAS approach demonstrates significant advantages over existing methods in the same dataset.

JBHI Journal 2025 Journal Article

GSAHermNet: A GraphSAGE-Based Neural Network with Hermite Interpolation for Individualized Gait Pattern Generation

  • Lin Meng
  • Shaochen Xu
  • Hongtao Dong
  • Juan Du
  • Uriel Martinez-Hernandez
  • Rui Xu
  • Dong Ming

Accurate generation of gait patterns is essential for advancing robotic gait rehabilitation. This study presents GSAHermNet, a novel two-stage framework that combines a GraphSAGE-based neural network for predicting key gait events with Hermite interpolation to reconstruct full joint trajectories. Unlike conventional methods that generate the entire gait cycle directly, GSAHermNet focuses on predicting key gait events using only seven body and walking parameters, thereby reducing over fitting and enhancing generalizability across diverse walking speeds and conditions. The model was trained on a public dataset of 42 healthy subjects using 5-fold cross-validation on 40 individuals, while the remaining two subjects were reserved for independent testing. Experimental results demonstrate that GSAHermNet achieves mean absolute deviations (MAD) below 4. 58° and correlation coefficients (r) of 0. 99 for hip and knee joints, and MAD below 3. 69° with r = 0. 85 for the ankle. Comparative analyses confirm that GSAHermNet outperforms conventional statistical and machine learning approaches in both accuracy and robust ness. The proposed approach has great potential for real word applications, such as adaptive control in functional electrical stimulation systems and personalized motion planning in lower-limb exoskeletons. An online framework for real-time gait trajectory generation will be established using wearable sensor inputs in future.

NeurIPS Conference 2025 Conference Paper

Lifelong Test-Time Adaptation via Online Learning in Tracked Low-Dimensional Subspace

  • Dexin Duan
  • Rui Xu
  • Peilin Liu
  • Fei Wen

Test-time adaptation (TTA) aims to adapt a source model to a target domain using only test data. Existing methods predominantly rely on unsupervised entropy minimization or its variants, which suffer from degeneration, leading to trivial solutions with low-entropy but inaccurate predictions. In this work, we identify entropy-deceptive (ED) samples, instances where the model makes highly confident yet incorrect predictions, as the underlying cause of degeneration. Further, we reveal that the gradients of entropy minimization in TTA have an intrinsic low-dimensional structure, driven primarily by entropy-truthful (ET) samples whose gradients are highly correlated. In contrast, ED samples have scattered, less correlated gradients. Leveraging this observation, we show that the detrimental impact of ED samples can be suppressed by constraining model updates within the principal subspace of backward gradients. Building on this insight, we propose LCoTTA, a lifelong continual TTA method that tracks the principal subspace of gradients online and utilizes their projections onto this subspace for adaptation. Further, we provide theoretical analysis to show that the proposed subspace-based method can enhance the robustness against detrimental ED samples. Extensive experiments demonstrate that LCoTTA effectively overcomes degeneration and significantly outperforms existing methods in long-term continual adaptation scenarios. Code is available online.

NeurIPS Conference 2025 Conference Paper

ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

  • Rui Xu
  • Dakuan Lu
  • Zicheng Zhao
  • Xiaoyu Tan
  • Xintao Wang
  • Siyu Yuan
  • Jiangjie Chen
  • Yinghui Xu

Spatial reasoning is a key capability in the field of artificial intelligence, especially crucial in areas such as robotics, computer vision, and natural language understanding. However, evaluating the ability of multimodal large language models (MLLMs) in complex spatial reasoning still faces challenges, particularly in scenarios requiring multi-step reasoning and precise mathematical constraints. This paper introduces ORIGAMISPACE, a new dataset and benchmark designed to evaluate the multi-step spatial reasoning ability and the capacity to handle mathematical constraints of MLLMs through origami tasks. The dataset contains 350 data instances, each comprising a strictly formatted crease pattern (CP diagram), the Compiled Flat Pattern, the complete Folding Process, and the final Folded Shape Image. We propose four evaluation tasks: Pattern Prediction, Multi-step Spatial Reasoning, Spatial Relationship Prediction, and End-to-End CP Code Generation. For the CP code generation task, we design an interactive environment and explore the possibility of using reinforcement learning methods to train MLLMs. Through experiments on existing MLLMs, we initially reveal the strengths and weaknesses of these models in handling complex spatial reasoning tasks.

ICLR Conference 2025 Conference Paper

Wasserstein-Regularized Conformal Prediction under General Distribution Shift

  • Rui Xu
  • Chao Chen
  • Yue Sun 0001
  • Parvathinathan Venkitasubramaniam
  • Sihong Xie

Conformal prediction yields a prediction set with guaranteed $1-\alpha$ coverage of the true target under the i.i.d. assumption, which can fail and lead to a gap between $1-\alpha$ and the actual coverage. Prior studies bound the gap using total variation distance, which cannot identify the gap changes under distribution shift at different $\alpha$, thus serving as a weak indicator of prediction set validity. Besides, existing methods are mostly limited to covariate shifts, while general joint distribution shifts are more common in practice but less researched. In response, we first propose a Wasserstein distance-based upper bound of the coverage gap and analyze the bound using probability measure pushforwards between the shifted joint data and conformal score distributions, enabling a separation of the effect of covariate and concept shifts over the coverage gap. We exploit the separation to design algorithms based on importance weighting and regularized representation learning (WR-CP) to reduce the Wasserstein bound with a finite-sample error bound. WR-CP achieves a controllable balance between conformal prediction accuracy and efficiency. Experiments on six datasets prove that WR-CP can reduce coverage gaps to 3.2% across different confidence levels and outputs prediction sets 37% smaller than the worst-case approach on average.

TMLR Journal 2024 Journal Article

From Persona to Personalization: A Survey on Role-Playing Language Agents

  • Jiangjie Chen
  • Xintao Wang
  • Rui Xu
  • Siyu Yuan
  • Yikai Zhang
  • Wei Shi
  • Jian Xie
  • Shuang Li

Recent advancements in large language models (LLMs) have significantly boosted the rise of Role-Playing Language Agents (RPLAs), i.e., specialized AI systems designed to simulate assigned personas. By harnessing multiple advanced abilities of LLMs, including in-context learning, instruction following, and social intelligence, RPLAs achieve a remarkable sense of human likeness and vivid role-playing performance. RPLAs can mimic a wide range of personas, ranging from historical figures and fictional characters to real-life individuals. Consequently, they have catalyzed numerous AI applications, such as emotional companions, interactive video games, personalized assistants and copilots, and digital clones. In this paper, we conduct a comprehensive survey of this field, illustrating the evolution and recent progress in RPLAs integrating with cutting-edge LLM technologies. We categorize personas into three types: 1) Demographic Persona, which leverages statistical stereotypes; 2) Character Persona, focused on well-established figures; and 3) Individualized Persona, customized through ongoing user interactions for personalized services. We begin by presenting a comprehensive overview of current methodologies for RPLAs, followed by the details for each persona type, covering corresponding data sourcing, agent construction, and evaluation. Afterward, we discuss the fundamental risks, existing limitations, and prospects of RPLAs. Additionally, we provide a brief review of RPLAs in AI products in the market, which reflects practical user demands that shape and drive RPLA research. Through this survey, we aim to establish a clear taxonomy of RPLA research and applications, facilitate future research in this critical and ever-evolving field, and pave the way for a future where humans and RPLAs coexist in harmony.

AAAI Conference 2024 Conference Paper

Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation

  • Zhouhong Gu
  • Xiaoxuan Zhu
  • Haoning Ye
  • Lin Zhang
  • Jianchen Wang
  • Yixin Zhu
  • Sihang Jiang
  • Zhuozhi Xiong

New Natural Langauge Process~(NLP) benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge.Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi-Specialty with 14,041 questions and Xiezhi-Interdiscipline with 10,746 questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results indicate that LLMs exceed average performance of humans in science, engineering, agronomy, medicine, and art, but fall short in economics, jurisprudence, pedagogy, literature, history, and management. All the evaluation code and data are open sourced in https://github.com/MikeGu721/XiezhiBenchmark

AAAI Conference 2023 Conference Paper

Converge to the Truth: Factual Error Correction via Iterative Constrained Editing

  • Jiangjie Chen
  • Rui Xu
  • Wenxuan Zeng
  • Changzhi Sun
  • Lei Li
  • Yanghua Xiao

Given a possibly false claim sentence, how can we automatically correct it with minimal editing? Existing methods either require a large number of pairs of false and corrected claims for supervised training or do not handle well errors spanning over multiple tokens within an utterance. In this paper, we propose VENCE, a novel method for factual error correction (FEC) with minimal edits. VENCE formulates the FEC problem as iterative sampling editing actions with respect to a target density function. We carefully design the target function with predicted truthfulness scores from an offline trained fact verification model. VENCE samples the most probable editing positions based on back-calculated gradients of the truthfulness score concerning input tokens and the editing actions using a distantly-supervised language model (T5). Experiments on a public dataset show that VENCE improves the well-adopted SARI metric by 5.3 (or a relative improvement of 11.8%) over the previous best distantly-supervised methods.

AAAI Conference 2022 Conference Paper

Domain Disentangled Generative Adversarial Network for Zero-Shot Sketch-Based 3D Shape Retrieval

  • Rui Xu
  • Zongyan Han
  • Le Hui
  • Jianjun Qian
  • Jin Xie

Sketch-based 3D shape retrieval is a challenging task due to the large domain discrepancy between sketches and 3D shapes. Since existing methods are trained and evaluated on the same categories, they cannot effectively recognize the categories that have not been used during training. In this paper, we propose a novel domain disentangled generative adversarial network (DD-GAN) for zero-shot sketch-based 3D retrieval, which can retrieve the unseen categories that are not accessed during training. Specifically, we first generate domain-invariant features and domain-specific features by disentangling the learned features of sketches and 3D shapes, where the domain-invariant features are used to align with the corresponding word embeddings. Then, we develop a generative adversarial network that combines the domainspecific features of the seen categories with the aligned domain-invariant features to synthesize samples, where the synthesized samples of the unseen categories are generated by using the corresponding word embeddings. Finally, we use the synthesized samples of the unseen categories combined with the real samples of the seen categories to train the network for retrieval, so that the unseen categories can be recognized. In order to reduce the domain shift problem, we utilize unlabeled unseen samples to enhance the discrimination ability of the discriminator. With the discriminator distinguishing the generated samples from the unlabeled unseen samples, the generator can generate more realistic unseen samples. Extensive experiments on the SHREC’13 and SHREC’14 datasets show that our method significantly improves the retrieval performance of the unseen categories.

JBHI Journal 2021 Journal Article

Joint Extraction of Retinal Vessels and Centerlines Based on Deep Semantics and Multi-Scaled Cross-Task Aggregation

  • Rui Xu
  • Tiantian Liu
  • Xinchen Ye
  • Fei Liu
  • Lin Lin
  • Liang Li
  • Satoshi Tanaka
  • Yen-Wei Chen

Retinal vessel segmentation and centerline extraction are crucial steps in building a computer-aided diagnosis system on retinal images. Previous works treat them as two isolated tasks, while ignoring their tight association. In this paper, we propose a deep semantics and multi-scaled cross-task aggregation network that takes advantage of the association to jointly improve their performances. Our network is featured by two sub-networks. The forepart is a deep semantics aggregation sub-network that aggregates strong semantic information to produce more powerful features for both tasks, and the tail is a multi-scaled cross-task aggregation sub-network that explores complementary information to refine the results. We evaluate the proposed method on three public databases, which are DRIVE, STARE and CHASE_DB1. Experimental results show that our method can not only simultaneously extract retinal vessels and their centerlines but also achieve the state-of-the-art performances on both tasks.

NeurIPS Conference 2020 Conference Paper

Discovering Symbolic Models from Deep Learning with Inductive Biases

  • Miles Cranmer
  • Alvaro Sanchez Gonzalez
  • Peter Battaglia
  • Rui Xu
  • Kyle Cranmer
  • David Spergel
  • Shirley Ho

We develop a general approach to distill symbolic representations of a learned deep model by introducing strong inductive biases. We focus on Graph Neural Networks (GNNs). The technique works as follows: we first encourage sparse latent representations when we train a GNN in a supervised setting, then we apply symbolic regression to components of the learned model to extract explicit physical relations. We find the correct known equations, including force laws and Hamiltonians, can be extracted from the neural network. We then apply our method to a non-trivial cosmology example—a detailed dark matter simulation—and discover a new analytic formula which can predict the concentration of dark matter from the mass distribution of nearby cosmic structures. The symbolic expressions extracted from the GNN using our technique also generalized to out-of-distribution-data better than the GNN itself. Our approach offers alternative directions for interpreting neural networks and discovering novel physical principles from the representations they learn.

AAAI Conference 2020 Conference Paper

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

  • Ya Zhao
  • Rui Xu
  • Xinchao Wang
  • Peng Hou
  • Haihong Tang
  • Mingli Song

Lip reading has witnessed unparalleled development in recent years thanks to deep learning and the availability of largescale datasets. Despite the encouraging results achieved, the performance of lip reading, unfortunately, remains inferior to the one of its counterpart speech recognition, due to the ambiguous nature of its actuations that makes it challenging to extract discriminant features from the lip movement videos. In this paper, we propose a new method, termed as Lip by Speech (LIBS), of which the goal is to strengthen lip reading by learning from speech recognizers. The rationale behind our approach is that the features extracted from speech recognizers may provide complementary and discriminant clues, which are formidable to be obtained from the subtle movements of the lips, and consequently facilitate the training of lip readers. This is achieved, specifically, by distilling multigranularity knowledge from speech recognizers to lip readers. To conduct this cross-modal knowledge distillation, we utilize an efficacious alignment scheme to handle the inconsistent lengths of the audios and videos, as well as an innovative filtering strategy to refine the speech recognizer’s prediction. The proposed method achieves the new state-of-the-art performance on the CMLR and LRS2 datasets, outperforming the baseline by a margin of 7. 66% and 2. 75% in character error rate, respectively.

JBHI Journal 2020 Journal Article

Pulmonary Textures Classification via a Multi-Scale Attention Network

  • Rui Xu
  • Zhen Cong
  • Xinchen Ye
  • Yasushi Hirano
  • Shoji Kido
  • Tomoko Gyobu
  • Yutaka Kawata
  • Osamu Honda

Precise classification of pulmonary textures is crucial to develop a computer aided diagnosis (CAD) system of diffuse lung diseases (DLDs). Although deep learning techniques have been applied to this task, the classification performance is not satisfied for clinical requirements, since commonly-used deep networks built by stacking convolutional blocks are not able to learn discriminative feature representation to distinguish complex pulmonary textures. For addressing this problem, we design a multi-scale attention network (MSAN) architecture comprised by several stacked residual attention modules followed by a multi-scale fusion module. Our deep network can not only exploit powerful information on different scales but also automatically select optimal features for more discriminative feature representation. Besides, we develop visualization techniques to make the proposed deep model transparent for humans. The proposed method is evaluated by using a large dataset. Experimental results show that our method has achieved the average classification accuracy of 94. 78% and the average f-value of 0. 9475 in the classification of 7 categories of pulmonary textures. Besides, visualization results intuitively explain the working behavior of the deep network. The proposed method has achieved the state-of-the-art performance to classify pulmonary textures on high resolution CT images.

IS Journal 2014 Journal Article

Optimal Design and Control of Smart Space Structures: A Memetic Evolution Approach

  • Dong-Xu Li
  • Rui Xu

To reach global optimal performance, here the authors propose a new design approach that offers concurrent design for smart space structures. In this approach, the quantity and placement of sensors/actuators and parameters of controllers are simultaneously optimized by a memetic evolution algorithm. A solar array smart structure has been used for computational experiments and the corresponding results indicate that the proposed concurrent design can obtain better performance than the sequential one.