Author name cluster

Ran Xu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

2 author rows

AAAI Conference 2026 Conference Paper

Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation

Rongyu Zhang
Aosong Cheng
Yulin Luo
Gaole Dai
Huanrui Yang
Jiaming Liu
Ran Xu
Li Du

Continual Test-Time Adaptation (CTTA), which aims to adapt the pre-trained model to ever-evolving target domains, emerges as an important task for vision models. As current vision models appear to be heavily biased towards texture, continuously adapting the model from one domain distribution to another can result in serious catastrophic forgetting. Drawing inspiration from the the encoding characteristics of neuron activation in neural networks, we propose the Mixture-of-Activation-Sparsity-Experts (MoASE) for the CTTA task. Given the distinct reaction of neurons with low and high activation to domain-specific and agnostic features, MoASE decomposes the neural activation into high-activation and low-activation components in each expert with a Spatial Differentiable Dropout (SDD). Based on the decomposition, we devise a Domain-Aware Router (DAR) that utilizes domain information to adaptively weight experts that process the post-SDD sparse activations, and the Activation Sparsity Gate (ASG) that adaptively assigns feature selection thresholds of the SDD for different experts for more precise feature decomposition. Finally, we introduce a Homeostatic-Proximal (HP) loss to maintain update consistency between the teacher and student experts to prevent error accumulation. Extensive experiments substantiate that MoASE achieves state-of-the-art performance in both classification and segmentation tasks.

PDF Details DOI

TCS Journal 2026 Journal Article

Leakage-resilient attribute-based encryption scheme with CCA security

Yanwei Zhou
Ran Xu
Zirui Qiao
Bo Yang

To ensure message security, including unforgeability and confidentiality, cryptographic techniques such as encryption, digital signatures, and security protocols are widely adopted. Conventional cryptographic schemes are typically proven secure under idealized models, where internal secrets remain hidden from external adversaries. However, in real-world implementations, side-channel attacks, such as electromagnetic analysis, power analysis, and timing attacks, can exploit physical leakage to compromise secret information. Consequently, in leakage-prone environments, cryptographic primitives previously deemed secure may no longer uphold their security guarantees. Recent research has thus focused on developing leakage-resilient cryptographic schemes to counteract such threats. While attribute-based encryption (ABE) scheme enables fine-grained access control, existing leakage-resilient ABE schemes only achieve chosen-plaintext attack (CPA) security and lack universality. To address this limitation, we present a generic construction for leakage-resilient ABE scheme based on an attribute-based hash proof system (AB-HPS) and an one-time lossy filter (OT-LF), proving its chosen-ciphertext attack (CCA) security. Furthermore, we introduce a novel cryptographic primitive termed two-key encapsulated attribute-based hash proof system (T-AB-HPS), which facilitates a more efficient CCA-secure leakage-resilient ABE construction from message authentication code (MAC). Specifically, our work makes two key contributions: (1) A generic framework for constructing CCA-secure leakage-resilient ABE scheme by integrating AB-HPS with an OT-LF. (2) An optimized and generalizable approach for designing CCA-secure leakage-resilient ABE scheme using T-AB-HPS in conjunction with a MAC.

Details DOI

TMLR Journal 2026 Journal Article

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Rui Meng
Ziyan Jiang
Ye Liu
Mingyi Su
Xinyi Yang
Yuepeng Fu
Can Qin
Raghuveer Thirukovalluru

Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, retrieval-augmented generation (RAG) systems, and recommendation. To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering -- spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.

PDF Details

IROS Conference 2025 Conference Paper

A Mole-inspired Incisor-Burrowing Robotic Platform for Planetary Exploration

Ran Xu
Jiabin Liu
Zhaofeng Liang
Hongmin Zheng
Kunquan Zheng
Zibiao Chen
Jiawei Chen
Tao Zhang 0064

Planetary exploration requires efficient methods for subsurface sampling, especially in extreme energy limitations. Traditional drilling methods are often energy intensive and require large platforms, limiting their applicability. Bio-inspired burrowing techniques, inspired by animals like moles, offer lightweight, low-power alternatives suitable for small robotic platforms. This paper presents a novel bio-inspired robotic platform, the Mole-like Incisor-Burrowing Robotic Platform (MIRP), designed to mimic the incisor-burrowing behavior of naked mole rats. The MIRP features an 11 DOFs mechanism with a compact design (220 mm × 140 mm × 80 mm) and uses servomotors to achieve low energy consumption. The robot combines a qu0adrupedal locomotion mechanism with an incisor-burrowing mechanism, allowing it to navigate granular terrains and perform excavation tasks. Kinematic analysis, including inverse kinematics and close-chain analysis, was conducted to optimize the robot’s motion strategy. A prototype was developed and tested in a simulated lunar regolith environment to test its maneuverability and burrowing performance. The power consumption of the prototype is below 10 W. This work validates the feasibility of bio-inspired incisor-burrowing for planetary exploration, offering a cost-effective and efficient solution for future extraterrestrial missions.

Details

NeurIPS Conference 2025 Conference Paper

AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play

Ran Xu
Yuchen Zhuang
Zihan Dong
Ruiyu Wang
Yue Yu
Joyce Ho
Linjun Zhang
Haoyu Wang

Search-augmented LLMs often struggle with complex reasoning tasks due to ineffective multi-hop retrieval and limited reasoning ability. We propose AceSearcher, a cooperative self-play framework that trains a single large language model (LLM) to alternate between two roles: a decomposer that breaks down complex queries and a solver that integrates retrieved contexts for answer generation. AceSearcher couples supervised fine-tuning on a diverse mixture of search, reasoning, and decomposition tasks with reinforcement fine-tuning optimized for final answer accuracy, eliminating the need for intermediate annotations. Extensive experiments on three reasoning-intensive tasks across 10 datasets show that AceSearcher outperforms state-of-the-art baselines, achieving an average exact match improvement of 7. 6%. Remarkably, on document-level finance reasoning tasks, AceSearcher-32B matches the performance of the giant DeepSeek-V3 model using less than 5% of iits parameters. Even at smaller scales (1. 5B and 8B), AceSearcher often surpasses existing search-augmented LLMs with up to 9× more parameters, highlighting its exceptional efficiency and effectiveness in tackling complex reasoning tasks.

PDF Details

NeurIPS Conference 2025 Conference Paper

DyMU: Dynamic Merging and Virtual Unmerging for Efficient Variable-Length VLMs

Zhenhailong Wang
Senthil Purushwalkam
Caiming Xiong
Silvio Savarese
Heng Ji
Ran Xu

We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fixed-length outputs in vision transformers. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence, thus preserving the downstream performance without additional fine-tuning. Unlike previous approaches, our method dynamically determines token length based on the image content —not just resolution—and operates completely training-free, making it readily applicable to most state-of-the-art VLM architectures. Extensive experiments on image and video understanding tasks, demonstrate that DyMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models, across diverse VLM architectures. Furthermore, qualitative analyses show that the adaptive token reduction from DToMe aligns well with human perception and enables users to better control computational costs through flexible integration with additional vision tools and models.

PDF Details

AAAI Conference 2025 Conference Paper

Text2Data: Low-Resource Data Generation with Textual Control

Shiyu Wang
Yihao Feng
Tian Lan
Ning Yu
Yu Bai
Ran Xu
Huan Wang
Caiming Xiong

Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines. Recognizing the importance of this interface, the machine learning community is investing considerable effort in generating data that is semantically coherent with textual instructions. While strides have been made in text-to-data generation spanning image editing, audio synthesis, video creation, and beyond, low-resource areas characterized by expensive annotations or complex data structures, such as molecules, motion dynamics, and time series, often lack textual labels. This deficiency impedes supervised learning, thereby constraining the application of advanced generative models for text-to-data tasks. In response to these challenges in the low-resource scenario, we propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model. Subsequently, it undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting. Comprehensive experiments demonstrate that Text2Data is able to achieve enhanced performance regarding controllability across various modalities, including molecules, motions and time series, when compared to existing baselines.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Anas Awadalla
Le Xue
Oscar Lo
Manli Shu
Hannah Lee
Etash Guha
Matt Jordan
Sheng Shen

Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date. MINT-1T comprises of one trillion text tokens and 3. 4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. As scaling multimodal interleaved datasets requires substantial engineering effort, sharing the data curation process and releasing the dataset greatly benefits the community. Our experiments show that LMMs trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS. We release our data at https: //github. com/mlfoundations/MINT-1T.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Neighborhood-Regularized Self-Training for Learning with Few Labels

Ran Xu
Yue Yu
Hejie Cui
Xuan Kan
Yanqiao Zhu
Joyce Ho
Chao Zhang
Carl Yang

Training deep neural networks (DNNs) with limited supervision has been a popular research topic as it can significantly alleviate the annotation burden. Self-training has been successfully applied in semi-supervised learning tasks, but one drawback of self-training is that it is vulnerable to the label noise from incorrect pseudo labels. Inspired by the fact that samples with similar labels tend to share similar representations, we develop a neighborhood-based sample selection approach to tackle the issue of noisy pseudo labels. We further stabilize self-training via aggregating the predictions from different rounds during sample selection. Experiments on eight tasks show that our proposed method outperforms the strongest self-training baseline with 1.83% and 2.51% performance gain for text and graph datasets on average. Our further analysis demonstrates that our proposed data selection strategy reduces the noise of pseudo labels by 36.8% and saves 57.3% of the time when compared with the best baseline. Our code and appendices will be uploaded to: https://github.com/ritaranx/NeST.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting

Hejie Cui
Xinyu Fang
Zihan Zhang
Ran Xu
Xuan Kan
Xin Liu
Yue Yu
Manling Li

Images contain rich relational knowledge that can help machines understand the world. Existing methods on visual knowledge extraction often rely on the pre-defined format (e. g. , sub-verb-obj tuples) or vocabulary (e. g. , relation types), restricting the expressiveness of the extracted knowledge. In this work, we take a first exploration to a new paradigm of open visual knowledge extraction. To achieve this, we present OpenVik which consists of an open relational region detector to detect regions potentially containing relational knowledge and a visual knowledge generator that generates format-free knowledge by prompting the large multimodality model with the detected region of interest. We also explore two data enhancement techniques for diversifying the generated format-free visual knowledge. Extensive knowledge quality evaluations highlight the correctness and uniqueness of the extracted open visual knowledge by OpenVik. Moreover, integrating our extracted knowledge across various visual reasoning applications shows consistent improvements, indicating the real-world applicability of OpenVik.

PDF Details

AAAI Conference 2023 Conference Paper

Tackling Data Heterogeneity in Federated Learning with Class Prototypes

Yutong Dai
Zeyuan Chen
Junnan Li
Shelby Heinecke
Lichao Sun
Ran Xu

Data heterogeneity across clients in federated learning (FL) settings is a widely acknowledged challenge. In response, personalized federated learning (PFL) emerged as a framework to curate local models for clients' tasks. In PFL, a common strategy is to develop local and global models jointly - the global model (for generalization) informs the local models, and the local models (for personalization) are aggregated to update the global model. A key observation is that if we can improve the generalization ability of local models, then we can improve the generalization of global models, which in turn builds better personalized models. In this work, we consider class imbalance, an overlooked type of data heterogeneity, in the classification setting. We propose FedNH, a novel method that improves the local models' performance for both personalization and generalization by combining the uniformity and semantics of class prototypes. FedNH initially distributes class prototypes uniformly in the latent space and smoothly infuses the class semantics into class prototypes. We show that imposing uniformity helps to combat prototype collapse while infusing class semantics improves local models. Extensive experiments were conducted on popular classification datasets under the cross-device setting. Our results demonstrate the effectiveness and stability of our method over recent works.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Can Qin
Shu Zhang
Ning Yu
Yihao Feng
Xinyi Yang
Yingbo Zhou
Huan Wang
Juan Carlos Niebles

Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.

PDF Details

IJCAI Conference 2022 Conference Paper

Visual Emotion Representation Learning via Emotion-Aware Pre-training

Yue Zhang
Wanying Ding
Ran Xu
Xiaohua Hu

Despite recent progress in deep learning, visual emotion recognition remains a challenging problem due to ambiguity of emotion perception, diverse concepts related to visual emotion and lack of large-scale annotated dataset. In this paper, we present a large-scale multimodal pre-training method to learn visual emotion representation by aligning emotion, object, attribute triplet with a contrastive loss. We conduct our pre-training on a large web dataset with noisy tags and fine-tune on visual emotion classification datasets. Our method achieves state-of-the-art performance for visual emotion classification.

PDF Details DOI

JBHI Journal 2021 Journal Article

Identification of lncRNA Signature Associated With Pan-Cancer Prognosis

Guoqing Bao
Ran Xu
Xiuying Wang
Jianxiong Ji
Linlin Wang
Wenjie Li
Qing Zhang
Bin Huang

Long noncoding RNAs (lncRNAs) have emerged as potential prognostic markers in various human cancers as they participate in many malignant behaviors. However, the value of lncRNAs as prognostic markers among diverse human cancers is still under investigation, and a systematic signature based on these transcripts that related to pan-cancer prognosis has yet to be reported. In this study, we proposed a framework to incorporate statistical power, biological rationale, and machine learning models for pan-cancer prognosis analysis. The framework identified a 5-lncRNA signature ( ENSG00000206567, PCAT29, ENSG00000257989, LOC388282, and LINC00339 ) from TCGA training studies ( n = 1, 878). The identified lncRNAs are significantly associated (all P $\leq$ 1. 48E-11) with overall survival (OS) of the TCGA cohort ( n = 4, 231). The signature stratified the cohort into low- and high-risk groups with significantly distinct survival outcomes (median OS of 9. 84 years versus 4. 37 years, log-rank P = 1. 48E-38) and achieved a time-dependent ROC/AUC of 0. 66 at 5 years. After routine clinical factors involved, the signature demonstrated better performance for long-term prognostic estimation (AUC of 0. 72). Moreover, the signature was further evaluated on two independent external cohorts (TARGET, n = 1, 122; CPTAC, n = 391; National Cancer Institute) which yielded similar prognostic values (AUC of 0. 60 and 0. 75; log-rank P = 8. 6E-09 and P = 2. 7E-06). An indexing system was developed to map the 5-lncRNA signature to prognoses of pan-cancer patients. In silico functional analysis indicated that the lncRNAs are associated with common biological processes driving human cancers. The five lncRNAs, especially ENSG00000206567, ENSG00000257989 and LOC388282 that never reported before, may serve as viable molecular targets common among diverse cancers.

Details DOI

AAAI Conference 2015 Conference Paper

Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework

Ran Xu
Caiming Xiong
Wei Chen
Jason Corso

Recently, joint video-language modeling has been attracting more and more attention. However, most existing approaches focus on exploring the language model upon on a ﬁxed visual model. In this paper, we propose a uniﬁed framework that jointly models video and the corresponding text sentences. The framework consists of three parts: a compositional semantics language model, a deep video model and a joint embedding model. In our language model, we propose a dependency-tree structure model that embeds sentence into a continuous vector space, which preserves visually grounded meanings and word order. In the visual model, we leverage deep neural networks to capture essential semantic information from videos. In the joint embedding model, we minimize the distance of the outputs of the deep video model and compositional language model in the joint space, and update these two models jointly. Based on these three parts, our system is able to accomplish three tasks: 1) natural language generation, and 2) video retrieval and 3) language retrieval. In the experiments, the results show our approach outperforms SVM, CRF and CCA baselines in predicting Subject-Verb- Object triplet and natural sentence generation, and is better than CCA in video retrieval and language retrieval tasks.

PDF Details