Arrow Research search

Author name cluster

Haoran Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers
2 author rows

Possible papers

13

AAAI Conference 2026 Conference Paper

Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

  • Sicheng Xie
  • Haidong Cao
  • Zejia Weng
  • Zhen Xing
  • Haoran Chen
  • Shiwei Shen
  • Jiaqi Leng
  • Zuxuan Wu

Distilling knowledge from human demonstrations is a promising way for robots to learn and act. Existing methods, which often rely on coarsely-aligned video pairs, are typically constrained to learning global or task-level features. As a result, they tend to neglect the fine-grained frame-level dynamics required for complex manipulation and generalization to novel tasks. We posit that this limitation stems from a vicious circle of inadequate datasets and the methods they inspire. To break this cycle, we propose a paradigm shift that treats fine-grained human-robot alignment as a conditional video generation problem. To this end, we first introduce H&R, a novel third-person dataset containing 2,600 episodes of precisely synchronized human and robot motions, collected using a VR teleoperation system. We then present Human2Robot, a framework designed to leverage this data. Human2Robot employs a Video Prediction Model to learn a rich and implicit representation of robot dynamics by generating robot videos from human input, which in turn guides a decoupled action decoder. Our real-world experiments demonstrate that this approach not only achieves high performance on seen tasks but also exhibits significant one-shot generalization to novel positions, objects, instances, and even new task categories.

AAAI Conference 2026 Conference Paper

Learning to Cluster Rare Cell Types: Implicit Semantic Data Augmentation for Spatial Multi-modal Omics Analysis

  • Daixian Liu
  • Hau-Sing So
  • Haoran Chen
  • Jiao Li
  • Shanshan Wang
  • Mengzhu Wang
  • Jingcai Guo

Spatial multi-modal omics technologies have transformed biological research by enabling the simultaneous profiling of gene expression, protein abundance, and chromatin accessibility within their native spatial contexts. Despite these advances, accurately clustering rare cell types remains a major challenge due to data sparsity, high dimensionality, and limited annotated samples. While Graph Neural Networks (GNNs) have shown potential in modeling spatial omics data, their effectiveness is often constrained by the use of fixed K-nearest neighbor (KNN) graph structures, which fail to capture latent semantic relationships masked by sequencing noise. To overcome these limitations, we propose CRCT (Clustering Rare Cell Types): a novel framework that combines Implicit Semantic Data Augmentation (ISDA) with adaptive graph learning for spatial multi-modal omics analysis. Unlike traditional augmentation strategies that generate explicit synthetic samples, CRCT operates in the deep feature space by dynamically estimating intra-class covariance matrices and implicitly perturbing features along semantically meaningful directions. This enables effective augmentation for rare cell populations while preserving biological fidelity. Extensive experiments across four real-world datasets (HLN, MB, Stereo‑CITE‑seq, and SPOTS) and one synthetic benchmark demonstrate the state-of-the-art performance of CRCT, achieving improvements of up to +1.7 NMI and +7.8 ARI over strong baseline methods.

IROS Conference 2025 Conference Paper

Aerial Grasping via Maximizing Delta-Arm Workspace Utilization

  • Haoran Chen
  • Weiliang Deng
  • Biyu Ye
  • Yifan Xiong
  • Zongliang Pan
  • Ximin Lyu

Workspace limitations restrict the operational capabilities and range of motion for systems with robotic arms. Maximizing workspace utilization has the potential to provide better solutions for aerial manipulation tasks, increasing the system’s flexibility and operational efficiency. In this paper, we introduce a novel planning framework for aerial grasping that maximizes workspace utilization. We formulate an optimization problem to optimize the aerial manipulator’s trajectory, incorporating task constraints to achieve efficient manipulation. To address the challenge of incorporating the delta arm’s non-convex workspace into optimization constraints, we leverage a Multilayer Perceptron (MLP) to map the point positions to feasibility probabilities. Furthermore, we employ Reversible Residual Networks (RevNet) to approximate the complex forward kinematics of the delta arm, utilizing its efficient model gradients to further eliminate workspace constraints. We validate our methods in simulations and real-world experiments to demonstrate their effectiveness.

NeurIPS Conference 2025 Conference Paper

ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection

  • Zhihao Sun
  • Haoran Jiang
  • Haoran Chen
  • Yixin Cao
  • Xipeng Qiu
  • Zuxuan Wu
  • Yu-Gang Jiang

Multimodal large language models have unlocked new possibilities for various multimodal tasks. However, their potential in image manipulation detection remains unexplored. When directly applied to the IMD task, M-LLMs often produce reasoning texts that suffer from hallucinations and overthinking. To address this, we propose ForgerySleuth, which leverages M-LLMs to perform comprehensive clue fusion and generate segmentation outputs indicating specific regions that are tampered with. Moreover, we construct the ForgeryAnalysis dataset through the Chain-of-Clues prompt, which includes analysis and reasoning text to upgrade the image manipulation detection task. A data engine is also introduced to build a larger-scale dataset for the pre-training phase. Our extensive experiments demonstrate the effectiveness of ForgeryAnalysis and show that ForgerySleuth significantly outperforms existing methods in generalization, robustness, and explainability.

NeurIPS Conference 2023 Conference Paper

Multi-Prompt Alignment for Multi-Source Unsupervised Domain Adaptation

  • Haoran Chen
  • Xintong Han
  • Zuxuan Wu
  • Yu-Gang Jiang

Most existing methods for unsupervised domain adaptation (UDA) rely on a shared network to extract domain-invariant features. However, when facing multiple source domains, optimizing such a network involves updating the parameters of the entire network, making it both computationally expensive and challenging, particularly when coupled with min-max objectives. Inspired by recent advances in prompt learning that adapts high-capacity models for downstream tasks in a computationally economic way, we introduce Multi-Prompt Alignment (MPA), a simple yet efficient framework for multi-source UDA. Given a source and target domain pair, MPA first trains an individual prompt to minimize the domain gap through a contrastive loss. Then, MPA denoises the learned prompts through an auto-encoding process and aligns them by maximizing the agreement of all the reconstructed prompts. Moreover, we show that the resulting subspace acquired from the auto-encoding process can easily generalize to a streamlined set of target domains, making our method more efficient for practical usage. Extensive experiments show that MPA achieves state-of-the-art results on three popular datasets with an impressive average accuracy of 54. 1% on DomainNet.

NeurIPS Conference 2022 Conference Paper

TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training

  • Yulong Liu
  • Guibo Zhu
  • Bin Zhu
  • Qi Song
  • Guojing Ge
  • Haoran Chen
  • GuanHui Qiao
  • Ru Peng

Vision-Language Pre-training (VLP) has been shown to be an efficient method to improve the performance of models on different vision-and-language downstream tasks. Substantial studies have shown that neural networks may be able to learn some general rules about language and visual concepts from a large-scale weakly labeled image-text dataset. However, most of the public cross-modal datasets that contain more than 100M image-text pairs are in English; there is a lack of available large-scale and high-quality Chinese VLP datasets. In this work, we propose a new framework for automatic dataset acquisition and cleaning with which we construct a new large-scale and high-quality cross-modal dataset named as TaiSu, containing 166 million images and 219 million Chinese captions. Compared with the recently released Wukong dataset, our dataset is achieved with much stricter restrictions on the semantic correlation of image-text pairs. We also propose to combine texts collected from the web with texts generated by a pre-trained image-captioning model. To the best of our knowledge, TaiSu is currently the largest publicly accessible Chinese cross-modal dataset. Furthermore, we test our dataset on several vision-language downstream tasks. TaiSu outperforms BriVL by a large margin on the zero-shot image-text retrieval task and zero-shot image classification task. TaiSu also shows better performance than Wukong on the image-retrieval task without using image augmentation for training. Results demonstrate that TaiSu can serve as a promising VLP dataset, both for understanding and generative tasks. More information can be referred to https: //github. com/ksOAn6g5/TaiSu.

AAAI Conference 2020 Conference Paper

TextScanner: Reading Characters in Order for Robust Scene Text Recognition

  • Zhaoyi Wan
  • Minghang He
  • Haoran Chen
  • Xiang Bai
  • Cong Yao

Driven by deep learning and a large volume of data, scene text recognition has evolved rapidly in recent years. Formerly, RNN-attention-based methods have dominated this field, but suffer from the problem of attention drift in certain situations. Lately, semantic segmentation based algorithms have proven effective at recognizing text of different forms (horizontal, oriented and curved). However, these methods may produce spurious characters or miss genuine characters, as they rely heavily on a thresholding procedure operated on segmentation maps. To tackle these challenges, we propose in this paper an alternative approach, called TextScanner, for scene text recognition. TextScanner bears three characteristics: (1) Basically, it belongs to the semantic segmentation family, as it generates pixel-wise, multi-channel segmentation maps for character class, position and order; (2) Meanwhile, akin to RNN-attention-based methods, it also adopts RNN for context modeling; (3) Moreover, it performs paralleled prediction for character position and class, and ensures that characters are transcripted in the correct order. The experiments on standard benchmark datasets demonstrate that TextScanner outperforms the state-of-the-art methods. Moreover, TextScanner shows its superiority in recognizing more difficult text such as Chinese transcripts and aligning with target characters.

IJCAI Conference 2019 Conference Paper

Triplet Enhanced AutoEncoder: Model-free Discriminative Network Embedding

  • Yao Yang
  • Haoran Chen
  • Junming Shao

Deep autoencoder is widely used in dimensionality reduction because of the expressive power of the neural network. Therefore, it is naturally suitable for embedding tasks, which essentially compresses high-dimensional information into a low-dimensional latent space. In terms of network representation, methods based on autoencoder such as SDNE and DNGR have achieved comparable results with the state-of-arts. However, all of them do not leverage label information, which leads to the embeddings lack the characteristic of discrimination. In this paper, we present Triplet Enhanced AutoEncoder (TEA), a new deep network embedding approach from the perspective of metric learning. Equipped with the triplet-loss constraint, the proposed approach not only allows capturing the topological structure but also preserving the discriminative information. Moreover, unlike existing discriminative embedding techniques, TEA is independent of any specific classifier, we call it the model-free property. Extensive empirical results on three public datasets (i. e, Cora, Citeseer and BlogCatalog) show that TEA is stable and achieves state-of-the-art performance compared with both supervised and unsupervised network embedding approaches on various percentages of labeled data. The source code can be obtained from https: //github. com/yybeta/TEA.

IJCAI Conference 2017 Conference Paper

Locality Preserving Projections for Grassmann manifold

  • Boyue Wang
  • Yongli Hu
  • Junbin Gao
  • Yanfeng Sun
  • Haoran Chen
  • Muhammad Ali
  • Baocai Yin

Learning on Grassmann manifold has become popular in many computer vision tasks, with the strong capability to extract discriminative information for imagesets and videos. However, such learning algorithms particularly on high-dimensional Grassmann manifold always involve with significantly high computational cost, which seriously limits the applicability of learning on Grassmann manifold in more wide areas. In this research, we propose an unsupervised dimensionality reduction algorithm on Grassmann manifold based on the Locality Preserving Projections (LPP) criterion. LPP is a commonly used dimensionality reduction algorithm for vector-valued data, aiming to preserve local structure of data in the dimension-reduced space. The strategy is to construct a mapping from higher dimensional Grassmann manifold into the one in a relative low-dimensional with more discriminative capability. The proposed method can be optimized as a basic eigenvalue problem. The performance of our proposed method is assessed on several classification and clustering tasks and the experimental results show show its clear advantages over other Grassmann based algorithms.