Arrow Research search

Author name cluster

Haoyuan Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers
2 author rows

Possible papers

11

JBHI Journal 2026 Journal Article

A Multi-Scale Neighbor Topology Guided Transformer and Kolmogorov-Arnold Network Enhanced Feature Learning Model for Disease-Related circRNA Prediction

  • Ping Xuan
  • Haoyuan Li
  • Hui Cui
  • Zelong Xu
  • Toshiya Nakaguchi
  • Tiangang Zhang

As circular non-coding RNA (circRNA) is closely associated with various human diseases, identifying disease-related circRNAs can provide a deeper understanding of the mechanisms underlying disease pathogenesis. Advanced circRNA-disease association prediction methods mainly focus on graph learning techniques such as graph convolutional networks. However, these methods do not fully encode the multiscale neighbor topologies of each node, and the dependencies among the pairwise attributes. We propose a multi-scale neighbor topology-guided transformer with Kolmogorov-Arnold network (KAN) enhanced feature learning for circRNA and disease association prediction, termed MKCD. First, MKCD incorporates an adaptive multiscale neighbor topology embedding construction strategy (AMNE), which generates neighbor topologies covering varying scopes of neighbors by random walks. Second, we design a dynamic multi-scale neighbor topology-guided transformer (DMTT) that leverages the multi-scale neighbor topologies to guide the learning of relationships among circRNA, miRNA, and disease nodes. The multi-scale neighbor topology is dynamically evolved, providing adaptive guidance to the transformer’s learning process. Third, we establish a feature-gated network (FGN) to evaluate the importance of topological features and the original node attributes. Finally, we propose an adaptive joint convolutional neural networks and KAN learning strategy (ACK) to learn the global and local dependencies of pairwise features. Comprehensive comparison experiments show that MKCD outperforms six state-of-the-art methods, improving AUC and AUPR by at least 14. 1% and 7. 6%, respectively. Ablation experiments further validate the effectiveness of AMNE, DMTT, FGN and ACK innovations. Case studies on three diseases further validate the application value of our method in discovering reliable circRNA candidates for the diseases.

AAAI Conference 2026 Conference Paper

CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

  • Guanghao Zhang
  • Tao Zhong
  • Yan Xia
  • Mushui Liu
  • Zhelun Yu
  • Haoyuan Li
  • Wanggui He
  • Dong She

While previous multimodal slow-thinking methods have demonstrated remarkable success in single-image understanding scenarios, their effectiveness becomes fundamentally constrained when extended to more complex multi-image comprehension tasks. This limitation stems from their predominant reliance on text-based intermediate reasoning processes. While for human, when engaging in sophisticated multi-image analysis, they typically perform two complementary cognitive operations: (1) continuous cross-image visual comparison through region-of-interest matching, and (2) dynamic memorization of critical visual concepts throughout the reasoning chain. Motivated by these observations, we propose the Complex Multi-Modal Chain-of-Thought (CMMCoT) framework, a multi-step reasoning framework that mimics human-like "slow thinking" for multi-image understanding. Our approach incorporates two key innovations: (1) The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals. This mechanism not only facilitates comprehensive cross-modal understanding but also enhances model interpretability. (2) The introduction of a test-time memory augmentation module that expands the model’s reasoning capacity during inference while preserving parameter efficiency. Furthermore, to facilitate research in this direction, we have curated a novel multi-image slow-thinking dataset. Extensive experiments demonstrate the effectiveness of our model.

AAAI Conference 2026 Conference Paper

MAU-GPT: Enhancing Multi-type Industrial Anomaly Understanding via Anomaly-aware and Generalist Experts Adaptation

  • Zhuonan Wang
  • Zhenxuan Fan
  • Siwen Tan
  • Yu Zhong
  • Yuqian Yuan
  • Haoyuan Li
  • Hao Jiang
  • Wenqiao Zhang

As industrial manufacturing scales, automating fine-grained product image analysis has become critical for quality control. However, existing approaches are hindered by limited dataset coverage and poor model generalization across diverse and complex anomaly patterns. To address these challenges, we introduce MAU-Set, a comprehensive dataset for Multi-type industrial Anomaly Understanding. It spans multiple industrial domains and features a hierarchical task structure, ranging from binary classification to complex reasoning. Alongside this dataset, we establish a rigorous evaluation protocol to facilitate fair and comprehensive model assessment. Building upon this foundation, we further present MAU-GPT, a domain-adapted multimodal large model specifically designed for industrial anomaly understanding. It incorporates a novel AMoE-LoRA mechanism that unifies anomaly-aware and generalist experts adaptation, enhancing both understanding and reasoning across diverse defect classes. Extensive experiments show that MAU-GPT consistently outperforms prior state-of-the-art methods across all domains, demonstrating strong potential for scalable and automated industrial inspection.

IROS Conference 2025 Conference Paper

CLGA: A Collaborative LLM Framework for Dynamic Goal Assignment in Multi-Robot Systems

  • Xin Yu 0009
  • Haoyuan Li
  • Yandong Wang 0002
  • Simin Li
  • Rongye Shi
  • Gangzheng Ai
  • Zhiqiang Pu
  • Wenjun Wu 0001

Goal assignment is a critical challenge in multi-robot systems. The emergence of large language models (LLMs) has enabled the use of natural language commands for tackling goal assignment problems. However, applying LLMs directly to these tasks presents two limitations: 1) limited accuracy and 2) excessive decision delays due to their autoregressive nature, hindering adaptability to unexpected changes. To address these issues, inspired by dual-process theory, we propose a framework called Collaborative LLMs for dynamic Goal Assignment (CLGA). Specifically, we leverage LLMs for pre-planning tasks and invoke an external solver to generate an initial goal assignment solution, ensuring solution accuracy. During execution, small-scale models enable real-time adjustments to respond to dynamic environmental changes. This approach integrates the strengths of slow, precise pre-planning and fast, adaptive online adjustments, allowing agents to efficiently handle real-world challenges. Additionally, we introduce a benchmark dataset for NLP-based goal assignment to advance research in this domain. Simulation and real-world experiments demonstrate that CLGA significantly enhances task execution efficiency and flexibility in multi-robot systems. The prompt, experimental videos, and datasets associated with this work are available at https://sites.google.com/view/project-clga/.

IJCAI Conference 2025 Conference Paper

CorrDetail: Visual Detail Enhanced Self-Correction for Face Forgery Detection

  • Binjia Zhou
  • Hengrui Lou
  • Lizhe Chen
  • Haoyuan Li
  • Dawei Luo
  • Shuai Chen
  • Jie Lei
  • Zunlei Feng

With the swift progression of image generation technology, the widespread emergence of facial deepfakes poses significant challenges to the field of security, thus amplifying the urgent need for effective deepfake detection. Existing techniques for face forgery detection can broadly be categorized into two primary groups: visual-based methods and multimodal approaches. The former often lacks clear explanations for forgery details, while the latter, which merges visual and linguistic modalities, is more prone to the issue of hallucinations. To address these shortcomings, we introduce a visual detail enhanced self-correction framework, designated CorrDetail, for interpretable face forgery detection. CorrDetail is meticulously designed to rectify authentic forgery details when provided with error-guided questioning, with the aim of fostering the ability to uncover forgery details rather than yielding hallucinated responses. Additionally, to bolster the reliability of its findings, a visual fine-grained detail enhancement module is incorporated, supplying CorrDetail with more precise visual forgery details. Ultimately, a fusion decision strategy is devised to further augment the model's discriminative capacity in handling extreme samples, through the integration of visual information compensation and model bias reduction. Experimental results demonstrate that CorrDetail not only achieves state-of-the-art performance compared to the latest methodologies but also excels in accurately identifying forged details, all while exhibiting robust generalization capabilities.

AAAI Conference 2025 Conference Paper

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

  • Wenyi Xiao
  • Ziwei Huang
  • Leilei Gan
  • Wanggui He
  • Haoyuan Li
  • Zhelun Yu
  • Fangxun Shu
  • Hao Jiang

The rapidly developing Large Vision Language Models (LVLMs) still face the hallucination phenomena where the generated responses do not align with the given contexts, significantly restricting the usages of LVLMs. Most previous work detects and mitigates hallucination at the coarse-grained level or requires expensive annotation (e.g., labeling by human experts or proprietary models). To address these issues, we propose detecting and mitigating hallucinations in LVLMs via fine-grained AI feedback. The basic idea is that we generate a small-size sentence-level hallucination annotation dataset by proprietary models, whereby we train a detection model which can perform sentence-level hallucination detection. Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for hallucination mitigation training. Furthermore, we propose differentiating the severity of hallucinations, and introducing a Hallucination Severity-Aware Direct Preference Optimization (HSA-DPO) which prioritizes the mitigation of critical hallucination in LVLMs by incorporating the severity of hallucinations into preference learning. Extensive experiments on hallucination detection and mitigation benchmarks demonstrate that our method sets a new state-of-the-art in hallucination detection on MHaluBench, surpassing GPT-4V and Gemini, and reduces the hallucination rate by 36.1% on AMBER and 76.3% on Object HalBench compared to the base model.

AAAI Conference 2025 Conference Paper

MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

  • Wanggui He
  • Siming Fu
  • Mushui Liu
  • Xierui Wang
  • Wenyi Xiao
  • Fangxun Shu
  • Yi Wang
  • Lei Zhang

Auto-regressive models have made significant progress in the realm of text-to-image synthesis, yet devising an appropriate model architecture and training strategy to achieve a satisfactory level remains an important avenue of exploration. In this work, we introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE). This innovative component integrates pre-trained LLMs by independently processing linguistic and visual information—freezing the textual component while fine-tuning the visual component. This methodology preserves the NLP capabilities of LLMs while imbuing them with exceptional visual understanding. Building upon the powerful base of the pre-trained Qwen-7B, MARS stands out with its bilingual generative capabilities corresponding to both English and Chinese language prompts and the capacity for joint image and text generation. The flexibility of this framework lends itself to migration towards any-to-any task adaptability. Furthermore, MARS employs a multi-stage training strategy that first establishes robust image-text alignment through complementary bidirectional tasks and subsequently concentrates on refining the T2I generation process, significantly augmenting text-image synchrony and the granularity of image details. Notably, MARS requires only 9% of the GPU days needed by SD1.5, yet it achieves remarkable results across a variety of benchmarks, illustrating the training efficiency and the potential for swift deployment in various applications.

ICLR Conference 2025 Conference Paper

UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting

  • Haoyuan Li
  • Yanpeng Zhou
  • Tao Tang
  • Jifei Song
  • Yihan Zeng
  • Michael Kampffmeyer
  • Hang Xu 0004
  • Xiaodan Liang

Recent advancements in multi-modal 3D pre-training methods have shown promising efficacy in learning joint representations of text, images, and point clouds. However, adopting point clouds as 3D representation fails to fully capture the intricacies of the 3D world and exhibits a noticeable gap between the discrete points and the dense 2D pixels of images. To tackle this issue, we propose UniGS, integrating 3D Gaussian Splatting (3DGS) into multi-modal pre-training to enhance the 3D representation. We first rely on the 3DGS representation to model the 3D world as a collection of 3D Gaussians with color and opacity, incorporating all the information of the 3D scene while establishing a strong connection with 2D images. Then, to achieve Language-Image-3D pertaining, UniGS starts with a pretrained vision-language model to establish a shared visual and textual space through extensive real-world image-text pairs. Subsequently, UniGS employs a 3D encoder to align the optimized 3DGS with the Language-Image representations to learn unified multi-modal representations. To facilitate the extraction of global explicit 3D features by the 3D encoder and achieve better cross-modal alignment, we additionally introduce a novel Gaussian-Aware Guidance module that guides the learning of fine-grained representations of the 3D domain. Through extensive experiments across the Objaverse, ABO, MVImgNet and SUN RGBD datasets with zero-shot classification, text-driven retrieval and open-world understanding tasks, we demonstrate the effectiveness of UniGS in learning a more general and stronger aligned multi-modal representation. Specifically, UniGS achieves leading results across different 3D tasks with remarkable improvements over previous SOTA, Uni3D, including on zero-shot classification (+9.36%), text-driven retrieval (+4.3%) and open-world understanding (+7.92%).

NeurIPS Conference 2022 Conference Paper

Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization

  • Yang Zhao
  • Chen Zhang
  • Haifeng Huang
  • Haoyuan Li
  • Zhou Zhao

Aiming to locate the object that emits a specified sound in complex scenes, the task of sounding object localization bridges two perception-oriented modalities of vision and acoustics, and brings enormous research value to the comprehensive perceptual understanding of machine intelligence. Although there are massive training data collected in this field, few of them contain accurate bounding box annotations, hindering the learning process and further application of proposed models. In order to address this problem, we try to explore an effective multi-modal knowledge transfer strategy to obtain precise knowledge from other similar tasks and transfer it through well-aligned multi-modal data to deal with this task in a zero-resource manner. Concretely, we design and propose a novel \textit{Two-stream Universal Referring localization Network} (TURN), which is composed of a localization stream and an alignment stream to carry out different functions. The former is utilized to extract the knowledge related to referring object localization from the image grounding task, while the latter is devised to learn a universal semantic space shared between texts and audios. Moreover, we further develop an adaptive sampling strategy to automatically identify the overlap between different data domains, thus boosting the performance and stability of our model. The extensive experiments on various publicly-available benchmarks demonstrate that TURN can achieve competitive performance compared with the state-of-the-art approaches without using any data in this field, which verifies the feasibility of our proposed mechanisms and strategies.

AAAI Conference 2020 Conference Paper

Urban2Vec: Incorporating Street View Imagery and POIs for Multi-Modal Urban Neighborhood Embedding

  • Zhecheng Wang
  • Haoyuan Li
  • Ram Rajagopal

Understanding intrinsic patterns and predicting spatiotemporal characteristics of cities require a comprehensive representation of urban neighborhoods. Existing works relied on either inter- or intra-region connectivities to generate neighborhood representations but failed to fully utilize the informative yet heterogeneous data within neighborhoods. In this work, we propose Urban2Vec, an unsupervised multimodal framework which incorporates both street view imagery and point-of-interest (POI) data to learn neighborhood embeddings. Specifically, we use a convolutional neural network to extract visual features from street view images while preserving geospatial similarity. Furthermore, we model each POI as a bag-of-words containing its category, rating, and review information. Analog to document embedding in natural language processing, we establish the semantic similarity between neighborhood (“document”) and the words from its surrounding POIs in the vector space. By jointly encoding visual, textual, and geospatial information into the neighborhood representation, Urban2Vec can achieve performances better than baseline models and comparable to fully-supervised methods in downstream prediction tasks. Extensive experiments on three U. S. metropolitan areas also demonstrate the model interpretability, generalization capability, and its value in neighborhood similarity analysis.

ICRA Conference 2016 Conference Paper

Control and experimental validation of robot-assisted automatic measurement system for Multi-Stud Tensioning Machine (MSTM)

  • Meng Li
  • Xingguang Duan
  • Haoyuan Li
  • Tengfei Cui
  • Liang Gao
  • Yue Zhan
  • Yan Xu

Multi-Stud Tensioning Machine (MSTM) is a specialized equipment used to open/seal the cover of the Reactor Pressure Vessel (RPV) during nuclear power plant maintenance. The tensioning residual values of the 58 studs are monitored for procedure evaluation. It is time-consuming for human operators to place the measurement meters into working positions. In order to reduce labor intensity and eliminate radiation exposure time, we develop a robot-assisted automatic measurement system to achieve meter placement and real-time data monitoring. The Field Programmable Gate Array (FPGA)-based distributed control scheme realizes high-speed data acquisition and coordinated control of the 58 node robots. The control software performs data analysis and sends emergency stop signals to the MSTM control PLC. The proposed system is validated in China Nuclear Power Technology Research Institute. Total operation time decreases from over 580 s to less than 120 s.