Arrow Research search

Author name cluster

Hong Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

32 papers
2 author rows

Possible papers

32

JBHI Journal 2026 Journal Article

PMSFINet: Progressive Multi-Scale Feature Interaction Network for Medical Image Segmentation

  • Yali Peng
  • Hong Li
  • Meiyun Wang
  • Le Qin
  • Yingkui Du
  • Yugen Yi

Recently, the Swin Transformer has demonstrated strong performance in dense prediction tasks such as image segmentation by employing a window-based multi-head self-attention mechanism, which effectively reduces computational complexity. However, it still encounters limitations in multi-scale feature fusion and boundary preservation, leading to suboptimal segmentation of complex or ambiguous structures commonly found in medical images. To address these challenges, we propose PMSFINet, a novel medical image segmentation network designed to enhance representation learning through progressive multi-scale feature interaction. The overall framework comprises three key components: (1) a Progressive Multi-Scale Feature Interactive (PMSFI) module that builds Dual-Scale Window Interactive Attention (DSWIA) blocks to enable efficient computation and cross-scale information exchange; (2) a Multi-Scale Super-Resolution Decoder (MSRD) that integrates super-resolution and spatial attention with a Local Similarity-Aware Sampler (LSAS) to refine structural details and enhance boundary clarity; and (3) a Cross-Attention Fusion (CAF) module that employs hybrid attention to dynamically fuse dual-branch features, improving feature complementarity and collaborative representation. Extensive experiments on the Synapse, ACDC, and ISIC2018 datasets yield Dice scores of 84. 94%, 92. 43%, and 90. 79%, respectively, demonstrating the strong generalization and robustness of PMSFINet across diverse medical imaging tasks. Ablation studies further verify the individual effectiveness of each proposed component.

ECAI Conference 2025 Conference Paper

SPOFormer: Enhancing Prior Learning for Multi-View 3D Occupancy Perception via Semantic-Aware Attention

  • Ruihang Li
  • Huangnan Zheng
  • Zhe Yin
  • Kaikai Xiao
  • Hong Li
  • Zhijie Pan

In recent years, vision-based 3D occupancy prediction has attracted significant interest in autonomous driving due to its detailed and comprehensive representation of the surrounding environment. Current research typically confines experiments to a single data domain, which results in severe overfitting to the visual parameters within a single data domain in the unified feature map construction module, limiting the models’ effectiveness when autonomous vehicles operate under diverse conditions (scenes and sensor suites). To address this, our paper introduces SPOFormer, a new pipeline developed through optimization of training strategies and model architectures. Our approach features a Semantic Attention module that employs a double-tiered supervision strategy. This module utilizes the attention mechanism’s query function to reconstruct semantic prediction maps, thus integrating semantic information into the 2D features. Additionally, we propose the Semantic-aware Multi-view Feature Fusion module, which processes regions of interest in 3D space using pre-trained depth and segmentation maps, allowing the network to operate independently of specific sensor configurations. Experiments conducted on the Occ3D-nuScenes and Occ3D-Waymo benchmarks demonstrate that SPOFormer not only achieves state-of-the-art perception performance but also attains up to 90% of the performance level of full training by fine-tuning the model with just 5% of target domain data. This efficiency is crucial for practical autonomous driving applications.

ICLR Conference 2025 Conference Paper

The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

  • Hong Li
  • Nanxi Li
  • Yuanjie Chen
  • Jianbin Zhu
  • Qinlu Guo
  • Cewu Lu
  • Yonglu Li 0001

Multi-modal Large Language Models (MLLMs) have exhibited impressive capability. However, recently many deficiencies of MLLMs have been found compared to human intelligence, $\textit{e.g.}$, hallucination. To drive the MLLMs study, the community dedicated efforts to building larger benchmarks with complex tasks. In this paper, we propose benchmarking an essential but usually overlooked intelligence: $\textbf{association}$, a human's basic capability to link observation and prior practice memory. To comprehensively investigate MLLM's performance on the association, we formulate the association task and devise a standard benchmark based on adjective and verb semantic concepts. Instead of costly data annotation and curation, we propose a convenient $\textbf{annotation-free}$ construction method transforming the general dataset for our association tasks. Simultaneously, we devise a rigorous data refinement process to eliminate confusion in the raw dataset. Building on this database, we establish three levels of association tasks: single-step, synchronous, and asynchronous associations. Moreover, we conduct a comprehensive investigation into the MLLMs' zero-shot association capabilities, addressing multiple dimensions, including three distinct memory strategies, both open-source and closed-source MLLMs, cutting-edge Mixture-of-Experts (MoE) models, and the involvement of human experts. Our systematic investigation shows that current open-source MLLMs consistently exhibit poor capability in our association tasks, even the currently state-of-the-art GPT-4V(vision) also has a significant gap compared to humans. We believe our benchmark would pave the way for future MLLM studies. $\textit{Our data and code are available at:} https://mvig-rhos.com/llm_inception.

NeurIPS Conference 2025 Conference Paper

UniTransfer: Video Concept Transfer via Progressive Spatio-Temporal Decomposition

  • guojun lei
  • Rong Zhang
  • Tianhang Liu
  • Hong Li
  • Zhiyuan Ma
  • Chi Wang
  • Weiwei Xu

Recent advancements in video generation models have enabled the creation of diverse and realistic videos, with promising applications in advertising and film production. However, as one of the essential tasks of video generation models, video concept transfer remains significantly challenging. Existing methods generally model video as an entirety, leading to limited flexibility and precision when solely editing specific regions or concepts. To mitigate this dilemma, we propose a novel architecture UniTransfer, which introduces both spatial and diffusion timestep decomposition in a progressive paradigm, achieving precise and controllable video concept transfer. Specifically, in terms of spatial decomposition, we decouple videos into three key components: the foreground subject, the background, and the motion flow. Building upon this decomposed formulation, we further introduce a dual-to-single-stream DiT-based architecture for supporting fine-grained control over different components in the videos. We also introduce a self-supervised pretraining strategy based on random masking to enhance the decomposed representation learning from large-scale unlabeled video data. Inspired by the Chain-of-Thought reasoning paradigm, we further revisit the denoising diffusion process and propose a Chain-of-Prompt (CoP) mechanism to achieve the timestep decomposition. We decompose the denoising process into three stages of different granularity and leverage large language models (LLMs) for stage-specific instructions to guide the generation progressively. We also curate an animal-centric video dataset called OpenAnimal to facilitate the advancement and benchmarking of research in video concept transfer. Extensive experiments demonstrate that our method achieves high-quality and controllable video concept transfer across diverse reference images and scenes, surpassing existing baselines in both visual fidelity and editability.

AAAI Conference 2024 Conference Paper

Hierarchical Aligned Multimodal Learning for NER on Tweet Posts

  • Peipei Liu
  • Hong Li
  • Yimo Ren
  • Jie Liu
  • Shuaizong Si
  • Hongsong Zhu
  • Limin Sun

Mining structured knowledge from tweets using named entity recognition (NER) can be beneficial for many downstream applications such as recommendation and intention under standing. With tweet posts tending to be multimodal, multimodal named entity recognition (MNER) has attracted more attention. In this paper, we propose a novel approach, which can dynamically align the image and text sequence and achieve the multi-level cross-modal learning to augment textual word representation for MNER improvement. To be specific, our framework can be split into three main stages: the first stage focuses on intra-modality representation learning to derive the implicit global and local knowledge of each modality, the second evaluates the relevance between the text and its accompanying image and integrates different grained visual information based on the relevance, the third enforces semantic refinement via iterative cross-modal interactions and co-attention. We conduct experiments on two open datasets, and the results and detailed analysis demonstrate the advantage of our model.

IROS Conference 2024 Conference Paper

SmartKit: User-Friendly Robot with Multiple Operating Systems

  • Guanyu Chen
  • Yiqun Zhou
  • Guoqing Yang
  • Hong Li
  • Pan Lv

Mobile robots have become extensively involved in human activities, taking on arduous tasks and providing significant assistance. Robot capabilities have been continuously enhanced, from simple chassis control to path planning and SLAM. Mixed criticality systems enable mobile robots to handle tasks of varying criticality by integrating multiple operating systems, allowing them to accomplish a wide range of tasks. However, besides improving robot computing performance, we should remember that robots are designed to serve humans. Reliability, usability, and affordability are all critical factors for robot design. We introduce SmartKit, a mixed criticality system (MCS) for mobile robots. Leveraging the efficiency in hardware utilization brought by virtualization, SmartKit can execute tasks of different criticality efficiently and securely. This paper will present the software and hardware architecture of SmartKit and provide performance and functionality validation of the robot system.

NeurIPS Conference 2022 Conference Paper

FNeVR: Neural Volume Rendering for Face Animation

  • Bohan Zeng
  • Boyu Liu
  • Hong Li
  • Xuhui Liu
  • Jianzhuang Liu
  • Dapeng Chen
  • Wei Peng
  • Baochang Zhang

Face animation, one of the hottest topics in computer vision, has achieved a promising performance with the help of generative models. However, it remains a critical challenge to generate identity preserving and photo-realistic images due to the sophisticated motion deformation and complex facial detail modeling. To address these problems, we propose a Face Neural Volume Rendering (FNeVR) network to fully explore the potential of 2D motion warping and 3D volume rendering in a unified framework. In FNeVR, we design a 3D Face Volume Rendering (FVR) module to enhance the facial details for image rendering. Specifically, we first extract 3D information with a well designed architecture, and then introduce an orthogonal adaptive ray-sampling module for efficient rendering. We also design a lightweight pose editor, enabling FNeVR to edit the facial pose in a simple yet effective way. Extensive experiments show that our FNeVR obtains the best overall quality and performance on widely used talking-head benchmarks.