Arrow Research search

Author name cluster

Yaoqi Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
1 author row

Possible papers

10

AAAI Conference 2026 Conference Paper

2D-CrossScan Mamba: Enhancing State Space Models with Spatially Consistent Multi-Path 2D Information Propagation

  • Longlong Yu
  • Wenxi Li
  • Yaoqi Sun
  • Hang Xu
  • Chenggang Yan
  • Yuchen Guo

Despite recent progress in adapting State Space Models such as Mamba to vision tasks, their intrinsic 1D scanning mechanism imposes limitations when applied to inherently 2D-structured data like images. Existing adaptations, including VMamba and 2DMamba, either suffer from inconsistency between scanning order and spatial locality or restrict inter-patch communication to singular paths, hindering effective information propagation. In this paper, we propose 2D-CrossScan, a novel 2D-compatible scan framework that enables spatially consistent, multi-path hidden state propagation by integrating modified state equations over two-dimensional neighborhoods. Furthermore, we mitigate redundant information accumulation due to overlapping paths via cross-directional subtraction. To fully align with the 2D spatial structure, we introduce a multi-directional scanning strategy that starts simultaneously from all four corners of the image, enabling diverse propagation paths and better feature integration. Our approach maintains efficiency, requiring only minimal architectural changes to existing Mamba variants. Experimental results demonstrate substantial improvements in multiple visual tasks, including object detection and semantic segmentation on PANDA and COCO datasets. Compared to baseline SSM-based methods, 2D-CrossScan consistently yields better spatial representations, as confirmed by extensive effective receptive field visualizations and attention analyses. These results highlight the importance of geometry-aware state propagation and validate 2D-CrossScan as a simple yet powerful extension to SSMs for vision.

AAAI Conference 2026 Conference Paper

Temporal Calibrating and Distilling for Scene-Text Aware Text-Video Retrieval

  • Zhiqian Zhao
  • Liang Li
  • Lei Shen
  • Xichun Sheng
  • Yaoqi Sun
  • Fang Kang
  • Chenggang Yan

Existing text-video retrieval methods mainly focus on singlemodal video content (i.e., visual entities), often overlooking heterogeneous scene text that is ubiquitous in human environments. Although scene text in videos provides finegrained semantics for cross-modal retrieval, effectively utilizing it presents two key challenges: (1) Temporally dense scene text disrupts sync with sparse video frames, obstructing video understanding;(2) Redundant scene text and irrelevant video frames hinder the learning of discriminative temporal clues for retrieval. To address them, we propose a temporal scene-text calibrating and distilling (TCD) network for textvideo retrieval. Specifically, we first design a window-OCR captioner that aggregates dense scene text into OCR captions to facilitate feature interaction. Next, we devise a heterogeneous semantics calibration module that leverages scene text as a self-supervised signal to temporally align window-level OCR captions and frame-level video features. Further, we introduce a context-guided temporal clue distillation module to learn the complementary and relevant details between scene text and video modalities, thereby obtaining discriminative temporal clues for retrieval. Extensive experiments show that our TCD achieves state-of-the-art performance on three scene-text related benchmarks.

AAAI Conference 2025 Conference Paper

Heterogeneous Prompt-Guided Entity Inferring and Distilling for Scene-Text Aware Cross-Modal Retrieval

  • Zhiqian Zhao
  • Liang Li
  • Jiehua Zhang
  • Yaoqi Sun
  • Xichun Sheng
  • Haibing Yin
  • Shaowei Jiang

In cross-modal retrieval, comprehensive image understanding is vital while the scene text in images can provide fine-grained information to understand visual semantics. Current methods fail to make full use of scene text. They suffer from the semantic ambiguity of independent scene text and overlook the heterogeneous concepts in image-caption pairs. In this paper, we propose a heterogeneous prompt-guided entity inferring and distilling (HOPID) network to explore the nature connection of scene text in images and captions and learn a property-centric scene text representation. Specifically, we propose to align scene text in images and captions via heterogeneous prompt, which consists of visual and text prompt. For text prompt, we introduce the discriminative entity inferring module to reason key scene text words from captions, while visual prompt highlights the corresponding scene text in images. Furthermore, to secure a robust scene text representation, we design a perceptive entity distilling module that distills the beneficial information of scene text at a fine-grained level. Extensive experiments show that the proposed method significantly outperforms existing approaches on two public cross-modal retrieval benchmarks.

AAAI Conference 2025 Conference Paper

SdalsNet: Self-Distilled Attention Localization and Shift Network for Unsupervised Camouflaged Object Detection

  • Peiyao Shou
  • Yixiu Liu
  • Wei Wang
  • Yaoqi Sun
  • Zhigao Zheng
  • Shangdong Zhu
  • Chenggang Yan

Unsupervised camouflaged object detection (UCOD) poses significant challenges, primarily attributed to the absence of human labels. Existing UCOD methodologies, leveraging attention mechanisms, often struggle to achieve precise localization of camouflaged objects. To overcome this limitation, we introduce a groundbreaking fully unsupervised algorithm for attention-guided camouflaged object localization, shift, and inference, termed the self-distilled attention localization and shift network (SdalsNet). In this study, we formulate an attention localization methodology aimed at accurately identifying the central coordinate of the camouflaged object. Furthermore, we propose four distinct loss functions tailored to refine the precision of attentional positioning. These loss functions effectively constrain the distances between three types of class tokens, facilitating seamless attentional shifting across the input sample. Additionally, we design a sophisticated prediction inference technique to reconstruct the binary output of an attention map, thereby providing a comprehensive understanding of the detected camouflaged objects. Experimental results on four challenging COD benchmark datasets corroborate the effectiveness of our proposed approach, demonstrating notable superiority over state-of-the-art methods.

NeurIPS Conference 2024 Conference Paper

Upping the Game: How 2D U-Net Skip Connections Flip 3D Segmentation

  • Xingru Huang
  • Yihao Guo
  • Jian Huang
  • Tianyun Zhang
  • Hong He
  • Shaowei Jiang
  • Yaoqi Sun

In the present study, we introduce an innovative structure for 3D medical image segmentation that effectively integrates 2D U-Net-derived skip connections into the architecture of 3D convolutional neural networks (3D CNNs). Conventional 3D segmentation techniques predominantly depend on isotropic 3D convolutions for the extraction of volumetric features, which frequently engenders inefficiencies due to the varying information density across the three orthogonal axes in medical imaging modalities such as computed tomography (CT) and magnetic resonance imaging (MRI). This disparity leads to a decline in axial-slice plane feature extraction efficiency, with slice plane features being comparatively underutilized relative to features in the time-axial. To address this issue, we introduce the U-shaped Connection (uC), utilizing simplified 2D U-Net in place of standard skip connections to augment the extraction of the axial-slice plane features while concurrently preserving the volumetric context afforded by 3D convolutions. Based on uC, we further present uC 3DU-Net, an enhanced 3D U-Net backbone that integrates the uC approach to facilitate optimal axial-slice plane feature utilization. Through rigorous experimental validation on five publicly accessible datasets—FLARE2021, OIMHS, FeTA2021, AbdomenCT-1K, and BTCV, the proposed method surpasses contemporary state-of-the-art models. Notably, this performance is achieved while reducing the number of parameters and computational complexity. This investigation underscores the efficacy of incorporating 2D convolutions within the framework of 3D CNNs to overcome the intrinsic limitations of volumetric segmentation, thereby potentially expanding the frontiers of medical image analysis. Our implementation is available at https: //github. com/IMOP-lab/U-Shaped-Connection.

IJCAI Conference 2020 Conference Paper

Real-World Automatic Makeup via Identity Preservation Makeup Net

  • Zhikun Huang
  • Zhedong Zheng
  • Chenggang Yan
  • Hongtao Xie
  • Yaoqi Sun
  • Jianzhong Wang
  • Jiyong Zhang

This paper focuses on the real-world automatic makeup problem. Given one non-makeup target image and one reference image, the automatic makeup is to generate one face image, which maintains the original identity with the makeup style in the reference image. In the real-world scenario, face makeup task demands a robust system against the environmental variants. The two main challenges in real-world face makeup could be summarized as follow: first, the background in real-world images is complicated. The previous methods are prone to change the style of background as well; second, the foreground faces are also easy to be affected. For instance, the ``heavy'' makeup may lose the discriminative information of the original identity. To address these two challenges, we introduce a new makeup model, called Identity Preservation Makeup Net (IPM-Net), which preserves not only the background but the critical patterns of the original identity. Specifically, we disentangle the face images to two different information codes, i. e. , identity content code and makeup style code. When inference, we only need to change the makeup style code to generate various makeup images of the target person. In the experiment, we show the proposed method achieves not only better accuracy in both realism (FID) and diversity (LPIPS) in the test set, but also works well on the real-world images collected from the Internet.