Author name cluster

Yuchen Guo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

34 papers

2 author rows

AAAI Conference 2026 Conference Paper

2D-CrossScan Mamba: Enhancing State Space Models with Spatially Consistent Multi-Path 2D Information Propagation

Longlong Yu
Wenxi Li
Yaoqi Sun
Hang Xu
Chenggang Yan
Yuchen Guo

Despite recent progress in adapting State Space Models such as Mamba to vision tasks, their intrinsic 1D scanning mechanism imposes limitations when applied to inherently 2D-structured data like images. Existing adaptations, including VMamba and 2DMamba, either suffer from inconsistency between scanning order and spatial locality or restrict inter-patch communication to singular paths, hindering effective information propagation. In this paper, we propose 2D-CrossScan, a novel 2D-compatible scan framework that enables spatially consistent, multi-path hidden state propagation by integrating modified state equations over two-dimensional neighborhoods. Furthermore, we mitigate redundant information accumulation due to overlapping paths via cross-directional subtraction. To fully align with the 2D spatial structure, we introduce a multi-directional scanning strategy that starts simultaneously from all four corners of the image, enabling diverse propagation paths and better feature integration. Our approach maintains efficiency, requiring only minimal architectural changes to existing Mamba variants. Experimental results demonstrate substantial improvements in multiple visual tasks, including object detection and semantic segmentation on PANDA and COCO datasets. Compared to baseline SSM-based methods, 2D-CrossScan consistently yields better spatial representations, as confirmed by extensive effective receptive field visualizations and attention analyses. These results highlight the importance of geometry-aware state propagation and validate 2D-CrossScan as a simple yet powerful extension to SSMs for vision.

PDF Details DOI

AAAI Conference 2026 Conference Paper

GigaMoE: Sparsity-Guided Mixture of Experts for Efficient Gigapixel Object Detection

Xiang Li
Wenxi Li
Yuetong Wang
Chenyang Lyu
Haozhe Lin
Guiguang Ding
Yuchen Guo

Object detection in High-Resolution Wide (HRW) shots, or gigapixel images, presents unique challenges due to extreme object sparsity and vast scale variations. State-of-the-art methods like SparseFormer have pioneered sparse processing by selectively focusing on important regions, yet they apply a uniform computational model to all selected regions, overlooking their intrinsic complexity differences. This leads to a suboptimal trade-off between performance and efficiency. In this paper, we introduce GigaMoE, a novel backbone architecture that pioneers adaptive computation for this domain by replacing the standard Feed-Forward Networks (FFNs) with a Mixture-of-Experts (MoE) module. Our architecture first employs a shared expert to provide a robust feature baseline for all selected regions. Upon this foundation, our core innovation---a novel Sparsity-Guided Routing mechanism---insightfully repurposes importance scores from the sparse backbone to provide a "computational bonus,'' dynamically engaging a variable number of specialized experts based on content complexity. The entire system is trained efficiently via a loss-free load-balancing technique, eliminating the need for cumbersome auxiliary losses. Extensive experiments show that GigaMoE sets a new state-of-the-art on the PANDA benchmark, improving detection accuracy by 1.1% over SparseFormer while simultaneously reducing the computational cost (FLOPs) by a remarkable 32.3%.

PDF Details DOI

JBHI Journal 2025 Journal Article

MAT: Mixing Attention Transfer from Multiple Transformers for Medical Tasks

Zi-Hao Bo
Yuchen Guo
Xiangru Chen
Jing Xie
Lishan Ye
Feng Xu

Transformer has been widely used for image analysis tasks, but in medicine, it suffers from limited data availability. To overcome this challenge, we propose a novel approach specially designed for transformers to transfer knowledge from multiple sources to target medical tasks with limited data, named Mixing Attention Transfer (MAT). MAT aims to harness and merge knowledge from multiple source transformers at the token and layer level to improve the performance of target medical tasks. The core component of MAT is the Mixing Attention layer, which encompasses: 1. token-level Routing and Fusion modules that allocate input images to adequate source modules; 2. sequence-level Aligned-Attention module that adaptively aligns outputs produced by different source modules. To the best of our knowledge, this is the first multi-source transfer learning approach specifically designed for transformers. Through extensive evaluations, we demonstrate the effectiveness of MAT on three medical scenarios: noisy-labeled, class-imbalanced, and fine-grained tasks.

Details DOI

AAAI Conference 2024 Conference Paper

In recent years, object detection in deep learning has experienced rapid development. However, most existing object detection models perform well only on closed-set datasets, ignoring a large number of potential objects whose categories are not defined in the training set. These objects are often identified as background or incorrectly classified as pre-defined categories by the detectors. In this paper, we focus on the challenging problem of Novel Class Discovery and Localization (NCDL), aiming to train detectors that can detect the categories present in the training data, while also actively discover, localize, and cluster new categories. We analyze existing NCDL methods and identify the core issue: object detectors tend to be biased towards seen objects, and this leads to the neglect of unseen targets. To address this issue, we first propose an Debiased Region Mining (DRM) approach that combines class-agnostic Region Proposal Network (RPN) and class-aware RPN in a complementary manner. Additionally, we suggest to improve the representation network through semi-supervised contrastive learning by leveraging unlabeled data. Finally, we adopt a simple and efficient mini-batch K-means clustering method for novel class discovery. We conduct extensive experiments on the NCDL benchmark, and the results demonstrate that the proposed DRM approach significantly outperforms previous methods, establishing a new state-of-the-art.

PDF Details DOI

ECAI Conference 2024 Conference Paper

Detecting Objects as Cascade Corners

Chenglong Liu
Jintao Liu
Haorao Wei
Jinze Yang
Liangyu Xu
Yuchen Guo
Lu Fang 0001

The corner-based detection paradigm enjoys the potential to produce high-quality boxes. But the development is constrained by three factors: 1) Hard to match corners. Heuristic corner matching algorithms can lead to incorrect boxes, especially when similar-looking objects co-occur. 2) Poor instance context. Two separate corners preserve few instance semantics, so it is difficult to guarantee getting both two class-specific corners on the same heatmap channel. 3) Unfriendly backbone. The training cost of the hourglass network is high. Accordingly, we build a novel corner-based framework, named Corner2Net. To achieve the corner-matching-free manner, we devise the cascade corner pipeline which progressively predicts the associated corner pair in two steps instead of synchronously searching two independent corners via parallel heads. Corner2Net decouples corner localization and object classification. Both two corners are class-agnostic and the instance-specific bottom-right corner further simplifies its search space. Meanwhile, RoI features with rich semantics are extracted for classification. Popular backbones (e. g. , ResNeXt) can be easily connected to Corner2Net. Experimental results on COCO show Corner2Net surpasses all existing corner-based detectors by a large margin in accuracy and speed.

Details

AAAI Conference 2024 Conference Paper

GigaHumanDet: Exploring Full-Body Detection on Gigapixel-Level Images

Chenglong Liu
Haoran Wei
Jinze Yang
Jintao Liu
Wenxi Li
Yuchen Guo
Lu Fang

Performing person detection in super-high-resolution images has been a challenging task. For such a task, modern detectors, which usually encode a box using center and width/height, struggle with accuracy due to two factors: 1) Human characteristic: people come in various postures and the center with high freedom is difficult to capture robust visual pattern; 2) Image characteristic: due to vast scale diversity of input (gigapixel-level), distance regression (for width and height) is hard to pinpoint, especially for a person, with substantial scale, who is near the camera. To address these challenges, we propose GigaHumanDet, an innovative solution aimed at further enhancing detection accuracy for gigapixel-level images. GigaHumanDet employs the corner modeling method to avoid the potential issues of a high degree of freedom in center pinpointing. To better distinguish similar-looking persons and enforce instance consistency of corner pairs, an instance-guided learning approach is designed to capture discriminative individual semantics. Further, we devise reliable shape-aware bodyness equipped with a multi-precision strategy as the human corner matching guidance to be appropriately adapted to the single-view large scene. Experimental results on PANDA and STCrowd datasets show the superiority and strong applicability of our design. Notably, our model achieves 82.4% in term of AP, outperforming current state-of-the-arts by more than 10%.

PDF Details DOI

AAMAS Conference 2024 Conference Paper

JDRec: Practical Actor-Critic Framework for Online Combinatorial Recommender System

Xin Zhao
Jiaxin Li
Zhiwei Fang
Yuchen Guo
Jinyuan Zhao
Jie He
Wenlong Chen
Changping Peng

In the realm of online recommendation systems, the Combinatorial Recommender (CR) system stands out for its unique approach. It presents users with a list of items on a result page, where user behavior is simultaneously influenced by contextual information and the items listed. Formulated as a combinatorial optimization problem, the objective of the CR system is to maximize the recommendation reward across the entire list of items. Despite the significant potential of CR systems, developing a practical and efficient model remains substantial challenges. These challenges stem from the dynamic nature of online environments and the pressing need for personalized recommendations. To tackle these challenges, we decompose the overarching problem into two sub-problems: list generation and list evaluation. We propose novel and pragmatic model architectures for each sub-problem aiming to concurrently enhance both effectiveness and efficiency. To further adapt the CR system to online scenarios, we integrate a bootstrap algorithm into an actor-critic reinforcement framework. This innovative approach called JD Recommender System (JDRec) is designed to continuously refine the recommendation mode through sustained user interaction, ensuring the system’s adaptability and relevance. The proposed JDRec framework, tested through rigorous offline and online experiments, has shown promising results. It has been successfully deployed in online JD recommendation systems, yielding a notable improvement in click-through rate by 2. 6% and augmenting the total value of the platform by 5. 03%. Besides, we release the large scale dataset used in our work to facilitate further research. This work is licensed under a Creative Commons Attribution International 4. 0 License. *Equal contribution. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org).

PDF

ECAI Conference 2024 Conference Paper

SaccadeMOT: Enhancing Object Detection and Tracking in Gigapixel Images via Scale-Aware Density Estimation

Wenxi Li
Ruxin Zhang
Haozhe Lin
Yuchen Guo
Chao Ma 0004
Xiaokang Yang 0001

The proliferation of gigapixel imaging has ushered in unprecedented challenges in object detection and tracking due to the intense computational demands. Previous deep learning approaches, often tailored for megapixel images, fall short in addressing the unique complexities presented by the gigapixel level. To bridge this gap, we introduce SaccadeMOT, a novel architecture designed for efficient gigapixel-level multi-object tracking. Based on our observations of density map regression in crowd counting and small object detection in object detection tasks, we propose a novel gigapixel detection paradigm that combines the strengths of both approaches. Firstly, the “saccade” stage swiftly identifies regions likely containing objects, followed by the “gaze” stage that refines the detection within these areas. This strategic region selection is complemented by a robust tracking mechanism that combines head and body tracking, enhancing accuracy in environments with potential occlusions. Validated on the PANDA dataset, SaccadeMOT not only demonstrates an 13× speed improvement over existing state-of-the-art tracker BotSORT but also exhibits promising applications in gigapixel-level pathology analysis, particularly in Whole Slide Imaging (WSI). This approach sets a new benchmark for handling super high-resolution images, offering significant advancements in both the speed and precision of object tracking technologies.

Details

ICLR Conference 2023 Conference Paper

Consolidator: Mergable Adapter with Group Connections for Visual Adaptation

Tianxiang Hao 0001
Hui Chen 0013
Yuchen Guo
Guiguang Ding

Recently, transformers have shown strong ability as visual feature extractors, surpassing traditional convolution-based models in various scenarios. However, the success of vision transformers largely owes to their capacity to accommodate numerous parameters. As a result, new challenges for adapting a well-trained transformer to downstream tasks arise. On the one hand, classic fine-tuning tunes all parameters in a huge model for every downstream task and thus easily falls into an overfitting situation, leading to inferior performance. On the other hand, on resource-limited devices, fine-tuning stores a full copy of all parameters and thus is usually impracticable for the shortage of storage space. However, few works have focused on how to efficiently and effectively transfer knowledge in a vision transformer. Existing methods did not dive into the properties of visual features, leading to inferior performance. Moreover, some of them bring heavy inference cost though benefiting storage. To tackle these problems, we propose consolidator to achieve efficient transfer learning for large vision models. Our consolidator modifies the pre-trained model with the addition of a small set of tunable parameters to temporarily store the task-specific knowledge while freezing the backbone model during adaptation. Motivated by the success of group-wise convolution, we adopt grouped connections across the features extracted by fully connected layers to construct tunable parts in a consolidator. To further enhance the model's capacity to transfer knowledge under a constrained storage budget and keep inference efficient, we consolidate the parameters in two stages: 1. between adaptation and storage, and 2. between loading and inference. On a series of downstream visual tasks, our consolidator can reach up to 7.56 better accuracy than full fine-tuning with merely 0.35% parameters, and outperform state-of-the-art parameter-efficient tuning methods by a clear margin. Code is available at github.

Details

AAAI Conference 2022 Conference Paper

ReMoNet: Recurrent Multi-Output Network for Efficient Video Denoising

Liuyu Xiang
Jundong Zhou
Jirui Liu
Zerun Wang
Haidong Huang
Jie Hu
Jungong Han
Yuchen Guo

While deep neural network-based video denoising methods have achieved promising results, it is still hard to deploy them on mobile devices due to their high computational cost and memory demands. This paper aims to develop a lightweight deep video denoising method that is friendly to resource-constrained mobile devices. Inspired by the facts that 1) consecutive video frames usually contain redundant temporal coherency, and 2) neural networks are usually over-parameterized, we propose a multi-input multi-output (MIMO) paradigm to process consecutive video frames within one-forward-pass. The basic idea is concretized to a novel architecture termed Recurrent Multi-output Network (ReMoNet), which consists of recurrent temporal fusion and temporal aggregation blocks and is further reinforced by similarity-based mutual distillation. We conduct extensive experiments on NVIDIA GPU and Qualcomm Snapdragon 888 mobile platform with Gaussian noise and simulated Image- Signal-Processor (ISP) noise. The experimental results show that ReMoNet is both effective and efficient on video denoising. Moreover, we show that ReMoNet is more robust under higher noise level scenarios.