Arrow Research search

Author name cluster

Guo Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers
2 author rows

Possible papers

16

AAAI Conference 2026 Conference Paper

Boosting Adversarial Transferability via Ensemble Non-Attention

  • Yipeng Zou
  • Qin Liu
  • Jie Wu
  • Yu Peng
  • Guo Chen
  • Hui Zhou
  • Guanghui Ye

Ensemble attacks integrate the outputs of surrogate models with diverse architectures, which can be combined with various gradient-based attacks to improve adversarial transferability. However, previous work shows unsatisfactory attack performance when transferring across heterogeneous model architectures. The main reason is that the gradient update directions of heterogeneous surrogate models differ widely, making it hard to reduce the gradient variance of ensemble models while making the best of individual model. To tackle this challenge, we design a novel ensemble attack, NAMEA, which for the first time integrates the gradients from the non-attention areas of ensemble models into the iterative gradient optimization process. Our design is inspired by the observation that the attention areas of heterogeneous models vary sharply, thus the non-attention areas of ViTs are likely to be the focus of CNNs and vice versa. Therefore, we merge the gradients respectively from the attention and non-attention areas of ensemble models so as to fuse the transfer information of CNNs and ViTs. Specifically, we pioneer a new way of decoupling the gradients of non-attention areas from those of attention areas, while merging gradients by meta-learning. Empirical evaluations on ImageNet dataset indicate that NAMEA outperforms AdaEA and SMER, the state-of-the-art ensemble attacks by an average of 15.0% and 9.6%, respectively. This work is the first attempt to explore the power of ensemble non-attention in boosting cross-architecture transferability, providing new insights into launching ensemble attacks.

AAAI Conference 2026 Conference Paper

LiR3AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation

  • Guo Chen
  • Junjie Huang
  • Huaijin Xie
  • Fei Sun
  • Tao Jia

Retrieval-Augmented Generation (RAG) effectively enhances Large Language Models (LLMs) by incorporating retrieved external knowledge into the generation process. Reasoning models improve LLM performance in multi-hop QA tasks, which require integrating and reasoning over multiple pieces of evidence across different documents to answer a complex question. However, they often introduce substantial computational costs, including increased token consumption and inference latency. To better understand and mitigate this trade-off, we conduct a comprehensive study of reasoning strategies for reasoning models in RAG multi-hop QA tasks. Our findings reveal that reasoning models adopt structured strategies to integrate retrieved and internal knowledge, primarily following two modes: Context-Grounded Reasoning, which relies directly on retrieved content, and Knowledge-Reconciled Reasoning, which resolves conflicts or gaps using internal knowledge. To this end, we propose a novel Lightweight Rerank Reasoning Strategy Framework for RAG (LiR³AG) to enable non-reasoning models to transfer reasoning strategies by restructuring retrieved evidence into coherent reasoning chains. LiR³AG significantly reduce the average 98% output tokens overhead and 58.6% inferencing time while improving 8B non-reasoning model's F1 performance ranging from 6.2% to 22.5% to surpass the performance of 32B reasoning model in RAG, offering a practical and efficient path forward for RAG systems.

AAAI Conference 2026 Conference Paper

Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys

  • Xu Yang
  • Jiapeng Zhang
  • Dongyang Zhao
  • Guo Chen
  • Zhuo Tang

The KV cache in self-attention has emerged as a major bottleneck in long-context and large-batch inference for LLMs. Existing approaches often treat sparsity prediction and compression as separate modules—relying on auxiliary index structures to select relevant tokens, and on complex quantization schemes to reduce memory usage. This fragmented design introduces redundant overhead and limits scalability. In this paper, we propose a novel paradigm: treating the compressed key representation not merely as storage, but as a self-indexing structure that directly enables efficient sparse attention. By designing a sign-based 1-bit vector quantization (VQ) scheme, our method unifies compression and retrieval in a single, hardware-friendly format. This approach eliminates the need for external indices or learning-based predictors, offering a lightweight yet robust solution for memory-constrained inference. All components are designed to be hardware-efficient and easy to implement. By implementing custom CUDA kernels, our method integrates seamlessly with FlashAttention, minimizing additional runtime and memory overhead. Experimental results demonstrate that our approach delivers both effectiveness and efficiency.

NeurIPS Conference 2025 Conference Paper

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

  • Guo Chen
  • Zhiqi Li
  • Shihao Wang
  • Jindong Jiang
  • Yicheng Liu
  • Lidong Lu
  • De-An Huang
  • Wonmin Byeon

We introduce Eagle2. 5, a frontier vision-language model (VLM) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle2. 5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle2. 5-8B achieves 72. 4\% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2. 5-VL-72B and InternVL2. 5-78B.

IJCAI Conference 2025 Conference Paper

Egocentric Object-Interaction Anticipation with Retentive and Predictive Learning

  • Guo Chen
  • Yifei Huang
  • Yin-Dong Zheng
  • Yicheng Liu
  • Jiahao Wang
  • Tong Lu

Egocentric object-interaction anticipation is critical for applications like augmented reality and robotics, but existing methods struggle with misaligned egocentric encoding, insufficient supervision, and underutilized historical context. These limitations stem from a lack of focus on retention, i. e. , retaining long-term object-centric interactions, and prediction, i. e. , future-centric encoding and future uncertainty modeling. We introduce EgoAnticipator, a novel Retentive and Predictive Learning framework that addresses these challenges. Our approach combines retentive pre-training for domain-specific encoding, predictive pre-training for future uncertainty modeling, and mirror distillation to transfer future-informed knowledge. Additionally, we propose long-term memory prompting to integrate historical interaction cues. We evaluate the effectiveness of our framework using the Ego4D short-term object interaction anticipation benchmark, covering both STAv1 and STAv2. Extensive experiments demonstrate that our framework outperforms existing methods, while ablation studies highlight the effectiveness of each design inside our retentive and predictive learning framework.

NeurIPS Conference 2025 Conference Paper

EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs

  • Yuping He
  • Yifei Huang
  • Guo Chen
  • Baoqi Pei
  • Jilan Xu
  • Tong Lu
  • Jiangmiao Pang

Transferring and integrating knowledge across first-person (egocentric) and third-person (exocentric) viewpoints is intrinsic to human intelligence, enabling humans to learn from others and convey insights from their own experiences. Despite rapid progress in multimodal large language models (MLLMs), their ability to perform such cross-view reasoning remains unexplored. To address this, we introduce EgoExoBench, the first benchmark for egocentric exocentric video understanding and reasoning. Built from publicly available datasets, EgoExoBench comprises over 7300 question–answer pairs spanning eleven sub-tasks organized into three core challenges: semantic alignment, viewpoint association, and temporal reasoning. We evaluate 13 state-of-the-art MLLMs and find that while these models excel on single-view tasks, they struggle to align semantics across perspectives, accurately associate views, and infer temporal dynamics in the ego-exo context. We hope EgoExoBench can serve as a valuable resource for research on embodied agents and intelligent assistants seeking human-like cross-view intelligence.

NeurIPS Conference 2025 Conference Paper

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

  • Baoqi Pei
  • Yifei Huang
  • Jilan Xu
  • Yuping He
  • Guo Chen
  • Fei Wu
  • Jiangmiao Pang
  • Yu Qiao

Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models (MLLMs), which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand–object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning (RFT) to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks.

ICLR Conference 2025 Conference Paper

SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios

  • Kai Li 0018
  • Wendi Sang
  • Chang Zeng
  • Runxuan Yang
  • Guo Chen
  • Xiaolin Hu

Systematic evaluation of speech separation and enhancement models under moving sound source conditions requires extensive and diverse data. However, real-world datasets often lack sufficient data for training and evaluation, and synthetic datasets, while larger, lack acoustic realism. Consequently, neither effectively meets practical needs. To address this issue, we introduce SonicSim, a synthetic toolkit based on the embodied AI simulation platform Habitat-sim, designed to generate highly customizable data for moving sound sources. SonicSim supports multi-level adjustments—including scene-level, microphone-level, and source-level—enabling the creation of more diverse synthetic data. Leveraging SonicSim, we constructed a benchmark dataset called SonicSet, utilizing LibriSpeech, Freesound Dataset 50k (FSD50K), Free Music Archive (FMA), and 90 scenes from Matterport3D to evaluate speech separation and enhancement models. Additionally, to investigate the differences between synthetic and real-world data, we selected 5 hours of raw, non-reverberant data from the SonicSet validation set and recorded a real-world speech separation dataset, providing a reference for comparing SonicSet with other synthetic datasets. For speech enhancement, we utilized the real-world dataset RealMAN to validate the acoustic gap between SonicSet and existing synthetic datasets. The results indicate that models trained on SonicSet generalize better to real-world scenarios compared to other synthetic datasets. Code is publicly available at ***https://cslikai.cn/SonicSim/***.

ICLR Conference 2025 Conference Paper

TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation

  • Mohan Xu
  • Kai Li
  • Guo Chen
  • Xiaolin Hu

In recent years, much speech separation research has focused primarily on improving model performance. However, for low-latency speech processing systems, high efficiency is equally important. Therefore, we propose a speech separation model with significantly reduced parameters and computational costs: Time-frequency Interleaved Gain Extraction and Reconstruction network (TIGER). TIGER leverages prior knowledge to divide frequency bands and compresses frequency information. We employ a multi-scale selective attention module to extract contextual features, while introducing a full-frequency-frame attention module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a dataset called EchoSet. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. Experimental results showed that models trained on EchoSet had better generalization ability than those trained on other datasets to the data collected in the physical world, which validated the practical value of the EchoSet. On EchoSet and real-world data, TIGER significantly reduces the number of parameters by 94.3% and the MACs by 95.3% while achieving performance surpassing state-of-the-art (SOTA) model TF-GridNet.

AAAI Conference 2024 Conference Paper

AVSegFormer: Audio-Visual Segmentation with Transformer

  • Shengyi Gao
  • Zhe Chen
  • Guo Chen
  • Wenhai Wang
  • Tong Lu

Audio-visual segmentation (AVS) aims to locate and segment the sounding objects in a given video, which demands audio-driven pixel-level scene understanding. The existing methods cannot fully process the fine-grained correlations between audio and visual cues across various situations dynamically. They also face challenges in adapting to complex scenarios, such as evolving audio, the coexistence of multiple objects, and more. In this paper, we propose AVSegFormer, a novel framework for AVS that leverages the transformer architecture. Specifically, It comprises a dense audio-visual mixer, which can dynamically adjust interested visual features, and a sparse audio-visual decoder, which implicitly separates audio sources and automatically matches optimal visual features. Combining both components provides a more robust bidirectional conditional multi-modal representation, improving the segmentation performance in different scenarios. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer.

ICML Conference 2024 Conference Paper

NeuralIndicator: Implicit Surface Reconstruction from Neural Indicator Priors

  • Shi-Sheng Huang
  • Guo Chen
  • Chen Li Heng
  • Hua Huang 0001

The neural implicit surface reconstruction from unorganized points is still challenging, especially when the point clouds are incomplete and/or noisy with complex topology structure. Unlike previous approaches performing neural implicit surface learning relying on local shape priors, this paper proposes to utilize global shape priors to regularize the neural implicit function learning for more reliable surface reconstruction. To this end, we first introduce a differentiable module to generate a smooth indicator function, which globally encodes both the indicative prior and local SDFs of the entire input point cloud. Benefit from this, we propose a new framework, called NeuralIndicator, to jointly learn both the smooth indicator function and neural implicit function simultaneously, using the global shape prior encoded by smooth indicator function to effectively regularize the neural implicit function learning, towards reliable and high-fidelity surface reconstruction from unorganized points without any normal information. Extensive evaluations on synthetic and real-scan datasets show that our approach consistently outperforms previous approaches, especially when point clouds are incomplete and/or noisy with complex topology structure.

JBHI Journal 2023 Journal Article

Dual-Input Transformer: An End-to-End Model for Preoperative Assessment of Pathological Complete Response to Neoadjuvant Chemotherapy in Breast Cancer Ultrasonography

  • Tong Tong
  • Dongyang Li
  • Jionghui Gu
  • Guo Chen
  • Guotao Bai
  • Xin Yang
  • Kun Wang
  • Tianan Jiang

Neoadjuvant chemotherapy (NAC) is the primary method to reduce the burden of tumor and metastasis; in the treatment of breast cancer, it may provide additional opportunities for breast-conserving surgery. Preoperative assessment of pathological complete response (PCR) to NAC is important for developing individualized treatment approaches and predicting patient prognosis. Compared to magnetic resonance imaging (MRI) and mammography, ultrasonography (US) has the advantages of simplicity, flexibility, and real-time imaging. Moreover, it does not require radiation and can provide multi-time acquisition of the tumor during NAC treatment. Recently, deep learning radiomics models based on multi-time-point US images for the prediction of NAC effectiveness have been proposed. To further improve the prediction performance, we carefully designed four supporting modules for our proposed dual-input transformer (DiT): isolated tokens-to-token patch embedding module, shared position embedding, time embedding, and weighted average pooling feature representation modules. The design of each module considers the characteristics of the US images at multiple time points. We validated our model on our retrospective US dataset composed of 484 cases from two centers whose consistency is not sufficiently high. Patients were allocated to training (n = 297), validation (n = 99), and external test (n = 88) sets. The results show that our model can achieve better performance than the Siamese CNN and the standard tokens-to-token vision transformer without using multi-time-point images. The ablation study also proved the effectiveness of each module designed for DiT.

AAAI Conference 2022 Conference Paper

DCAN: Improving Temporal Action Detection via Dual Context Aggregation

  • Guo Chen
  • Yin-Dong Zheng
  • Limin Wang
  • Tong Lu

Temporal action detection aims to locate the boundaries of action in the video. The current method based on boundary matching enumerates and calculates all possible boundary matchings to generate proposals. However, these methods neglect the long-range context aggregation in boundary prediction. At the same time, due to the similar semantics of adjacent matchings, local semantic aggregation of densely-generated matchings cannot improve semantic richness and discrimination. In this paper, we propose the end-to-end proposal generation method named Dual Context Aggregation Network (DCAN) to aggregate context on two levels, namely, boundary level and proposal level, for generating high-quality action proposals, thereby improving the performance of temporal action detection. Specifically, we design the Multi-Path Temporal Context Aggregation (MTCA) to achieve smooth context aggregation on boundary level and precise evaluation of boundaries. For matching evaluation, Coarse-to-fine Matching (CFM) is designed to aggregate context on the proposal level and refine the matching map from coarse to fine. We conduct extensive experiments on ActivityNet v1.3 and THUMOS-14. DCAN obtains an average mAP of 35.39% on ActivityNet v1.3 and reaches mAP 54.14% at [email protected] on THUMOS-14, which demonstrates DCAN can generate high-quality proposals and achieve state-of-the-art performance. We release the code at https://github.com/cg1177/DCAN.