Author name cluster

Hang Guo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers

1 author row

AAAI Conference 2026 Conference Paper

Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling

Ziran Qin
Youru Lv
Mingbao Lin
Hang Guo
Zeren Zhang
Danping Zou
Weiyao Lin

Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality content generation with substantially fewer decoding steps. However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale generation paradigm. We begin with a crucial observation: attention heads in VAR models can be divided into two functionally distinct categories: Contextual Heads focus on maintaining semantic consistency, while Structural Heads are responsible for preserving spatial coherence. This structural divergence causes existing one-size-fits-all compression methods to perform poorly on VAR models. To address this, we propose HACK, a training-free Head-Aware KV cache Compression frameworK. HACK utilizes an offline classification scheme to separate head types, enabling it to apply pattern-specific compression strategies with asymmetric cache budgets for each category. By doing so, HACK effectively constrains the average KV cache length within a fixed budget B, reducing the theoretical attention complexity from O(n4) to O(Bn2). Extensive experiments on multiple VAR models across text-to-image and class-conditional tasks validate the effectiveness and generalizability of HACK. It achieves up to 70% KV cache compression without degrading output quality, resulting in memory savings and faster in- ference. For example, HACK provides a 1.75× memory reduction and a 1.57× speedup on Infinity-8B.

PDF Details DOI

AAAI Conference 2025 Conference Paper

CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning

Peiyuan Liu
Hang Guo
Tao Dai
Naiqi Li
Jigang Bao
Xudong Ren
Yong Jiang
Shu-Tao Xia

Deep learning (e.g., Transformer) has been widely and successfully used in multivariate time series forecasting (MTSF). Unlike existing methods that focus on training models from a single modal of time series input, large language models (LLMs) based MTSF methods with cross-modal text and time series input have recently shown great superiority, especially with limited temporal data. However, current LLM-based MTSF methods usually focus on adapting and fine-tuning LLMs, while neglecting the distribution discrepancy between textual and temporal input tokens, thus leading to sub-optimal performance. To address this issue, we propose a novel Cross-Modal LLM Fine-Tuning (CALF) framework for MTSF by reducing the distribution discrepancy between textual and temporal data, which mainly consists of the temporal target branch with temporal input and the textual source branch with aligned textual input. To reduce the distribution discrepancy, we develop the cross-modal match module to first align cross-modal input distributions. Additionally, to minimize the modality distribution gap in both feature and output spaces, feature regularization loss is developed to align the intermediate features between the two branches for better weight updates, while output consistency loss is introduced to allow the output representations of both branches to correspond effectively. Thanks to the modality alignment, CALF establishes state-of-the-art performance for both long-term and short-term forecasting tasks with low computational complexity, and exhibits favorable few-shot and zero-shot abilities similar to that in LLMs.

PDF Details DOI

AAAI Conference 2025 Conference Paper

DCSF-KD: Dynamic Channel-wise Spatial Feature Knowledge Distillation for Object Detection

Tao Dai
Yang Lin
Hang Guo
Jinbao Wang
Zexuan Zhu

Knowledge distillation (KD) has recently gained great success in the field of object detection. By transferring the knowledge of the spatial or channel domain from the teacher model to the student model, it allows for a more compact representation with minimal performance loss. Despite this progress, existing KD methods typically treat knowledge from spatial or channel domains independently, ignoring the exploitation of the mutual relationship between these domains. In this work, we first explore the connection between spatial and channel domains and find there exists a strong correlation between them, i.e. the salient channels tend to contain significant object regions in the spatial domain. Motivated by this observation, we propose DCSF-KD, a novel Dynamic Channel-wise Spatial Feature Knowledge Distillation framework for object detection by fully exploiting both spatial and channel knowledge. Specifically, we introduce channel-wise spatial feature distillation and global channel attention distillation, using information from both domains to improve the accuracy of the student network. Experiments demonstrate that our DCSF-KD outperforms existing detection methods on both homogeneous and heterogeneous teacher-student network pairs. For example, when using the MaskRCNN-Swin detector as the teacher, and based on RetinaNet and FCOS with ResNet-50 on MS COCO, our DCSF-KD can achieve 41.9% and 44.1% mAP, respectively.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

DIIN: Diffusion Iterative Implicit Networks for Arbitrary-scale Super-resolution

Tao Dai
Song Wang
Hang Guo
Jianping Wang
Zexuan Zhu

Implicit neural representation (INR) aims to represent continuous domain signals via implicit neural functions and has achieved great success in arbitrary-scale image super-resolution (SR). However, most existing INR-based SR methods focus on learning implicit features from independent coordinate, while neglecting interactions of neighborhood coordinates, thus resulting in limited contextual awareness. In this paper, we rethink the forward process of implicit neural functions as a signal diffusion process, we propose a novel Diffusion Iterative Implicit Network (DIIN) for arbitrary-scale SR to promote global signal flow with neighborhood interactions. The DIIN framework mainly consists of stacked Diffusion Iteration Layers with dictionary cross-attention block to enrich the iterative update process with supplementary information. Besides, we develop the Position-Aware Embedding Block to strengthen spatial dependencies between consecutive input samples. Extensive experiments on public datasets demonstrate that our method achieves state-of-the-art or competitive performance, highlighting its effectiveness and efficiency for arbitrary-scale SR. Our code is available at https: //github. com/Song-1205/DIIN.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Point Cloud Mixture-of-Domain-Experts Model for 3D Self-supervised Learning

Yaohua Zha
Tao Dai
Hang Guo
Yanzi Wang
Bin Chen
Ke Chen
Shu-Tao Xia

Point clouds, as a primary representation of 3D data, can be categorized into scene domain point clouds and object domain point clouds. Point cloud self-supervised learning (SSL) has become a mainstream paradigm for learning 3D representations. However, existing point cloud SSL primarily focuses on learning domain-specific 3D representations within a single domain, neglecting the complementary nature of cross-domain knowledge, which limits the learning of 3D representations. In this paper, we propose to learn a comprehensive Point cloud Mixture-of-Domain-Experts model (Point-MoDE) via a block-to-scene pre-training strategy. Specifically, We first propose a mixture-of-domain-expert model consisting of scene domain experts and multiple shared object domain experts. Furthermore, we propose a block-to-scene pretraining strategy, which leverages the features of point blocks in the object domain to regress their initial positions in the scene domain through object-level block mask reconstruction and scene-level block position regression. By integrating the complementary knowledge between object and scene, this strategy simultaneously facilitates the learning of both object-domain and scene-domain representations, leading to a more comprehensive 3D representation. Extensive experiments in downstream tasks demonstrate the superiority of our model.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

SocialMP: Learning Social Aware Motion Patterns via Additive Fusion for Pedestrian Trajectory Prediction

Tianci Gao
Yuzhen Zhang
Hang Guo
Pei Lv

Accurately capturing social interaction in complex scenarios is essential for pedestrian trajectory prediction task. The uncertainty in pedestrian interactions and the physical constraints imposed by the environment make this task challenging. To solve this problem, existing methods adopt dimensionality reduction algorithms to capture explainable human motions and behaviors. However, these approaches not only suffer from weak social awareness due to the inadequate feature extraction, but also overlook physical constraints, leading to predicted trajectories often cross unwalkable areas. To overcome these problems, we build an attention-based motion pattern representation, named SocialMP, which can effectively enhance the social awareness and environmental perception of motion patterns. Specifically, our method first characterizes the motion patterns through singular value decomposition and defines a visual field-based rule to model environmental social interaction. Then, an attention-based additive fusion mechanism is designed to enhance social awareness and environment perception of motion patterns. Therein, we integrate social interactions into motion patterns through cross-attention mechanism to generate latent motion patterns, and feed them into our devised additive fusion structure with backward connection for multiple iterations. Lastly, we design a map loss function by applying an additional penalty into average displacement error to prevent the pedestrians from passing through the unwalkable area. Extensive experiments on ETH-UCY and SDD datasets demonstrate that our SocialMP can not only improve prediction accuracy but also generate plausible trajectories.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping

Taolin Zhang
Jinpeng Wang
Hang Guo
Tao Dai
Bin Chen
Shu-Tao Xia

Adaptation of pretrained vision-language models such as CLIP to various downstream tasks have raised great interest in recent researches. Previous works have proposed a variety of test-time adaptation (TTA) methods to achieve strong generalization without any knowledge of the target domain. However, existing training-required TTA approaches like TPT necessitate entropy minimization that involves large computational overhead, while training-free methods like TDA overlook the potential for information mining from the test samples themselves. In this paper, we break down the design of existing popular training-required and training-free TTA methods and bridge the gap between them within our framework. Specifically, we maintain a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples. The historical samples are filtered from the testing data stream and serve to extract useful information from the target distribution, while the boosting samples are drawn from regional bootstrapping and capture the knowledge of the test sample itself. We theoretically justify the rationality behind our method and empirically verify its effectiveness on both the out-of-distribution and the cross-domain datasets, showcasing its applicability in real-world situations.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

FreqFormer: Frequency-aware Transformer for Lightweight Image Super-resolution

Tao Dai
Jianping Wang
Hang Guo
Jinmin Li
Jinbao Wang
Zexuan Zhu

Transformer-based models have been widely and successfully used in various low-vision visual tasks, and have achieved remarkable performance in single image super-resolution (SR). Despite the significant progress in SR, Transformer-based SR methods (e. g. , SwinIR) still suffer from the problems of heavy computation cost and low-frequency preference, while ignoring the reconstruction of rich high-frequency information, hence hindering the representational power of Transformers. To address these issues, in this paper, we propose a novel Frequency-aware Transformer (FreqFormer) for lightweight image SR. Specifically, a Frequency Division Module (FDM) is first introduced to separately handle high- and low-frequency information in a divide-and-conquer manner. Moreover, we present Frequency-aware Transformer Block (FTB) to extracting both spatial frequency attention and channel transposed attention to recover high-frequency details. Extensive experimental results on public datasets demonstrate the superiority of our FreqFormer over state-of-the-art SR methods in terms of both quantitative metrics and visual quality. Code and models are available at https: //github. com/JPWang-CS/FreqFormer.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

LCM: Locally Constrained Compact Point Cloud Model for Masked Point Modeling

Yaohua Zha
Naiqi Li
Yanzi Wang
Tao Dai
Hang Guo
Bin Chen
Zhi Wang
Zhihao Ouyang

The pre-trained point cloud model based on Masked Point Modeling (MPM) has exhibited substantial improvements across various tasks. However, these models heavily rely on the Transformer, leading to quadratic complexity and limited decoder, hindering their practice application. To address this limitation, we first conduct a comprehensive analysis of existing Transformer-based MPM, emphasizing the idea that redundancy reduction is crucial for point cloud analysis. To this end, we propose a Locally constrained Compact point cloud Model (LCM) consisting of a locally constrained compact encoder and a locally constrained Mamba-based decoder. Our encoder replaces self-attention with our local aggregation layers to achieve an elegant balance between performance and efficiency. Considering the varying information density between masked and unmasked patches in the decoder inputs of MPM, we introduce a locally constrained Mamba-based decoder. This decoder ensures linear complexity while maximizing the perception of point cloud geometry information from unmasked patches with higher information density. Extensive experimental results show that our compact model significantly surpasses existing Transformer-based models in both performance and efficiency, especially our LCM-based Point-MAE model, compared to the Transformer-based model, achieved an improvement of 1. 84%, 0. 67%, and 0. 60% in performance on the three variants of ScanObjectNN while reducing parameters by 88% and computation by 73%. The code is available at https: //github. com/zyh16143998882/LCM.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Parameter Efficient Adaptation for Image Restoration with Heterogeneous Mixture-of-Experts

Hang Guo
Tao Dai
Yuanchao Bai
Bin Chen
Xudong Ren
Zexuan Zhu
Shu-Tao Xia

Designing single-task image restoration models for specific degradation has seen great success in recent years. To achieve generalized image restoration, all-in-one methods have recently been proposed and shown potential for multiple restoration tasks using one single model. Despite the promising results, the existing all-in-one paradigm still suffers from high computational costs as well as limited generalization on unseen degradations. In this work, we introduce an alternative solution to improve the generalization of image restoration models. Drawing inspiration from recent advancements in Parameter Efficient Transfer Learning (PETL), we aim to tune only a small number of parameters to adapt pre-trained restoration models to various tasks. However, current PETL methods fail to generalize across varied restoration tasks due to their homogeneous representation nature. To this end, we propose AdaptIR, a Mixture-of-Experts (MoE) with orthogonal multi-branch design to capture local spatial, global spatial, and channel representation bases, followed by adaptive base combination to obtain heterogeneous representation for different degradations. Extensive experiments demonstrate that our AdaptIR achieves stable performance on single-degradation tasks, and excels in hybrid-degradation tasks, with training only 0. 6% parameters for 8 hours.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

ReFIR: Grounding Large Restoration Models with Retrieval Augmentation

Hang Guo
Tao Dai
Zhihao Ouyang
Taolin Zhang
Yaohua Zha
Bin Chen
Shu-Tao Xia

Recent advances in diffusion-based Large Restoration Models (LRMs) have significantly improved photo-realistic image restoration by leveraging the internal knowledge embedded within model weights. However, existing LRMs often suffer from the hallucination dilemma, i. e. , producing incorrect contents or textures when dealing with severe degradations, due to their heavy reliance on limited internal knowledge. In this paper, we propose an orthogonal solution called the Retrieval-augmented Framework for Image Restoration (ReFIR), which incorporates retrieved images as external knowledge to extend the knowledge boundary of existing LRMs in generating details faithful to the original scene. Specifically, we first introduce the nearest neighbor lookup to retrieve content-relevant high-quality images as reference, after which we propose the cross-image injection to modify existing LRMs to utilize high-quality textures from retrieved images. Thanks to the additional external knowledge, our ReFIR can well handle the hallucination challenge and facilitate faithfully results. Extensive experiments demonstrate that ReFIR can achieve not only high-fidelity but also realistic restoration results. Importantly, our ReFIR requires no training and is adaptable to various LRMs.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement

Hang Guo
Tao Dai
GuangHao Meng
Shu-Tao Xia

Scene text image super-resolution (STISR), aiming to improve image quality while boosting downstream scene text recognition accuracy, has recently achieved great success. However, most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process, and neglect the disturbance from the complex background, thus limiting the performance. To address these issues, in this paper, we propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution. To model the location of characters effectively, we propose the location enhancement module to extract character region features based on the attention map sequence. Besides, we propose the multi-modal alignment module to perform bidirectional visual-semantic alignment to generate high-quality prior guidance, which is then incorporated into the super-resolution branch in an adaptive manner using the proposed adaptive fusion module. Experiments on TextZoom and four scene text recognition benchmarks demonstrate the superiority of our method over other state-of-the-art methods. Code is available at https: //github. com/csguoh/LEMMA.

PDF Details DOI

YNIMG Journal 2021 Journal Article

Age-dependent cross frequency coupling features from children to adults during general anesthesia

Zhenhu Liang
Na Ren
Xin Wen
Haiwen Li
Hang Guo
Yaqun Ma
Zheng Li
Xiaoli Li

BACKGROUND: The frequency coupling characteristics in electroencephalogram (EEG) induced by anesthetics have been well studied in adults, but the investigation of the age-dependent cross frequency coupling features from children to adults is still lacking. METHODS: We analyzed EEG signals recorded from pediatric to adult patients (n = 131), separated into six age groups: <1 year (n = 15), 1-3 years (n = 23), 3-6 years (n = 19), 6-12 years (n = 18), 12-18 years (n = 16), and 18-45 years (n = 40). Age related EEG power and cross frequency coupling analysis (phase amplitude coupling (PAC) and quadratic phase coupling) of data from maintenance of a surgical state of anesthesia (MOSSA) was conducted. Also, for patients of ages less than 6 years, we analyzed the performance of cross frequency coupling derived indices in distinguishing the states of wakefulness, MOSSA, and recovery of consciousness (ROC). RESULTS: (1) During MOSSA, EEG power substantially increased with age from infancy to 3-6 years then decreased with age in the theta-gamma frequency bands. The infant group (<1 year) had the highest slow oscillation (SO) power among all age groups. (2) The distinct PAC pattern is absent in patients less than 1 year of age both in SO-alpha and delta-alpha frequency band coupling during propofol induced unconsciousness. The modulation index between delta and alpha oscillations in MOSSA increased with age. (3) Wavelet bicoherence derived indices reach their peaks in the 3-6 years group and then decrease with age growth. (4) The Diag_En index (normalized entropy of the diagonal bicoherence entries of the bicoherence matrix) performed the best at distinguishing different states for ages less than 6 years (p<0.05). CONCLUSIONS: The combination of propofol induction and sevoflurane maintenance exhibited age-dependent EEG power spectra, PAC, and bicoherence, likely related to brain development. These observations suggest new rules for infant and child brain state monitoring during general anesthesia are needed.

Details DOI

YNICL Journal 2019 Journal Article

Disturbed neurovascular coupling in type 2 diabetes mellitus patients: Evidence from a comprehensive fMRI analysis

Bo Hu
Lin-Feng Yan
Qian Sun
Ying Yu
Jin Zhang
Yu-Jie Dai
Yang Yang
Yu-Chuan Hu

BACKGROUND: Previous studies presumed that the disturbed neurovascular coupling to be a critical risk factor of cognitive impairments in type 2 diabetes mellitus (T2DM), but distinct clinical manifestations were lacked. Consequently, we decided to investigate the neurovascular coupling in T2DM patients by exploring the MRI relationship between neuronal activity and the corresponding cerebral blood perfusion. METHODS: Degree centrality (DC) map and amplitude of low-frequency fluctuation (ALFF) map were used to represent neuronal activity. Cerebral blood flow (CBF) map was used to represent cerebral blood perfusion. Correlation coefficients were calculated to reflect the relationship between neuronal activity and cerebral blood perfusion. RESULTS: At the whole gray matter level, the manifestation of neurovascular coupling was investigated by using 4 neurovascular biomarkers. We compared these biomarkers and found no significant changes. However, at the brain region level, neurovascular biomarkers in T2DM patients were significantly decreased in 10 brain regions. ALFF-CBF in left hippocampus and fractional ALFF-CBF in left amygdala were positively associated with the executive function, while ALFF-CBF in right fusiform gyrus was negatively related to the executive function. The disease severity was negatively related to the memory and executive function. The longer duration of T2DM was related to the milder depression, which suggests T2DM-related depression may not be a physiological condition but be a psychological condition. CONCLUSION: Correlations between neuronal activity and cerebral perfusion maps may be a method for detecting neurovascular coupling abnormalities, which could be used for diagnosis in the future. Trial registry number: This study has been registered in ClinicalTrials.gov (NCT02420470) on April 2, 2015 and published on July 29, 2015.

Details DOI