Arrow Research search

Author name cluster

Xiaoyan Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers
1 author row

Possible papers

15

AAAI Conference 2026 Conference Paper

Seeing the Unseen: Zooming in the Dark with Event Cameras

  • Dachun Kai
  • Zeyu Xiao
  • Huyue Zhu
  • Jiaxiao Wang
  • Yueyi Zhang
  • Xiaoyan Sun

This paper addresses low-light video super-resolution (LVSR), aiming to restore high-resolution videos from low-light, low-resolution (LR) inputs. Existing LVSR methods often struggle to recover fine details due to limited contrast and insufficient high-frequency information. To overcome these challenges, we present RetinexEVSR, the first event-driven LVSR framework that leverages high-contrast event signals and Retinex-inspired priors to enhance video quality under low-light scenarios. Unlike previous approaches that directly fuse degraded signals, RetinexEVSR introduces a novel bidirectional cross-modal fusion strategy to extract and integrate meaningful cues from noisy event data and degraded RGB frames. Specifically, an illumination-guided event enhancement module is designed to progressively refine event features using illumination maps derived from the Retinex model, thereby suppressing low-light artifacts while preserving high-contrast details. Furthermore, we propose an event-guided reflectance enhancement module that utilizes the enhanced event features to dynamically recover reflectance details via a multi-scale fusion mechanism. Experimental results show that our RetinexEVSR achieves state-of-the-art performance on three datasets. Notably, on the SDSD benchmark, our method can get up to 2.95 dB gain while reducing runtime by 65% compared to prior event-based methods.

AAAI Conference 2025 Conference Paper

Efficient Event-Based Semantic Segmentation via Exploiting Frame-Event Fusion: A Hybrid Neural Network Approach

  • Hebei Li
  • Yansong Peng
  • Jiahui Yuan
  • Peixi Wu
  • Jin Wang
  • Yueyi Zhang
  • Xiaoyan Sun

Event cameras have recently been introduced into image semantic segmentation, owing to their high temporal resolution and other advantageous properties. However, existing event-based semantic segmentation methods often fail to fully exploit the complementary information provided by frames and events, resulting in complex training strategies and increased computational costs. To address these challenges, we propose an efficient hybrid framework for image semantic segmentation, comprising a Spiking Neural Network branch for events and an Artificial Neural Network branch for frames. Specifically, we introduce three specialized modules to facilitate the interaction between these two branches: the Adaptive Temporal Weighting (ATW) Injector, the Event-Driven Sparse (EDS) Injector, and the Channel Selection Fusion (CSF) module. The ATW Injector dynamically integrates temporal features from event data into frame features, enhancing segmentation accuracy by leveraging critical dynamic temporal information. The EDS Injector effectively combines sparse event data with rich frame features, ensuring precise temporal and spatial information alignment. The CSF module selectively merges these features to optimize segmentation performance. Experimental results demonstrate that our framework not only achieves state-of-the-art accuracy across the DDD17-Seg, DSEC-Semantic, and M3ED-Semantic datasets but also significantly reduces energy consumption, achieving a 65% reduction on the DSEC-Semantic dataset.

AAAI Conference 2025 Conference Paper

Event-Enhanced Blurry Video Super-Resolution

  • Dachun Kai
  • Yueyi Zhang
  • Jin Wang
  • Zeyu Xiao
  • Zhiwei Xiong
  • Xiaoyan Sun

In this paper, we tackle the task of blurry video super-resolution (BVSR), aiming to generate high-resolution (HR) videos from low-resolution (LR) and blurry inputs. Current BVSR methods often fail to restore sharp details at high resolutions, resulting in noticeable artifacts and jitter due to insufficient motion information for deconvolution and the lack of high-frequency details in LR frames. To address these challenges, we introduce event signals into BVSR and propose a novel event-enhanced network, Ev-DeblurVSR. To effectively fuse information from frames and events for feature deblurring, we introduce a reciprocal feature deblurring module that leverages motion information from intra-frame events to deblur frame features while reciprocally using global scene context from the frames to enhance event features. Furthermore, to enhance temporal consistency, we propose a hybrid deformable alignment module that fully exploits the complementary motion information from inter-frame events and optical flow to improve motion estimation in the deformable alignment process. Extensive evaluations demonstrate that Ev-DeblurVSR establishes a new state-of-the-art performance on both synthetic and real-world datasets. Notably, on real data, our method is 2.59 dB more accurate and 7.28× faster than the recent best BVSR baseline FMA-Net.

AAAI Conference 2025 Conference Paper

Spiking Point Transformer for Point Cloud Classification

  • Peixi Wu
  • Bosong Chai
  • Hebei Li
  • Menghua Zheng
  • Yansong Peng
  • Zeyu Wang
  • Xuan Nie
  • Yueyi Zhang

Spiking Neural Networks (SNNs) offer an attractive and energy-efficient alternative to conventional Artificial Neural Networks (ANNs) due to their sparse binary activation. When SNN meets Transformer, it shows great potential in 2D image processing. However, their application for 3D point cloud remains underexplored. To this end, we present Spiking Point Transformer (SPT), the first transformer-based SNN framework for point cloud classification. Specifically, we first design Queue-Driven Sampling Direct Encoding for point cloud to reduce computational costs while retaining the most effective support points at each time step. We introduce the Hybrid Dynamics Integrate-and-Fire Neuron (HD-IF), designed to simulate selective neuron activation and reduce over-reliance on specific artificial neurons. SPT attains state-of-the-art results on three benchmark datasets that span both real-world and synthetic datasets in the SNN domain. Meanwhile, the theoretical energy consumption of SPT is at least 6.4x less than its ANN counterpart.

AAAI Conference 2024 Conference Paper

Image Captioning with Multi-Context Synthetic Data

  • Feipeng Ma
  • Yizhou Zhou
  • Fengyun Rao
  • Yueyi Zhang
  • Xiaoyan Sun

Image captioning requires numerous annotated image-text pairs, resulting in substantial annotation costs. Recently, large models (e.g. diffusion models and large language models) have excelled in producing high-quality images and text. This potential can be harnessed to create synthetic image-text pairs for training captioning models. Synthetic data can improve cost and time efficiency in data collection, allow for customization to specific domains, bootstrap generalization capability for zero-shot performance, and circumvent privacy concerns associated with real-world data. However, existing methods struggle to attain satisfactory performance solely through synthetic data. We identify the issue as generated images from simple descriptions mostly capture a solitary perspective with limited context, failing to align with the intricate scenes prevalent in real-world imagery. To tackle this, we present an innovative pipeline that introduces multi-context data generation. Beginning with an initial text corpus, our approach employs a large language model to extract multiple sentences portraying the same scene from diverse viewpoints. These sentences are then condensed into a single sentence with multiple contexts. Subsequently, we generate intricate images using the condensed captions through diffusion models. Our model is exclusively trained on synthetic image-text pairs crafted through this process. The effectiveness of our pipeline is validated through experimental results in both the in-domain and cross-domain settings, where it achieves state-of-the-art performance on well-known datasets such as MSCOCO, Flickr30k, and NoCaps.

AAAI Conference 2024 Conference Paper

TMFormer: Token Merging Transformer for Brain Tumor Segmentation with Missing Modalities

  • Zheyu Zhang
  • Gang Yang
  • Yueyi Zhang
  • Huanjing Yue
  • Aiping Liu
  • Yunwei Ou
  • Jian Gong
  • Xiaoyan Sun

Numerous techniques excel in brain tumor segmentation using multi-modal magnetic resonance imaging (MRI) sequences, delivering exceptional results. However, the prevalent absence of modalities in clinical scenarios hampers performance. Current approaches frequently resort to zero maps as substitutes for missing modalities, inadvertently introducing feature bias and redundant computations. To address these issues, we present the Token Merging transFormer (TMFormer) for robust brain tumor segmentation with missing modalities. TMFormer tackles these challenges by extracting and merging accessible modalities into more compact token sequences. The architecture comprises two core components: the Uni-modal Token Merging Block (UMB) and the Multi-modal Token Merging Block (MMB). The UMB enhances individual modality representation by adaptively consolidating spatially redundant tokens within and outside tumor-related regions, thereby refining token sequences for augmented representational capacity. Meanwhile, the MMB mitigates multi-modal feature fusion bias, exclusively leveraging tokens from present modalities and merging them into a unified multi-modal representation to accommodate varying modality combinations. Extensive experimental results on the BraTS 2018 and 2020 datasets demonstrate the superiority and efficacy of TMFormer compared to state-of-the-art methods when dealing with missing modalities.

NeurIPS Conference 2024 Conference Paper

Visual Perception by Large Language Model’s Weights

  • Feipeng Ma
  • Hongwei Xue
  • Yizhou Zhou
  • Guangting Wang
  • Fengyun Rao
  • Shilin Yan
  • Yueyi Zhang
  • Siying Wu

Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs) and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM's weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. Code and models are released at \url{https: //github. com/FeipengMa6/VLoRA}.

AAAI Conference 2023 Conference Paper

Better and Faster: Adaptive Event Conversion for Event-Based Object Detection

  • Yansong Peng
  • Yueyi Zhang
  • Peilin Xiao
  • Xiaoyan Sun
  • Feng Wu

Event cameras are a kind of bio-inspired imaging sensor, which asynchronously collect sparse event streams with many advantages. In this paper, we focus on building better and faster event-based object detectors. To this end, we first propose a computationally efficient event representation Hyper Histogram, which adequately preserves both the polarity and temporal information of events. Then we devise an Adaptive Event Conversion module, which converts events into Hyper Histograms according to event density via an adaptive queue. Moreover, we introduce a novel event-based augmentation method Shadow Mosaic, which significantly improves the event sample diversity and enhances the generalization ability of detection models. We equip our proposed modules on three representative object detection models: YOLOv5, Deformable-DETR, and RetinaNet. Experimental results on three event-based detection datasets (1Mpx, Gen1, and MVSEC-NIGHTL21) demonstrate that our proposed approach outperforms other state-of-the-art methods by a large margin, while achieving a much faster running speed (< 14 ms and < 4 ms for 50 ms event data on the 1Mpx and Gen1 datasets).

NeurIPS Conference 2021 Conference Paper

Dual Progressive Prototype Network for Generalized Zero-Shot Learning

  • Chaoqun Wang
  • Shaobo Min
  • Xuejin Chen
  • Xiaoyan Sun
  • Houqiang Li

Generalized Zero-Shot Learning (GZSL) aims to recognize new categories with auxiliary semantic information, e. g. , category attributes. In this paper, we handle the critical issue of domain shift problem, i. e. , confusion between seen and unseen categories, by progressively improving cross-domain transferability and category discriminability of visual representations. Our approach, named Dual Progressive Prototype Network (DPPN), constructs two types of prototypes that record prototypical visual patterns for attributes and categories, respectively. With attribute prototypes, DPPN alternately searches attribute-related local regions and updates corresponding attribute prototypes to progressively explore accurate attribute-region correspondence. This enables DPPN to produce visual representations with accurate attribute localization ability, which benefits the semantic-visual alignment and representation transferability. Besides, along with progressive attribute localization, DPPN further projects category prototypes into multiple spaces to progressively repel visual representations from different categories, which boosts category discriminability. Both attribute and category prototypes are collaboratively learned in a unified framework, which makes visual representations of DPPN transferable and distinctive. Experiments on four benchmarks prove that DPPN effectively alleviates the domain shift problem in GZSL.

AAAI Conference 2021 Conference Paper

Task-Independent Knowledge Makes for Transferable Representations for Generalized Zero-Shot Learning

  • Chaoqun Wang
  • Xuejin Chen
  • Shaobo Min
  • Xiaoyan Sun
  • Houqiang Li

Generalized Zero-Shot Learning (GZSL) targets recognizing new categories by learning transferable image representations. Existing methods find that, by aligning image representations with corresponding semantic labels, the semanticaligned representations can be transferred to unseen categories. However, supervised by only seen category labels, the learned semantic knowledge is highly task-specific, which makes image representations biased towards seen categories. In this paper, we propose a novel Dual-Contrastive Embedding Network (DCEN) that simultaneously learns taskspecific and task-independent knowledge via semantic alignment and instance discrimination. First, DCEN leverages task labels to cluster representations of the same semantic category by cross-modal contrastive learning and exploring semantic-visual complementarity. Besides task-specific knowledge, DCEN then introduces task-independent knowledge by attracting representations of different views of the same image and repelling representations of different images. Compared to high-level seen category supervision, this instance discrimination supervision encourages DCEN to capture low-level visual knowledge, which is less biased toward seen categories and alleviates the representation bias. Consequently, the task-specific and task-independent knowledge jointly make for transferable representations of DCEN, which obtains averaged 4. 1% improvement on four public benchmarks.

AAAI Conference 2021 Conference Paper

Training Spiking Neural Networks with Accumulated Spiking Flow

  • Hao Wu
  • Yueyi Zhang
  • Wenming Weng
  • Yongting Zhang
  • Zhiwei Xiong
  • Zheng-Jun Zha
  • Xiaoyan Sun
  • Feng Wu

The fast development of neuromorphic hardwares promotes Spiking Neural Networks (SNNs) to a thrilling research avenue. Current SNNs, though much efficient, are less effective compared with leading Artificial Neural Networks (ANNs) especially in supervised learning tasks. Recent efforts further demonstrate the potential of SNNs in supervised learning by introducing approximated backpropagation (BP) methods. To deal with the non-differentiable spike function in SNNs, these BP methods utilize information from the spatio-temporal domain to adjust the model parameters. With the increasing of time window and network size, the computational complexity of spatio-temporal backpropagation augments dramatically. In this paper, we propose a new backpropagation method for SNNs based on the accumulated spiking flow (ASF), i. e. ASF- BP. In the proposed ASF-BP method, updating parameters does not rely on the spike train of spiking neurons but leverage accumulated inputs and outputs of spiking neurons over the time window, which reduces the BP complexity significantly. We further present an adaptive linear estimation model to approach the dynamic characteristics of spiking neurons statistically. Experimental results demonstrate that with our proposed ASF-BP method, light-weight convolutional SNNs achieve superior performances compared with other spike-based BP methods on both non-neuromorphic (MNIST, CIFAR10) and neuromorphic (CIFAR10-DVS) datasets. The code is available at https: //github. com/neural-lab/ASF-BP.

AAAI Conference 2020 Conference Paper

Posterior-Guided Neural Architecture Search

  • Yizhou Zhou
  • Xiaoyan Sun
  • Chong Luo
  • Zheng-Jun Zha
  • Wenjun Zeng

The emergence of neural architecture search (NAS) has greatly advanced the research on network design. Recent proposals such as gradient-based methods or one-shot approaches significantly boost the efficiency of NAS. In this paper, we formulate the NAS problem from a Bayesian perspective. We propose explicitly estimating the joint posterior distribution over pairs of network architecture and weights. Accordingly, a hybrid network representation is presented which enables us to leverage the Variational Dropout so that the approximation of the posterior distribution becomes fully gradient-based and highly efficient. A posterior-guided sampling method is then presented to sample architecture candidates and directly make evaluations. As a Bayesian approach, our posterior-guided NAS (PGNAS) avoids tuning a number of hyper-parameters and enables a very effective architecture sampling in posterior probability space. Interestingly, it also leads to a deeper insight into the weight sharing used in the one-shot NAS and naturally alleviates the mismatch between the sampled architecture and weights caused by the weight sharing. We validate our PGNAS method on the fundamental image classification task. Results on Cifar-10, Cifar-100 and ImageNet show that PGNAS achieves a good trade-off between precision and speed of search among NAS methods. For example, it takes 11 GPU days to search a very competitive architecture with 1. 98% and 14. 28% test errors on Cifar10 and Cifar100, respectively.

IJCAI Conference 2019 Conference Paper

Mutually Reinforced Spatio-Temporal Convolutional Tube for Human Action Recognition

  • Haoze Wu
  • Jiawei Liu
  • Zheng-Jun Zha
  • Zhenzhong Chen
  • Xiaoyan Sun

Recent works use 3D convolutional neural networks to explore spatio-temporal information for human action recognition. However, they either ignore the correlation between spatial and temporal features or suffer from high computational cost by spatio-temporal features extraction. In this work, we propose a novel and efficient Mutually Reinforced Spatio-Temporal Convolutional Tube (MRST) for human action recognition. It decomposes 3D inputs into spatial and temporal representations, mutually enhances both of them by exploiting the interaction of spatial and temporal information and selectively emphasizes informative spatial appearance and temporal motion, meanwhile reducing the complexity of structure. Moreover, we design three types of MRSTs according to the different order of spatial and temporal information enhancement, each of which contains a spatio-temporal decomposition unit, a mutually reinforced unit and a spatio-temporal fusion unit. An end-to-end deep network, MRST-Net, is also proposed based on the MRSTs to better explore spatio-temporal information in human actions. Extensive experiments show MRST-Net yields the best performance, compared to state-of-the-art approaches.