Arrow Research search

Author name cluster

Sheng Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers
2 author rows

Possible papers

6

AAAI Conference 2025 Conference Paper

Enriching Multimodal Sentiment Analysis Through Textual Emotional Descriptions of Visual-Audio Content

  • Sheng Wu
  • Dongxiao He
  • Xiaobao Wang
  • Longbiao Wang
  • Jianwu Dang

Multimodal Sentiment Analysis (MSA) stands as a critical research frontier, seeking to comprehensively unravel human emotions by amalgamating text, audio, and visual data. Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge, particularly when emotional polarities across various segments appear similar. In this paper, our objective is to spotlight emotion-relevant attributes of audio and visual modalities to facilitate multimodal fusion in the context of nuanced emotional shifts in visual-audio scenarios. To this end, we introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions aimed at accentuating emotional features of visual-audio content. DEVA employs an Emotional Description Generator (EDG) to transmute raw audio and visual data into textualized sentiment descriptions, thereby amplifying their emotional characteristics. These descriptions are then integrated with the source data to yield richer, enhanced features. Furthermore, DEVA incorporates the Text-guided Progressive Fusion Module (TPF), leveraging varying levels of text as a core modality guide. This module progressively fuses visual-audio minor modalities to alleviate disparities between text and visual-audio modalities. Experimental results on widely used sentiment analysis benchmark datasets, including MOSI, MOSEI, and CH-SIMS, underscore significant enhancements compared to state-of-the-art models. Moreover, fine-grained emotion experiments corroborate the robust sensitivity of DEVA to subtle emotional variations.

NeurIPS Conference 2025 Conference Paper

EventMG: Efficient Multilevel Mamba-Graph Learning for Spatiotemporal Event Representation

  • Sheng Wu
  • Lin Jin
  • Hui Feng
  • Bo Hu

Event cameras offer unique advantages in scenarios involving high speed, low light, and high dynamic range, yet their asynchronous and sparse nature poses significant challenges to efficient spatiotemporal representation learning. Specifically, despite notable progress in the field, effectively modeling the full spatiotemporal context, selectively attending to salient dynamic regions, and robustly adapting to the variable density and dynamic nature of event data remain key challenges. Motivated by these challenges, this paper proposes EventMG, a lightweight, efficient, multilevel Mamba-Graph architecture designed for learning high-quality spatiotemporal event representations. EventMG employs a multilevel approach, jointly modeling information at the micro (single event) and macro (event cluster) levels to comprehensively capture the multi-scale characteristics of event data. At the micro-level, it focuses on spatiotemporal details, employing State Space Model (SSM) based Mamba, to precisely capture long-range dependencies among numerous event nodes. Concurrently, at the macro-level, Component Graphs are introduced to efficiently encode the local semantics and global topology of dense event regions. Furthermore, to better accommodate the dynamic and sparse characteristics of data, we propose the Spatiotemporal-aware Event Scanning Technology (SEST), integrating the Adaptive Perturbation Network (APN) and Multidirectional Scanning Module (MSM), which substantially enhances the model's ability to perceive and focus on key spatiotemporal patterns. By employing this novel collaborative paradigm, EventMG demonstrates the ability to effectively capture multi-level spatiotemporal characteristics of event data while maintaining a low parameter count and linear computational complexity, suggesting a promising direction for event representation learning.

JBHI Journal 2025 Journal Article

SVMB-Net: Local Global Fusion and Multi-Branch Cross-Feature Attention for Skin Lesion Segmentation

  • Yuan Zhao
  • Jinlai Zhang
  • Wujiao He
  • Sheng Wu

Accurate segmentation of skin lesions remains a key challenge for early cancer diagnosis due to complex morphological variations such as irregular shape, heterogeneous texture and low contrast. To address these limitations, we propose SVMB-Net, a dual-architecture framework that integrates SwinTransformer and CNN with the following innovations: first, the Super ViT-CNN (SViT-C) hybrid encoder employs a special global restoration module for extracting high-level semantics, whereas the dual-branch fusion module combines CNN's local feature extraction with SwinTransformer's global context modelling synergy. Second, our multi-branch deep cross-feature attention decoder introduces a multi-scale attention mechanism. Comprehensive evaluations on three clinical datasets show significant improvements: on ISIC2018, SVMB-Net improves $DSC$ by 7. 67% to 93. 88% and $ACC$ by 2. 21% to 96. 97% against the current state-of-the-art segmentation method DINOv2. Experiments conducted at ISIC2017 and $PH^{2}$ show an $IoU$ of 83. 45% and an $ACC$ of 97. 08%, which largely outperforms 16 existing methods such as SAM2-UNet and VM-UNet. The architecture provides a powerful solution for automated lesion analysis in real-world clinical settings. Early detection and surgical treatment are crucial for the successful cure of skin cancer. However, the accuracy of detecting skin cancer is challenged due to its variations in shape, size, color, texture, hair, contrast difference, brightness, and irregular boundaries. To address these issues, a new skin cancer image segmentation method, SVMB-Net, is proposed in this paper. Our code will be open sourced at https://github.com/Sleepearlyy/SVMB-Net. git.

NeurIPS Conference 2024 Conference Paper

EGSST: Event-based Graph Spatiotemporal Sensitive Transformer for Object Detection

  • Sheng Wu
  • Hang Sheng
  • Hui Feng
  • Bo Hu

Event cameras provide exceptionally high temporal resolution in dynamic vision systems due to their unique event-driven mechanism. However, the sparse and asynchronous nature of event data makes frame-based visual processing methods inappropriate. This study proposes a novel framework, Event-based Graph Spatiotemporal Sensitive Transformer (EGSST), for the exploitation of spatial and temporal properties of event data. Firstly, a well-designed graph structure is employed to model event data, which not only preserves the original temporal data but also captures spatial details. Furthermore, inspired by the phenomenon that human eyes pay more attention to objects that produce significant dynamic changes, we design a Spatiotemporal Sensitivity Module (SSM) and an adaptive Temporal Activation Controller (TAC). Through these two modules, our framework can mimic the response of the human eyes in dynamic environments by selectively activating the temporal attention mechanism based on the relative dynamics of event data, thereby effectively conserving computational resources. In addition, the integration of a lightweight, multi-scale Linear Vision Transformer (LViT) markedly enhances processing efficiency. Our research proposes a fully event-driven approach, effectively exploiting the temporal precision of event data and optimising the allocation of computational resources by intelligently distinguishing the dynamics within the event data. The framework provides a lightweight, fast, accurate, and fully event-based solution for object detection tasks in complex dynamic environments, demonstrating significant practicality and potential for application.

ECAI Conference 2023 Conference Paper

Online Privacy Preservation for Camera-Incremental Person Re-Identification

  • Sheng Wu
  • Wenhang Ge
  • Jiong Wu
  • Jingke Meng
  • Huang Zhang

Task-incremental person re-identification aims to train a model with consecutively available cross-camera annotated data in the current task and a small number of saved data in preceding tasks, which may lead to individual privacy disclosure due to data storage and annotation. In this work, we investigate a more realistic online privacy preservation scenario for camera-incremental person re-identification, where data storage in preceding cameras is not allowed, while data in the current camera are intra-camera annotated online by a pedestrian tracking algorithm without cross-camera annotation. In this setup, the missing data of previous cameras not only results in catastrophic forgetting as task-incremental learning, but also makes the cross-camera association infeasible, which further leads to the incapability of person matching across cameras due to the camera-wise domain gap. To solve these problems, we propose an Online Privacy Preservation (OPP) framework based on the generated exemplars of previous cameras by DeepInversion, where generated exemplars used as supplements to alleviate forgetting and enable cross-camera association to be feasible for camera-wise domain shift mitigation, meanwhile further improving the cross-camera matching capability. Specifically, we propose to mine underlying cross-camera positive pairs between samples of the current camera and exemplars of previous cameras by similarity cues. Furthermore, we introduce a mixup learning strategy to handle the domain gap with mixed samples and labels. Finally, intra-camera incremental learning and cross-camera incremental learning are aggregated into the OPP framework. Extensive experiments on Re-ID benchmarks validate the superiority of the OPP framework as compared with state-of-the-art methods.