Author name cluster

Shengeng Tang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers

2 author rows

AAAI Conference 2026 Conference Paper

Accelerating Controllable Generation via Hybrid-grained Cache

Lin Liu
Huixia Ben
Shuo Wang
Jinda Lu
Junxiang Qiu
Shengeng Tang
Yanbin Hao

Controllable generative models have been widely used to improve the realism of synthetic visual content. However, such models must handle control conditions and content generation computational requirements, resulting in generally low generation efficiency. To address this issue, we propose a Hybrid-Grained Cache (HGC) approach that reduces computational overhead by adopting cache strategies with different granularities at different computational stages. Specifically, (1) we use a coarse-grained cache (block-level) based on feature reuse to dynamically bypass redundant computations in encoder-decoder blocks between each step of model reasoning. (2) We design a fine-grained cache (prompt-level) that acts within a module, where the fine-grained cache reuses cross-attention maps within consecutive reasoning steps and extends them to the corresponding module computations of adjacent steps. These caches of different granularities can be seamlessly integrated into each computational link of the controllable generation process. We verify the effectiveness of HGC on four benchmark datasets, especially its advantages in balancing generation efficiency and visual quality. For example, on the COCO-Stuff segmentation benchmark, our HGC significantly reduces the computational cost (MACs) by 63% (from 18.22T → 6.70T↓), while keeping the loss of semantic fidelity (quantized performance degradation) within 1.5%.

PDF Details DOI

AAAI Conference 2026 Conference Paper

LinProVSR: Linguistics-Knowledge Guided Progressive Disambiguation Network for Visual Speech Recognition

Feng Xue
Baochao Zhu
Wei Jia
Shujie Li
Yu Li
Jinrui Zhang
Shengeng Tang
Dan Guo

Visual Speech Recognition (VSR), commonly known as lipreading, enables the recognition of spoken text by analyzing lip visual features. Due to the subtlety of lip movements, its recognition is much harder than other motion recognition tasks. Existing VSR models face the challenge of viseme ambiguity when processing phonemes with similar pronunciations—multiple phonemes share similar viseme features, leading to a notable drop in lipreading accuracy. To address this issue, this study proposes a Linguistics-Knowledge Guided Progressive Disambiguation Network for Visual Speech Recognition(LinProVSR) framework. First, an ambiguous sample set is constructed based on linguistic knowledge to provide supervisory signals for the model's training. Then, a Progressive Contrastive Disambiguation Network (PCDN) is designed, which progressively enhances the model's ability to capture the subtle viseme differences corresponding to similar phonemes through viseme-phoneme contrastive disambiguation in the encoding stage and text contrastive disambiguation in the decoding stage. Furthermore, we pioneer the Ambiguous Word Error Rate (AWER) metric specifically for evaluating recognition of phonetically ambiguous text, and verify the effectiveness of the proposed method on multiple public datasets, achieving a significant breakthrough especially in distinguishing visually similar phonemes.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning

Yu Fei
Quan Deng
Shengeng Tang
Li Yuehua
Lechao Cheng

Open-world 3D scene understanding is fundamentally challenging for vision and robotics, due to the constraints of closed-vocabulary supervision and static annotations. To address this, we propose a unified framework for Open-World 3D Scene Graph Generation with Retrieval-Augmented Reasoning, which enables generalizable and interactive 3D scene understanding. Our method integrates vision-language models with retrieval-based reasoning to support multimodal exploration and language-guided interaction. The framework comprises two key components: (1) a dynamic scene graph generation module that detects objects and infers semantic relationships without fixed label sets, and (2) a retrieval-augmented reasoning pipeline that encodes scene graphs into a vector database to support text/image-conditioned queries. We evaluate our method on 3DSSG and Replica benchmarks across four tasks—scene question answering, visual grounding, instance retrieval, and task planning—demonstrating robust generalization and superior performance in diverse environments. Our results highlight the effectiveness of combining open-vocabulary perception with retrieval-based reasoning for scalable 3D scene understanding.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Wi-CBR: Salient-aware Adaptive WiFi Sensing for Cross-domain Behavior Recognition

Ruobei Zhang
Shengeng Tang
Huan Yan
Xiang Zhang
Jiabao Guo

The challenge in WiFi-based cross-domain Behavior Recognition lies in the significant interference of domain-specific signals on gesture variation. However, previous methods alleviate this interference by mapping the phase from multiple domains into a common feature space. If the Doppler Frequency Shift (DFS) signal is used to dynamically supplement the phase features to achieve better generalization, it enables the model to not only explore a wider feature space but also to avoid potential degradation of gesture semantic information. Specifically, we propose a novel Salient-aware Adaptive WiFi Sensing for Cross-domain Behavior Recognition (Wi-CBR), which constructs a dual-branch self-attention module that captures temporal features from phase information reflecting dynamic path length variations while extracting kinematic features from DFS correlated with motion velocity. Moreover, we design a Saliency Guidance Module that employs group attention mechanisms to mine critical activity features and utilizes gating mechanisms to optimize information entropy, facilitating feature fusion and enabling effective interaction between salient and non-salient behavioral characteristics. Extensive experiments on two large-scale public datasets (Widar3.0 and XRF55) demonstrate the superior performance of our method in both in-domain and cross-domain scenarios.

PDF Details DOI

TIST Journal 2025 Journal Article

Alleviating Confirmation Bias in Learning with Noisy Labels via Two-Network Collaboration

Chenglong Xu
Peipei Song
Shengeng Tang
Dan Guo
Xun Yang

Deep neural networks (DNNs) have achieved remarkable success in various computer vision tasks, e.g., image classification. However, most of the existing models depend heavily on annotated data, where label noise is inevitable. Training with such noisy data negatively impacts the generalization performance of DNNs. To this end, recent advances in learning with noisy labels (LNL) adopt the sample selection strategy that identifies clean samples from the noisy dataset to update DNNs, using semi-supervised learning where rejected samples are treated as unlabeled data. However, existing LNL methods often overlook the varying fitting difficulties of different classes, resulting in suboptimal sample selection and confirmation bias, and consequently, the errors accumulate during semi-supervised training. In this article, we propose a novel method, TNCollab, which aims at alleviating confirmation bias in both sample selection and semi-supervised training stages via two-network collaboration. Specifically, we introduce a class-adaptive threshold for sample selection to address the varying fitting difficulties across different classes. Additionally, we construct a hard set consisting of samples where the two networks disagree and introduce a noise-robust loss to extract potentially useful information while maintaining robustness against label noise. Furthermore, we propose a dual consistency loss to ensure consistent predictions between the networks across different augmented views of the same sample, facilitating mutual learning. Extensive experiments demonstrate that TNCollab achieves state-of-the-art performance on image classification and facial expression recognition tasks, particularly on CIFAR-10, CIFAR-100, WebVision, Clothing1M, Tiny-ImageNet, and RAF-DB datasets, showing improved visual understanding and generalization capabilities. Our codes are available at https://github.com/Delete12137/TNCollab.

Details DOI

AAAI Conference 2025 Conference Paper

Dense Audio-Visual Event Localization Under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration

Ziheng Zhou
Jinxing Zhou
Wei Qian
Shengeng Tang
Xiaojun Chang
Dan Guo

In the field of audio-visual learning, most research tasks focus exclusively on short videos. This paper focuses on the more practical Dense Audio-Visual Event Localization (DAVEL) task, advancing audio-visual scene understanding for longer, untrimmed videos. This task seeks to identify and temporally pinpoint all events simultaneously occurring in both audio and visual streams. Typically, each video encompasses dense events of multiple classes, which may overlap on the timeline, each exhibiting varied durations. Given these challenges, effectively exploiting the audio-visual relations and the temporal features encoded at various granularities becomes crucial. To address these challenges, we introduce a novel CCNet, comprising two core modules: the Cross-Modal Consistency Collaboration (CMCC) and the Multi-Temporal Granularity Collaboration (MTGC). Specifically, the CMCC module contains two branches: a cross-modal interaction branch and a temporal consistency-gated branch. The former branch facilitates the aggregation of consistent event semantics across modalities through the encoding of audio-visual relations, while the latter branch guides one modality's focus to pivotal event-relevant temporal areas as discerned in the other modality. The MTGC module includes a coarse-to-fine collaboration block and a fine-to-coarse collaboration block, providing bidirectional support among coarse- and fine-grained temporal features. Extensive experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization.

PDF Details DOI

ICML Conference 2025 Conference Paper

Knowledge Swapping via Learning and Unlearning

Mingyu Xing
Lechao Cheng
Shengeng Tang
Yaxiong Wang
Zhun Zhong
Meng Wang 0001

We introduce Knowledge Swapping, a novel task designed to selectively regulate knowledge of a pretrained model by enabling the forgetting of user-specified information, retaining essential knowledge, and acquiring new knowledge simultaneously. By delving into the analysis of knock-on feature hierarchy, we find that incremental learning typically progresses from low-level representations to higher-level semantics, whereas forgetting tends to occur in the opposite direction—starting from high-level semantics and moving down to low-level features. Building upon this, we propose to benchmark the knowledge swapping task with the strategy of Learning Before Forgetting. Comprehensive experiments on various tasks like image classification, object detection, and semantic segmentation validate the effectiveness of the proposed strategy. The source code is available at https: //github. com/xingmingyu123456/KnowledgeSwapping.

Details

ICML Conference 2025 Conference Paper

Navigating Semantic Drift in Task-Agnostic Class-Incremental Learning

Fangwen Wu
Lechao Cheng
Shengeng Tang
Xiaofeng Zhu
Chaowei Fang
Dingwen Zhang
Meng Wang 0001

Class-incremental learning (CIL) seeks to enable a model to sequentially learn new classes while retaining knowledge of previously learned ones. Balancing flexibility and stability remains a significant challenge, particularly when the task ID is unknown. To address this, our study reveals that the gap in feature distribution between novel and existing tasks is primarily driven by differences in mean and covariance moments. Building on this insight, we propose a novel semantic drift calibration method that incorporates mean shift compensation and covariance calibration. Specifically, we calculate each class’s mean by averaging its sample embeddings and estimate task shifts using weighted embedding changes based on their proximity to the previous mean, effectively capturing mean shifts for all learned classes with each new task. We also apply Mahalanobis distance constraint for covariance calibration, aligning class-specific embedding covariances between old and current networks to mitigate the covariance shift. Additionally, we integrate a feature-level self-distillation approach to enhance generalization. Comprehensive experiments on commonly used datasets demonstrate the effectiveness of our approach. The source code is available at https: //github. com/fwu11/MACIL. git.

Details

AAAI Conference 2025 Conference Paper

Patch-level Sounding Object Tracking for Audio-Visual Question Answering

Zhangbin Li
Jinxing Zhou
Jing Zhang
Shengeng Tang
Kun Li
Dan Guo

Answering questions related to audio-visual scenes, i.e., the AVQA task, is becoming increasingly popular. A critical challenge is accurately identifying and tracking sounding objects related to the question along the timeline. In this paper, we present a new Patch-level Sounding Object Tracking (PSOT) method. It begins with a Motion-driven Key Patch Tracking (M-KPT) module, which relies on visual motion information to identify salient visual patches with significant movements that are more likely to relate to sounding objects and questions. We measure the patch-wise motion intensity map between neighboring video frames and utilize it to construct and guide a motion-driven graph network. Meanwhile, we design a Sound-driven KPT (S-KPT) module to explicitly track sounding patches. This module also involves a graph network, with the adjacency matrix regularized by the audio-visual correspondence map. The M-KPT and S-KPT modules are performed in parallel for each temporal segment, allowing balanced tracking of salient and sounding objects. Based on the tracked patches, we further propose a Question-driven KPT (Q-KPT) module to retain patches highly relevant to the question, ensuring the model focuses on the most informative clues. The audio-visual-question features are updated during the processing of these modules, which are then aggregated for final answer prediction. Extensive experiments on standard datasets demonstrate the effectiveness of our method, achieving competitive performance even compared to recent large-scale pretraining-based approaches.

PDF Details DOI

AAAI Conference 2025 Conference Paper

PhysDiff: Physiology-based Dynamicity Disentangled Diffusion Model for Remote Physiological Measurement

Wei Qian
Gaoji Su
Dan Guo
Jinxing Zhou
Xiaobai Li
Bin Hu
Shengeng Tang
Meng Wang

Recent works on remote PhotoPlethysmoGraphy (rPPG) estimation typically use techniques like CNNs and Transformers to encode implicit features from facial videos for prediction. These methods learn to directly map facial videos to the static values of rPPG signals, overlooking the inherent dynamic characteristics of rPPG sequence. Moreover, the rPPG signal is extremely weak and highly susceptible to interference from various sources of noise, including illumination conditions, head movements, and variations in skin tone. To address these limitations, we propose a Physiology-based dynamicity disentangled diffusion (PhysDiff) model particularly designed for robust rPPG estimation. PhysDiff leverages the diffusion model to learn the distribution of quasi-periodic rPPG signal and uses a dynamicity disentanglement strategy to capture two dynamic characteristics in temporal rPPG signal, i.e., trend and amplitude. This disentanglement is motivated by the underlying dynamic physiological processes of vasodilation and vasoconstriction, ensuring a more precise representation of the rPPG signal. The disentangled components are then used as pivotal conditions in the proposed spatial-temporal hybrid denoiser for rPPG reconstruction. Besides, we introduce a periodicity-based multi-hypothesis selection strategy in model inference, which compares the natural periodicity of multiple generated rPPG hypotheses and selects the most favorable one as the final prediction. Extensive experiments on four datasets demonstrate that our PhysDiff significantly outperforms prior methods on both intra-dataset and cross-dataset testing.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Shaping a Stabilized Video by Mitigating Unintended Changes for Concept-Augmented Video Editing

Mingce Guo
Jingxuan He
Yufei Yin
Zhangye Wang
Shengeng Tang
Lechao Cheng

Text-driven video editing powered by generative diffusion models holds significant promise for applications spanning film production, advertising, and beyond. However, the limited expressiveness of pre-trained word embeddings often restricts nuanced edits, especially when targeting novel concepts with specific attributes. In this work, we present a novel Concept-Augmented Textual Inversion (CATI) framework that flexibly integrates new object information from user-provided concept videos. By fine-tuning only the V (Value) projection in attention via Low-Rank Adaptation (LoRA), our approach preserves the original attention distribution of the diffusion model while efficiently incorporating external concept knowledge. To further stabilize editing results and mitigate the issue of attention dispersion when prompt keywords are modified, we introduce a Dual Prior Supervision (DPS) mechanism. DPS supervises cross-attention between the source and target prompts, preventing undesired changes to non-target areas and improving the fidelity of novel concepts. Extensive evaluations demonstrate that our plug-and-play solution not only maintains spatial and temporal consistency but also outperforms state-of-the-art methods in generating lifelike and stable edited videos. The source code is publicly available at https: //guomc9. github. io/STIVE-PAGE/.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Sign-IDD: Iconicity Disentangled Diffusion for Sign Language Production

Shengeng Tang
Jiayi He
Dan Guo
Yanyan Wei
Feng Li
Richang Hong

Sign Language Production (SLP) aims to generate semantically consistent sign videos from textual statements, where the conversion from textual glosses to sign poses (G2P) is a crucial step. Existing G2P methods typically treat sign poses as discrete three-dimensional coordinates and directly fit them, which overlooks the relative positional relationships among joints. To this end, we provide a new perspective, constraining joint associations and gesture details by modeling the limb bones to improve the accuracy and naturalness of the generated poses. In this work, we propose a pioneering iconicity disentangled diffusion framework, termed Sign-IDD, specifically designed for SLP. Sign-IDD incorporates a novel Iconicity Disentanglement (ID) module to bridge the gap between relative positions among joints. The ID module disentangles the conventional 3D joint representation into a 4D bone representation, comprising the 3D spatial direction vector and 1D spatial distance vector between adjacent joints. Additionally, an Attribute Controllable Diffusion (ACD) module is introduced to further constrain joint associations, in which the attribute separation layer aims to separate the bone direction and length attributes, and the attribute control layer is designed to guide the pose generation by leveraging the above attributes. The ACD module utilizes the gloss embeddings as semantic conditions and finally generates sign poses from noise embeddings. Extensive experiments on PHOENIX14T and USTC-CSL datasets validate the effectiveness of our method.

PDF Details DOI

TIST Journal 2024 Journal Article

Intermediary-Generated Bridge Network for RGB-D Cross-Modal Re-Identification

Jingjing Wu
Richang Hong
Shengeng Tang

RGB-D cross-modal person re-identification (re-id) targets at retrieving the person of interest across RGB and depth image modalities. To cope with the modal discrepancy, some existing methods generate an auxiliary mode with either inherent properties of input modes or extra deep networks. However, such useful intermediary role included in generated mode is often overlooked in these approaches, leading to insufficient exploitation of crucial bridge knowledge. By contrast, in this article, we propose a novel approach that constructs an intermediary mode through the constraints of self-supervised intermediary learning, which is freedom from modal prior knowledge and additional module parameters. We then design a bridge network to fully mine the intermediary role of generated modality through carrying out multi-modal integration and decomposition. For one thing, this network leverages a multi-modal transformer to integrate the information of three modes via fully exploiting their heterogeneous relations with the intermediary mode as the bridge. It conducts the identification consistency constraint to promote cross-modal associations. For another, it employs circle contrastive learning to decompose the cross-modal constraint process into several subprocedures, which provides the intermediate relay during pulling two original modalities closer. Experiments on two public datasets demonstrate that the proposed method exceeds the state-of-the-arts. The effectiveness of each component in this method is verified through numerous ablation studies. Additionally, we have demonstrated the generalization ability of the proposed method through experiments.

Details DOI

IJCAI Conference 2019 Conference Paper

Connectionist Temporal Modeling of Video and Language: a Joint Model for Translation and Sign Labeling

Dan Guo
Shengeng Tang
Meng Wang

Online sign interpretation suffers from challenges presented by hybrid semantics learning among sequential variations of visual representations, sign linguistics, and textual grammars. This paper proposes a Connectionist Temporal Modeling (CTM) network for sentence translation and sign labeling. To acquire short-term temporal correlations, a Temporal Convolution Pyramid (TCP) module is performed on 2D CNN features to realize (2D+1D)=pseudo 3D' CNN features. CTM aligns the pseudo 3D' with the original 3D CNN clip features and fuses them. Next, we implement a connectionist decoding scheme for long-term sequential learning. Here, we embed dynamic programming into the decoding scheme, which learns temporal mapping among features, sign labels, and the generated sentence directly. The solution using dynamic programming to sign labeling is considered as pseudo labels. Finally, we utilize the pseudo supervision cues in an end-to-end framework. A joint objective function is designed to measure feature correlation, entropy regularization on sign labeling, and probability maximization on sentence decoding. The experimental results using the RWTH-PHOENIX-Weather and USTC-CSL datasets demonstrate the effectiveness of the proposed approach.

PDF Details