Author name cluster

Qing Gu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

1 author row

AAAI Conference 2026 Conference Paper

On Modality Weighting and Specificity for Multi-Modal Entity Alignment

Yu Xing
Qizhuo Xie
Yunhui Liu
Qing Gu
Tao Zheng
Bin Chong
Tieke He

Multi-modal entity alignment aims to identify equivalent entities across different multi-modal knowledge graphs (MMKGs). While prior work has achieved notable progress through improved multi-modal encoding and cross-modal fusion techniques, two critical challenges remain unresolved. First, due to the heterogeneous and often inconsistent sources from which MMKGs are constructed, the quality and informativeness of modalities vary significantly across entities, leading to the modality weighting problem. Second, existing cross-modal fusion mechanisms predominantly emphasize modality-shared information, often at the expense of modality-specific signals that are also essential for precise alignment. To address these issues, we propose HUMEA, a novel framework that integrates hierarchical Mixture-of-Experts (MoE) with unimodal distillation. HUMEA consists of: (1) A hierarchical MoE module comprising intra-modal and inter-modal experts, which adaptively modulates modality contributions by capturing entity representations at fine-to-coarse semantic granularities. In addition, we introduce a contrastive mutual information loss to enhance expert diversity and reduce redundancy. (2) A unimodal distillation strategy that preserves modality-specific information in the fused representations through single-modality alignment and distillation, achieving a balanced integration of shared and unique modality features. Extensive experiments on two benchmark datasets, FB15K-DB15K and FB15K-YAGO15K, demonstrate state-of-the-art performance, validating the effectiveness of our approach.

PDF Details DOI

AAAI Conference 2026 Conference Paper

RegionMarker: A Region-Triggered Semantic Watermarking Framework for Embedding-as-a-Service Copyright Protection

Shufan Yang
Zifeng Cheng
Zhiwei Jiang
Yafeng Yin
Cong Wang
Shiping Ge
Yuchen Fu
Qing Gu

Embedding-as-a-Service (EaaS) is an effective and convenient deployment solution for addressing various NLP tasks. Nevertheless, recent research has shown that EaaS is vulnerable to model extraction attacks, which could lead to significant economic losses for model providers. For copyright protection, existing methods inject watermark embeddings into text embeddings and use them to detect copyright infringement. However, current watermarking methods often resist only a subset of attacks and fail to provide comprehensive protection. To this end, we present the region-triggered semantic watermarking framework called RegionMarker, which defines trigger regions within a low-dimensional space and injects watermarks into text embeddings associated with these regions. By utilizing a secret dimensionality reduction matrix to project onto this subspace and randomly selecting trigger regions, RegionMarker makes it difficult for watermark removal attacks to evade detection. Furthermore, by embedding watermarks across the entire trigger region and using the text embedding as the watermark, RegionMarker is resilient to both paraphrasing and dimension-perturbation attacks. Extensive experiments on various datasets show that RegionMarker is effective in resisting different attack methods, thereby protecting the copyright of EaaS.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization

Cong Wang
Zexuan Deng
Zhiwei Jiang
Yafeng Yin
Fei Shen
Zifeng Cheng
Shiping Ge
Shiwei Gan

Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (e. g. , skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporate multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (i. e. , fine-grained poses and 3D hands). SignViP contains three core components. (1) Sign Video Diffusion Model is jointly trained with a multi-condition encoder to learn continuous embeddings that encapsulate fine-grained motion and appearance. (2) Finite Scalar Quantization (FSQ) Autoencoder is further trained to compress and quantize these embeddings into discrete tokens for compact representation of the conditions. (3) Multi-Condition Token Translator is trained to translate spoken language text to discrete multi-condition tokens. During inference, Multi-Condition Token Translator first translates the spoken language text into discrete multi-condition tokens. These tokens are then decoded to continuous embeddings by FSQ Autoencoder, which are subsequently injected into Sign Video Diffusion Model to guide video generation. Experimental results show that SignViP achieves state-of-the-art performance across metrics, including video quality, temporal coherence, and semantic fidelity. The code is available at https: //github. com/umnooob/signvip/.

PDF Details

AAAI Conference 2025 Conference Paper

Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

Shiping Ge
Qiang Chen
Zhiwei Jiang
Yafeng Yin
Liu Qin
Ziyao Chen
Qing Gu

Weakly-Supervised Dense Video Captioning (WSDVC) aims to localize and describe all events of interest in a video without requiring annotations of event boundaries. This setting poses a great challenge in accurately locating the temporal location of event, as the relevant supervision is unavailable. Existing methods rely on explicit alignment constraints between event locations and captions, which involve complex event proposal procedures during both training and inference. To tackle this problem, we propose a novel implicit location-caption alignment paradigm by complementary masking, which simplifies the complex event proposal and localization process while maintaining effectiveness. Specifically, our model comprises two components: a dual-mode video captioning module and a mask generation module. The dual-mode video captioning module captures global event information and generates descriptive captions, while the mask generation module generates differentiable positive and negative masks for localizing the events. These masks enable the implicit alignment of event locations and captions by ensuring that captions generated from positively and negatively masked videos are complementary, thereby forming a complete video description. In this way, even under weak supervision, the event location and event caption can be aligned implicitly. Extensive experiments on the public datasets demonstrate that our method outperforms existing weakly-supervised methods and achieves competitive results compared to fully-supervised methods.

PDF Details DOI

EAAI Journal 2025 Journal Article

Reading comprehension powered semantic fusion network for identification of N-ary drug combinations

Hua Zhang
Peiqian Zhan
Cheng Yang
Yongjian Yan
Zijing Cai
Guogen Shan
Bo Jiang
Bi Chen

The concurrent use of multiple medications to treat one or more diseases is prevalent. Identifying N-ary drug combinations from biomedical texts aids in uncovering significant pharmacological effects triggered by drug-drug interactions. Previous methods for this emerging task have primarily concentrated on representing drug entities using pre-trained language models, overlooking the comprehensive extraction of contextual and task-specific semantic information. To address these limitations, we develop a semantic fusion method grounded in machine reading comprehension (MRC) framework. Our model, termed Reading Comprehension powered semantic Fusion network for Identification of N-ary Drug combinations (RCFIND), first constructs relevant contexts and queries for each individual drug combination. Then, diverse information sources, including task-specific semantics, drug entity representations and contextual details, are fused by using a simplified Capsule network as well as incorporating contrastive learning. We assess RCFIND, achieving F1 scores ranging from 72. 0% to 83. 3% across four types of evaluations. Experimental results demonstrate significant performance enhancements over existing baselines, with at least a 5% F1 score improvement. Ablation studies and further analysis confirm the efficacy of the MRC framework and contrastive learning in accurately identifying N-ary drug combinations.

Details DOI

NeurIPS Conference 2025 Conference Paper

Steering When Necessary: Flexible Steering Large Language Models with Backtracking

Zifeng Cheng
Jinwei Gan
Zhiwei Jiang
Cong Wang
Yafeng Yin
Xiang Luo
Yuchen Fu
Qing Gu

Large language models (LLMs) have achieved remarkable performance across many generation tasks. Nevertheless, effectively aligning them with desired behaviors remains a significant challenge. Activation steering is an effective and cost-efficient approach that directly modifies the activations of LLMs during the inference stage, aligning their responses with the desired behaviors and avoiding the high cost of fine-tuning. Existing methods typically indiscriminately intervene to all generations or rely solely on the question to determine intervention, which limits the accurate assessment of the intervention strength. To this end, we propose the F lexible A ctivation S teering with B acktracking ( FASB ) framework, which dynamically determines both the necessity and strength of intervention by tracking the internal states of the LLMs during generation, considering both the question and the generated content. Since intervening after detecting a deviation from the desired behavior is often too late, we further propose the backtracking mechanism to correct the deviated tokens and steer the LLMs toward the desired behavior. Extensive experiments on the TruthfulQA dataset and six multiple-choice datasets demonstrate that our method outperforms baselines. Our code will be released at https: //github. com/gjw185/FASB.

PDF Details

NeurIPS Conference 2024 Conference Paper

AP-Adapter: Improving Generalization of Automatic Prompts on Unseen Text-to-Image Diffusion Models

Yuchen Fu
Zhiwei Jiang
Yuliang Liu
Cong Wang
Zexuan Deng
Zhaoling Chen
Qing Gu

Recent advancements in Automatic Prompt Optimization (APO) for text-to-image generation have streamlined user input while ensuring high-quality image output. However, most APO methods are trained assuming a fixed text-to-image model, which is impractical given the emergence of new models. To address this, we propose a novel task, model-generalized automatic prompt optimization (MGAPO), which trains APO methods on a set of known models to enable generalization to unseen models during testing. MGAPO presents significant challenges. First, we experimentally confirm the suboptimal performance of existing APO methods on unseen models. We then introduce a two-stage prompt optimization method, AP-Adapter. In the first stage, a large language model is used to rewrite the prompts. In the second stage, we propose a novel method to construct an enhanced representation space by leveraging inter-model differences. This space captures the characteristics of multiple domain models, storing them as domain prototypes. These prototypes serve as anchors to adjust prompt representations, enabling generalization to unseen models. The optimized prompt representations are subsequently used to generate conditional representations for controllable image generation. We curate a multi-modal, multi-model dataset that includes multiple diffusion models and their corresponding text-image data, and conduct experiments under a model generalization setting. The experimental results demonstrate the AP-Adapter's ability to enable the automatic prompts to generalize well to previously unseen diffusion models, generating high-quality images.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Controlling Class Layout for Deep Ordinal Classification via Constrained Proxies Learning

Cong Wang
Zhiwei Jiang
Yafeng Yin
Zifeng Cheng
Shiping Ge
Qing Gu

For deep ordinal classification, learning a well-structured feature space specific to ordinal classification is helpful to properly capture the ordinal nature among classes. Intuitively, when Euclidean distance metric is used, an ideal ordinal layout in feature space would be that the sample clusters are arranged in class order along a straight line in space. However, enforcing samples to conform to a specific layout in the feature space is a challenging problem. To address this problem, in this paper, we propose a novel Constrained Proxies Learning (CPL) method, which can learn a proxy for each ordinal class and then adjusts the global layout of classes by constraining these proxies. Specifically, we propose two kinds of strategies: hard layout constraint and soft layout constraint. The hard layout constraint is realized by directly controlling the generation of proxies to force them to be placed in a strict linear layout or semicircular layout (i.e., two instantiations of strict ordinal layout). The soft layout constraint is realized by constraining that the proxy layout should always produce unimodal proxy-to-proxies similarity distribution for each proxy (i.e., to be a relaxed ordinal layout). Experiments show that the proposed CPL method outperforms previous deep ordinal classification methods under the same setting of feature extractor.

PDF Details DOI

IJCAI Conference 2021 Conference Paper

Modelling General Properties of Nouns by Selectively Averaging Contextualised Embeddings

Na Li
Zied Bouraoui
Jose Camacho-Collados
Luis Espinosa-Anke
Qing Gu
Steven Schockaert

While the success of pre-trained language models has largely eliminated the need for high-quality static word vectors in many NLP applications, static word vectors continue to play an important role in tasks where word meaning needs to be modelled in the absence of linguistic context. In this paper, we explore how the contextualised embeddings predicted by BERT can be used to produce high-quality word vectors for such domains, in particular related to knowledge base completion, where our focus is on capturing the semantic properties of nouns. We find that a simple strategy of averaging the contextualised embeddings of masked word mentions leads to vectors that outperform the static word vectors learned by BERT, as well as those from standard word embedding models, in property induction tasks. We notice in particular that masking target words is critical to achieve this strong performance, as the resulting vectors focus less on idiosyncratic properties and more on general semantic properties. Inspired by this view, we propose a filtering strategy which is aimed at removing the most idiosyncratic mention vectors, allowing us to obtain further performance gains in property induction.

PDF Details DOI