Author name cluster

Qianying Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers

1 author row

AAAI Conference 2026 Conference Paper

ST-SAM: Multimodal Scene Text Segmentation with Dense Visual and Sparse Textual Prompts via SAM

Jin Wei
Yaqiang Wu
Jiayi Yan
Zeng Li
Zhen Xu
Yu Zhou
Lingling Zhang
Qianying Wang

Scene text segmentation is a critical preprocessing step in various text-based applications. Specialist text segmentation methods, often relying on a detect-then-segment paradigm, tend to exhibit reduced robustness and can lead to cascading errors. The introduction of the Segment Anything Model (SAM) has revolutionized general segmentation by leveraging vision foundation models. However, SAM still falls short when applied to domain-specific tasks such as scene text segmentation. To bridge this gap between SAM and specialized scene text segmentation approaches, we propose ST-SAM (Scene Text SAM), a parameter-efficient fine-tuning framework tailored to adapt SAM for high-quality scene text segmentation without relying on explicit text detection. ST-SAM incorporates a multimodal prompting mechanism: a lightweight visual encoder generates multi-scale spatial features to provide precise visual context; and textual prompts generated by a large language model offer high-level semantic guidance. We demonstrate the advantages of the proposed ST-SAM as follows: (1) ST-SAM achieves new state-of-the-art performance on multiple scene text segmentation benchmarks, including 85.30% fgIoU on Total-Text and 91.03% fgIoU on TextSeg, outperforming both specialist and generalist models. (2) ST-SAM enables effective domain adaptation by flexibly adapting the general SAM architecture to the domain of scene text. (3) By discarding the detect-then-segment pipeline, ST-SAM simplifies the inference process while still achieving robust performance on complex text cases.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Boosting Knowledge Utilization in Multimodal Large Language Models via Adaptive Logits Fusion and Attention Reallocation

Wenbin An
Jiahao Nie
Feng Tian
Haonan Lin
Mingxiang Cai
Yaqiang Wu
Qianying Wang
Xiaoqin Zhang

Despite their recent progress, Multimodal Large Language Models (MLLMs) often struggle in knowledge-intensive tasks due to the limited and outdated parametric knowledge acquired during training. Multimodal Retrieval Augmented Generation addresses this issue by retrieving contextual knowledge from external databases, thereby enhancing MLLMs with expanded knowledge sources. However, existing MLLMs often fail to fully leverage the retrieved contextual knowledge for response generation. We examine representative MLLMs and identify two major causes, namely, attention bias toward different tokens and knowledge conflicts between parametric and contextual knowledge. To this end, we design Adaptive Logits Fusion and Attention Reallocation (ALFAR), a training-free and plug-and-play approach that improves MLLM responses by maximizing the utility of the retrieved knowledge. Specifically, ALFAR tackles the challenges from two perspectives. First, it alleviates attention bias by adaptively shifting attention from visual tokens to relevant context tokens according to query-context relevance. Second, it decouples and weights parametric and contextual knowledge at output logits, mitigating conflicts between the two types of knowledge. As a plug-and-play method, ALFAR achieves superior performance across diverse datasets without requiring additional training or external tools. Extensive experiments over multiple MLLMs and benchmarks show that ALFAR consistently outperforms the state-of-the-art by large margins. Our code and data are available at https: //github. com/Lackel/ALFAR.

PDF Details

NeurIPS Conference 2025 Conference Paper

Causal-R: A Causal-Reasoning Geometry Problem Solver for Optimized Solution Exploration

Wenjun Wu
Lingling Zhang
Bo Zhao
Muye Huang
Qianying Wang
Jun Liu

The task of geometry problem solving has been a long-standing focus in the automated mathematics community and draws growing attention due to its complexity for both symbolic and neural models. Although prior studies have explored various effective approaches for enhancing problem solving performances, two fundamental challenges remain unaddressed, which are essential to the application in practical scenarios. First, the multi-step reasoning gap between the initial geometric conditions and ultimate problem goal leads to a great search space for solution exploration. Second, obtaining multiple interpretable and shorter solutions remains an open problem. In this work, we introduce the Causal-Reasoning Geometry Problem Solver to overcome these challenges. Specifically, the Causal Graph Reasoning theory is proposed to perform symbolic reasoning before problem solving. Several causal graphs are constructed according to predefined rule base, where each graph is composed of primitive nodes, causal edges and prerequisite edges. By applying causal graph deduction from initial conditions, the reachability status of nodes are iteratively conveyed by causal edges until reaching the target nodes, representing feasible causal deduction paths. In this way, the search space of solutions is compressed from the beginning, the end and intermediate reasoning paths, while ensuring the interpretability and variety of solutions. To achieve this, we further propose Forward Matrix Deduction which transforms the causal graphs into matrices and vectors, and applies matrix operations to update the status value of reachable nodes in iterations. Finally, multiple solutions can be generated by tracing back from the target nodes after validation. Experiments demonstrate the effectiveness of our method to obtain multiple shorter and interpretable solutions. Code is available after acceptance.

PDF Details

AAAI Conference 2025 Conference Paper

Unleashing the Potential of Model Bias for Generalized Category Discovery

Wenbin An
Haonan Lin
Jiahao Nie
Feng Tian
Wenkai Shi
Yaqiang Wu
Qianying Wang
Ping Chen

Generalized Category Discovery is a significant and complex task that aims to identify both known and undefined novel categories from a set of unlabeled data, leveraging another labeled dataset containing only known categories. The primary challenges stem from model bias induced by pre-training on only known categories and the lack of precise supervision for novel ones, leading to category bias towards known categories and category confusion among different novel categories, which hinders models' ability to identify novel categories effectively. To address these challenges, we propose a novel framework named Self-Debiasing Calibration (SDC). Unlike prior methods that regard model bias towards known categories as an obstacle to novel category identification, SDC provides a novel insight into unleashing the potential of the bias to facilitate novel category learning. Specifically, we utilize the biased pre-trained model to guide the subsequent learning process on unlabeled data. The output of the biased model serves two key purposes. First, it provides an accurate modeling of category bias, which can be utilized to measure the degree of bias and debias the output of the current training model. Second, it offers valuable insights for distinguishing different novel categories by transferring knowledge between similar categories. Based on these insights, SDC dynamically adjusts the output logits of the current training model using the output of the biased model. This approach produces less biased logits to effectively address the issue of category bias towards known categories, and generates more accurate pseudo labels for unlabeled data, thereby mitigating category confusion for novel categories. Experiments on three benchmark datasets show that SDC outperforms SOTA methods, especially in the identification of novel categories.

PDF Details DOI

AAAI Conference 2024 Conference Paper

A Unified Knowledge Transfer Network for Generalized Category Discovery

Wenkai Shi
Wenbin An
Feng Tian
Yan Chen
Yaqiang Wu
Qianying Wang
Ping Chen

Generalized Category Discovery (GCD) aims to recognize both known and novel categories in an unlabeled dataset by leveraging another labeled dataset with only known categories. Without considering knowledge transfer from known to novel categories, current methods usually perform poorly on novel categories due to the lack of corresponding supervision. To mitigate this issue, we propose a unified Knowledge Transfer Network (KTN), which solves two obstacles to knowledge transfer in GCD. First, the mixture of known and novel categories in unlabeled data makes it difficult to identify transfer candidates (i.e., samples with novel categories). For this, we propose an entropy-based method that leverages knowledge in the pre-trained classifier to differentiate known and novel categories without requiring extra data or parameters. Second, the lack of prior knowledge of novel categories presents challenges in quantifying semantic relationships between categories to decide the transfer weights. For this, we model different categories with prototypes and treat their similarities as transfer weights to measure the semantic similarities between categories. On the basis of two treatments, we transfer knowledge from known to novel categories by conducting pre-adjustment of logits and post-adjustment of labels for transfer candidates based on the transfer weights between different categories. With the weighted adjustment, KTN can generate more accurate pseudo-labels for unlabeled data, which helps to learn more discriminative features and boost model performance on novel categories. Extensive experiments show that our method outperforms state-of-the-art models on all evaluation metrics across multiple benchmark datasets. Furthermore, different from previous clustering-based methods that can only work offline with abundant data, KTN can be deployed online conveniently with faster inference speed. Code and data are available at https://github.com/yibai-shi/KTN.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Flipped Classroom: Aligning Teacher Attention with Student in Generalized Category Discovery

Haonan Lin
Wenbin An
Jiahao Wang
Yan Chen
Feng Tian
Mengmeng Wang
Guang Dai
Qianying Wang

Recent advancements have shown promise in applying traditional Semi-Supervised Learning strategies to the task of Generalized Category Discovery (GCD). Typically, this involves a teacher-student framework in which the teacher imparts knowledge to the student to classify categories, even in the absence of explicit labels. Nevertheless, GCD presents unique challenges, particularly the absence of priors for new classes, which can lead to the teacher's misguidance and unsynchronized learning with the student, culminating in suboptimal outcomes. In our work, we delve into why traditional teacher-student designs falter in generalized category discovery as compared to their success in closed-world semi-supervised learning. We identify inconsistent pattern learning as the crux of this issue and introduce FlipClass—a method that dynamically updates the teacher to align with the student's attention, instead of maintaining a static teacher reference. Our teacher-attention-update strategy refines the teacher's focus based on student feedback, promoting consistent pattern recognition and synchronized learning across old and new classes. Extensive experiments on a spectrum of benchmarks affirm that FlipClass significantly surpasses contemporary GCD methods, establishing new standards for the field.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Schedule Your Edit: A Simple yet Effective Diffusion Noise Schedule for Image Editing

Haonan Lin
Yan Chen
Jiahao Wang
Wenbin An
Mengmeng Wang
Feng Tian
Yong Liu
Guang Dai

Text-guided diffusion models have significantly advanced image editing, enabling high-quality and diverse modifications driven by text prompts. However, effective editing requires inverting the source image into a latent space, a process often hindered by prediction errors inherent in DDIM inversion. These errors accumulate during the diffusion process, resulting in inferior content preservation and edit fidelity, especially with conditional inputs. We address these challenges by investigating the primary contributors to error accumulation in DDIM inversion and identify the singularity problem in traditional noise schedules as a key issue. To resolve this, we introduce the Logistic Schedule, a novel noise schedule designed to eliminate singularities, improve inversion stability, and provide a better noise space for image editing. This schedule reduces noise prediction errors, enabling more faithful editing that preserves the original content of the source image. Our approach requires no additional retraining and is compatible with various existing editing methods. Experiments across eight editing tasks demonstrate the Logistic Schedule's superior performance in content preservation and edit fidelity compared to traditional noise schedules, highlighting its adaptability and effectiveness. The project page is available at https: //lonelvino. github. io/SYE/.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Transfer and Alignment Network for Generalized Category Discovery

Wenbin An
Feng Tian
Wenkai Shi
Yan Chen
Yaqiang Wu
Qianying Wang
Ping Chen

Generalized Category Discovery (GCD) is a crucial real-world task that aims to recognize both known and novel categories from an unlabeled dataset by leveraging another labeled dataset with only known categories. Despite the improved performance on known categories, current methods perform poorly on novel categories. We attribute the poor performance to two reasons: biased knowledge transfer between labeled and unlabeled data and noisy representation learning on the unlabeled data. The former leads to unreliable estimation of learning targets for novel categories and the latter hinders models from learning discriminative features. To mitigate these two issues, we propose a Transfer and Alignment Network (TAN), which incorporates two knowledge transfer mechanisms to calibrate the biased knowledge and two feature alignment mechanisms to learn discriminative features. Specifically, we model different categories with prototypes and transfer the prototypes in labeled data to correct model bias towards known categories. On the one hand, we pull instances with known categories in unlabeled data closer to these prototypes to form more compact clusters and avoid boundary overlap between known and novel categories. On the other hand, we use these prototypes to calibrate noisy prototypes estimated from unlabeled data based on category similarities, which allows for more accurate estimation of prototypes for novel categories that can be used as reliable learning targets later. After knowledge transfer, we further propose two feature alignment mechanisms to acquire both instance- and category-level knowledge from unlabeled data by aligning instance features with both augmented features and the calibrated prototypes, which can boost model performance on both known and novel categories with less noise. Experiments on three benchmark datasets show that our model outperforms SOTA methods, especially on novel categories. Theoretical analysis is provided for an in-depth understanding of our model in general. Our code and data are available at https://github.com/Lackel/TAN.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Diagram Visual Grounding: Learning to See with Gestalt-Perceptual Attention

Xin Hu
Lingling Zhang
Jun Liu
Xinyu Zhang
Wenjun Wu
Qianying Wang

Diagram visual grounding aims to capture the correlation between language expression and local objects in the diagram, and plays an important role in the applications like textbook question answering and cross-modal retrieval. Most diagrams consist of several colors and simple geometries. This results in sparse low-level visual features, which further aggravates the gap between low-level visual and high-level semantic features of diagrams. The phenomenon brings challenges to the diagram visual grounding. To solve the above issues, we propose a gestalt-perceptual attention model to align the diagram objects and language expressions. For low-level visual features, inspired by the gestalt that simulates human visual system, we build a gestalt-perception graph network to make up the features learned by the traditional backbone network. For high-level semantic features, we design a multi-modal context attention mechanism to facilitate the interaction between diagrams and language expressions, so as to enhance the semantics of diagrams. Finally, guided by diagram features and linguistic embedding, the target query is gradually decoded to generate the coordinates of the referred object. By conducting comprehensive experiments on diagrams and natural images, we demonstrate that the proposed model achieves superior performance over the competitors. Our code will be released at https: //github. com/AIProCode/GPA.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Generalized Category Discovery with Decoupled Prototypical Network

Wenbin An
Feng Tian
Qinghua Zheng
Wei Ding
Qianying Wang
Ping Chen

Generalized Category Discovery (GCD) aims to recognize both known and novel categories from a set of unlabeled data, based on another dataset labeled with only known categories. Without considering differences between known and novel categories, current methods learn about them in a coupled manner, which can hurt model's generalization and discriminative ability. Furthermore, the coupled training approach prevents these models transferring category-specific knowledge explicitly from labeled data to unlabeled data, which can lose high-level semantic information and impair model performance. To mitigate above limitations, we present a novel model called Decoupled Prototypical Network (DPN). By formulating a bipartite matching problem for category prototypes, DPN can not only decouple known and novel categories to achieve different training targets effectively, but also align known categories in labeled and unlabeled data to transfer category-specific knowledge explicitly and capture high-level semantics. Furthermore, DPN can learn more discriminative features for both known and novel categories through our proposed Semantic-aware Prototypical Learning (SPL). Besides capturing meaningful semantic information, SPL can also alleviate the noise of hard pseudo labels through semantic-weighted soft assignment. Extensive experiments show that DPN outperforms state-of-the-art models by a large margin on all evaluation metrics across multiple benchmark datasets. Code and data are available at https://github.com/Lackel/DPN.

PDF Details DOI

IJCAI Conference 2021 Conference Paper

MatchVIE: Exploiting Match Relevancy between Entities for Visual Information Extraction

Guozhi Tang
Lele Xie
Lianwen Jin
Jiapeng Wang
Jingdong Chen
Zhen Xu
Qianying Wang
Yaqiang Wu

Visual Information Extraction (VIE) task aims to extract key information from multifarious document images (e. g. , invoices and purchase receipts). Most previous methods treat the VIE task simply as a sequence labeling problem or classification problem, which requires models to carefully identify each kind of semantics by introducing multimodal features, such as font, color, layout. But simply introducing multimodal features can't work well when faced with numeric semantic categories or some ambiguous texts. To address this issue, in this paper we propose a novel key-value matching model based on a graph neural network for VIE (MatchVIE). Through key-value matching based on relevancy evaluation, the proposed MatchVIE can bypass the recognitions to various semantics, and simply focuses on the strong relevancy between entities. Besides, we introduce a simple but effective operation, Num2Vec, to tackle the instability of encoded values, which helps model converge more smoothly. Comprehensive experiments demonstrate that the proposed MatchVIE can significantly outperform previous methods. Notably, to the best of our knowledge, MatchVIE may be the first attempt to tackle the VIE task by modeling the relevancy between keys and values and it is a good complement to the existing methods.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution

Jiapeng Wang
Chongyu Liu
Lianwen Jin
Guozhi Tang
Jiaxin Zhang
Shuaitao Zhang
Qianying Wang
Yaqiang Wu

Visual information extraction (VIE) has attracted considerable attention recently owing to its various advanced applications such as document understanding, automatic marking and intelligent education. Most existing works decoupled this problem into several independent sub-tasks of text spotting (text detection and recognition) and information extraction, which completely ignored the high correlation among them during optimization. In this paper, we propose a robust visual information extraction system (VIES) towards real-world scenarios, which is an unified end-to-end trainable framework for simultaneous text detection, recognition and information extraction by taking a single document image as input and outputting the structured information. Specifically, the information extraction branch collects abundant visual and semantic representations from text spotting for multimodal feature fusion and conversely, provides higherlevel semantic clues to contribute to the optimization of text spotting. Moreover, regarding the shortage of public benchmarks, we construct a fully-annotated dataset called EPHOIE (https: //github. com/HCIILAB/EPHOIE), which is the first Chinese benchmark for both text spotting and visual information extraction. EPHOIE consists of 1, 494 images of examination paper head with complex layouts and background, including a total of 15, 771 Chinese handwritten or printed text instances. Compared with the state-of-the-art methods, our VIES shows significant superior performance on the EPHOIE dataset and achieves a 9. 01% F-score gain on the widely used SROIE dataset under the end-to-end scenario.

PDF Details

AAAI Conference 2020 Conference Paper

Text recognition has attracted considerable research interests because of its various applications. The cutting-edge text recognition methods are based on attention mechanisms. However, most of attention methods usually suffer from serious alignment problem due to its recurrency alignment operation, where the alignment relies on historical decoding results. To remedy this issue, we propose a decoupled attention network (DAN), which decouples the alignment operation from using historical decoding results. DAN is an effective, ﬂexible and robust end-to-end text recognizer, which consists of three components: 1) a feature encoder that extracts visual features from the input image; 2) a convolutional alignment module that performs the alignment operation based on visual features from the encoder; and 3) a decoupled text decoder that makes ﬁnal prediction by jointly using the feature map and attention maps. Experimental results show that DAN achieves state-of-the-art performance on multiple text recognition tasks, including ofﬂine handwritten text recognition and regular/irregular scene text recognition. Codes will be released. 1

PDF Details