Arrow Research search

Author name cluster

Zhineng Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
1 author row

Possible papers

12

AAAI Conference 2026 Conference Paper

Complex Mathematical Expression Recognition: Benchmark, Large-Scale Dataset and Strong Baseline

  • Weikang Bai
  • Yongkun Du
  • Yuchen Su
  • Yazhen Xie
  • Zhineng Chen

Mathematical Expression Recognition (MER) has made significant progress in recognizing simple expressions, but the robust recognition of complex mathematical expressions with many tokens and multiple lines remains a formidable challenge. In this paper, we first introduce CMER-Bench, a carefully constructed benchmark that categorizes expressions into three difficulty levels: easy, moderate, and complex. Leveraging CMER-Bench, we conduct a comprehensive evaluation of existing MER models and general-purpose multimodal large language models (MLLMs). The results reveal that while current methods perform well on easy and moderate expressions, their performance degrades significantly when handling complex mathematical expressions, mainly because existing public training datasets are primarily composed of simple samples. In response, we propose MER-17M and CMER-3M that are large-scale datasets emphasizing the recognition of complex mathematical expressions. The datasets provide rich and diverse samples to support the development of accurate and robust complex MER models. Furthermore, to address the challenges posed by the complicated spatial layout of complex expressions, we introduce a novel expression tokenizer, and a new representation called Structured Mathematical Language, which explicitly models the hierarchical and spatial structure of expressions beyond LaTeX format. Based on these, we propose a specialized model named CMERNet, built upon an encoder-decoder architecture and trained on CMER-3M. Experimental results show that CMERNet, with only 125 million parameters, significantly outperforms existing MER models and MLLMs on CMER-Bench.

AAAI Conference 2026 Conference Paper

MDiff4STR: Mask Diffusion Model for Scene Text Recognition

  • Yongkun Du
  • Miaomiao Zhao
  • Songlin Fan
  • Zhineng Chen
  • Caiyan Jia
  • Yu-Gang Jiang

Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for STR. Specifically, we identify two key challenges in applying MDMs to STR: noising gap between training and inference, and overconfident predictions during inference. Both significantly hinder the performance of MDMs. To mitigate the first issue, we develop six noising strategies that better align training with inference behavior. For the second, we propose a token-replacement noise mechanism that provides a non-mask noise type, encouraging the model to reconsider and revise overly confident but incorrect predictions. We conduct extensive evaluations of MDiff4STR on both standard and challenging STR benchmarks, covering diverse scenarios including irregular, artistic, occluded, and Chinese text, as well as whether the use of pretraining. Across these settings, MDiff4STR consistently outperforms popular STR models, surpassing state-of-the-art ARMs in accuracy, while maintaining fast inference with only three denoising steps. Code: https://github.com/Topdu/OpenOCR.

AAAI Conference 2026 Conference Paper

SSR-SAM: Retrieval-Style Segment Anything Model for Semi-Supervised Ultra-High-Resolution Image Segmentation

  • Shijie Li
  • Yiming Chen
  • Zhineng Chen
  • Kai Hu
  • Xieping Gao

Accurate segmentation of ultra-high-resolution (UHR) images, which often exceed tens of millions of pixels, is critically important in domains such as remote sensing and biomedical imaging. However, acquiring pixel-level annotations for such high-resolution images is prohibitively expensive and labor-intensive. While semi-supervised semantic segmentation can significantly reduce the annotation burden, its extension to UHR images holds great potential for addressing the unique challenges posed by sparse supervision. To this end, we propose SSR-SAM, a retrieval-style semi-supervised segmentation framework tailored for UHR images. Leveraging the promptable paradigm of the Segment Anything Model (SAM), SSR-SAM treats locally annotated regions as prompts to retrieve semantically consistent pixels across the entire image. Building upon this retrieval-style segmentation paradigm, we further introduce prompt-level perturbation, a novel trail to deploy consistency regularization for semi-supervised segmentation. It encourages the model to learn consistency across predictions guided by diverse visual-semantic prompts, thereby enhancing generalization on unlabeled data. We evaluate SSR-SAM on three UHR datasets: Inria Aerial, BCSS, and URUR. Experimental results show that SSR-SAM achieves clear performance gains over the labeled-only supervision, with average mIoU improvements of 4.9%, 4.15%, and 2.5%, respectively. Additionally, SSR-SAM possesses zero-shot segmentation capability, exhibiting potential for general retrieval-style segmentation tasks.

NeurIPS Conference 2025 Conference Paper

Confusion-Driven Self-Supervised Progressively Weighted Ensemble Learning for Non-Exemplar Class Incremental Learning

  • Kai Hu
  • Zhang Yu
  • Yuan Zhang
  • Zhineng Chen
  • Xieping Gao

Non-exemplar class incremental learning (NECIL) aims to continuously assimilate new knowledge while retaining previously acquired knowledge in scenarios where prior examples are unavailable. A prevalent strategy within NECIL mitigates knowledge forgetting by freezing the feature extractor after training on the initial task. However, this freezing mechanism does not provide explicit training to differentiate between new and old classes, resulting in overlapping feature representations. To address this challenge, we propose a C onfusion-driven se L f-supervised pr O gressi V ely weighted E nsemble lea R ning ( CLOVER ) framework for NECIL. Firstly, we introduce a confusion-driven self-supervised learning approach that enhances representation extraction by guiding the model to distinguish between highly confusable classes, thereby reducing class representation overlap. Secondly, we develop a progressively weighted ensemble learning method that gradually adjusts weights to integrate diverse knowledge more effectively, further minimizing representation overlap. Finally, extensive experiments demonstrate that our proposed method achieves state-of-the-art results on the CIFAR100, TinyImageNet, and ImageNet-Subset NECIL benchmarks.

AAAI Conference 2025 Conference Paper

Distilling Knowledge from Heterogeneous Architectures for Semantic Segmentation

  • Yanglin Huang
  • Kai Hu
  • Yuan Zhang
  • Zhineng Chen
  • Xieping Gao

Current knowledge distillation (KD) methods for semantic segmentation focus on guiding the student to imitate the teacher's knowledge within homogeneous architectures. However, these methods overlook the diverse knowledge contained in architectures with different inductive biases, which is crucial for enabling the student to acquire a more precise and comprehensive understanding of the data during distillation. To this end, we propose for the first time a generic knowledge distillation method for semantic segmentation from a heterogeneous perspective, named HeteroAKD. Due to the substantial disparities between heterogeneous architectures, such as CNN and Transformer, directly transferring cross-architecture knowledge presents significant challenges. To eliminate the influence of architecture-specific information, the intermediate features of both the teacher and student are skillfully projected into an aligned logits space. Furthermore, to utilize diverse knowledge from heterogeneous architectures and deliver customized knowledge required by the student, a teacher-student knowledge mixing mechanism (KMM) and a teacher-student knowledge evaluation mechanism (KEM) are introduced. These mechanisms are performed by assessing the reliability and its discrepancy between heterogeneous teacher-student knowledge. Extensive experiments conducted on three main-stream benchmarks using various teacher-student pairs demonstrate that our HeteroAKD framework outperforms state-of-the-art KD methods in facilitating distillation between heterogeneous architectures.

AAAI Conference 2025 Conference Paper

Explicit Relational Reasoning Network for Scene Text Detection

  • Yuchen Su
  • Zhineng Chen
  • Yongkun Du
  • Zhilong Ji
  • Kai Hu
  • Jinfeng Bai
  • Xieping Gao

Connected component (CC) is a proper text shape representation that aligns with human reading intuition. However, CC-based text detection methods have recently faced a developmental bottleneck that their time-consuming post-processing is difficult to eliminate. To address this issue, we introduce an explicit relational reasoning network (ERRNet) to elegantly model the component relationships without post-processing. Concretely, we first represent each text instance as multiple ordered text components, and then treat these components as objects in sequential movement. In this way, scene text detection can be innovatively viewed as a tracking problem. From this perspective, we design an end-to-end tracking decoder to achieve a CC-based method dispensing with post-processing entirely. Additionally, we observe that there is an inconsistency between classification confidence and localization quality, so we propose a Polygon Monte-Carlo method to quickly and accurately evaluate the localization quality. Based on this, we introduce a position-supervised classification loss to guide the task-aligned learning of ERRNet. Experiments on challenging benchmarks demonstrate the effectiveness of our ERRNet. It consistently achieves state-of-the-art accuracy while holding highly competitive inference speed.

AAAI Conference 2025 Conference Paper

Out of Length Text Recognition with Sub-String Matching

  • Yongkun Du
  • Zhineng Chen
  • Caiyan Jia
  • Xieping Gao
  • Yu-Gang Jiang

Scene Text Recognition (STR) methods have demonstrated robust performance in word-level text recognition. However, in real applications the text image is sometimes long due to detected with multiple horizontal words. It triggers the requirement to build long text recognition models from readily available short (i.e., word-level) text datasets, which has been less studied previously. In this paper, we term this task Out of Length (OOL) text recognition. We establish the first Long Text Benchmark (LTB) to facilitate the assessment of different methods in long text recognition. Meanwhile, we propose a novel method called OOL Text Recognition with sub-String Matching (SMTR). SMTR comprises two cross-attention-based modules: one encodes a sub-string containing multiple characters into next and previous queries, and the other employs the queries to attend to the image features, matching the sub-string and simultaneously recognizing its next and previous character. SMTR can recognize text of arbitrary length by iterating the process above. To avoid being trapped in recognizing highly similar sub-strings, we introduce a regularization training to compel SMTR to effectively discover subtle differences between similar sub-strings for precise matching. In addition, we propose an inference augmentation strategy to alleviate confusion caused by identical sub-strings in the same text and improve the overall recognition efficiency. Extensive experimental results reveal that SMTR, even when trained exclusively on short text, outperforms existing methods in public short text benchmarks and exhibits a clear advantage on LTB.

IJCAI Conference 2025 Conference Paper

Stochasticity-aware No-Reference Point Cloud Quality Assessment

  • Songlin Fan
  • Wei Gao
  • Zhineng Chen
  • Ge Li
  • Guoqing Liu
  • Qicheng Wang

The evolution of point cloud processing algorithms necessitates an accurate assessment for their quality. Previous works consistently regard point cloud quality assessment (PCQA) as a MOS regression problem and devise a deterministic mapping, ignoring the stochasticity in generating MOS from subjective tests. This work presents the first probabilistic architecture for no-reference PCQA, motivated by the labeling process of existing datasets. The proposed method can model the quality judging stochasticity of subjects through a tailored conditional variational autoencoder (CVAE) and produces multiple intermediate quality ratings. These intermediate ratings simulate the judgments from different subjects and are then integrated into an accurate quality prediction, mimicking the generation process of a ground truth MOS. Specifically, our method incorporates a Prior Module, a Posterior Module, and a Quality Rating Generator, where the former two modules are introduced to model the judging stochasticity in subjective tests, while the latter is developed to generate diverse quality ratings. Extensive experiments indicate that our approach outperforms previous cutting-edge methods by a large margin and exhibits gratifying crossdataset robustness. Codes are available at https: //git. openi. org. cn/OpenPointCloud/nrpcqa.

AAAI Conference 2024 Conference Paper

LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network

  • Yuchen Su
  • Zhineng Chen
  • Zhiwen Shao
  • Yuning Du
  • Zhilong Ji
  • Jinfeng Bai
  • Yong Zhou
  • Yu-Gang Jiang

Recently, regression-based methods, which predict parameterized text shapes for text localization, have gained popularity in scene text detection. However, the existing parameterized text shape methods still have limitations in modeling arbitrary-shaped texts due to ignoring the utilization of text-specific shape information. Moreover, the time consumption of the entire pipeline has been largely overlooked, leading to a suboptimal overall inference speed. To address these issues, we first propose a novel parameterized text shape method based on low-rank approximation. Unlike other shape representation methods that employ data-irrelevant parameterization, our approach utilizes singular value decomposition and reconstructs the text shape using a few eigenvectors learned from labeled text contours. By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation. Next, we propose a dual assignment scheme for speed acceleration. It adopts a sparse assignment branch to accelerate the inference speed, and meanwhile, provides ample supervised signals for training through a dense assignment branch. Building upon these designs, we implement an accurate and efficient arbitrary-shaped text detector named LRANet. Extensive experiments are conducted on several challenging benchmarks, demonstrating the superior accuracy and efficiency of LRANet compared to state-of-the-art methods. Code is available at: https://github.com/ychensu/LRANet.git

AAAI Conference 2023 Conference Paper

Resolving Task Confusion in Dynamic Expansion Architectures for Class Incremental Learning

  • Bingchen Huang
  • Zhineng Chen
  • Peng Zhou
  • Jiayin Chen
  • Zuxuan Wu

The dynamic expansion architecture is becoming popular in class incremental learning, mainly due to its advantages in alleviating catastrophic forgetting. However, task confu- sion is not well assessed within this framework, e.g., the discrepancy between classes of different tasks is not well learned (i.e., inter-task confusion, ITC), and certain prior- ity is still given to the latest class batch (i.e., old-new con- fusion, ONC). We empirically validate the side effects of the two types of confusion. Meanwhile, a novel solution called Task Correlated Incremental Learning (TCIL) is pro- posed to encourage discriminative and fair feature utilization across tasks. TCIL performs a multi-level knowledge distil- lation to propagate knowledge learned from old tasks to the new one. It establishes information flow paths at both fea- ture and logit levels, enabling the learning to be aware of old classes. Besides, attention mechanism and classifier re- scoring are applied to generate more fair classification scores. We conduct extensive experiments on CIFAR100 and Ima- geNet100 datasets. The results demonstrate that TCIL con- sistently achieves state-of-the-art accuracy. It mitigates both ITC and ONC, while showing advantages in battle with catas- trophic forgetting even no rehearsal memory is reserved. Source code: https://github.com/YellowPancake/TCIL.

IJCAI Conference 2023 Conference Paper

TPS++: Attention-Enhanced Thin-Plate Spline for Scene Text Recognition

  • Tianlun Zheng
  • Zhineng Chen
  • Jinfeng Bai
  • Hongtao Xie
  • Yu-Gang Jiang

Text irregularities pose significant challenges to scene text recognizers. Thin-Plate Spline (TPS)-based rectification is widely regarded as an effective means to deal with them. Currently, the calculation of TPS transformation parameters purely depends on the quality of regressed text borders. It ignores the text content and often leads to unsatisfactory rectified results for severely distorted text. In this work, we introduce TPS++, an attention-enhanced TPS transformation that incorporates the attention mechanism to text rectification for the first time. TPS++ formulates the parameter calculation as a joint process of foreground control point regression and content-based attention score estimation, which is computed by a dedicated designed gated-attention block. TPS++ builds a more flexible content-aware rectifier, generating a natural text correction that is easier to read by the subsequent recognizer. Moreover, TPS++ shares the feature backbone with the recognizer in part and implements the rectification at feature-level rather than image-level, incurring only a small overhead in terms of parameters and inference time. Experiments on public benchmarks show that TPS++ consistently improves the recognition and achieves state-of-the-art accuracy. Meanwhile, it generalizes well on different backbones and recognizers. Code is at https: //github. com/simplify23/TPS_PP.

IJCAI Conference 2022 Conference Paper

SVTR: Scene Text Recognition with a Single Visual Model

  • Yongkun Du
  • Zhineng Chen
  • Caiyan Jia
  • Xiaoting Yin
  • Tianlun Zheng
  • Chenxia Li
  • Yuning Du
  • Yu-Gang Jiang

Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription. This hybrid architecture, although accurate, is complex and less efficient. In this study, we propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework, which dispenses with the sequential modeling entirely. The method, termed SVTR, firstly decomposes an image text into small patches named character components. Afterward, hierarchical stages are recurrently carried out by component-level mixing, merging and/or combining. Global and local mixing blocks are devised to perceive the inter-character and intra-character patterns, leading to a multi-grained character component perception. Thus, characters are recognized by a simple linear prediction. Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR. SVTR-L (Large) achieves highly competitive accuracy in English and outperforms existing methods by a large margin in Chinese, while running faster. In addition, SVTR-T (Tiny) is an effective and much smaller model, which shows appealing speed at inference. The code is publicly available at https: //github. com/PaddlePaddle/PaddleOCR.