Author name cluster

Jun Du

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

1 author row

AAAI Conference 2026 Conference Paper

Binary-Gaussian: Compact and Progressive Representation for 3D Gaussian Segmentation

An Yang
Chenyu Liu
Jun Du
Jianqing Gao
Jia Pan
Jinshui Hu
Baocai Yin
Bing Yin

3D Gaussian Splatting (3D-GS) has emerged as an efficient 3D representation and a promising foundation for semantic tasks like segmentation. However, existing 3D-GS-based segmentation methods typically rely on high-dimensional category features, which introduce substantial memory overhead. Moreover, fine-grained segmentation remains challenging due to label space congestion and the lack of stable multi-granularity control mechanisms. To address these limitations, we propose a coarse-to-fine binary encoding scheme for per-Gaussian category representation, which compresses each feature into a single integer via the binary-to-decimal mapping, drastically reducing memory usage. We further design a progressive training strategy that decomposes panoptic segmentation into a series of independent sub-tasks, reducing inter-class conflicts and thereby enhancing fine-grained segmentation capability. Additionally, we fine-tune opacity during segmentation training to address the incompatibility between photometric rendering and semantic segmentation, which often leads to foreground-background confusion. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art segmentation performance while significantly reducing memory consumption and accelerating inference.

PDF Details DOI

AAAI Conference 2026 Conference Paper

READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation

Haotian Wang
Yuzhe Weng
Jun Du
Haoran Xu
Xiaoyan Wu
Shan He
Bing Yin
Cong Liu

The introduction of diffusion models has brought significant advances to the field of audio-driven talking head generation. However, the extremely slow inference speed severely limits the practical implementation of diffusion-based talking head generation models. In this study, we propose READ, a real-time diffusion-transformer-based talking head generation framework. Our approach first learns a spatiotemporal highly compressed video latent space via a temporal VAE, significantly reducing the token count to accelerate generation. To achieve better audio-visual alignment within this compressed latent space, a pre-trained Speech Autoencoder (SpeechAE) is proposed to generate temporally compressed speech latent codes corresponding to the video latent space. These latent representations are then modeled by a carefully designed Audio-to-Video Diffusion Transformer (A2V-DiT) backbone for efficient talking head synthesis. Furthermore, to ensure temporal consistency and accelerated inference in extended generation, we propose a novel asynchronous noise scheduler (ANS) for both the training and inference processes of our framework. The ANS leverages asynchronous add-noise and asynchronous motion-guided generation in the latent space, ensuring consistency in generated video clips. Experimental results demonstrate that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime, achieving an optimal balance between quality and speed while maintaining robust metric stability in long-time generation.

PDF Details DOI

AAAI Conference 2025 Conference Paper

DocMamba: Efficient Document Pre-training with State Space Model

Pengfei Hu
Zhenrong Zhang
Jiefeng Ma
Shuhang Liu
Jun Du
Jianshu Zhang

In recent years, visually-rich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mechanism's quadratic computational complexity hinders their efficiency and ability to process long documents. In this paper, we present DocMamba, a novel framework based on the state space model. It is designed to reduce computational complexity to linear while preserving global modeling capabilities. To further enhance its effectiveness in document processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture contiguous semantic information. Experimental results demonstrate that DocMamba achieves new state-of-the-art results on downstream datasets such as FUNSD, CORD, and SORIE, while significantly improving speed and reducing memory usage. Notably, experiments on the HRDoc confirm DocMamba's potential for length extrapolation.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

Chang Li
Ruoyu Wang
Lijuan Liu
Jun Du
Yixuan Sun
Zilu Guo
Zhengrong Zhang
Yuan Jiang

Text-to-music (TTM) generation, which converts textual descriptions into audio, opens up innovative avenues for multimedia creation. Achieving high quality and diversity in this process demands extensive, high-quality data, which are often scarce in available datasets. Most open-source datasets frequently suffer from issues like low-quality waveforms and low text-audio consistency, hindering the advancement of music generation models. To address these challenges, we propose a novel quality-aware training paradigm for generating high-quality, high-musicality music from large-scale, quality-imbalanced datasets. Additionally, by leveraging unique properties in the latent space of musical signals, we adapt and implement a masked diffusion transformer (MDT) model for the TTM task, showcasing its capacity for quality control and enhanced musicality. Furthermore, we introduce a three-stage caption refinement approach to address low-quality captions' issue. Experiments show state-of-the-art (SOTA) performance on benchmark datasets including MusicCaps and the Song-Describer Dataset with both objective and subjective metrics. Demo audio samples are available at https: //qa-mdt. github. io/, code and pretrained checkpoints are open-sourced at https: //github. com/ivcylc/OpenMusic.

PDF Details DOI

AAAI Conference 2025 Conference Paper

RFL: Simplifying Chemical Structure Recognition with Ring-Free Language

Qikai Chang
Mingjun Chen
Changpeng Pi
Pengfei Hu
Zhenrong Zhang
Jiefeng Ma
Jun Du
Baocai Yin

The primary objective of Optical Chemical Structure Recognition is to identify chemical structure images into corresponding markup sequences. However, the complex two-dimensional structures of molecules, particularly those with rings and multiple branches, present significant challenges for current end-to-end methods to learn one-dimensional markup directly. To overcome this limitation, we propose a novel Ring-Free Language (RFL), which utilizes a divide-and-conquer strategy to describe chemical structures in a hierarchical form. RFL allows complex molecular structures to be decomposed into multiple parts, ensuring both uniqueness and conciseness while enhancing readability. This approach significantly reduces the learning difficulty for recognition models. Leveraging RFL, we propose a universal Molecular Skeleton Decoder (MSD), which comprises a skeleton generation module that progressively predicts the molecular skeleton and individual rings, along with a branch classification module for predicting branch information. Experimental results demonstrate that the proposed RFL and MSD can be applied to various mainstream methods, achieving superior performance compared to state-of-the-art approaches in both printed and handwritten scenarios.

PDF Details DOI

JBHI Journal 2025 Journal Article

Topological GCN Guided Improved Conformer for Detection of Hip Landmarks From Ultrasound Images

Tianxiang Huang
Jing Shi
Ge Jin
Juncheng Li
Jun Wang
Qian Wang
Jun Du
Jun Shi

The B-mode ultrasound based computer-aided diagnosis (CAD) has shown its effectiveness for diagnosis of Developmental Dysplasia of the Hip (DDH) in infants within 6 months. Hip landmark detection is a feasible way for the CAD of DDH according to the Graf's method. However, existing landmark detection algorithms mainly focus on designing special models to capture the features from hip ultrasound images, but generally ignore the important spatial relations among different landmarks. To this end, a novel weakly supervised learning-based algorithm, the Topological Graph Convolutional Network (TGCN) guided Improved Conformer (TGCN-ICF), is proposed for detecting landmarks from hip ultrasound images. The TGCN-ICF includes two subnetworks: an Improved Conformer (ICF) subnetwork to generate heatmaps and constraint vectors from ultrasound images, and a TGCN subnetwork to additionally explore topological relations among hip landmarks with the guidance of class labels for further refining and improving the detection accuracy. Moreover, a new Mutual Modulation Fusion (MMF) module is developed to fully exchange and fuse the extracted feature information from the convolutional neural network (CNN) and Transformer branches in ICF. Meanwhile, a novel Mutual Supervision Constraint (MSC) strategy is designed to provide a constraint for detection of each hip landmark. The experimental results on two real-world DDH datasets demonstrate that the TGCN-ICF outperforms all the compared algorithms, suggesting its potential applications.

Details DOI

JBHI Journal 2024 Journal Article

Channel Adaptive and Sparsity Personalized Federated Learning for Privacy Protection in Smart Healthcare Systems

Ziqi Chen
Jun Du
Xiangwang Hou
Keping Yu
Jintao Wang
Zhu Han

With the booming development of Smart Healthcare Systems (SHSs), employing federated learning (FL) in SHS devices has become a research hotspot. FL, as a distributed learning framework, can train models without sharing the original data among users, and then protect the user privacy. Existing research has proposed many methods to improve the security and efficiency of FL, which may not fully consider the characteristics of SHSs. Specifically, the requirements of privacy protection and efficiency pose significant challenges to FL. Current studies have struggled to balance privacy security and efficiency, and the degradation of model training efficiency in SHSs can be critical to patient health. Therefore, to improve the privacy protection of healthcare data and ensure communication efficiency, this work proposes a novel personalized FL framework based on Communication quality and Adaptive Sparsification (pFedCAS). In order to achieve privacy protection, a control unit is proposed and introduced to adjust the sparsity of the local model adaptively. To further improve the training efficiency, a selection unit is added during global model aggregation to select suitable clients for parameter updates. Finally, we validate the proposed method operated on the HAM10000 dataset. Simulation results validate that pFedCAS can not only improve privacy protection, but also gain an improvement of 15% in training accuracy and a reduction of 30% in training costs based on communication quality. The simulation results also validate the excellent robustness of pFedCAS to non-iid data.

Details DOI

JBHI Journal 2024 Journal Article

Involution Transformer Based U-Net for Landmark Detection in Ultrasound Images for Diagnosis of Infantile DDH

Tianxiang Huang
Jing Shi
Juncheng Li
Jun Wang
Jun Du
Jun Shi

The B-mode ultrasound based computer-aided diagnosis (CAD) has demonstrated its effectiveness for diagnosis of Developmental Dysplasia of the Hip (DDH) in infants, which can conduct the Graf's method by detecting landmarks in hip ultrasound images. However, it is still necessary to explore more valuable information around these landmarks to enhance feature representation for improving detection performance in the detection model. To this end, a novel Involution Transformer based U-Net (IT-UNet) network is proposed for hip landmark detection. The IT-UNet integrates the efficient involution operation into Transformer to develop an Involution Transformer module (ITM), which consists of an involution attention block and a squeeze-and-excitation involution block. The ITM can capture both the spatial-related information and long-range dependencies from hip ultrasound images to effectively improve feature representation. Moreover, an Involution Downsampling block (IDB) is developed to alleviate the issue of feature loss in the encoder modules, which combines involution and convolution for the purpose of downsampling. The experimental results on two DDH ultrasound datasets indicate that the proposed IT-UNet achieves the best landmark detection performance, indicating its potential applications.

Details DOI

IJCAI Conference 2024 Conference Paper

SEMv3: A Fast and Robust Approach to Table Separation Line Detection

Chunxia Qin
Zhenrong Zhang
Pengfei Hu
Chenyu Liu
Jiefeng Ma
Jun Du

Table structure recognition (TSR) aims to parse the inherent structure of a table from its input image. The "split-and-merge" paradigm is a pivotal approach to parse table structure, where the table separation line detection is crucial. However, challenges such as wireless and deformed tables make it demanding. In this paper, we adhere to the "split-and-merge" paradigm and propose SEMv3 (SEM: Split, Embed and Merge), a method that is both fast and robust for detecting table separation lines. During the split stage, we introduce a Keypoint Offset Regression (KOR) module, which effectively detects table separation lines by directly regressing the offset of each line relative to its keypoint proposals. Moreover, in the merge stage, we define a series of merge actions to efficiently describe the table structure based on table grids. Extensive ablation studies demonstrate that our proposed KOR module can detect table separation lines quickly and accurately. Furthermore, on public datasets (e. g. WTW, ICDAR-2019 cTDaR Historical and iFLYTAB), SEMv3 achieves state-of-the-art (SOTA) performance. The code is available at https: //github. com/Chunchunwumu/SEMv3.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding

Jiefeng Ma
Yan Wang
Chenyu Liu
Jun Du
Yu Hu
Zhenrong Zhang
Pengfei Hu
Qing Wang

Accurately identifying and organizing textual content is crucial for the automation of document processing in the field of form understanding. Existing datasets, such as FUNSD and XFUND, support entity classification and relationship prediction tasks but are typically limited to local and entity-level annotations. This limitation overlooks the hierarchically structured representation of documents, constraining comprehensive understanding of complex forms. To address this issue, we present the SRFUND, a hierarchically structured multi-task form understanding benchmark. SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets, encompassing five tasks: (1) word to text-line merging, (2) text-line to entity merging, (3) entity category classification, (4) item table localization, and (5) entity-based full-document hierarchical structure recovery. We meticulously supplemented the original dataset with missing annotations at various levels of granularity and added detailed annotations for multi-item table regions within the forms. Additionally, we introduce global hierarchical structure dependencies for entity relation prediction tasks, surpassing traditional local key-value associations. The SRFUND dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese, making it a powerful tool for cross-lingual form understanding. Extensive experimental results demonstrate that the SRFUND dataset presents new challenges and significant opportunities in handling diverse layouts and global hierarchical structures of forms, thus providing deep insights into the field of form understanding. The original dataset and implementations of baseline methods are available at https: //sprateam-ustc. github. io/SRFUND.

PDF Details DOI

AAAI Conference 2023 Conference Paper

HRDoc: Dataset and Baseline Method toward Hierarchical Reconstruction of Document Structures

Jiefeng Ma
Jun Du
Pengfei Hu
Zhenrong Zhang
Jianshu Zhang
Huihui Zhu
Cong Liu

The problem of document structure reconstruction refers to converting digital or scanned documents into corresponding semantic structures. Most existing works mainly focus on splitting the boundary of each element in a single document page, neglecting the reconstruction of semantic structure in multi-page documents. This paper introduces hierarchical reconstruction of document structures as a novel task suitable for NLP and CV fields. To better evaluate the system performance on the new task, we built a large-scale dataset named HRDoc, which consists of 2,500 multi-page documents with nearly 2 million semantic units. Every document in HRDoc has line-level annotations including categories and relations obtained from rule-based extractors and human annotators. Moreover, we proposed an encoder-decoder-based hierarchical document structure parsing system (DSPS) to tackle this problem. By adopting a multi-modal bidirectional encoder and a structure-aware GRU decoder with soft-mask operation, the DSPS model surpass the baseline method by a large margin. All scripts and datasets will be made publicly available at https://github.com/jfma-USTC/HRDoc.

PDF Details DOI

JBHI Journal 2022 Journal Article

Diagnosis of Infantile Hip Dysplasia With B-Mode Ultrasound via Two-Stage Meta-Learning Based Deep Exclusivity Regularized Machine

Bangming Gong
Jing Shi
Xiangmin Han
Huan Zhang
Yuemin Huang
Liwei Hu
Jun Wang
Jun Du

The B-mode ultrasound (BUS) based computer-aided diagnosis (CAD) has shown its effectiveness for developmental dysplasia of the hip (DDH) in infants. In this work, a two-stage meta-learning based deep exclusivity regularized machine (TML-DERM) is proposed for the BUS-based CAD of DDH. TML-DERM integrates deep neural network (DNN) and exclusivity regularized machine into a unified framework to simultaneously improve the feature representation and classification performance. Moreover, the first-stage meta-learning is mainly conducted on the DNN module to alleviate the overfitting issue caused by the significantly increased parameters in DNN, and a random sampling strategy is adopted to self-generate the meta-tasks; while the second-stage meta-learning mainly learns the combination of multiple weak classifiers by a weight vector to improve the classification performance, and also optimizes the unified framework again. The experimental results on a DDH ultrasound dataset show the proposed TML-DERM algorithm achieves the superior classification performance with the mean accuracy of 85. 89%, sensitivity of 86. 54%, and specificity of 85. 23%.

Details DOI

AAAI Conference 2022 Conference Paper

TDv2: A Novel Tree-Structured Decoder for Offline Mathematical Expression Recognition

Changjie Wu
Jun Du
Yunqing Li
Jianshu Zhang
Chen Yang
Bo Ren
Yiqing Hu

In recent years, tree decoders become more popular than La- TeX string decoders in the field of handwritten mathematical expression recognition (HMER) as they can capture the hierarchical tree structure of mathematical expressions. However previous tree decoders converted the tree structure labels into a fixed and ordered sequence, which could not make full use of the diversified expression of tree labels. In this study, we propose a novel tree decoder (TDv2) to fully utilize the tree structure labels. Compared with previous tree decoders, this new model does not require a fixed priority for different branches of a node during training and inference, which can effectively improve the model generalization capability. The input and output of the model make full use of the tree structure label, so that there is no need to find the parent node in the decoding process, which simplifies the decoding process and adds a priori information to help predict the node. We verified the effectiveness of each part of the model through comprehensive ablation experiments and attention visualization analysis. On the authoritative CROHME 14/16/19 datasets, our method achieves the state-of-the-art results.

PDF Details

AAAI Conference 2015 Conference Paper

Modelling Class Noise with Symmetric and Asymmetric Distributions

Jun Du
Zhihua Cai

In classification problem, we assume that the samples around the class boundary are more likely to be incorrectly annotated than others, and propose boundaryconditional class noise (BCN). Based on the BCN assumption, we use unnormalized Gaussian and Laplace distributions to directly model how class noise is generated, in symmetric and asymmetric cases. In addition, we demonstrate that Logistic regression and Probit regression can also be reinterpreted from this class noise perspective, and compare them with the proposed models. The empirical study shows that, the proposed asymmetric models overall outperform the benchmark linear models, and the asymmetric Laplace-noise model achieves the best performance among all.

PDF Details

TIST Journal 2012 Journal Article

A Generic Approach for Systematic Analysis of Sports Videos

Ning Zhang
Ling-Yu Duan
Lingfang Li
Qingming Huang
Jun Du
Wen Gao
Ling Guan

Various innovative and original works have been applied and proposed in the field of sports video analysis. However, individual works have focused on sophisticated methodologies with particular sport types and there has been a lack of scalable and holistic frameworks in this field. This article proposes a solution and presents a systematic and generic approach which is experimented on a relatively large-scale sports consortia. The system aims at the event detection scenario of an input video with an orderly sequential process. Initially, domain knowledge-independent local descriptors are extracted homogeneously from the input video sequence. Then the video representation is created by adopting a bag-of-visual-words (BoW) model. The video’s genre is first identified by applying the k-nearest neighbor (k-NN) classifiers on the initially obtained video representation, and various dissimilarity measures are assessed and evaluated analytically. Subsequently, an unsupervised probabilistic latent semantic analysis (PLSA)-based approach is employed at the same histogram-based video representation, characterizing each frame of video sequence into one of four view groups, namely closed-up-view, mid-view, long-view, and outer-field-view. Finally, a hidden conditional random field (HCRF) structured prediction model is utilized for interesting event detection. From experimental results, k-NN classifier using KL-divergence measurement demonstrates the best accuracy at 82.16% for genre categorization. Supervised SVM and unsupervised PLSA have average classification accuracies at 82.86% and 68.13%, respectively. The HCRF model achieves 92.31% accuracy using the unsupervised PLSA based label input, which is comparable with the supervised SVM based input at an accuracy of 93.08%. In general, such a systematic approach can be widely applied in processing massive videos generically.

Details DOI