EAAI Journal 2026 Journal Article
An efficient knowledge tracing model via Mamba Contextual Encoding and Dynamic Sparse Attention mechanism
- RuiJuan Zhang
- Feng Zhang
- Cong Liu
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
3D Gaussian Splatting (3D-GS) has emerged as an efficient 3D representation and a promising foundation for semantic tasks like segmentation. However, existing 3D-GS-based segmentation methods typically rely on high-dimensional category features, which introduce substantial memory overhead. Moreover, fine-grained segmentation remains challenging due to label space congestion and the lack of stable multi-granularity control mechanisms. To address these limitations, we propose a coarse-to-fine binary encoding scheme for per-Gaussian category representation, which compresses each feature into a single integer via the binary-to-decimal mapping, drastically reducing memory usage. We further design a progressive training strategy that decomposes panoptic segmentation into a series of independent sub-tasks, reducing inter-class conflicts and thereby enhancing fine-grained segmentation capability. Additionally, we fine-tune opacity during segmentation training to address the incompatibility between photometric rendering and semantic segmentation, which often leads to foreground-background confusion. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art segmentation performance while significantly reducing memory consumption and accelerating inference.
AAAI Conference 2026 Conference Paper
Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a key obstacle for diffusion transformers in addressing this problem is the mismatch between positional encodings seen at inference and those used during training. Existing strategies such as positional encodings interpolation, extrapolation, or hybrids, do not fully resolve this mismatch. In this paper, we propose a novel two-dimensional randomized positional encodings, namely RPE-2D, that prioritizes the order of image patches rather than their absolute distances, enabling seamless high- and low-resolution generation without training on multiple resolutions. Concretely, RPE-2D independently samples positions along the horizontal and vertical axes over an expanded range during training, ensuring that the encodings used at inference lie within the training distribution and thereby improving resolution generalization. We further introduce a simple random resize-and-crop augmentation to strengthen order modeling and add micro-conditioning to indicate the applied cropping pattern. On the ImageNet dataset, RPE-2D achieves state-of-the-art resolution generalization performance, outperforming competitive methods when trained at 256^2 and evaluated at 384^2 and 512^2, and when trained at 512^2 and evaluated at 768^2 and 1024^2. RPE-2D also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration, and multi-resolution inheritance.
EAAI Journal 2026 Journal Article
JBHI Journal 2026 Journal Article
Congenital heart disease (CHD) is the leading cause of neonatal mortality worldwide, making early and accurate diagnosis crucial. In resource-constrained regions, standardized ultrasound screening remains difficult due to the shortage of specialized clinicians. Digital Twin (DT) technology, which constructs virtual AI diagnostic models, enables personalized assessments of heart structure and function, offering intelligent diagnostic support in primary or remote healthcare settings. However, the real-time updates required by DT systems place increased demands on the inference speed and computational efficiency of AI models. Existing methods often suffer from parameter redundancy and inference delays, making them inadequate for meeting the low-latency, large-scale needs of clinical applications. To overcome these challenges, we propose the Liquid-Sequencer, a lightweight model for the diagnosis of fetal CHD. The model first employs a convolutional network with DPSE (Depthwise Separable and Squeeze-and-Excitation) modules to efficiently extract spatial features by utilizing depthwise separable convolution and channel attention. These feature maps are then processed by a bidirectional liquid sequence module, where orthogonally scanned Liquid Neural Networks (LNNs) capture global context with linear complexity, offering a more efficient alternative to self-attention mechanisms. This integration of spatial and sequential learning is well-suited to the dynamic nature of fetal cardiac ultrasound and the demands of DT systems. Experimental results on 12 datasets demonstrate outstanding performance with only 0. 30M parameters. Additionally, t-SNE visualizations reveal highly discriminative feature representations and clear inter-class separations, underscoring the model's potential as an advanced diagnostic tool.
TMLR Journal 2026 Journal Article
Integrating self-supervised learning (SSL) prior to supervised learning (SL) is a prevalent strategy for enhancing model performance, especially in scenarios with limited labeled data. Nonetheless, this approach inherently introduces a trade-off between computational efficiency and performance gains. Although SSL significantly improves representation learning, it necessitates an additional and often computationally expensive training phase, posing substantial overhead in resource-constrained environments. To mitigate these limitations, we propose MixTraining, a novel training framework designed to interleave multiple epochs of SSL and SL within a unified $\textit{mixtraining phase}$. This phase enables a seamless transition between self-supervised and supervised objectives, facilitating enhanced synergy and improved overall accuracy. Additionally, MixTraining consolidates shared computational steps, thereby reducing redundant computations and lowering overall training latency. Comprehensive experimental evaluations demonstrate that MixTraining provides a superior trade-off between computational efficiency and model performance compared to conventional training pipelines. Specifically, on the TinyImageNet dataset using the ViT-Tiny model, MixTraining achieves an absolute accuracy improvement of 8.81% (a relative gain of 18.89%) while concurrently accelerating training by 1.29$\times$.
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
Agentic AI aims to create systems that set their own goals, adapt proactively to change, and refine behavior through continuous experience. Recent advances suggest that, when facing multiple and unforeseen tasks, agents could benefit from sharing machine-learned knowledge and reusing policies that have already been fully or partially learned by other agents. However, how to query, select, and retrieve policies from a pool of agents, and how to integrate such policies remains a largely unexplored area. This study explores how an agent decides what knowledge to select, from whom, and when and how to integrate it in its own policy in order to accelerate its own learning. The proposed algorithm, Modular Sharing and Composition in Collective Learning (MOSAIC), improves learning in agentic collectives by combining (1) knowledge selection using performance signals and cosine similarity on Wasserstein task embeddings, (2) modular and transferable neural representations via masks, and (3) policy integration, composition and fine-tuning. MOSAIC outperforms isolated learners and global sharing approaches in both learning speed and overall performance, and in some cases solves tasks that isolated agents cannot. The results also demonstrate that selective, goal-driven reuse leads to less susceptibility to task interference. We also observe the emergence of self-organization, where agents solving simpler tasks accelerate the learning of harder ones through shared knowledge.
AAAI Conference 2026 Conference Paper
The introduction of diffusion models has brought significant advances to the field of audio-driven talking head generation. However, the extremely slow inference speed severely limits the practical implementation of diffusion-based talking head generation models. In this study, we propose READ, a real-time diffusion-transformer-based talking head generation framework. Our approach first learns a spatiotemporal highly compressed video latent space via a temporal VAE, significantly reducing the token count to accelerate generation. To achieve better audio-visual alignment within this compressed latent space, a pre-trained Speech Autoencoder (SpeechAE) is proposed to generate temporally compressed speech latent codes corresponding to the video latent space. These latent representations are then modeled by a carefully designed Audio-to-Video Diffusion Transformer (A2V-DiT) backbone for efficient talking head synthesis. Furthermore, to ensure temporal consistency and accelerated inference in extended generation, we propose a novel asynchronous noise scheduler (ANS) for both the training and inference processes of our framework. The ANS leverages asynchronous add-noise and asynchronous motion-guided generation in the latent space, ensuring consistency in generated video clips. Experimental results demonstrate that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime, achieving an optimal balance between quality and speed while maintaining robust metric stability in long-time generation.
EAAI Journal 2025 Journal Article
AAAI Conference 2025 Conference Paper
Existing Transformer-based RGBT trackers achieve remarkable performance benefits by leveraging self-attention to extract uni-modal features and cross-attention to enhance multi-modal feature interaction and search-template correlation. Nevertheless, the independent search-template correlation calculations are prone to be affected by low-quality data, which might result in contradictory and ambiguous correlation weights. It not only limits the intra-modal feature representation, but also harms the robustness of cross-attention for multi-modal feature interaction and search-template correlation computation. To address these issues, we propose a novel approach called Cross-modulated Attention Transformer (CAFormer), which innovatively integrates inter-modality interaction into the search-template correlation computation within typical attention mechanism, for RGBT tracking. In particular, we first independently generate correlation maps for each modality and feed them into the designed correlation modulated enhancement module, which can modify inaccurate correlation weights by seeking the consensus between modalities. Such kind of design unifies self-attention and cross-attention schemes, which not only alleviates inaccurate attention weight computation in self-attention but also eliminates redundant computation introduced by extra cross-attention scheme. In addition, we design a collaborative token elimination strategy to further improve tracking inference efficiency and accuracy. Experiments on five public RGBT tracking benchmarks show the outstanding performance of the proposed CAFormer against state-of-the-art methods.
JBHI Journal 2025 Journal Article
Optical coherence tomography is a crucial imaging technique for the detection and analysis of retinal diseases. Precise classification of optical coherence tomography images helps ophthalmologists and healthcare providers design personalized treatment plans in clinical practice. In this paper, we focus on optical coherence tomography image classification for seven types of retinal diseases and normal retina. Although existing deep neural networks could be applied to optical coherence tomography images for the classification, the features were extracted within same hyperspace, causing “feature congestion”. Moreover, the class of normal retina was regarded as a “type of retinal disease”, impeding the extraction of true imaging structures for retinal diseases. To deal with the two issues, we innovate a deep neural network module to enhance imaging features so that the enhanced features are more distinct for classification. Consistent to medical findings, we propose two domains of retinal diseases and anchor imaging features onto cross-domains. We tested and evaluated our model on two datasets for eight-classes and four-classes classification, respectively. Our experimental results demonstrated that the proposed module outperforms state-of-the-art methods. We also conducted ablation studies and sensitivity tests for comprehensive evaluation of our method.
AAAI Conference 2025 Conference Paper
Patch deformation-based methods have recently exhibited substantial effectiveness in multi-view stereo, due to the incorporation of deformable and expandable perception to reconstruct textureless areas. However, such approaches typically focus on exploring correlative reliable pixels to alleviate match ambiguity during patch deformation, but ignore the deformation instability caused by mistaken edge-skipping and visibility occlusion, leading to potential estimation deviation. To remedy the above issues, we propose DVP-MVS, which innovatively synergizes depth-edge aligned and cross-view prior for robust and visibility-aware patch deformation. Specifically, to avoid unexpected edge-skipping, we first utilize Depth Anything V2 followed by the Roberts operator to initialize coarse depth and edge maps respectively, both of which are further aligned through an erosion-dilation strategy to generate fine-grained homogeneous boundaries for guiding patch deformation. In addition, we reform view selection weights as visibility maps and restore visible areas by cross-view depth reprojection, then regard them as cross-view prior to facilitate visibility-aware patch deformation. Finally, we improve propagation and refinement with multi-view geometry consistency by introducing aggregated visible hemispherical normals based on view selection and local projection depth differences based on epipolar lines, respectively. Extensive evaluations on ETH3D and Tanks & Temples benchmarks demonstrate that our method can achieve state-of-the-art performance with excellent robustness and generalization.
AAAI Conference 2025 Conference Paper
Recently, patch deformation-based methods have demonstrated significant strength in multi-view stereo by adaptively expanding the reception field of patches to help reconstruct textureless areas. However, such methods mainly concentrate on searching for pixels without matching ambiguity (i.e., reliable pixels) when constructing deformed patches, while neglecting the deformation instability caused by unexpected edge-skipping, resulting in potential matching distortions. Addressing this, we propose MSP-MVS, a method introducing multi-granularity segmentation prior for edge-confined patch deformation. Specifically, to avoid unexpected edge-skipping, we first aggregate and further refine multi-granularity depth edges gained from Semantic-SAM as prior to guide patch deformation within depth-continuous (i.e., homogeneous) areas. Moreover, to address attention imbalance caused by edge-confined patch deformation, we implement adaptive equidistribution and disassemble-clustering of correlative reliable pixels (i.e., anchors), thereby promoting attention-consistent patch deformation. Finally, to prevent deformed patches from falling into local-minimum matching costs caused by the fixed sampling pattern, we introduce disparity-sampling synergistic 3D optimization to help identify global-minimum matching costs. Evaluations on ETH3D and Tanks & Temples benchmarks prove our method obtains state-of-the-art performance with remarkable generalization.
IJCAI Conference 2024 Conference Paper
Recently, heatmap regression methods based on 1D landmark representations have shown prominent performance on locating facial landmarks. However, previous methods ignored to make deep explorations on the good potentials of 1D landmark representations for sequential and structural modeling of multiple landmarks to track facial landmarks. To address this limitation, we propose a Transformer architecture, namely 1DFormer, which learns informative 1D landmark representations by capturing the dynamic and the geometric patterns of landmarks via token communications in both temporal and spatial dimensions for facial landmark tracking. For temporal modeling, we propose a confidence-enhanced multi-head attention mechanism with a recurrently token mixing strategy to adaptively and robustly embed long-term landmark dynamics into their 1D representations; for structure modeling, we design intra-group and inter-group geometric encoding mechanisms to encode the component-level as well as global-level facial structure patterns as a refinement for the 1D representations of landmarks through token communications in the spatial dimension via 1D convolutional layers. Experimental results on the 300VW and the TF databases show that 1DFormer successfully models the long-range sequential patterns as well as the inherent facial structures to learn informative 1D representations of landmark sequences, and achieves state-of-the-art performance on facial landmark tracking. Codes of our model are available in the supplementary materials.
EAAI Journal 2024 Journal Article
EAAI Journal 2024 Journal Article
ICLR Conference 2024 Conference Paper
We introduce Clifford Group Equivariant Simplicial Message Passing Networks, a method for steerable $\mathrm{E}(n)$-equivariant message passing on simplicial complexes. Our method integrates the expressivity of Clifford group-equivariant layers with simplicial message passing, which is topologically more intricate than regular graph message passing. Clifford algebras include higher-order objects such as bivectors and trivectors, which express geometric features (e.g., areas, volumes) derived from vectors. Using this knowledge, we represent simplex features through geometric products of their vertices. To achieve efficient simplicial message passing, we share the parameters of the message network across different dimensions. Additionally, we restrict the final message to an aggregation of the incoming messages from different dimensions, leading to what we term *shared* simplicial message passing. Experimental results show that our method is able to outperform both equivariant and simplicial graph neural networks on a variety of geometric tasks.
AAAI Conference 2024 Conference Paper
Scene text recognition is inherently a vision-language task. However, previous works have predominantly focused either on extracting more robust visual features or designing better language modeling. How to effectively and jointly model vision and language to mitigate heavy reliance on a single modality remains a problem. In this paper, aiming to enhance vision-language reasoning in scene text recognition, we present a balanced, unified and synchronized vision-language reasoning network (BUSNet). Firstly, revisiting the image as a language by balanced concatenation along length dimension alleviates the issue of over-reliance on vision or language. Secondly, BUSNet learns an ensemble of unified external and internal vision-language model with shared weight by masked modality modeling (MMM). Thirdly, a novel vision-language reasoning module (VLRM) with synchronized vision-language decoding capacity is proposed. Additionally, BUSNet achieves improved performance through iterative reasoning, which utilizes the vision-language prediction as a new language input. Extensive experiments indicate that BUSNet achieves state-of-the-art performance on several mainstream benchmark datasets and more challenge datasets for both synthetic and real training data compared to recent outstanding methods. Code and dataset will be available at https://github.com/jjwei66/BUSNet.
NeurIPS Conference 2023 Conference Paper
The prosperity of deep neural networks (DNNs) is largely benefited from open-source datasets, based on which users can evaluate and improve their methods. In this paper, we revisit backdoor-based dataset ownership verification (DOV), which is currently the only feasible approach to protect the copyright of open-source datasets. We reveal that these methods are fundamentally harmful given that they could introduce malicious misclassification behaviors to watermarked DNNs by the adversaries. In this paper, we design DOV from another perspective by making watermarked models (trained on the protected dataset) correctly classify some `hard' samples that will be misclassified by the benign model. Our method is inspired by the generalization property of DNNs, where we find a \emph{hardly-generalized domain} for the original dataset (as its \emph{domain watermark}). It can be easily learned with the protected dataset containing modified samples. Specifically, we formulate the domain generation as a bi-level optimization and propose to optimize a set of visually-indistinguishable clean-label modified data with similar effects to domain-watermarked samples from the hardly-generalized domain to ensure watermark stealthiness. We also design a hypothesis-test-guided ownership verification via our domain watermark and provide the theoretical analyses of our method. Extensive experiments on three benchmark datasets are conducted, which verify the effectiveness of our method and its resistance to potential adaptive methods.
AAAI Conference 2023 Conference Paper
The problem of document structure reconstruction refers to converting digital or scanned documents into corresponding semantic structures. Most existing works mainly focus on splitting the boundary of each element in a single document page, neglecting the reconstruction of semantic structure in multi-page documents. This paper introduces hierarchical reconstruction of document structures as a novel task suitable for NLP and CV fields. To better evaluate the system performance on the new task, we built a large-scale dataset named HRDoc, which consists of 2,500 multi-page documents with nearly 2 million semantic units. Every document in HRDoc has line-level annotations including categories and relations obtained from rule-based extractors and human annotators. Moreover, we proposed an encoder-decoder-based hierarchical document structure parsing system (DSPS) to tackle this problem. By adopting a multi-modal bidirectional encoder and a structure-aware GRU decoder with soft-mask operation, the DSPS model surpass the baseline method by a large margin. All scripts and datasets will be made publicly available at https://github.com/jfma-USTC/HRDoc.
IJCAI Conference 2022 Conference Paper
Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by each query attending to all keys/values, various methods have constrained the range of attention within local regions, where each query only attends to keys/values within a hand-crafted window. However, these hand-crafted window partition mechanisms are data-agnostic and ignore their input content, so it is likely that one query maybe attend to irrelevant keys/values. To address this issue, we propose a Dynamic Group Attention (DG-Attention), which dynamically divides all queries into multiple groups and selects the most relevant keys/values for each group. Our DG-Attention can flexibly model more relevant dependencies without any spatial constraint that is used in hand-crafted window based attention. Built on the DG-Attention, we develop a general vision transformer backbone named Dynamic Group Transformer (DGT). Extensive experiments show that our models can outperform the state-of-the-art methods on multiple common vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation.
IJCAI Conference 2022 Conference Paper
With the privatization deployment of DNNs on edge devices, the security of on-device DNNs has raised significant concern. To quantify the model leakage risk of on-device DNNs automatically, we propose NNReverse, the first learning-based method which can reverse DNNs from AI programs without domain knowledge. NNReverse trains a representation model to represent the semantics of binary code for DNN layers. By searching the most similar function in our database, NNReverse infers the layer type of a given function’s binary code. To represent assembly instructions semantics precisely, NNReverse proposes a more fine-grained embedding model to represent the textual and structural-semantic of assembly functions.
IJCAI Conference 2021 Conference Paper
Deep neural networks, particularly face recognition models, have been shown to be vulnerable to both digital and physical adversarial examples. However, existing adversarial examples against face recognition systems either lack transferability to black-box models, or fail to be implemented in practice. In this paper, we propose a unified adversarial face generation method - Adv-Makeup, which can realize imperceptible and transferable attack under the black-box setting. Adv-Makeup develops a task-driven makeup generation method with the blending module to synthesize imperceptible eye shadow over the orbital region on faces. And to achieve transferability, Adv-Makeup implements a fine-grained meta-learning based adversarial attack strategy to learn more vulnerable or sensitive features from various models. Compared to existing techniques, sufficient visualization results demonstrate that Adv-Makeup is capable to generate much more imperceptible attacks under both digital and physical scenarios. Meanwhile, extensive quantitative experiments show that Adv-Makeup can significantly improve the attack success rate under black-box setting, even attacking commercial systems.
JBHI Journal 2021 Journal Article
The classification of six types of white blood cells (WBCs) is considered essential for leukemia diagnosis, while the classification is labor-intensive and strict with the clinical experience. To relieve the complicated process with an efficient and automatic method, we propose the A ttention-aware R esidual Network based M anifold L earning model (ARML) to classify WBCs. The proposed ARML model leverages the adaptive attention-aware residual learning to exploit the category-relevant image-level features and strengthen the first-order feature representation ability. To learn more discriminatory information than the first-order ones, the second-order features are characterized. Afterwards, ARML encodes both the first- and second-order features with Gaussian embedding into the Riemannian manifold to learn the underlying non-linear structure of the features for classification. ARML can be trained in an end-to-end fashion, and the learnable parameters are iteratively optimized. 10800 WBCs images (1800 images for each type) is collected, 9000 images and five-fold cross-validation are used for training and validation of the model, while additional 1800 images for testing. The results show that ARML achieving average classification accuracy of 0. 953 outperforms other state-of-the-art methods with fewer trainable parameters. In the ablation study, ARML achieves improved accuracy against its three variants: without manifold learning (AR), without attention-aware learning (RML), and AR without attention-aware learning. The t-SNE results illustrate that ARML has learned more distinguishable features than the comparison methods, which benefits the WBCs classification. ARML provides a clinically feasible WBCs classification solution for leukemia diagnose with an efficient manner.
IROS Conference 2020 Conference Paper
Detecting loop closures in 3D Light Detection and Ranging (LiDAR) data is a challenging task since point-level methods always suffer from instability. This paper presents a semantic-level approach named GOSMatch to perform reliable place recognition. Our method leverages novel descriptors, which are generated from the spatial relationship between semantics, to perform frame description and data association. We also propose a coarse-to-fine strategy to efficiently search for loop closures. Besides, GOSMatch can give an accurate 6-DOF initial pose estimation once a loop closure is confirmed. Extensive experiments have been conducted on the KITTI odometry dataset and the results show that GOSMatch can achieve robust loop closure detection performance and outperform existing methods.
AAAI Conference 2014 Conference Paper
Hashing is a popular approximate nearest neighbor search approach for large-scale image retrieval. Supervised hashing, which incorporates similarity/dissimilarity information on entity pairs to improve the quality of hashing function learning, has recently received increasing attention. However, in the existing supervised hashing methods for images, an input image is usually encoded by a vector of hand-crafted visual features. Such hand-crafted feature vectors do not necessarily preserve the accurate semantic similarities of images pairs, which may often degrade the performance of hashing function learning. In this paper, we propose a supervised hashing method for image retrieval, in which we automatically learn a good image representation tailored to hashing as well as a set of hash functions. The proposed method has two stages. In the first stage, given the pairwise similarity matrix S over training images, we propose a scalable coordinate descent method to decompose S into a product of HHT where H is a matrix with each of its rows being the approximate hash code associated to a training image. In the second stage, we propose to simultaneously learn a good feature representation for the input images as well as a set of hash functions, via a deep convolutional network tailored to the learned hash codes in H and optionally the discrete class labels of the images. Extensive empirical evaluations on three benchmark datasets with different kinds of images show that the proposed method has superior performance gains over several state-of-the-art supervised and unsupervised hashing methods.
AAAI Conference 2013 Conference Paper
Rank aggregation, which combines multiple individual rank lists to obtain a better one, is a fundamental technique in various applications such as meta-search and recommendation systems. Most existing rank aggregation methods blindly combine multiple rank lists with possibly considerable noises, which often degrades their performances. In this paper, we propose a new model for robust rank aggregation (RRA) via matrix learning, which recovers a latent rank list from the possibly incomplete and noisy input rank lists. In our model, we construct a pairwise comparison matrix to encode the order information in each input rank list. Based on our observations, each comparison matrix can be naturally decomposed into a shared low-rank matrix, combined with a deviation error matrix which is the sum of a column-sparse matrix and a row-sparse one. The latent rank list can be easily extracted from the learned lowrank matrix. The optimization formulation of RRA has an element-wise multiplication operator to handle missing values, a symmetric constraint on the noise structure, and a factorization trick to restrict the maximum rank of the low-rank matrix. To solve this challenging optimization problem, we propose a novel procedure based on the Augmented Lagrangian Multiplier scheme. We conduct extensive experiments on metasearch and collaborative filtering benchmark datasets. The results show that the proposed RRA has superior performance gain over several state-of-the-art algorithms for rank aggregation.