EAAI Journal 2026 Journal Article
A novel deep learning-based model for fault diagnosis of complex systems with high uncertainty
- Tao Dai
- Yang Sui
- Jiasheng Yan
- Jiahao Zhu
- Xiaohan Li
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
Deep learning-based 3D anomaly detection methods have demonstrated significant potential in industrial manufacturing. However, many approaches are specifically designed for anomaly detection tasks, which limits their generalizability to other 3D tasks. In contrast, self-supervised point cloud models aim for general representation learning, yet our investigation reveals that these classical models are suboptimal at anomaly detection under the unified fine-tuning paradigm. This motivates us to develop a more generalizable 3D model that can effectively detect anomalies without relying on task-specific designs. Interestingly, we find that using only the curvature of each point as its anomaly score already outperforms several classical self-supervised and dedicated anomaly detection models, highlighting the critical role of curvature in 3D anomaly detection. In this paper, we propose a Curvature-Augmented Self-supervised Learning (CASL) framework based on a reconstruction paradigm. Built upon the classical U-Net architecture, our approach introduces multi-scale curvature prompts to guide the decoder in predicting the coordinates of each point. Without relying on any dedicated anomaly detection mechanisms, it achieves leading detection performance through straightforward anomaly classification fine-tuning. Moreover, the learned representations generalize well to standard 3D understanding tasks such as point cloud classification.
AAAI Conference 2026 Conference Paper
The generalization capability of deepfake detectors is crucial for real-world applications. Data augmentation to generate synthetic fake faces has served as an effective strategy to enhance generalization. Interestingly, current state-of-the-art (SoTA) methods rely on fixed augmentation strategies, raising a fundamental question: Can a single static augmentation approach suffice, or does the diversity of forgery features necessitate dynamic strategies? We argue that existing methods overlook the evolving complexity of real-world forgery patterns, such as facial warping, expression manipulation, and compression artifacts, which cannot be fully simulated by fixed policies. To bridge this gap, we propose CRDA (Curriculum Reinforcement-Learning Data Augmentation), a novel framework that guides the detector to progressively master multi-domain forgery features from simple to complex. CRDA synthesizes augmented samples using a configurable pool of forgery operations and dynamically generates adversarial samples tailored to the detector’s current learning state. Key to our approach is the integration of reinforcement learning (RL) and causal inference. To efficiently explore the vast augmentation space, an RL agent dynamically selects augmentation actions based on the detector’s performance, ensuring continuous adaptation to increasingly challenging forgeries. Simultaneously, the agent’s output is designed to introduce variations in action spaces, generating heterogeneous forgery patterns. These variations are guided by causal inference theory, which mitigates spurious correlations by suppressing task-irrelevant biases and enforcing the model to focus on causally invariant features. This integration ensures robust generalization by decoupling synthetic augmentation patterns from the model’s learned representations. Extensive experiments demonstrate that the proposed method significantly improves the generalizability of the detector, achieving superior performance compared to state-of-the-art methods on multiple cross-domain datasets.
AAAI Conference 2025 Conference Paper
Transformer-based and MLP-based methods have emerged as leading approaches in time series forecasting (TSF). However, real-world time series often show different patterns at different scales, and future changes are shaped by the interplay of these overlapping scales, requiring high-capacity models. While Transformer-based methods excel in capturing long-range dependencies, they suffer from high computational complexities and tend to overfit. Conversely, MLP-based methods offer computational efficiency and adeptness in modeling temporal dynamics, but they struggle with capturing temporal patterns with complex scales effectively. Based on the observation of multi-scale entanglement effect in time series, we propose a novel MLP-based Adaptive Multi-Scale Decomposition (AMD) framework for TSF. Our framework decomposes time series into distinct temporal patterns at multiple scales, leveraging the Multi-Scale Decomposable Mixing (MDM) block to dissect and aggregate these patterns. Complemented by the Dual Dependency Interaction (DDI) block and the Adaptive Multi-predictor Synthesis (AMS) block, our approach effectively models both temporal and channel dependencies and utilizes autocorrelation to refine multi-scale data integration. Comprehensive experiments demonstrate our AMD framework not only overcomes the limitations of existing methods but also consistently achieves state-of-the-art performance across various datasets.
AAAI Conference 2025 Conference Paper
Deep learning (e.g., Transformer) has been widely and successfully used in multivariate time series forecasting (MTSF). Unlike existing methods that focus on training models from a single modal of time series input, large language models (LLMs) based MTSF methods with cross-modal text and time series input have recently shown great superiority, especially with limited temporal data. However, current LLM-based MTSF methods usually focus on adapting and fine-tuning LLMs, while neglecting the distribution discrepancy between textual and temporal input tokens, thus leading to sub-optimal performance. To address this issue, we propose a novel Cross-Modal LLM Fine-Tuning (CALF) framework for MTSF by reducing the distribution discrepancy between textual and temporal data, which mainly consists of the temporal target branch with temporal input and the textual source branch with aligned textual input. To reduce the distribution discrepancy, we develop the cross-modal match module to first align cross-modal input distributions. Additionally, to minimize the modality distribution gap in both feature and output spaces, feature regularization loss is developed to align the intermediate features between the two branches for better weight updates, while output consistency loss is introduced to allow the output representations of both branches to correspond effectively. Thanks to the modality alignment, CALF establishes state-of-the-art performance for both long-term and short-term forecasting tasks with low computational complexity, and exhibits favorable few-shot and zero-shot abilities similar to that in LLMs.
AAAI Conference 2025 Conference Paper
Knowledge distillation (KD) has recently gained great success in the field of object detection. By transferring the knowledge of the spatial or channel domain from the teacher model to the student model, it allows for a more compact representation with minimal performance loss. Despite this progress, existing KD methods typically treat knowledge from spatial or channel domains independently, ignoring the exploitation of the mutual relationship between these domains. In this work, we first explore the connection between spatial and channel domains and find there exists a strong correlation between them, i.e. the salient channels tend to contain significant object regions in the spatial domain. Motivated by this observation, we propose DCSF-KD, a novel Dynamic Channel-wise Spatial Feature Knowledge Distillation framework for object detection by fully exploiting both spatial and channel knowledge. Specifically, we introduce channel-wise spatial feature distillation and global channel attention distillation, using information from both domains to improve the accuracy of the student network. Experiments demonstrate that our DCSF-KD outperforms existing detection methods on both homogeneous and heterogeneous teacher-student network pairs. For example, when using the MaskRCNN-Swin detector as the teacher, and based on RetinaNet and FCOS with ResNet-50 on MS COCO, our DCSF-KD can achieve 41.9% and 44.1% mAP, respectively.
AAAI Conference 2025 Conference Paper
Diffusion models represent the state-of-the-art in generative modeling. Due to their high training costs, many works leverage pre-trained diffusion models' powerful representations for downstream tasks, such as face super-resolution (FSR), through fine-tuning or prior-based methods. However, relying solely on priors without supervised training makes it challenging to meet the pixel-level accuracy requirements of discrimination task. Although prior-based methods can achieve high fidelity and high-quality results, ensuring consistency remains a significant challenge. In this paper, we propose a masking strategy with strong and weak constraints and iterative refinement for real-world FSR, termed Diffusion Prior Interpolation (DPI). We introduce conditions and constraints on consistency by masking different sampling stages based on the structural characteristics of the face. Furthermore, we propose a condition Corrector (CRT) to establish a reciprocal posterior sampling process. DPI can balance consistency and diversity and can be seamlessly integrated into pre-trained models. In extensive experiments conducted on synthetic and real datasets, along with consistency validation in face recognition, DPI demonstrates superiority over SOTA FSR methods.
IJCAI Conference 2025 Conference Paper
Implicit neural representation (INR) aims to represent continuous domain signals via implicit neural functions and has achieved great success in arbitrary-scale image super-resolution (SR). However, most existing INR-based SR methods focus on learning implicit features from independent coordinate, while neglecting interactions of neighborhood coordinates, thus resulting in limited contextual awareness. In this paper, we rethink the forward process of implicit neural functions as a signal diffusion process, we propose a novel Diffusion Iterative Implicit Network (DIIN) for arbitrary-scale SR to promote global signal flow with neighborhood interactions. The DIIN framework mainly consists of stacked Diffusion Iteration Layers with dictionary cross-attention block to enrich the iterative update process with supplementary information. Besides, we develop the Position-Aware Embedding Block to strengthen spatial dependencies between consecutive input samples. Extensive experiments on public datasets demonstrate that our method achieves state-of-the-art or competitive performance, highlighting its effectiveness and efficiency for arbitrary-scale SR. Our code is available at https: //github. com/Song-1205/DIIN.
IJCAI Conference 2025 Conference Paper
Low-rank regularization (LRR) has been widely applied in various machine learning tasks, but the associated optimization is challenging. Directly optimizing the rank function under constraints is NP-hard in general. To overcome this difficulty, various relaxations of the rank function were studied. However, optimization of these relaxed LRRs typically depends on singular value decomposition, which is a time-consuming and nondifferentiable operator that cannot be optimized with gradient-based techniques. To address these challenges, in this paper we propose an efficient differentiable approximation of the generalized LRR. The considered LRR form subsumes many popular choices like the nuclear norm, the Schatten-p norm, and various nonconvex relaxations. Our method enables LRR terms to be appended to loss functions in a plug-and-play fashion, and the GPU-friendly operations enable efficient and convenient implementation. Furthermore, convergence analysis is presented, which rigorously shows that both the bias and the variance of our rank estimator rapidly reduce with increased sample size and iteration steps. In the experimental study, the proposed method is applied to various tasks, which demonstrates its versatility and efficiency. Code is available at https: //github. com/naiqili/EDLRR.
AAAI Conference 2025 Conference Paper
Vision-language retrieval (VLR) has attracted significant attention in both academia and industry, which involves using text (or images) as queries to retrieve corresponding images (or text). However, existing methods often neglect the rich visual semantics knowledge of entities, thus leading to incorrect retrieval results. To address this problem, we propose the Entity Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual knowledge of entities to enrich queries. Specifically, since humans recognize entities through visual cues, we employ a large language model (LLM) to generate Entity Visual Descriptions (EVDs) as alignment cues to complement textual data. These EVDs are then integrated into raw queries to create visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced queries may introduce noise or low-quality expansions, we develop a novel, trainable EVD-aware Rewriter (EaRW) for vision-language retrieval tasks. EaRW utilizes EVD knowledge and the generative capabilities of the language model to effectively rewrite queries. With our specialized training strategy, EaRW can generate high-quality and low-noise EVD-enhanced queries. Extensive quantitative and qualitative experiments on image-text retrieval benchmarks validate the superiority of EvdCLIP on vision-language retrieval tasks.
AAAI Conference 2025 Conference Paper
Sampling strategy (e.g., fixed farthest point sampling) of point cloud has been an essential step for developing practical solutions in 3D computer vision tasks. Previous fixed sampling is simple, but suffer from suboptimal performance for downstream tasks. To adapt to target networks properly, adaptive sampling methods with trainable parameters have been recently developed to enhance the performance. However, existing adaptive sampling methods still suffer from the over-coupling problem of target network, and thus become model-specific, which limits their practical applications. To address this issue, we propose a novel general cross-scale decoupled sampling method (GCD-sampling) for point cloud, which consists of original feature cache, cross-scale feature fusion and convex combination learning for better feature extraction. To reduce the coupling relationship with the target task network, our method only utilizes the point cloud coordinates as the input and output of itself. Besides, we introduce an arbitrary scale structure to enable parameter sharing across multi-scale sampling in point cloud networks. Extensive experiments on different architectures demonstrate the effectiveness of our method over other existing adaptive sampling methods.
IJCAI Conference 2025 Conference Paper
3D Anomaly Detection (AD) is a promising means of controlling the quality of manufactured products. However, existing methods typically require carefully training a task-specific model for each category independently, leading to high cost, low efficiency, and weak generalization. This study presents a novel unified model for Multi-Category 3D Anomaly Detection (MC3D-AD) that aims to utilize both local and global geometry-aware information to reconstruct normal representations of all categories. First, to learn robust and generalized features of different categories, we propose an adaptive geometry-aware masked attention module that extracts geometry variation information to guide mask attention. Then, we introduce a local geometry-aware encoder reinforced by the improved mask attention to encode group-level feature tokens. Finally, we design a global query decoder that utilizes point cloud position embeddings to improve the decoding process and reconstruction ability. This leads to local and global geometry-aware reconstructed feature tokens for the 3D AD task. MC3D-AD is evaluated on two publicly available Real3D-AD and Anomaly-ShapeNet datasets, and exhibits significant superiority over current state-of-the-art single-category methods, achieving 3. 1% and 9. 3% improvement in object-level AUROC over Real3D-AD and Anomaly-ShapeNet, respectively. The code is available at https: //github. com/iCAN-SZU/MC3D-AD.
IJCAI Conference 2025 Conference Paper
Point clouds, as a primary representation of 3D data, can be categorized into scene domain point clouds and object domain point clouds. Point cloud self-supervised learning (SSL) has become a mainstream paradigm for learning 3D representations. However, existing point cloud SSL primarily focuses on learning domain-specific 3D representations within a single domain, neglecting the complementary nature of cross-domain knowledge, which limits the learning of 3D representations. In this paper, we propose to learn a comprehensive Point cloud Mixture-of-Domain-Experts model (Point-MoDE) via a block-to-scene pre-training strategy. Specifically, We first propose a mixture-of-domain-expert model consisting of scene domain experts and multiple shared object domain experts. Furthermore, we propose a block-to-scene pretraining strategy, which leverages the features of point blocks in the object domain to regress their initial positions in the scene domain through object-level block mask reconstruction and scene-level block position regression. By integrating the complementary knowledge between object and scene, this strategy simultaneously facilitates the learning of both object-domain and scene-domain representations, leading to a more comprehensive 3D representation. Extensive experiments in downstream tasks demonstrate the superiority of our model.
NeurIPS Conference 2024 Conference Paper
Adaptation of pretrained vision-language models such as CLIP to various downstream tasks have raised great interest in recent researches. Previous works have proposed a variety of test-time adaptation (TTA) methods to achieve strong generalization without any knowledge of the target domain. However, existing training-required TTA approaches like TPT necessitate entropy minimization that involves large computational overhead, while training-free methods like TDA overlook the potential for information mining from the test samples themselves. In this paper, we break down the design of existing popular training-required and training-free TTA methods and bridge the gap between them within our framework. Specifically, we maintain a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples. The historical samples are filtered from the testing data stream and serve to extract useful information from the target distribution, while the boosting samples are drawn from regional bootstrapping and capture the knowledge of the test sample itself. We theoretically justify the rationality behind our method and empirically verify its effectiveness on both the out-of-distribution and the cross-domain datasets, showcasing its applicability in real-world situations.
IJCAI Conference 2024 Conference Paper
Recently developed generative methods, including invertible rescaling network (IRN) based and generative adversarial network (GAN) based methods, have demonstrated exceptional performance in image rescaling. However, IRN-based methods tend to produce over-smoothed results, while GAN-based methods easily generate fake details, which thus hinders their real applications. To address this issue, we propose Boundary-aware Decoupled Flow Networks (BDFlow) to generate realistic and visually pleasing results. Unlike previous methods that model high-frequency information as standard Gaussian distribution directly, our BDFlow first decouples the high-frequency information into semantic high-frequency that adheres to a Boundary distribution and non-semantic high-frequency counterpart that adheres to a Gaussian distribution. Specifically, to capture semantic high-frequency parts accurately, we use Boundary-aware Mask (BAM) to constrain the model to produce rich textures, while non-semantic high-frequency part is randomly sampled from a Gaussian distribution. Comprehensive experiments demonstrate that our BDFlow significantly outperforms other state-of-the-art methods while maintaining lower complexity. Notably, our BDFlow improves the PSNR by 4. 4 dB and the SSIM by 0. 1 on average over GRAIN, utilizing only 74% of the parameters and 20% of the computation. The code will be available at https: //github. com/THU-Kingmin/BAFlow.
NeurIPS Conference 2024 Conference Paper
Deep neural networks (DNNs) have recently achieved remarkable advancements in time series forecasting (TSF) due to their powerful ability of sequence dependence modeling. To date, existing DNN-based TSF methods still suffer from unreliable predictions for real-world data due to its non-stationarity characteristics, i. e. , data distribution varies quickly over time. To mitigate this issue, several normalization methods (e. g. , SAN) have recently been specifically designed by normalization in a fixed period/window in the time domain. However, these methods still struggle to capture distribution variations, due to the complex time patterns of time series in the time domain. Based on the fact that wavelet transform can decompose time series into a linear combination of different frequencies, which exhibits distribution variations with time-varying periods, we propose a novel Dual-domain Dynamic Normalization (DDN) to dynamically capture distribution variations in both time and frequency domains. Specifically, our DDN tries to eliminate the non-stationarity of time series via both frequency and time domain normalization in a sliding window way. Besides, our DDN can serve as a plug-in-play module, and thus can be easily incorporated into other forecasting models. Extensive experiments on public benchmark datasets under different forecasting models demonstrate the superiority of our DDN over other normalization methods. Code will be made available following the review process.
IJCAI Conference 2024 Conference Paper
Transformer-based models have been widely and successfully used in various low-vision visual tasks, and have achieved remarkable performance in single image super-resolution (SR). Despite the significant progress in SR, Transformer-based SR methods (e. g. , SwinIR) still suffer from the problems of heavy computation cost and low-frequency preference, while ignoring the reconstruction of rich high-frequency information, hence hindering the representational power of Transformers. To address these issues, in this paper, we propose a novel Frequency-aware Transformer (FreqFormer) for lightweight image SR. Specifically, a Frequency Division Module (FDM) is first introduced to separately handle high- and low-frequency information in a divide-and-conquer manner. Moreover, we present Frequency-aware Transformer Block (FTB) to extracting both spatial frequency attention and channel transposed attention to recover high-frequency details. Extensive experimental results on public datasets demonstrate the superiority of our FreqFormer over state-of-the-art SR methods in terms of both quantitative metrics and visual quality. Code and models are available at https: //github. com/JPWang-CS/FreqFormer.
IJCAI Conference 2024 Conference Paper
Traditional QR codes consist of a grid of black-and-white square modules, which lack aesthetic appeal and meaning for human perception. This has motivated recent research to beautify the visual appearance of QR codes. However, there exists a trade-off between the visual quality and scanning-robustness of the image, causing outputs of previous works are simple and of low quality to ensure scanning-robustness. In this paper, we introduce a novel approach GladCoder to generate stylized QR codes that are personalized, natural, and text-driven. Its pipeline includes a Depth-guided Aesthetic QR code Generator (DAG) to improve quality of image foreground, and a GrayscaLe-Aware Denoising (GLAD) process to enhance scanning-robustness. The overall pipeline is based on diffusion models, which allow users to create stylized QR images from a textual prompt to describe the image and a textual input to be encoded. Experiments demonstrate that our method can generate stylized QR code with appealing perception details, while maintaining robust scanning reliability under real world applications.
IJCAI Conference 2024 Conference Paper
Invertible Rescaling Networks (IRNs) and their variants have witnessed remarkable achievements in various image processing tasks like image rescaling. However, we observe that IRNs with deeper networks are difficult to train, thus hindering the representational ability of IRNs. To address this issue, we propose Invertible Residual Rescaling Models (IRRM) for image rescaling by learning a bijection between a high-resolution image and its low-resolution counterpart with a specific distribution. Specifically, we propose IRRM to build a deep network, which contains several Residual Downscaling Modules (RDMs) with long skip connections. Each RDM consists of several Invertible Residual Blocks (IRBs) with short connections. In this way, RDM allows rich low-frequency information to be bypassed by skip connections and forces models to focus on extracting high-frequency information from the image. Extensive experiments show that our IRRM performs significantly better than other state-of-the-art methods with much fewer parameters and complexity. Particularly, our IRRM has respectively PSNR gains of at least 0. 3 dB over HCFlow and IRN in the x4 rescaling while only using 60% parameters and 50% FLOPs. The code will be available at https: //github. com/THU-Kingmin/IRRM.
NeurIPS Conference 2024 Conference Paper
The pre-trained point cloud model based on Masked Point Modeling (MPM) has exhibited substantial improvements across various tasks. However, these models heavily rely on the Transformer, leading to quadratic complexity and limited decoder, hindering their practice application. To address this limitation, we first conduct a comprehensive analysis of existing Transformer-based MPM, emphasizing the idea that redundancy reduction is crucial for point cloud analysis. To this end, we propose a Locally constrained Compact point cloud Model (LCM) consisting of a locally constrained compact encoder and a locally constrained Mamba-based decoder. Our encoder replaces self-attention with our local aggregation layers to achieve an elegant balance between performance and efficiency. Considering the varying information density between masked and unmasked patches in the decoder inputs of MPM, we introduce a locally constrained Mamba-based decoder. This decoder ensures linear complexity while maximizing the perception of point cloud geometry information from unmasked patches with higher information density. Extensive experimental results show that our compact model significantly surpasses existing Transformer-based models in both performance and efficiency, especially our LCM-based Point-MAE model, compared to the Transformer-based model, achieved an improvement of 1. 84%, 0. 67%, and 0. 60% in performance on the three variants of ScanObjectNN while reducing parameters by 88% and computation by 73%. The code is available at https: //github. com/zyh16143998882/LCM.
NeurIPS Conference 2024 Conference Paper
Designing single-task image restoration models for specific degradation has seen great success in recent years. To achieve generalized image restoration, all-in-one methods have recently been proposed and shown potential for multiple restoration tasks using one single model. Despite the promising results, the existing all-in-one paradigm still suffers from high computational costs as well as limited generalization on unseen degradations. In this work, we introduce an alternative solution to improve the generalization of image restoration models. Drawing inspiration from recent advancements in Parameter Efficient Transfer Learning (PETL), we aim to tune only a small number of parameters to adapt pre-trained restoration models to various tasks. However, current PETL methods fail to generalize across varied restoration tasks due to their homogeneous representation nature. To this end, we propose AdaptIR, a Mixture-of-Experts (MoE) with orthogonal multi-branch design to capture local spatial, global spatial, and channel representation bases, followed by adaptive base combination to obtain heterogeneous representation for different degradations. Extensive experiments demonstrate that our AdaptIR achieves stable performance on single-degradation tasks, and excels in hybrid-degradation tasks, with training only 0. 6% parameters for 8 hours.
AAAI Conference 2024 Conference Paper
Level generation is a central focus of Procedural Content Generation (PCG), yet deep learning-based approaches are limited by scarce training data, i.e., human-designed levels. Despite being a dominant framework, Generative Adversarial Networks (GANs) exhibit a substantial quality gap between generated and human-authored levels, alongside rising training costs, particularly with increasing token complexity. In this paper, we introduce a diffusion-based generative model that learns from just one example. Our approach involves two core components: 1) an efficient yet expressive level representation, and 2) a latent denoising network with constrained receptive fields. To start with, our method utilizes token semantic labels, similar to word embeddings, to provide dense representations. This strategy not only surpasses one-hot encoding in representing larger game levels but also improves stability and accelerates convergence in latent diffusion. In addition, we adapt the denoising network architecture to confine the receptive field to localized patches of the data, aiming to facilitate single-example learning. Extensive experiments demonstrate that our model is capable of generating stylistically congruent samples of arbitrary sizes compared to manually designed levels. It suits a wide range of level structures with fewer artifacts than GAN-based approaches. The source code is available at https://github.com/shiqi-dai/diffusioncraft.
NeurIPS Conference 2024 Conference Paper
Recent advances in diffusion-based Large Restoration Models (LRMs) have significantly improved photo-realistic image restoration by leveraging the internal knowledge embedded within model weights. However, existing LRMs often suffer from the hallucination dilemma, i. e. , producing incorrect contents or textures when dealing with severe degradations, due to their heavy reliance on limited internal knowledge. In this paper, we propose an orthogonal solution called the Retrieval-augmented Framework for Image Restoration (ReFIR), which incorporates retrieved images as external knowledge to extend the knowledge boundary of existing LRMs in generating details faithful to the original scene. Specifically, we first introduce the nearest neighbor lookup to retrieve content-relevant high-quality images as reference, after which we propose the cross-image injection to modify existing LRMs to utilize high-quality textures from retrieved images. Thanks to the additional external knowledge, our ReFIR can well handle the hallucination challenge and facilitate faithfully results. Extensive experiments demonstrate that ReFIR can achieve not only high-fidelity but also realistic restoration results. Importantly, our ReFIR requires no training and is adaptable to various LRMs.
AAAI Conference 2024 Conference Paper
Learning 3D representation plays a critical role in masked autoencoder (MAE) based pre-training methods for point cloud, including single-modal and cross-modal based MAE. Specifically, although cross-modal MAE methods learn strong 3D representations via the auxiliary of other modal knowledge, they often suffer from heavy computational burdens and heavily rely on massive cross-modal data pairs that are often unavailable, which hinders their applications in practice. Instead, single-modal methods with solely point clouds as input are preferred in real applications due to their simplicity and efficiency. However, such methods easily suffer from limited 3D representations with global random mask input. To learn compact 3D representations, we propose a simple yet effective Point Feature Enhancement Masked Autoencoders (Point-FEMAE), which mainly consists of a global branch and a local branch to capture latent semantic features. Specifically, to learn more compact features, a share-parameter Transformer encoder is introduced to extract point features from the global and local unmasked patches obtained by global random and local block mask strategies, followed by a specific decoder to reconstruct. Meanwhile, to further enhance features in the local branch, we propose a Local Enhancement Module with local patch convolution to perceive fine-grained local context at larger scales. Our method significantly improves the pre-training efficiency compared to cross-modal alternatives, and extensive downstream experiments underscore the state-of-the-art effectiveness, particularly outperforming our baseline (Point-MAE) by 5.16%, 5.00%, and 5.04% in three variants of ScanObjectNN, respectively. Code is available at https://github.com/zyh16143998882/AAAI24-PointFEMAE.
AAAI Conference 2024 Conference Paper
In recent years, vision language pre-training frameworks have made significant progress in natural language processing and computer vision, achieving remarkable performance improvement on various downstream tasks. However, when extended to point cloud data, existing works mainly focus on building task-specific models, and fail to extract universal 3D vision-language embedding that generalize well. We carefully investigate three common tasks in semantic 3D scene understanding, and derive key insights into the development of a pre-training model. Motivated by these observations, we propose a vision-language pre-training framework 3DVLP (3D vision-language pre-training with object contrastive learning), which transfers flexibly on 3D vision-language downstream tasks. 3DVLP takes visual grounding as the proxy task and introduces Object-level IoU-guided Detection (OID) loss to obtain high-quality proposals in the scene. Moreover, we design Object-level Cross-Contrastive alignment (OCC) task and Object-level Self-Contrastive learning (OSC) task to align the objects with descriptions and distinguish different objects in the scene, respectively. Extensive experiments verify the excellent performance of 3DVLP on three 3D vision-language tasks, reflecting its superiority in semantic 3D scene understanding. Code is available at https://github.com/iridescentttt/3DVLP.
AAAI Conference 2023 Conference Paper
Deep neural networks (DNNs) have witnessed remarkable achievement in image super-resolution (SR), and plenty of DNN-based SR models with elaborated network designs have recently been proposed. However, existing methods usually require substantial computations by operating in spatial domain. To address this issue, we propose a general frequency-oriented framework (FSR) to accelerate SR networks by considering data characteristics in frequency domain. Our FSR mainly contains dual feature aggregation module (DFAM) to extract informative features in both spatial and transform domains, followed by a four-path SR-Module with different capacities to super-resolve in the frequency domain. Specifically, DFAM further consists of a transform attention block (TABlock) and a spatial context block (SCBlock) to extract global spectral information and local spatial information, respectively, while SR-Module is a parallel network container that contains four to-be-accelerated branches. Furthermore, we propose an adaptive weight strategy for a trade-off between image details recovery and visual quality. Extensive experiments show that our FSR can save FLOPs by almost 40% while reducing inference time by 50% for other SR methods (e.g., FSRCNN, CARN, SRResNet and RCAN). Code is available at https://github.com/THU-Kingmin/FSR.
AAAI Conference 2023 Conference Paper
Beyond achieving higher compression efficiency over classical image compression codecs, deep image compression is expected to be improved with additional side information, e.g., another image from a different perspective of the same scene. To better utilize the side information under the distributed compression scenario, the existing method only implements patch matching at the image domain to solve the parallax problem caused by the difference in viewing points. However, the patch matching at the image domain is not robust to the variance of scale, shape, and illumination caused by the different viewing angles, and can not make full use of the rich texture information of the side information image. To resolve this issue, we propose Multi-Scale Feature Domain Patch Matching (MSFDPM) to fully utilizes side information at the decoder of the distributed image compression model. Specifically, MSFDPM consists of a side information feature extractor, a multi-scale feature domain patch matching module, and a multi-scale feature fusion network. Furthermore, we reuse inter-patch correlation from the shallow layer to accelerate the patch matching of the deep layer. Finally, we find that our patch matching in a multi-scale feature domain further improves compression rate by about 20% compared with the patch matching method at image domain.
AAAI Conference 2023 Conference Paper
Real-world recognition system often encounters the challenge of unseen labels. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding (e.g., GloVe). However, such methods only exploit single-modal knowledge from a language model, while ignoring the rich semantic information inherent in image-text pairs. Instead, recently developed open-vocabulary (OV) based methods succeed in exploiting such information of image-text pairs in object detection, and achieve impressive performance. Inspired by the success of OV-based methods, we propose a novel open-vocabulary framework, named multi-modal knowledge transfer (MKT), for multi-label classification. Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pre-training (VLP) model. To facilitate transferring the image-text matching ability of VLP model, knowledge distillation is employed to guarantee the consistency of image and label embeddings, along with prompt tuning to further update the label embeddings. To further enable the recognition of multiple objects, a simple but effective two-stream module is developed to capture both local and global features. Extensive experimental results show that our method significantly outperforms state-of-the-art methods on public benchmark datasets.
IJCAI Conference 2023 Conference Paper
Scene text image super-resolution (STISR), aiming to improve image quality while boosting downstream scene text recognition accuracy, has recently achieved great success. However, most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process, and neglect the disturbance from the complex background, thus limiting the performance. To address these issues, in this paper, we propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution. To model the location of characters effectively, we propose the location enhancement module to extract character region features based on the attention map sequence. Besides, we propose the multi-modal alignment module to perform bidirectional visual-semantic alignment to generate high-quality prior guidance, which is then incorporated into the super-resolution branch in an adaptive manner using the proposed adaptive fusion module. Experiments on TextZoom and four scene text recognition benchmarks demonstrate the superiority of our method over other state-of-the-art methods. Code is available at https: //github. com/csguoh/LEMMA.
AAAI Conference 2022 Conference Paper
The high efficiency in computation and storage makes hashing (including binary hashing and quantization) a common strategy in large-scale retrieval systems. To alleviate the reliance on expensive annotations, unsupervised deep hashing becomes an important research problem. This paper provides a novel solution to unsupervised deep quantization, namely Contrastive Quantization with Code Memory (MeCoQ). Different from existing reconstruction-based strategies, we learn unsupervised binary descriptors by contrastive learning, which can better capture discriminative visual semantics. Besides, we uncover that codeword diversity regularization is critical to prevent contrastive learning-based quantization from model degeneration. Moreover, we introduce a novel quantization code memory module that boosts contrastive learning with lower feature drift than conventional feature memories. Extensive experiments on benchmark datasets show that MeCoQ outperforms state-of-the-art methods. Code and configurations are publicly released.
AAAI Conference 2021 Conference Paper
Recently, a variety of regularization techniques have been widely applied in deep neural networks, which mainly focus on the regularization of weight parameters to encourage generalization effectively. Label regularization techniques are also proposed with the motivation of softening the labels while neglecting the relation of classes. Among them, the technique of knowledge distillation proposes to distill the soft label, which contains the knowledge of class relations. However, this technique needs to pre-train an extra cumbersome teacher model. In this paper, we propose a method called Knowledge Refinery (KR), which enables the neural network to learn the relation of classes on-the-fly without the teacher-student training strategy. We propose the definition of decoupled labels, which consist of the original hard label and the residual label. To exhibit the generalization of KR, we evaluate our method in both fields of computer vision and natural language processing. Our empirical results show consistent performance gains under all experimental settings.
AAAI Conference 2020 Conference Paper
Deep product quantization network (DPQN) has recently received much attention in fast image retrieval tasks due to its efficiency of encoding high-dimensional visual features especially when dealing with large-scale datasets. Recent studies show that deep neural networks (DNNs) are vulnerable to input with small and maliciously designed perturbations (a. k. a. , adversarial examples). This phenomenon raises the concern of security issues for DPQN in the testing/deploying stage as well. However, little effort has been devoted to investigating how adversarial examples affect DPQN. To this end, we propose product quantization adversarial generation (PQ-AG), a simple yet effective method to generate adversarial examples for product quantization based retrieval systems. PQ-AG aims to generate imperceptible adversarial perturbations for query images to form adversarial queries, whose nearest neighbors from a targeted product quantizaiton model are not semantically related to those from the original queries. Extensive experiments show that our PQ-AQ successfully creates adversarial examples to mislead targeted product quantization retrieval models. Besides, we found that our PQ-AG significantly degrades retrieval performance in both white-box and black-box settings.
IJCAI Conference 2019 Conference Paper
Grassland degradation estimation is essential to prevent global land desertification and sandstorms. Typically, the key to such estimation is to measure the coverage of indicator plants. However, traditional methods of estimation rely heavily on human eyes and manual labor, thus inevitably leading to subjective results and high labor costs. In contrast, deep learning-based image segmentation algorithms are potentially capable of automatic assessment of the coverage of indicator plants. Nevertheless, a suitable image dataset comprising grassland images is not publicly available. To this end, we build an original Automatic Grassland Degradation Estimation Dataset (AGDE-Dataset), with a large number of grassland images captured from the wild. Based on AGDE-Dataset, we are able to propose a brand new scheme to automatically estimate grassland degradation, which mainly consists of two components. 1) Semantic segmentation: we design a deep neural network with an improved encoder-decoder structure to implement semantic segmentation of grassland images. In addition, we propose a novel Focal-Hinge Loss to alleviate the class imbalance of semantics in the training stage. 2) Degradation estimation: we provide the estimation of grassland degradation based on the results of semantic segmentation. Experimental results show that the proposed method achieves satisfactory accuracy in grassland degradation estimation.
IJCAI Conference 2017 Conference Paper
Most existing survey aggregation methods assume that the sample data follow Gaussian distribution. However, these methods are sensitive to outliers, due to the thin-tailed property of the Gaussian distribution. To address this issue, we propose a robust survey aggregation method based on Student-t distribution and sparse representation. Specifically, we assume that the samples follow Student-$t$ distribution, instead of the common Gaussian distribution. Due to the Student-t distribution, our method is robust to outliers, which can be explained from both Bayesian point of view and non-Bayesian point of view. In addition, inspired by James-Stain estimator (JS) and Compressive Averaging (CAvg), we propose to sparsely represent the global mean vector by an adaptive basis comprising both data-specific basis and combined generic bases. Theoretically, we prove that JS and CAvg are special cases of our method. Extensive experiments demonstrate that our proposed method achieves significant improvement over the state-of-the-art methods on both synthetic and real datasets.
IJCAI Conference 2017 Conference Paper
Gaussian Process Regression (GPR) is a powerful Bayesian method. However, the performance of GPR can be significantly degraded when the training data are contaminated by outliers, including target outliers and input outliers. Although there are some variants of GPR (e. g. , GPR with Student-t likelihood (GPRT)) aiming to handle outliers, most of the variants focus on handling the target outliers while little effort has been done to deal with the input outliers. In contrast, in this work, we aim to handle both the target outliers and the input outliers at the same time. Specifically, we replace the Gaussian noise in GPR with independent Student-t noise to cope with the target outliers. Moreover, to enhance the robustness w. r. t. the input outliers, we use a Student-t Process prior instead of the common Gaussian Process prior, leading to Student-t Process Regression with Student-t Likelihood (TPRT). We theoretically show that TPRT is more robust to both input and target outliers than GPR and GPRT, and prove that both GPR and GPRT are special cases of TPRT. Various experiments demonstrate that TPRT outperforms GPR and its variants on both synthetic and real datasets.