Arrow Research search

Author name cluster

Xiaochun Cao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

90 papers
2 author rows

Possible papers

90

AAAI Conference 2026 Conference Paper

FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models

  • Hongyang Wang
  • Yichen Shi
  • Zhuofu Tao
  • Yuhao Gao
  • Liepiao Zhang
  • Xun Lin
  • Jun Feng
  • Xiaochen Yuan

Face anti-spoofing (FAS) is crucial for protecting facial recognition systems from presentation attacks. Previous methods approached this task as a classification problem, lacking interpretability and reasoning behind the predicted results. Recently, multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and decision-making in visual tasks. However, there is currently no universal and comprehensive MLLM and dataset specifically designed for FAS task. To address this gap, we propose FaceShield, a MLLM for FAS, along with the corresponding pre-training and supervised fine-tuning (SFT) datasets, FaceShield-pre10K and FaceShield-sft45K. FaceShield is capable of determining the authenticity of faces, identifying types of spoofing attacks, providing reasoning for its judgments, and detecting attack areas. Specifically, we employ spoof-aware vision perception (SAVP) that incorporates both the original image and auxiliary information based on prior knowledge. We then use an prompt-guided vision token masking (PVTM) strategy to random mask vision tokens, thereby improving the model's generalization ability. We conducted extensive experiments on three benchmark datasets, demonstrating that FaceShield significantly outperforms previous deep learning models and general MLLMs on four FAS tasks, i.e., coarse-grained classification, fine-grained classification, reasoning, and attack localization.

AAAI Conference 2026 Conference Paper

GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations

  • Xinwei Liu
  • Xiaojun Jia
  • Yuan Xun
  • Simeng Qin
  • Xiaochun Cao

Vision-Language Models (VLMs) such as GPT-4o now demonstrate a remarkable ability to infer users' locations from public shared images, posing a substantial risk to geoprivacy. Although adversarial perturbations offer a potential defense, current methods are ill-suited for this scenario: they often perform poorly on high-resolution images and low perturbation budgets, and may introduce irrelevant semantic content. To address these limitations, we propose GeoShield, a novel adversarial framework designed for robust geoprivacy protection in real-world scenarios. GeoShield comprises three key modules: a feature disentanglement module that separates geographical and non-geographical information, an exposure element identification module that pinpoints geo-revealing regions within an image, and a scale-adaptive enhancement module that jointly optimizes perturbations at both global and local levels to ensure effectiveness across resolutions. Extensive experiments on challenging benchmarks show that GeoShield consistently surpasses prior methods in black-box settings, achieving strong privacy protection with minimal impact on visual or semantic quality. To our knowledge, this work is the first to explore adversarial perturbations for defending against geolocation inference by advanced VLMs, providing a practical solution to escalating privacy concerns.

NeurIPS Conference 2025 Conference Paper

A Closer Look at Graph Transformers: Cross-Aggregation and Beyond

  • Jiaming Zhuo
  • Ziyi Ma
  • Yintong Lu
  • Yuwei Liu
  • Kun Fu
  • Di Jin
  • Chuan Wang
  • Wu Wenning

Graph Transformers (GTs), which effectively capture long-range dependencies and structural biases simultaneously, have recently emerged as promising alternatives to traditional Graph Neural Networks (GNNs). Advanced approaches for GTs to leverage topology information involve integrating GNN modules or modulating node attributes using positional encodings. Unfortunately, the underlying mechanism driving their effectiveness remains insufficiently understood. In this paper, we revisit these strategies and uncover a shared underlying mechanism—Cross Aggregation—that effectively captures the interaction between graph topology and node attributes. Building on this insight, we propose the Universal Graph Cross-attention Transformer (UGCFormer), a universal GT framework with linear computational complexity. The idea is to interactively learn the representations of graph topology and node attributes through a linearized Dual Cross-attention (DCA) module. In theory, this module can adaptively capture interactions between these two types of graph information, thereby achieving effective aggregation. To alleviate overfitting arising from the dual-channel design, we introduce a consistency constraint that enforces representational alignment. Extensive evaluations on multiple benchmark datasets demonstrate the effectiveness and efficiency of UGCFormer.

NeurIPS Conference 2025 Conference Paper

Ada-R1: Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization

  • Haotian Luo
  • Haiying He
  • Yibo Wang
  • Jinluan Yang
  • Rui Liu
  • Naiqiang Tan
  • Xiaochun Cao
  • Dacheng Tao

Recently, long-thought reasoning models achieve strong performance on complex reasoning tasks, but often incur substantial inference overhead, making efficiency a critical concern. Our empirical analysis reveals that the benefit of using Long-CoT varies across problems: while some problems require elaborate reasoning, others show no improvement—or even degraded accuracy. This motivates adaptive reasoning strategies that tailor reasoning depth to the input. However, prior work primarily reduces redundancy within long reasoning paths, limiting exploration of more efficient strategies beyond the Long-CoT paradigm. To address this, we propose a novel two-stage framework for adaptive and efficient reasoning. First, we construct a hybrid reasoning model by merging long and short CoT models to enable diverse reasoning styles. Second, we apply bi-level preference training to guide the model to select suitable reasoning styles (group-level), and prefer concise and correct reasoning within each style group (instance-level). Experiments demonstrate that our method significantly reduces inference costs compared to other baseline approaches, while maintaining performance. Notably, on five mathematical datasets, the average length of reasoning is reduced by more than 50\%, highlighting the potential of adaptive strategies to optimize reasoning efficiency in large language models.

NeurIPS Conference 2025 Conference Paper

CamEdit: Continuous Camera Parameter Control for Photorealistic Image Editing

  • Xinran Qin
  • Zhixin Wang
  • Fan Li
  • Haoyu Chen
  • Renjing Pei
  • Wenbo Li
  • Xiaochun Cao

Recent advances in diffusion models have substantially improved text-driven image editing. However, existing frameworks based on discrete textual tokens struggle to support continuous control over camera parameters and smooth transitions in visual effects. These limitations hinder their applications to realistic, camera-aware, and fine-grained editing tasks. In this paper, we present CamEdit, a diffusion-based framework for photorealistic image editing that enables continuous and semantically meaningful manipulation of common camera parameters such as aperture and shutter speed. CamEdit incorporates a continuous parameter prompting mechanism and a parameter-aware modulation module that guides the model in smoothly adjusting focal plane, aperture, and shutter speed, reflecting the effects of varying camera settings within the diffusion process. To support supervised learning in this setting, we introduce CamEdit50K, a dataset specifically designed for photorealistic image editing with continuous camera parameter settings. It contains over 50k image pairs combining real and synthetic data with dense camera parameter variations across diverse scenes. Extensive experiments demonstrate that CamEdit enables flexible, consistent, and high-fidelity image editing, achieving state-of-the-art performance in camera-aware visual manipulation and fine-grained photographic control.

NeurIPS Conference 2025 Conference Paper

Continual Model Merging without Data: Dual Projections for Balancing Stability and Plasticity

  • Enneng Yang
  • Anke Tang
  • Li Shen
  • Guibing Guo
  • Xingwei Wang
  • Xiaochun Cao
  • Jie Zhang

Model merging integrates multiple expert models with diverse capabilities into a unified framework, facilitating collaborative learning. However, most existing methods assume simultaneous access to all models, which is often impractical in real-world scenarios where models are received sequentially. While some studies have investigated continual model merging (CMM)--which involves sequentially merging multiple models--the challenge of balancing prior knowledge (stability) and incorporating new tasks (plasticity) remains unresolved. This paper, for the first time, formally defines the stability and plasticity of CMM from the perspective of orthogonal projection. Subsequently, we analyze the relationships among the spaces spanned by task data, historical gradients, and accumulated gradients. Building on this, we propose a data-free \textbf{D}ual \textbf{O}rthogonal \textbf{P}rojection (DOP) method, which eliminates data dependence and mitigates interference between the merged model and models for old and new tasks by projecting their parameter differences onto their respective approximate data spaces. Finally, to solve potential conflicts between stability and plasticity, we reformulate DOP as a multi-objective optimization problem and employ a multi-gradient descent algorithm to obtain a Pareto-optimal solution. Extensive experiments across multiple architectures and task configurations validate that our approach significantly outperforms state-of-the-art CMM methods.

AAAI Conference 2025 Conference Paper

Critical Forgetting-Based Multi-Scale Disentanglement for Deepfake Detection

  • Kai Li
  • Wenqi Ren
  • Jianshu Li
  • Wei Wang
  • Xiaochun Cao

Recent face forgery detection methods based on disentangled representation learning utilize paired images for cross-reconstruction, aiming to extract forgery-relevant attributes and forgery-irrelevant content. However, there still exist the following issues that may comprise the detector performance: 1) using information-dense images as the decoupling targets increases the decoupling difficulty; 2) the extracted attribute features are reconstruction-irrelevant rather than forgery-relevant, and single-scale forgery representation decoupling cannot capture sufficient discriminative information; 3) the generalization performance of decoupled attribute features is poor as the detector focuses on learning specific artifact types in the training set. To address these issues, we propose a novel disentangled representation learning framework for deepfake detection. First, we extract features by partitioning the dense information within the image, focusing independently on texture, color, or edges. These features are then used as the decoupling targets rather than the images themselves, which could mitigate the decoupling difficulty. Second, we extend reconstruction loss from image-level to feature-level, thus extending the forgery representation decoupling from single-scale to multi-scale. Third, we propose a critical forgetting mechanism that forces the detector to forget the most salient features during training, which correspond to specific forgery artifact types in the training set. Extensive experimental results validate the efficacy of the proposed method.

ICLR Conference 2025 Conference Paper

Decoupled Graph Energy-based Model for Node Out-of-Distribution Detection on Heterophilic Graphs

  • Yuhan Chen 0007
  • Yihong Luo
  • Yifan Song 0006
  • Pengwen Dai
  • Jing Tang 0004
  • Xiaochun Cao

Despite extensive research efforts focused on Out-of-Distribution (OOD) detection on images, OOD detection on nodes in graph learning remains underexplored. The dependence among graph nodes hinders the trivial adaptation of existing approaches on images that assume inputs to be i.i.d. sampled, since many unique features and challenges specific to graphs are not considered, such as the heterophily issue. Recently, GNNSafe, which considers node dependence, adapted energy-based detection to the graph domain with state-of-the-art performance, however, it has two serious issues: 1) it derives node energy from classification logits without specifically tailored training for modeling data distribution, making it less effective at recognizing OOD data; 2) it highly relies on energy propagation, which is based on homophily assumption and will cause significant performance degradation on heterophilic graphs, where the node tends to have dissimilar distribution with its neighbors. To address the above issues, we suggest training Energy-based Models (EBMs) by Maximum Likelihood Estimation (MLE) to enhance data distribution modeling and removing energy propagation to overcome the heterophily issues. However, training EBMs via MLE requires performing Markov Chain Monte Carlo (MCMC) sampling on both node feature and node neighbors, which is challenging due to the node interdependence and discrete graph topology. To tackle the sampling challenge, we introduce Decoupled Graph Energy-based Model (DeGEM), which decomposes the learning process into two parts—a graph encoder that leverages topology information for node representations and an energy head that operates in latent space. Additionally, we propose a Multi-Hop Graph encoder (MH) and Energy Readout (ERo) to enhance node representation learning, Conditional Energy (CE) for improved EBM training, and Recurrent Update for the graph encoder and energy head to promote each other. This approach avoids sampling adjacency matrices and removes the need for energy propagation to extract graph topology information. Extensive experiments validate that DeGEM, without OOD exposure during training, surpasses previous state-of-the-art methods, achieving an average AUROC improvement of 6.71% on *homophilic* graphs and 20.29% on *heterophilic* graphs, and even outperform methods trained with OOD exposure. Our code is available at: [https://github.com/draym28/DeGEM](https://github.com/draym28/DeGEM).

ICML Conference 2025 Conference Paper

Disentangled Graph Spectral Domain Adaptation

  • Liang Yang 0002
  • Xin Chen
  • Jiaming Zhuo
  • Di Jin 0001
  • Chuan Wang 0002
  • Xiaochun Cao
  • Zhen Wang 0004
  • Yuanfang Guo

The distribution shifts and the scarcity of labels prevent graph learning methods, especially graph neural networks (GNNs), from generalizing across domains. Compared to Unsupervised Domain Adaptation (UDA) with embedding alignment, Unsupervised Graph Domain Adaptation (UGDA) becomes more challenging in light of the attribute and topology entanglement in the representation. Beyond embedding alignment, UGDA turns to topology alignment but is limited by the ability of the employed topology model and the estimation of pseudo labels. To alleviate this issue, this paper proposed a Disentangled Graph Spectral Domain adaptation (DGSDA) by disentangling attribute and topology alignments and directly aligning flexible graph spectral filters beyond topology. Specifically, Bernstein polynomial approximation, which mimics the behavior of the function to be approximated to a remarkable degree, is employed to capture complicated topology characteristics and avoid the expensive eigenvalue decomposition. Theoretical analysis reveals the tight GDA bound of DGSDA and the rationality of polynomial coefficient regularization. Quantitative and qualitative experiments justify the superiority of the proposed DGSDA.

ICML Conference 2025 Conference Paper

Do We Really Need Message Passing in Brain Network Modeling?

  • Liang Yang 0002
  • Yuwei Liu
  • Jiaming Zhuo
  • Di Jin 0001
  • Chuan Wang 0002
  • Zhen Wang 0004
  • Xiaochun Cao

Brain network analysis plays a critical role in brain disease prediction and diagnosis. Graph mining tools have made remarkable progress. Graph neural networks (GNNs) and Transformers, which rely on the message-passing scheme, recently dominated this field due to their powerful expressive ability on graph data. Unfortunately, by considering brain network construction using pairwise Pearson’s coefficients between any pairs of ROIs, model analysis and experimental verification reveal that the message-passing under both GNNs and Transformers can NOT be fully explored and exploited. Surprisingly, this paper observes the significant performance and efficiency enhancements of the Hadamard product compared to the matrix product, which is the matrix form of message passing, in processing the brain network. Inspired by this finding, a novel Brain Quadratic Network (BQN) is proposed by incorporating quadratic networks, which possess better universal approximation properties. Moreover, theoretical analysis demonstrates that BQN implicitly performs community detection along with representation learning. Extensive evaluations verify the superiority of the proposed BQN compared to the message-passing-based brain network modeling. Source code is available at https: //github. com/LYWJUN/BQN-demo.

ICLR Conference 2025 Conference Paper

DUALFormer: Dual Graph Transformer

  • Jiaming Zhuo
  • Yuwei Liu
  • Yintong Lu
  • Ziyi Ma
  • Kun Fu
  • Chuan Wang 0002
  • Yuanfang Guo
  • Zhen Wang 0004

Graph Transformers (GTs), adept at capturing the locality and globality of graphs, have shown promising potential in node classification tasks. Most state-of-the-art GTs succeed through integrating local Graph Neural Networks (GNNs) with their global Self-Attention (SA) modules to enhance structural awareness. Nonetheless, this architecture faces limitations arising from scalability challenges and the trade-off between capturing local and global information. On the one hand, the quadratic complexity associated with the SA modules poses a significant challenge for many GTs, particularly when scaling them to large-scale graphs. Numerous GTs necessitated a compromise, relinquishing certain aspects of their expressivity to garner computational efficiency. On the other hand, GTs face challenges in maintaining detailed local structural information while capturing long-range dependencies. As a result, they typically require significant computational costs to balance the local and global expressivity. To address these limitations, this paper introduces a novel GT architecture, dubbed DUALFormer, featuring a dual-dimensional design of its GNN and SA modules. Leveraging approximation theory from Linearized Transformers and treating the query as the surrogate representation of node features, DUALFormer \emph{efficiently} performs the computationally intensive global SA module on feature dimensions. Furthermore, by such a separation of local and global modules into dual dimensions, DUALFormer achieves a natural balance between local and global expressivity. In theory, DUALFormer can reduce intra-class variance, thereby enhancing the discriminability of node representations. Extensive experiments on eleven real-world datasets demonstrate its effectiveness and efficiency over existing state-of-the-art GTs.

ICML Conference 2025 Conference Paper

Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models

  • Chao Huang 0008
  • Yushu Shi
  • Jie Wen 0001
  • Wei Wang 0169
  • Yong Xu 0001
  • Xiaochun Cao

With advancements in visual language models (VLMs) and large language models (LLMs), video anomaly detection (VAD) has progressed beyond binary classification to fine-grained categorization and multidimensional analysis. However, existing methods focus mainly on coarse-grained detection, lacking anomaly explanations. To address these challenges, we propose Ex-VAD, an Explainable Fine-grained Video Anomaly Detection approach that combines fine-grained classification with detailed explanations of anomalies. First, we use a VLM to extract frame-level captions, and an LLM converts them to video-level explanations, enhancing the model’s explainability. Second, integrating textual explanations of anomalies with visual information greatly enhances the model’s anomaly detection capability. Finally, we apply label-enhanced alignment to optimize feature fusion, enabling precise fine-grained detection. Extensive experimental results on the UCF-Crime and XD-Violence datasets demonstrate that Ex-VAD significantly outperforms existing State-of-The-Art methods.

ICML Conference 2025 Conference Paper

Focal-SAM: Focal Sharpness-Aware Minimization for Long-Tailed Classification

  • Sicong Li
  • Qianqian Xu 0001
  • Zhiyong Yang 0001
  • Zitai Wang
  • Linchao Zhang
  • Xiaochun Cao
  • Qingming Huang

Real-world datasets often follow a long-tailed distribution, making generalization to tail classes difficult. Recent methods resorted to long-tail variants of Sharpness-Aware Minimization (SAM), such as ImbSAM and CC-SAM, to improve generalization by flattening the loss landscape. However, these attempts face a trade-off between computational efficiency and control over the loss landscape. On the one hand, ImbSAM is efficient but offers only coarse control as it excludes head classes from the SAM process. On the other hand, CC-SAM provides fine-grained control through class-dependent perturbations but at the cost of efficiency due to multiple backpropagations. Seeing this dilemma, we introduce Focal-SAM, which assigns different penalties to class-wise sharpness, achieving fine-grained control without extra backpropagations, thus maintaining efficiency. Furthermore, we theoretically analyze Focal-SAM’s generalization ability and derive a sharper generalization bound. Extensive experiments on both traditional and foundation models validate the effectiveness of Focal-SAM.

AAAI Conference 2025 Conference Paper

Graph Contrastive Learning with Joint Spectral Augmentation of Attribute and Topology

  • Liang Yang
  • Zhenna Li
  • Jiaming Zhuo
  • Jing Liu
  • Ziyi Ma
  • Chuan Wang
  • Zhen Wang
  • Xiaochun Cao

As an essential technique for Graph Contrastive Learning (GCL), Graph Augmentation (GA) improves the generalization capability of the GCLs by introducing different forms of the same graph. To ensure information integrity, existing GA strategies have been designed to simultaneously process the two types of information available in graphs: node attributes and graph topology. Nonetheless, these strategies tend to augment the two types of graph information separately, ignoring their correlation, resulting in limited representation ability. To overcome this drawback, this paper proposes a novel GCL framework with a Joint spectrAl augMentation, named GCL-JAM. Motivated the equivalence between the graph learning objective on an attribute graph and the spectral clustering objective on the attribute-interpolated graph, the node attributes are first abstracted as another type of node to harmonize the node attributes and graph topology. The newly constructed graph is then utilized to perform spectral augmentation to capture the correlation during augmentation. Theoretically, the proposed joint spectral augmentation is proved to perturb more inter-class edges and noise attributes compared to separate augmentation methods. Extensive experiments on homophily and heterophily graphs validate the effectiveness and universality of GCL-JAM.

ICLR Conference 2025 Conference Paper

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

  • Xiaojun Jia
  • Tianyu Pang
  • Chao Du
  • Yihao Huang 0001
  • Jindong Gu
  • Yang Liu 0003
  • Xiaochun Cao
  • Min Lin

Large language models (LLMs) are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of ”Sure'' largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in GCG (i.e., adaptively deciding how many tokens to replace in each step) to accelerate convergence, as well as tricks like easy-to-hard initialization. Then, we combine these improved technologies to develop an efficient jailbreak method, dubbed $\mathcal{I}$-GCG. In our experiments, we evaluate our $\mathcal{I}$-GCG on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve a nearly 100\% attack success rate. The code is released at https://github.com/jiaxiaojunQAQ/I-GCG.

NeurIPS Conference 2025 Conference Paper

Lie Detector: Unified Backdoor Detection via Cross-Examination Framework

  • Xuan Wang
  • Siyuan Liang
  • Dongping Liao
  • Han Fang
  • Aishan Liu
  • Xiaochun Cao
  • Yu-liang Lu
  • Ee-Chien Chang

Institutions with limited data and computing resources often outsource model training to third-party providers in a semi-honest setting, assuming adherence to prescribed training protocols with pre-defined learning paradigm (e. g. , supervised or semi-supervised learning). However, this practice can introduce severe security risks, as adversaries may poison the training data to embed backdoors into the resulting model. Existing detection approaches predominantly rely on statistical analyses, which often fail to maintain universally accurate detection accuracy across different learning paradigms. To address this challenge, we propose a unified backdoor detection framework in the semi-honest setting that exploits cross-examination of model inconsistencies between two independent service providers. Specifically, we integrate central kernel alignment to enable robust feature similarity measurements across different model architectures and learning paradigms, thereby facilitating precise recovery and identification of backdoor triggers. We further introduce backdoor fine-tuning sensitivity analysis to distinguish backdoor triggers from adversarial perturbations, substantially reducing false positives. Extensive experiments demonstrate that our method achieves superior detection performance, improving accuracy by 4. 4%, 1. 7%, and 10. 6% over SoTA baselines across supervised, self-supervised, and autoregressive learning tasks, respectively. Notably, it is the first to effectively detect backdoors in multimodal large language models, further highlighting its broad applicability and advancing secure deep learning.

IJCAI Conference 2025 Conference Paper

MMGIA: Gradient Inversion Attack Against Multimodal Federated Learning via Intermodal Correlation

  • Lele Zheng
  • Yang Cao
  • Leo Yu Zhang
  • Wei Wang
  • Yulong Shen
  • Xiaochun Cao

Multimodal federated learning (MMFL) enables collaborative model training across multiple modalities, such as images and text, without requiring direct data sharing. However, the inherent correlations between modalities introduce new privacy vulnerabilities, making MMFL more susceptible to gradient inversion attacks. In this work, we propose MMGIA, an intermodal correlation-driven gradient inversion attack that systematically exploits multimodal correlation to enhance data reconstruction quality. MMGIA consists of a two-stage optimization framework: the first stage independently reconstructs each modality using traditional gradient inversion techniques, while the second stage refines these reconstructions through pre-trained feature extractors to align modalities in a shared latent space. To further improve reconstruction accuracy, we introduce a quality-weighted fusion strategy, which dynamically integrates multimodal embeddings into a global fused representation that serves as a guiding signal for refining each modality’s reconstruction. This ensures that high-quality reconstructions contribute more to the optimization process, preventing degradation in well-reconstructed modalities while enhancing weaker ones. We conduct extensive experiments on multiple multimodal scenarios, demonstrating that MMGIA outperforms both the only existing multimodal attack and state-of-the-art single-modal attacks, revealing the heightened privacy risks in MMFL.

ICML Conference 2025 Conference Paper

Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent

  • Yongxian Wei
  • Anke Tang
  • Li Shen 0008
  • Zixuan Hu
  • Chun Yuan 0003
  • Xiaochun Cao

Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data. Existing methods attempt to alleviate task conflicts by sparsifying task vectors or promoting orthogonality among them. However, they overlook the fundamental target of model merging: the merged model performs as closely as possible to task-specific models on respective tasks. We find these methods inevitably discard task-specific information that, while causing conflicts, is crucial for performance. Based on our findings, we frame model merging as a constrained optimization problem ($\textit{i. e. }$, minimizing the gap between the merged model and individual models, subject to the constraint of retaining shared knowledge) and solve it via adaptive projective gradient descent. Specifically, we align the merged model with individual models by decomposing and reconstituting the loss function, alleviating conflicts through $\textit{data-free}$ optimization of task vectors. To retain shared knowledge, we optimize this objective by projecting gradients within a $\textit{shared subspace}$ spanning all tasks. Moreover, we view merging coefficients as adaptive learning rates and propose a task-aware, training-free strategy. Experiments show that our plug-and-play approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.

IJCAI Conference 2025 Conference Paper

Object-Level Backdoor Attacks in RGB-T Semantic Segmentation with Cross-Modality Trigger Optimization

  • Xianghao Jiao
  • Di Wang
  • Jiawei Liang
  • Jianjie Huang
  • Wei Wang
  • Xiaochun Cao

The escalating threat of backdoor risks in deep vision models is a pressing concern. Existing research on backdoor attacks is often confined to a single modality, neglecting the challenges posed by multi-modality scene perception. This work is a pioneer of backdoor attacks in RGB-Thermal (RGB-T) semantic segmentation. We overcome the critical limitation of current segmentation backdoor attacks that indiscriminately compromise all objects of a victim class, failing to provide fine-grained control for selectively targeting specific objects as required by adversaries. To address this, we introduce a novel Object-level Backdoor Attack pipeline, termed OBA. The OBA first employs a precise data poisoning (PDP) to lock a specific victim object. Specifically, the PDP embeds the trigger into the only victim object and modifies its label’s pixels at the corresponding positions, thus enabling object-level attacks. In addition, the domain gap between static single-modality triggers and multi-modality scenarios limits the PDP. We therefore introduce a Cross-Modality Trigger Generation (CMTG) method. Through style designs of triggers and cross-modality trigger co-optimization, the target domain semantics and multi-modality model perception patterns are encoded into triggers, achieving high effectiveness, stealth, and physical feasibility of triggers. Extensive experiments show that the proposed OBA enables precise manipulation of the designated object within the specific class.

ICML Conference 2025 Conference Paper

One Image is Worth a Thousand Words: A Usability Preservable Text-Image Collaborative Erasing Framework

  • Feiran Li
  • Qianqian Xu 0001
  • Shilong Bao
  • Zhiyong Yang 0001
  • Xiaochun Cao
  • Qingming Huang

Concept erasing has recently emerged as an effective paradigm to prevent text-to-image diffusion models from generating visually undesirable or even harmful content. However, current removal methods heavily rely on manually crafted text prompts, making it challenging to achieve a high erasure ( efficacy ) while minimizing the impact on other benign concepts ( usability ), as illustrated in Fig. 1. In this paper, we attribute the limitations to the inherent gap between the text and image modalities, which makes it hard to transfer the intricately entangled concept knowledge from text prompts to the image generation process. To address this, we propose a novel solution by directly integrating visual supervision into the erasure process, introducing the first text-image Collaborative Concept Erasing ( Co-Erasing ) framework. Specifically, Co-Erasing describes the concept jointly by text prompts and the corresponding undesirable images induced by the prompts, and then reduces the generating probability of the target concept through negative guidance. This approach effectively bypasses the knowledge gap between text and image, significantly enhancing erasure efficacy. Additionally, we design a text-guided image concept refinement strategy that directs the model to focus on visual features most relevant to the specified text concept, minimizing disruption to other benign concepts. Finally, comprehensive experiments suggest that Co-Erasing outperforms state-of-the-art erasure approaches significantly with a better trade-off between efficacy and usability.

IJCAI Conference 2025 Conference Paper

Physical Adversarial Camouflage Through Gradient Calibration and Regularization

  • Jiawei Liang
  • Siyuan Liang
  • Jianjie Huang
  • Chenxi Si
  • Ming Zhang
  • Xiaochun Cao

The advancement of deep object detectors has greatly affected safety-critical fields like autonomous driving. However, physical adversarial camouflage poses a significant security risk by altering object textures to deceive detectors. Existing techniques struggle with variable physical environments, facing two main challenges: 1) inconsistent sampling point densities across distances hinder the gradient optimization from ensuring local continuity, and 2) updating texture gradients from multiple angles causes conflicts, reducing optimization stability and attack effectiveness. To address these issues, we propose a novel adversarial camouflage framework based on gradient optimization. First, we introduce a gradient calibration strategy, which ensures consistent gradient updates across distances by propagating gradients from sparsely to unsampled texture points, thereby expanding the attack's effective range. Additionally, we develop a gradient decorrelation method, which prioritizes and orthogonalizes gradients based on loss values, enhancing stability and effectiveness in multi-angle optimization by eliminating redundant or conflicting updates. Extensive experimental results on various detection models, angles, and distances show that our method significantly surpasses the state-of-the-art, with an average attack success rate (ASR) increase of 13. 46\% across distances and 11. 03\% across angles. Furthermore, experiments in real-world settings confirm the method's threat potential, highlighting the urgent need for more robust autopilot systems less prone to spoofing.

NeurIPS Conference 2025 Conference Paper

Rethinking Joint Maximum Mean Discrepancy for Visual Domain Adaptation

  • Wei Wang
  • Haifeng Xia
  • Chao Huang
  • Zhengming Ding
  • Cong Wang
  • Haojie Li
  • Xiaochun Cao

In domain adaption (DA), joint maximum mean discrepancy (JMMD), as a famous distribution-distance metric, aims to measure joint probability distribution difference between the source domain and target domain, while it is still not fully explored and especially hard to be applied into a subspace-learning framework as its empirical estimation involves a tensor-product operator whose partial derivative is difficult to obtain. To solve this issue, we deduce a concise JMMD based on the Representer theorem that avoids the tensor-product operator and obtains two essential findings. First, we reveal the uniformity of JMMD by proving that previous marginal, class conditional, and weighted class conditional probability distribution distances are three special cases of JMMD with different label reproducing kernels. Second, inspired by graph embedding, we observe that the similarity weights, which strengthen the intra-class compactness in the graph of Hilbert Schmidt independence criterion (HSIC), take opposite signs in the graph of JMMD, revealing why JMMD degrades the feature discrimination. This motivates us to propose a novel loss JMMD-HSIC by jointly considering JMMD and HSIC to promote discrimination of JMMD. Extensive experiments on several cross-domain datasets could demonstrate the validity of our revealed theoretical results and the effectiveness of our proposed JMMD-HSIC.

NeurIPS Conference 2025 Conference Paper

RoMa: A Robust Model Watermarking Scheme for Protecting IP in Diffusion Models

  • Yingsha Xie
  • Rui Min
  • Zeyu Qin
  • Fei Ma
  • Li Shen
  • Fei Yu
  • Xiaochun Cao

Preserving intellectual property (IP) within a pre-trained diffusion model is critical for protecting the model's copyright and preventing unauthorized model deployment. In this regard, model watermarking is a common practice for IP protection that embeds traceable information within models and allows for further verification. Nevertheless, existing watermarking schemes often face challenges due to their vulnerability to fine-tuning, limiting their practical application in general pre-training and fine-tuning paradigms. Inspired by using mode connectivity to analyze model performance between a pair of connected models, we investigate watermark vulnerability by leveraging Linear Mode Connectivity (LMC) as a proxy to analyze the fine-tuning dynamics of watermark performance. Our results show that existing watermarked models tend to converge to sharp minima in the loss landscape, thus making them vulnerable to fine-tuning. To tackle this challenge, we propose RoMa, a Ro bust M odel w a termarking scheme that improves the robustness of watermarks against fine-tuning. Specifically, RoMa decomposes watermarking into two components, including Embedding Functionality, which preserves reliable watermark detection capability, and Path-specific Smoothness, which enhances the smoothness along the watermark-connected path to improve robustness. Extensive experiments on benchmark datasets MS-COCO-2017 and CUB-200-2011 demonstrate that RoMa significantly improves watermark robustness against fine-tuning while maintaining generation quality, outperforming baselines. The code is available at https: //github. com/xiekks/RoMa.

AAAI Conference 2025 Conference Paper

SUMI-IFL: An Information-Theoretic Framework for Image Forgery Localization with Sufficiency and Minimality Constraints

  • Ziqi Sheng
  • Wei Lu
  • Xiangyang Luo
  • Jiantao Zhou
  • Xiaochun Cao

Image forgery localization (IFL) is a crucial technique for preventing tampered image misuse and protecting social safety. However, due to the rapid development of image tampering technologies, extracting more comprehensive and accurate forgery clues remains an urgent challenge. To address these challenges, we introduce a novel information-theoretic IFL framework named SUMI-IFL that imposes sufficiency-view and minimality-view constraints on forgery feature representation. First, grounded in the theoretical analysis of mutual information, the sufficiency-view constraint is enforced on the feature extraction network to ensure that the latent forgery feature contains comprehensive forgery clues. Considering that forgery clues obtained from a single aspect alone may be incomplete, we construct the latent forgery feature by integrating several orthogonal individual image features. Second, based on the information bottleneck, the minimality-view constraint is imposed on the feature reasoning network to achieve an accurate and concise forgery feature representation that counters the interference of task-unrelated features. Extensive experiments show the superior performance of SUMI-IFL to existing state-of-the-art methods, not only on in-dataset comparisons but also on cross-dataset comparisons.

ICML Conference 2025 Conference Paper

Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision

  • Li Shen 0008
  • Anke Tang
  • Yong Luo 0002
  • Tao Sun 0005
  • Han Hu 0003
  • Xiaochun Cao

Pruning is a widely used technique for compressing large neural networks that eliminates weights that have minimal impact on the model’s performance. Current pruning methods, exemplified by magnitude pruning, assign an importance score to each weight based on its magnitude and remove weights with scores below a certain threshold. Nonetheless, these methods often create a gap between the original dense and the pruned sparse model, potentially impairing performance. Especially when the sparsity ratio is high, the gap becomes more pronounced. To mitigate this issue, we introduce a method to bridge the gap left by pruning by utilizing a low-rank approximation of the difference between the dense and sparse matrices. Our method entails the iterative refinement of the sparse weight matrix augmented by a low-rank adjustment. This technique captures and retains the essential information often lost during pruning, thereby improving the performance of the pruned model. Furthermore, we offer a comprehensive theoretical analysis of our approach, emphasizing its convergence properties and establishing a solid basis for its efficacy. Experimental results on LLaMa models validate its effectiveness on large language models across various pruning techniques and sparsity levels. Our method shows significant improvements: at 50% sparsity, it reduces perplexity by 53. 9% compared to conventional magnitude pruning on LLaMa-7B. Furthermore, to achieve a specific performance target, our approach enables an 8. 6% reduction in model parameters while maintaining a sparsity ratio of about 50%.

NeurIPS Conference 2025 Conference Paper

Towards Irreversible Attack: Fooling Scene Text Recognition via Multi-Population Coevolution Search

  • Jingyu Li
  • Pengwen Dai
  • Mingqing Zhu
  • Chengwei Wang
  • Haolong Liu
  • Xiaochun Cao

Recent work has shown that scene text recognition (STR) models are vulnerable to adversarial examples. Different from non-sequential vision tasks, the output sequence of STR models contains rich information. However, existing adversarial attacks against STR models can only lead to a few incorrect characters in the predicted text. These attack results still carry partial information about the original prediction and could be easily corrected by an external dictionary or a language model. Therefore, we propose the Multi-Population Coevolution Search (MPCS) method to attack each character in the image. We first decompose the global optimization objective into sub-objectives to solve the attack pixel concentration problem existing in previous attack methods. While this distributed optimization paradigm brings a new joint perturbation shift problem, we propose a novel coevolution energy function to solve it. Experiments on recent STR models show the superiority of our method. The code is available at \url{https: //github. com/Lee-Jingyu/MPCS}.

ICLR Conference 2025 Conference Paper

Understanding the Stability-based Generalization of Personalized Federated Learning

  • Yingqi Liu
  • Qinglun Li
  • Jie Tang 0001
  • Yifan Shi
  • Li Shen 0008
  • Xiaochun Cao

Despite great achievements in algorithm design for Personalized Federated Learning (PFL), research on the theoretical analysis of generalization is still in its early stages. Some theoretical results have investigated the generalization performance of personalized models under the problem setting and hypothesis in convex conditions, which can not reflect the real iteration performance during non-convex training. To further understand the real performance from a generalization perspective, we propose the first algorithm-dependent generalization analysis with uniform stability for the typical PFL method, Partial Model Personalization, on smooth and non-convex objectives. Specifically, we decompose the generalization errors into aggregation errors and fine-tuning errors, then creatively establish a generalization analysis framework corresponding to the gradient estimation process of the personalized training. This framework builds up the bridge among PFL, FL and Pure Local Training for personalized aims in heterogeneous scenarios, which clearly demonstrates the effectiveness of PFL from the generalization perspective. Moreover, we demonstrate the impact of trivial factors like learning steps, stepsizes and communication topologies and obtain the excess risk analysis with optimization errors for PFL. Promising experiments on CIFAR datasets also corroborate our theoretical insights. Our code can be seen in https://github.com/YingqiLiu1999/Understanding-the-Stability-based-Generalization-of-Personalized-Federated-Learning.

NeurIPS Conference 2025 Conference Paper

Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training

  • Qinglun Li
  • Yingqi Liu
  • Miao Zhang
  • Xiaochun Cao
  • Quanjun Yin
  • Li Shen

Decentralized training removes the centralized server, making it a communication-efficient approach that can significantly improve training efficiency, but it often suffers from degraded performance compared to centralized training. Multi-Gossip Steps (MGS) serve as a simple yet effective bridge between decentralized and centralized training, significantly reducing experiment performance gaps. However, the theoretical reasons for its effectiveness and whether this gap can be fully eliminated by MGS remain open questions. In this paper, we derive upper bounds on the generalization error and excess error of MGS using stability analysis, systematically answering these two key questions. 1). Optimization Error Reduction: MGS reduces the optimization error bound at an exponential rate, thereby exponentially tightening the generalization error bound and enabling convergence to better solutions. 2). Gap to Centralization: Even as MGS approaches infinity, a non-negligible gap in generalization error remains compared to centralized mini-batch SGD ($\mathcal{O}(T^{\frac{c\beta}{c\beta +1}}/{n m})$ in centralized and $\mathcal{O}(T^{\frac{2c\beta}{2c\beta +2}}/{n m^{\frac{1}{2c\beta +2}}})$ in decentralized). Furthermore, we provide the first unified analysis of how factors like learning rate, data heterogeneity, node count, per-node sample size, and communication topology impact the generalization of MGS under non-convex settings without the bounded gradients assumption, filling a critical theoretical gap in decentralized training. Finally, promising experiments on CIFAR datasets support our theoretical findings.

NeurIPS Conference 2025 Conference Paper

Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought

  • Chao Huang
  • Benfeng Wang
  • Wei Wang
  • Jie Wen
  • Chengliang Liu
  • Li Shen
  • Xiaochun Cao

Recent advancements in reasoning capability of Multimodal Large Language Models (MLLMs) demonstrate its effectiveness in tackling complex visual tasks. However, existing MLLM-based Video Anomaly Detection (VAD) methods remain limited to shallow anomaly descriptions without deep reasoning. In this paper, we propose a new task named Video Anomaly Reasoning (VAR), which aims to enable deep analysis and understanding of anomalies in the video by requiring MLLMs to think explicitly before answering. To this end, we propose Vad-R1, an end-to-end MLLM-based framework for VAR. Specifically, we design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies, guiding the MLLMs to reason about anomalies step-by-step. Based on the structured P2C-CoT, we construct Vad-Reasoning, a dedicated dataset for VAR. Furthermore, we propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs through a self-verification mechanism with limited annotations. Experimental results demonstrate that Vad-R1 achieves superior performance, outperforming both open-source and proprietary models on VAD and VAR tasks.

TMLR Journal 2024 Journal Article

A Survey on Transferability of Adversarial Examples Across Deep Neural Networks

  • Jindong Gu
  • Xiaojun Jia
  • Pau de Jorge
  • Wenqian Yu
  • Xinwei Liu
  • Avery Ma
  • Yuan Xun
  • Anjun Hu

The emergence of Deep Neural Networks (DNNs) has revolutionized various domains by enabling the resolution of complex tasks spanning image recognition, natural language processing, and scientific problem-solving. However, this progress has also brought to light a concerning vulnerability: adversarial examples. These crafted inputs, imperceptible to humans, can manipulate machine learning models into making erroneous predictions, raising concerns for safety-critical applications. An intriguing property of this phenomenon is the transferability of adversarial examples, where perturbations crafted for one model can deceive another, often with a different architecture. This intriguing property enables ``black-box'' attacks which circumvents the need for detailed knowledge of the target model. This survey explores the landscape of the adversarial transferability of adversarial examples. We categorize existing methodologies to enhance adversarial transferability and discuss the fundamental principles guiding each approach. While the predominant body of research primarily concentrates on image classification, we also extend our discussion to encompass other vision tasks and beyond. Challenges and opportunities are discussed, highlighting the importance of fortifying DNNs against adversarial vulnerabilities in an evolving landscape.

AAAI Conference 2024 Conference Paper

Does Few-Shot Learning Suffer from Backdoor Attacks?

  • Xinwei Liu
  • Xiaojun Jia
  • Jindong Gu
  • Yuan Xun
  • Siyuan Liang
  • Xiaochun Cao

The field of few-shot learning (FSL) has shown promising results in scenarios where training data is limited, but its vulnerability to backdoor attacks remains largely unexplored. We first explore this topic by first evaluating the performance of the existing backdoor attack methods on few-shot learning scenarios. Unlike in standard supervised learning, existing backdoor attack methods failed to perform an effective attack in FSL due to two main issues. Firstly, the model tends to overfit to either benign features or trigger features, causing a tough trade-off between attack success rate and benign accuracy. Secondly, due to the small number of training samples, the dirty label or visible trigger in the support set can be easily detected by victims, which reduces the stealthiness of attacks. It seemed that FSL could survive from backdoor attacks. However, in this paper, we propose the Few-shot Learning Backdoor Attack (FLBA) to show that FSL can still be vulnerable to backdoor attacks. Specifically, we first generate a trigger to maximize the gap between poisoned and benign features. It enables the model to learn both benign and trigger features, which solves the problem of overfitting. To make it more stealthy, we hide the trigger by optimizing two types of imperceptible perturbation, namely attractive and repulsive perturbation, instead of attaching the trigger directly. Once we obtain the perturbations, we can poison all samples in the benign support set into a hidden poisoned support set and fine-tune the model on it. Our method demonstrates a high Attack Success Rate (ASR) in FSL tasks with different few-shot learning paradigms while preserving clean accuracy and maintaining stealthiness. This study reveals that few-shot learning still suffers from backdoor attacks, and its security should be given attention.

NeurIPS Conference 2024 Conference Paper

EnsIR: An Ensemble Algorithm for Image Restoration via Gaussian Mixture Models

  • Shangquan Sun
  • Wenqi Ren
  • Zikun Liu
  • Hyunhee Park
  • Rui Wang
  • Xiaochun Cao

Image restoration has experienced significant advancements due to the development of deep learning. Nevertheless, it encounters challenges related to ill-posed problems, resulting in deviations between single model predictions and ground-truths. Ensemble learning, as a powerful machine learning technique, aims to address these deviations by combining the predictions of multiple base models. Most existing works adopt ensemble learning during the design of restoration models, while only limited research focuses on the inference-stage ensemble of pre-trained restoration models. Regression-based methods fail to enable efficient inference, leading researchers in academia and industry to prefer averaging as their choice for post-training ensemble. To address this, we reformulate the ensemble problem of image restoration into Gaussian mixture models (GMMs) and employ an expectation maximization (EM)-based algorithm to estimate ensemble weights for aggregating prediction candidates. We estimate the range-wise ensemble weights on a reference set and store them in a lookup table (LUT) for efficient ensemble inference on the test set. Our algorithm is model-agnostic and training-free, allowing seamless integration and enhancement of various pre-trained image restoration models. It consistently outperforms regression-based methods and averaging ensemble approaches on 14 benchmarks across 3 image restoration tasks, including super-resolution, deblurring and deraining. The codes and all estimated weights have been released in Github.

NeurIPS Conference 2024 Conference Paper

Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification

  • Yihong Luo
  • Yuhan Chen
  • Siya Qiu
  • Yiwei Wang
  • Chen Zhang
  • Yan Zhou
  • Xiaochun Cao
  • Jing Tang

Graph Neural Networks (GNNs) have shown superior performance in node classification. However, GNNs perform poorly in the Few-Shot Node Classification (FSNC) task that requires robust generalization to make accurate predictions for unseen classes with limited labels. To tackle the challenge, we propose the integration of Sharpness-Aware Minimization (SAM)--a technique designed to enhance model generalization by finding a flat minimum of the loss landscape--into GNN training. The standard SAM approach, however, consists of two forward-backward steps in each training iteration, doubling the computational cost compared to the base optimizer (e. g. , Adam). To mitigate this drawback, we introduce a novel algorithm, Fast Graph Sharpness-Aware Minimization (FGSAM), that integrates the rapid training of Multi-Layer Perceptrons (MLPs) with the superior performance of GNNs. Specifically, we utilize GNNs for parameter perturbation while employing MLPs to minimize the perturbed loss so that we can find a flat minimum with good generalization more efficiently. Moreover, our method reutilizes the gradient from the perturbation phase to incorporate graph topology into the minimization process at almost zero additional cost. To further enhance training efficiency, we develop FGSAM+ that executes exact perturbations periodically. Extensive experiments demonstrate that our proposed algorithm outperforms the standard SAM with lower computational costs in FSNC tasks. In particular, our FGSAM+ as a SAM variant offers a faster optimization than the base optimizer in most cases. In addition to FSNC, our proposed methods also demonstrate competitive performance in the standard node classification task for heterophilic graphs, highlighting the broad applicability.

ICML Conference 2024 Conference Paper

Harnessing Hierarchical Label Distribution Variations in Test Agnostic Long-tail Recognition

  • Zhiyong Yang 0001
  • Qianqian Xu 0001
  • Zitai Wang
  • Sicong Li
  • Boyu Han
  • Shilong Bao
  • Xiaochun Cao
  • Qingming Huang

This paper explores test-agnostic long-tail recognition, a challenging long-tail task where the test label distributions are unknown and arbitrarily imbalanced. We argue that the variation in these distributions can be broken down hierarchically into global and local levels. The global ones reflect a broad range of diversity, while the local ones typically arise from milder changes, often focused On a particular neighbor. Traditional methods predominantly use a Mixture-of-Expert (MoE) approach, targeting a few fixed test label distributions that exhibit substantial global variations. However, the local variations are left unconsidered. To address this issue, we propose a new MoE strategy, $\mathsf{DirMixE}$, which assigns experts to different Dirichlet meta-distributions of the label distribution, each targeting a specific aspect of local variations. Additionally, the diversity among these Dirichlet meta-distributions inherently captures global variations. This dual-level approach also leads to a more stable objective function, allowing us to sample different test distributions better to quantify the mean and variance of performance outcomes. Theoretically, we show that our proposed objective benefits from enhanced generalization by virtue of the variance-based regularization. Comprehensive experiments across multiple benchmarks confirm the effectiveness of $\mathsf{DirMixE}$.

ICLR Conference 2024 Conference Paper

Less is More: Fewer Interpretable Region via Submodular Subset Selection

  • Ruoyu Chen 0001
  • Hua Zhang 0008
  • Siyuan Liang 0004
  • Jingzhi Li 0002
  • Xiaochun Cao

Image attribution algorithms aim to identify important regions that are highly relevant to model decisions. Although existing attribution solutions can effectively assign importance to target elements, they still face the following challenges: 1) existing attribution methods generate inaccurate small regions thus misleading the direction of correct attribution, and 2) the model cannot produce good attribution results for samples with wrong predictions. To address the above challenges, this paper re-models the above image attribution problem as a submodular subset selection problem, aiming to enhance model interpretability using fewer regions. To address the lack of attention to local regions, we construct a novel submodular function to discover more accurate small interpretation regions. To enhance the attribution effect for all samples, we also impose four different constraints on the selection of sub-regions, i.e., confidence, effectiveness, consistency, and collaboration scores, to assess the importance of various subsets. Moreover, our theoretical analysis substantiates that the proposed function is in fact submodular. Extensive experiments show that the proposed method outperforms SOTA methods on two face datasets (Celeb-A and VGG-Face2) and one fine-grained dataset (CUB-200-2011). For correctly predicted samples, the proposed method improves the Deletion and Insertion scores with an average of 4.9\% and 2.5\% gain relative to HSIC-Attribution. For incorrectly predicted samples, our method achieves gains of 81.0\% and 18.4\% compared to the HSIC-Attribution algorithm in the average highest confidence and Insertion score respectively. The code is released at https://github.com/RuoyuChen10/SMDL-Attribution.

IJCAI Conference 2024 Conference Paper

Multi-Attention Based Visual-Semantic Interaction for Few-Shot Learning

  • Peng Zhao
  • Yin Wang
  • Wei Wang
  • Jie Mu
  • Huiting Liu
  • Cong Wang
  • Xiaochun Cao

Few-Shot Learning (FSL) aims to train a model that can generalize to recognize new classes, with each new class having only very limited training samples. Since extracting discriminative features for new classes with few samples is challenging, existing FSL methods leverage visual and semantic prior knowledge to guide discriminative feature learning. However, for meta-learning purposes, the semantic knowledge of the query set is unavailable, so their features lack discriminability. To address this problem, we propose a novel Multi-Attention based Visual-Semantic Interaction (MAVSI) approach for FSL. Specifically, we utilize spatial and channel attention mechanisms to effectively select discriminative visual features for the support set based on its ground-truth semantics while using all the support set semantics for each query set sample. Then, a relation module with class prototypes of the support set is employed to supervise and select discriminative visual features for the query set. To further enhance the discriminability of the support set, we introduce a visual-semantic contrastive learning module to promote the similarity between visual features and their corresponding semantic features. Extensive experiments on four benchmark datasets demonstrate that our proposed MAVSI could outperform existing state-of-the-art FSL methods.

NeurIPS Conference 2024 Conference Paper

Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features

  • Benyuan Meng
  • Qianqian Xu
  • Zitai Wang
  • Xiaochun Cao
  • Qingming Huang

Diffusion models are initially designed for image generation. Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks such as semantic segmentation. Given numerous activations, selecting a small yet effective subset poses a fundamental problem. To this end, the early study of this field performs a large-scale quantitative comparison of the discriminative ability of the activations. However, we find that many potential activations have not been evaluated, such as the queries and keys used to compute attention scores. Moreover, recent advancements in diffusion architectures bring many new activations, such as those within embedded ViT modules. Both combined, activation selection remains unresolved but overlooked. To tackle this issue, this paper takes a further step with a much broader range of activations evaluated. Considering the significant increase in activations, a full-scale quantitative comparison is no longer operational. Instead, we seek to understand the properties of these activations, such that the activations that are clearly inferior can be filtered out in advance via simple qualitative evaluation. After careful analysis, we discover three properties universal among diffusion models, enabling this study to go beyond specific models. On top of this, we present effective feature selection solutions for several popular diffusion models. Finally, the experiments across multiple discriminative tasks validate the superiority of our method over the SOTA competitors. Our code is available at https: //github. com/Darkbblue/generic-diffusion-feature.

AAAI Conference 2024 Conference Paper

Omnidirectional Image Super-resolution via Bi-projection Fusion

  • Jiangang Wang
  • Yuning Cui
  • Yawen Li
  • Wenqi Ren
  • Xiaochun Cao

With the rapid development of virtual reality, omnidirectional images (ODIs) have attracted much attention from both the industrial community and academia. However, due to storage and transmission limitations, the resolution of current ODIs is often insufficient to provide an immersive virtual reality experience. Previous approaches address this issue using conventional 2D super-resolution techniques on equirectangular projection without exploiting the unique geometric properties of ODIs. In particular, the equirectangular projection (ERP) provides a complete field-of-view but introduces significant distortion, while the cubemap projection (CMP) can reduce distortion yet has a limited field-of-view. In this paper, we present a novel Bi-Projection Omnidirectional Image Super-Resolution (BPOSR) network to take advantage of the geometric properties of the above two projections. Then, we design two tailored attention methods for these projections: Horizontal Striped Transformer Block (HSTB) for ERP and Perspective Shift Transformer Block (PSTB) for CMP. Furthermore, we propose a fusion module to make these projections complement each other. Extensive experiments demonstrate that BPOSR achieves state-of-the-art performance on omnidirectional image super-resolution. The code is available at https://github.com/W-JG/BPOSR.

IJCAI Conference 2024 Conference Paper

Optimal Graph Learning and Nuclear Norm Maximization for Deep Cross-Domain Robust Label Propagation

  • Wei Wang
  • Hanyang Li
  • Ke Shi
  • Chao Huang
  • Yang Cao
  • Cong Wang
  • Xiaochun Cao

Domain adaptation aims to achieve label transfer from a labeled source domain to an unlabeled target domain, where the two domains exhibit different distributions. Existing methods primarily concentrate on designing a feature extractor to learn better domain-invariant features, along with developing an effective classifier for reliable predictions. In this paper, we introduce optimal graph learning to generate a cross-domain graph that effectively connects the two domains, and two domain-specific graphs to capture domain-specific structures. On the one hand, we incorporate the three graphs into the label propagation (LP) classifier to enhance its robustness to distribution difference. On the other hand, we leverage the three graphs to introduce graph embedding losses, promoting the learning of locally discriminative and domain-invariant features. Furthermore, we maximize the nuclear norm of predictions in LP to enhance class diversity, thereby improving its robustness to class imbalance problem. Correspondingly, we develop an efficient algorithm to solve the associated optimization problem. Finally, we integrate the proposed LP and graph embedding losses into a deep neural network, resulting in our proposed deep cross-domain robust LP. Extensive experiments conducted on three cross-domain benchmark datasets demonstrate that our proposed approach could outperform existing state-of-the-art domain adaptation methods.

ICLR Conference 2024 Conference Paper

Poisoned Forgery Face: Towards Backdoor Attacks on Face Forgery Detection

  • Jiawei Liang
  • Siyuan Liang 0004
  • Aishan Liu
  • Xiaojun Jia
  • Junhao Kuang
  • Xiaochun Cao

The proliferation of face forgery techniques has raised significant concerns within society, thereby motivating the development of face forgery detection methods. These methods aim to distinguish forged faces from genuine ones and have proven effective in practical applications. However, this paper introduces a novel and previously unrecognized threat in face forgery detection scenarios caused by backdoor attack. By embedding backdoors into models and incorporating specific trigger patterns into the input, attackers can deceive detectors into producing erroneous predictions for forged faces. To achieve this goal, this paper proposes \emph{Poisoned Forgery Face} framework, which enables clean-label backdoor attacks on face forgery detectors. Our approach involves constructing a scalable trigger generator and utilizing a novel convolving process to generate translation-sensitive trigger patterns. Moreover, we employ a relative embedding method based on landmark-based regions to enhance the stealthiness of the poisoned samples. Consequently, detectors trained on our poisoned samples are embedded with backdoors. Notably, our approach surpasses SoTA backdoor baselines with a significant improvement in attack success rate (+16.39\% BD-AUC) and reduction in visibility (-12.65\% $L_\infty$). Furthermore, our attack exhibits promising performance against backdoor defenses. We anticipate that this paper will draw greater attention to the potential threats posed by backdoor attacks in face forgery detection scenarios. Our codes will be made available at \url{https://github.com/JWLiang007/PFF}.

AAAI Conference 2024 Conference Paper

PTUS: Photo-Realistic Talking Upper-Body Synthesis via 3D-Aware Motion Decomposition Warping

  • Luoyang Lin
  • Zutao Jiang
  • Xiaodan Liang
  • Liqian Ma
  • Michael C. Kampffmeyer
  • Xiaochun Cao

Talking upper-body synthesis is a promising task due to its versatile potential for video creation and consists of animating the body and face from a source image with the motion from a given driving video. However, prior synthesis approaches fall short in addressing this task and have been either limited to animating heads of a target person only, or have animated the upper body but neglected the synthesis of precise facial details. To tackle this task, we propose a Photo-realistic Talking Upper-body Synthesis method via 3D-aware motion decomposition warping, named PTUS, to both precisely synthesize the upper body as well as recover the details of the face such as blinking and lip synchronization. In particular, the motion decomposition mechanism consists of a face-body motion decomposition, which decouples the 3D motion estimation of the face and body, and a local-global motion decomposition, which decomposes the 3D face motion into global and local motions resulting in the transfer of facial expression. The 3D-aware warping module transfers the large-scale and subtle 3D motions to the extracted 3D depth-aware features in a coarse-tofine manner. Moreover, we present a new dataset, Talking-UB, which includes upper-body images with high-resolution faces, addressing the limitations of prior datasets that either consist of only facial images or upper-body images with blurry faces. Experimental results demonstrate that our proposed method can synthesize high-quality videos that preserve facial details, and achieves superior results compared to state-of-the-art cross-person motion transfer approaches. Code and collected dataset are released in https://github.com/cooluoluo/PTUS.

AAAI Conference 2024 Conference Paper

SDGAN: Disentangling Semantic Manipulation for Facial Attribute Editing

  • Wenmin Huang
  • Weiqi Luo
  • Jiwu Huang
  • Xiaochun Cao

Facial attribute editing has garnered significant attention, yet prevailing methods struggle with achieving precise attribute manipulation while preserving irrelevant details and controlling attribute styles. This challenge primarily arises from the strong correlations between different attributes and the interplay between attributes and identity. In this paper, we propose Semantic Disentangled GAN (SDGAN), a novel method addressing this challenge. SDGAN introduces two key concepts: a semantic disentanglement generator that assigns facial representations to distinct attribute-specific editing modules, enabling the decoupling of the facial attribute editing process, and a semantic mask alignment strategy that confines attribute editing to appropriate regions, thereby avoiding undesired modifications. Leveraging these concepts, SDGAN demonstrates accurate attribute editing and achieves high-quality attribute style manipulation through both latent-guided and reference-guided manners. We extensively evaluate our method on the CelebA-HQ database, providing both qualitative and quantitative analyses. Our results establish that SDGAN significantly outperforms state-of-the-art techniques, showcasing the effectiveness of our approach. To foster reproducibility and further research, we will provide the code for our method.

ICML Conference 2024 Conference Paper

Size-invariance Matters: Rethinking Metrics and Losses for Imbalanced Multi-object Salient Object Detection

  • Feiran Li
  • Qianqian Xu 0001
  • Shilong Bao
  • Zhiyong Yang 0001
  • Runmin Cong
  • Xiaochun Cao
  • Qingming Huang

This paper explores the size-invariance of evaluation metrics in Salient Object Detection (SOD), especially when multiple targets of diverse sizes co-exist in the same image. We observe that current metrics are size-sensitive, where larger objects are focused, and smaller ones tend to be ignored. We argue that the evaluation should be size-invariant because bias based on size is unjustified without additional semantic information. In pursuit of this, we propose a generic approach that evaluates each salient object separately and then combines the results, effectively alleviating the imbalance. We further develop an optimization framework tailored to this goal, achieving considerable improvements in detecting objects of different sizes. Theoretically, we provide evidence supporting the validity of our new metrics and present the generalization analysis of SOD. Extensive experiments demonstrate the effectiveness of our method.

NeurIPS Conference 2024 Conference Paper

Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques

  • Benyuan Meng
  • Qianqian Xu
  • Zitai Wang
  • Zhiyong Yang
  • Xiaochun Cao
  • Qingming Huang

Diffusion models are powerful generative models, and this capability can also be applied to discrimination. The inner activations of a pre-trained diffusion model can serve as features for discriminative tasks, namely, diffusion feature. We discover that diffusion feature has been hindered by a hidden yet universal phenomenon that we call content shift. To be specific, there are content differences between features and the input image, such as the exact shape of a certain object. We locate the cause of content shift as one inherent characteristic of diffusion models, which suggests the broad existence of this phenomenon in diffusion feature. Further empirical study also indicates that its negative impact is not negligible even when content shift is not visually perceivable. Hence, we propose to suppress content shift to enhance the overall quality of diffusion features. Specifically, content shift is related to the information drift during the process of recovering an image from the noisy input, pointing out the possibility of turning off-the-shelf generation techniques into tools for content shift suppression. We further propose a practical guideline named GATE to efficiently evaluate the potential benefit of a technique and provide an implementation of our methodology. Despite the simplicity, the proposed approach has achieved superior results on various tasks and datasets, validating its potential as a generic booster for diffusion features. Our code is available at https: //github. com/Darkbblue/diffusion-content-shift.

NeurIPS Conference 2024 Conference Paper

Unified Graph Augmentations for Generalized Contrastive Learning on Graphs

  • Jiaming Zhuo
  • Yintong Lu
  • Hui Ning
  • Kun Fu
  • Bingxin Niu
  • Dongxiao He
  • Chuan Wang
  • Yuanfang Guo

In real-world scenarios, networks (graphs) and their tasks possess unique characteristics, requiring the development of a versatile graph augmentation (GA) to meet the varied demands of network analysis. Unfortunately, most Graph Contrastive Learning (GCL) frameworks are hampered by the specificity, complexity, and incompleteness of their GA techniques. Firstly, GAs designed for specific scenarios may compromise the universality of models if mishandled. Secondly, the process of identifying and generating optimal augmentations generally involves substantial computational overhead. Thirdly, the effectiveness of the GCL, even the learnable ones, is constrained by the finite selection of GAs available. To overcome the above limitations, this paper introduces a novel unified GA module dubbed UGA after reinterpreting the mechanism of GAs in GCLs from a message-passing perspective. Theoretically, this module is capable of unifying any explicit GAs, including node, edge, attribute, and subgraph augmentations. Based on the proposed UGA, a novel generalized GCL framework dubbed Graph cOntrastive UnifieD Augmentations (GOUDA) is proposed. It seamlessly integrates widely adopted contrastive losses and an introduced independence loss to fulfill the common requirements of consistency and diversity of augmentation across diverse scenarios. Evaluations across various datasets and tasks demonstrate the generality and efficiency of the proposed GOUDA over existing state-of-the-art GCLs.

NeurIPS Conference 2023 Conference Paper

A Unified Generalization Analysis of Re-Weighting and Logit-Adjustment for Imbalanced Learning

  • Zitai Wang
  • Qianqian Xu
  • Zhiyong Yang
  • Yuan He
  • Xiaochun Cao
  • Qingming Huang

Real-world datasets are typically imbalanced in the sense that only a few classes have numerous samples, while many classes are associated with only a few samples. As a result, a naive ERM learning process will be biased towards the majority classes, making it difficult to generalize to the minority classes. To address this issue, one simple but effective approach is to modify the loss function to emphasize the learning on minority classes, such as re-weighting the losses or adjusting the logits via class-dependent terms. However, existing generalization analysis of such losses is still coarse-grained and fragmented, failing to explain some empirical results. To bridge this gap between theory and practice, we propose a novel technique named data-dependent contraction to capture how these modified losses handle different classes. On top of this technique, a fine-grained generalization bound is established for imbalanced learning, which helps reveal the mystery of re-weighting and logit-adjustment in a unified manner. Furthermore, a principled learning algorithm is developed based on the theoretical insights. Finally, the empirical results on benchmark datasets not only validate the theoretical results but also demonstrate the effectiveness of the proposed method.

NeurIPS Conference 2023 Conference Paper

DRAUC: An Instance-wise Distributionally Robust AUC Optimization Framework

  • Siran Dai
  • Qianqian Xu
  • Zhiyong Yang
  • Xiaochun Cao
  • Qingming Huang

The Area Under the ROC Curve (AUC) is a widely employed metric in long-tailed classification scenarios. Nevertheless, most existing methods primarily assume that training and testing examples are drawn i. i. d. from the same distribution, which is often unachievable in practice. Distributionally Robust Optimization (DRO) enhances model performance by optimizing it for the local worst-case scenario, but directly integrating AUC optimization with DRO results in an intractable optimization problem. To tackle this challenge, methodically we propose an instance-wise surrogate loss of Distributionally Robust AUC (DRAUC) and build our optimization framework on top of it. Moreover, we highlight that conventional DRAUC may induce label bias, hence introducing distribution-aware DRAUC as a more suitable metric for robust AUC learning. Theoretically, we affirm that the generalization gap between the training loss and testing error diminishes if the training set is sufficiently large. Empirically, experiments on corrupted benchmark datasets demonstrate the effectiveness of our proposed method. Code is available at: https: //github. com/EldercatSAM/DRAUC.

AAAI Conference 2023 Conference Paper

Generating Transferable 3D Adversarial Point Cloud via Random Perturbation Factorization

  • Bangyan He
  • Jian Liu
  • Yiming Li
  • Siyuan Liang
  • Jingzhi Li
  • Xiaojun Jia
  • Xiaochun Cao

Recent studies have demonstrated that existing deep neural networks (DNNs) on 3D point clouds are vulnerable to adversarial examples, especially under the white-box settings where the adversaries have access to model parameters. However, adversarial 3D point clouds generated by existing white-box methods have limited transferability across different DNN architectures. They have only minor threats in real-world scenarios under the black-box settings where the adversaries can only query the deployed victim model. In this paper, we revisit the transferability of adversarial 3D point clouds. We observe that an adversarial perturbation can be randomly factorized into two sub-perturbations, which are also likely to be adversarial perturbations. It motivates us to consider the effects of the perturbation and its sub-perturbations simultaneously to increase the transferability for sub-perturbations also contain helpful information. In this paper, we propose a simple yet effective attack method to generate more transferable adversarial 3D point clouds. Specifically, rather than simply optimizing the loss of perturbation alone, we combine it with its random factorization. We conduct experiments on benchmark dataset, verifying our method's effectiveness in increasing transferability while preserving high efficiency.

ICML Conference 2023 Conference Paper

IRNeXt: Rethinking Convolutional Network Design for Image Restoration

  • Yuning Cui 0001
  • Wenqi Ren
  • Sining Yang
  • Xiaochun Cao
  • Alois C. Knoll

We present IRNeXt, a simple yet effective convolutional network architecture for image restoration. Recently, Transformer models have dominated the field of image restoration due to the powerful ability of modeling long-range pixels interactions. In this paper, we excavate the potential of the convolutional neural network (CNN) and show that our CNN-based model can receive comparable or better performance than Transformer models with low computation overhead on several image restoration tasks. By re-examining the characteristics possessed by advanced image restoration algorithms, we discover several key factors leading to the performance improvement of restoration models. This motivates us to develop a novel network for image restoration based on cheap convolution operators. Comprehensive experiments demonstrate that IRNeXt delivers state-of-the-art performance among numerous datasets on a range of image restoration tasks with low computational complexity, including image dehazing, single-image defocus/motion deblurring, image deraining, and image desnowing. https: //github. com/c-yn/IRNeXt.

IJCAI Conference 2023 Conference Paper

LSGNN: Towards General Graph Neural Network in Node Classification by Local Similarity

  • Yuhan Chen
  • Yihong Luo
  • Jing Tang
  • Liang Yang
  • Siya Qiu
  • Chuan Wang
  • Xiaochun Cao

Heterophily has been considered as an issue that hurts the performance of Graph Neural Networks (GNNs). To address this issue, some existing work uses a graph-level weighted fusion of the information of multi-hop neighbors to include more nodes with homophily. However, the heterophily might differ among nodes, which requires to consider the local topology. Motivated by it, we propose to use the local similarity (LocalSim) to learn node-level weighted fusion, which can also serve as a plug-and-play module. For better fusion, we propose a novel and efficient Initial Residual Difference Connection (IRDC) to extract more informative multi-hop information. Moreover, we provide theoretical analysis on the effectiveness of LocalSim representing node homophily on synthetic graphs. Extensive evaluations over real benchmark datasets show that our proposed method, namely Local Similarity Graph Neural Network (LSGNN), can offer comparable or superior state-of-the-art performance on both homophilic and heterophilic graphs. Meanwhile, the plug-and-play model can significantly boost the performance of existing GNNs.

NeurIPS Conference 2023 Conference Paper

Punctuation-level Attack: Single-shot and Single Punctuation Can Fool Text Models

  • Wenqiang Wang
  • Chongyang Du
  • Tao Wang
  • Kaihao Zhang
  • Wenhan Luo
  • Lin Ma
  • Wei Liu
  • Xiaochun Cao

The adversarial attacks have attracted increasing attention in various fields including natural language processing. The current textual attacking models primarily focus on fooling models by adding character-/word-/sentence-level perturbations, ignoring their influence on human perception. In this paper, for the first time in the community, we propose a novel mode of textual attack, punctuation-level attack. With various types of perturbations, including insertion, displacement, deletion, and replacement, the punctuation-level attack achieves promising fooling rates against SOTA models on typical textual tasks and maintains minimal influence on human perception and understanding of the text by mere perturbation of single-shot single punctuation. Furthermore, we propose a search method named Text Position Punctuation Embedding and Paraphrase (TPPEP) to accelerate the pursuit of optimal position to deploy the attack, without exhaustive search, and we present a mathematical interpretation of TPPEP. Thanks to the integrated Text Position Punctuation Embedding (TPPE), the punctuation attack can be applied at a constant cost of time. Experimental results on public datasets and SOTA models demonstrate the effectiveness of the punctuation attack and the proposed TPPE. We additionally apply the single punctuation attack to summarization, semantic-similarity-scoring, and text-to-image tasks, and achieve encouraging results.

ICLR Conference 2023 Conference Paper

Selective Frequency Network for Image Restoration

  • Yuning Cui 0001
  • Yi Tao
  • Zhenshan Bing
  • Wenqi Ren
  • Xinwei Gao
  • Xiaochun Cao
  • Kai Huang 0001
  • Alois C. Knoll

Image restoration aims to reconstruct the latent sharp image from its corrupted counterpart. Besides dealing with this long-standing task in the spatial domain, a few approaches seek solutions in the frequency domain in consideration of the large discrepancy between spectra of sharp/degraded image pairs. However, these works commonly utilize transformation tools, e.g., wavelet transform, to split features into several frequency parts, which is not flexible enough to select the most informative frequency component to recover. In this paper, we exploit a multi-branch and content-aware module to decompose features into separate frequency subbands dynamically and locally, and then accentuate the useful ones via channel-wise attention weights. In addition, to handle large-scale degradation blurs, we propose an extremely simple decoupling and modulation module to enlarge the receptive field via global and window-based average pooling. Integrating two developed modules into a U-Net backbone, the proposed Selective Frequency Network (SFNet) performs favorably against state-of-the-art algorithms on five image restoration tasks, including single-image defocus deblurring, image dehazing, image motion deblurring, image desnowing, and image deraining.

NeurIPS Conference 2023 Conference Paper

Self-supervised Graph Neural Networks via Low-Rank Decomposition

  • Liang Yang
  • Runjie Shi
  • Qiuliang Zhang
  • Bingxin Niu
  • Zhen Wang
  • Xiaochun Cao
  • Chuan Wang

Self-supervised learning is introduced to train graph neural networks (GNNs) by employing propagation-based GNNs designed for semi-supervised learning tasks. Unfortunately, this common choice tends to cause two serious issues. Firstly, global parameters cause the model lack the ability to capture the local property. Secondly, it is difficult to handle networks beyond homophily without label information. This paper tends to break through the common choice of employing propagation-based GNNs, which aggregate representations of nodes belonging to different classes and tend to lose discriminative information. If the propagation in each ego-network is just between the nodes from the same class, the obtained representation matrix should follow the low-rank characteristic. To meet this requirement, this paper proposes the Low-Rank Decomposition-based GNNs (LRD-GNN-Matrix) by employing Low-Rank Decomposition to the attribute matrix. Furthermore, to incorporate long-distance information, Low-Rank Tensor Decomposition-based GNN (LRD-GNN-Tensor) is proposed by constructing the node attribute tensor from selected similar ego-networks and performing Low-Rank Tensor Decomposition. The employed tensor nuclear norm facilitates the capture of the long-distance relationship between original and selected similar ego-networks. Extensive experiments demonstrate the superior performance and the robustness of LRD-GNNs.

AAAI Conference 2022 Conference Paper

Defending against Model Stealing via Verifying Embedded External Features

  • Yiming Li
  • Linghui Zhu
  • Xiaojun Jia
  • Yong Jiang
  • Shu-Tao Xia
  • Xiaochun Cao

Obtaining a well-trained model involves expensive data collection and training procedures, therefore the model is a valuable intellectual property. Recent studies revealed that adversaries can ‘steal’ deployed models even when they have no training samples and can not get access to the model parameters or structures. Currently, there were some defense methods to alleviate this threat, mostly by increasing the cost of model stealing. In this paper, we explore the defense from another angle by verifying whether a suspicious model contains the knowledge of defender-specified external features. Specifically, we embed the external features by tempering a few training samples with style transfer. We then train a meta-classifier to determine whether a model is stolen from the victim. This approach is inspired by the understanding that the stolen models should contain the knowledge of features learned by the victim model. We examine our method on both CIFAR-10 and ImageNet datasets. Experimental results demonstrate that our method is effective in detecting different types of model stealing simultaneously, even if the stolen model is obtained via a multi-stage stealing process. The codes for reproducing main results are available at Github (https: //github. com/zlh-thu/StealingVerification).

AAAI Conference 2022 Conference Paper

Geometry Interaction Knowledge Graph Embeddings

  • Zongsheng Cao
  • Qianqian Xu
  • Zhiyong Yang
  • Xiaochun Cao
  • Qingming Huang

Knowledge graph (KG) embeddings have shown great power in learning representations of entities and relations for link prediction tasks. Previous work usually embeds KGs into a single geometric space such as Euclidean space (zero curved), hyperbolic space (negatively curved) or hyperspherical space (positively curved) to maintain their specific geometric structures (e. g. , chain, hierarchy and ring structures). However, the topological structure of KGs appears to be complicated, since it may contain multiple types of geometric structures simultaneously. Therefore, embedding KGs in a single space, no matter the Euclidean space, hyperbolic space or hyperspheric space, cannot capture the complex structures of KGs accurately. To overcome this challenge, we propose Geometry Interaction knowledge graph Embeddings (GIE), which learns spatial structures interactively between the Euclidean, hyperbolic and hyperspherical spaces. Theoretically, our proposed GIE can capture a richer set of relational information, model key inference patterns, and enable expressive semantic matching across entities. Experimental results on three wellestablished knowledge graph completion benchmarks show that our GIE achieves the state-of-the-art performance with fewer parameters.

NeurIPS Conference 2022 Conference Paper

OPEN: Orthogonal Propagation with Ego-Network Modeling

  • Liang Yang
  • Lina Kang
  • Qiuliang Zhang
  • Mengzhe Li
  • Bingxin Niu
  • Dongxiao He
  • Zhen Wang
  • Chuan Wang

To alleviate the unfavorable effect of noisy topology in Graph Neural networks (GNNs), some efforts perform the local topology refinement through the pairwise propagation weight learning and the multi-channel extension. Unfortunately, most of them suffer a common and fatal drawback: irrelevant propagation to one node and in multi-channels. These two kinds of irrelevances make propagation weights in multi-channels free to be determined by the labeled data, and thus the GNNs are exposed to overfitting. To tackle this issue, a novel Orthogonal Propagation with Ego-Network modeling (OPEN) is proposed by modeling relevances between propagations. Specifically, the relevance between propagations to one node is modeled by whole ego-network modeling, while the relevance between propagations in multi-channels is modeled via diversity requirement. By interpreting the propagations to one node from the perspective of dimension reduction, propagation weights are inferred from principal components of the ego-network, which are orthogonal to each other. Theoretical analysis and experimental evaluations reveal four attractive characteristics of OPEN as modeling high-order relationships beyond pairwise one, preventing overfitting, robustness, and high efficiency.

NeurIPS Conference 2022 Conference Paper

OpenAUC: Towards AUC-Oriented Open-Set Recognition

  • Zitai Wang
  • Qianqian Xu
  • Zhiyong Yang
  • Yuan He
  • Xiaochun Cao
  • Qingming Huang

Traditional machine learning follows a close-set assumption that the training and test set share the same label space. While in many practical scenarios, it is inevitable that some test samples belong to unknown classes (open-set). To fix this issue, Open-Set Recognition (OSR), whose goal is to make correct predictions on both close-set samples and open-set samples, has attracted rising attention. In this direction, the vast majority of literature focuses on the pattern of open-set samples. However, how to evaluate model performance in this challenging task is still unsolved. In this paper, a systematic analysis reveals that most existing metrics are essentially inconsistent with the aforementioned goal of OSR: (1) For metrics extended from close-set classification, such as Open-set F-score, Youden's index, and Normalized Accuracy, a poor open-set prediction can escape from a low performance score with a superior close-set prediction. (2) Novelty detection AUC, which measures the ranking performance between close-set and open-set samples, ignores the close-set performance. To fix these issues, we propose a novel metric named OpenAUC. Compared with existing metrics, OpenAUC enjoys a concise pairwise formulation that evaluates open-set performance and close-set performance in a coupling manner. Further analysis shows that OpenAUC is free from the aforementioned inconsistency properties. Finally, an end-to-end learning method is proposed to minimize the OpenAUC risk, and the experimental results on popular benchmark datasets speak to its effectiveness.

NeurIPS Conference 2022 Conference Paper

OTKGE: Multi-modal Knowledge Graph Embeddings via Optimal Transport

  • Zongsheng Cao
  • Qianqian Xu
  • Zhiyong Yang
  • Yuan He
  • Xiaochun Cao
  • Qingming Huang

Multi-modal knowledge graph embeddings (KGE) have caught more and more attention in learning representations of entities and relations for link prediction tasks. Different from previous uni-modal KGE approaches, multi-modal KGE can leverage expressive knowledge from a wealth of modalities (image, text, etc. ), leading to more comprehensive representations of real-world entities. However, the critical challenge along this course lies in that the multi-modal embedding spaces are usually heterogeneous. In this sense, direct fusion will destroy the inherent spatial structure of different modal embeddings. To overcome this challenge, we revisit multi-modal KGE from a distributional alignment perspective and propose optimal transport knowledge graph embeddings (OTKGE). Specifically, we model the multi-modal fusion procedure as a transport plan moving different modal embeddings to a unified space by minimizing the Wasserstein distance between multi-modal distributions. Theoretically, we show that by minimizing the Wasserstein distance between the individual modalities and the unified embedding space, the final results are guaranteed to maintain consistency and comprehensiveness. Moreover, experimental results on well-established multi-modal knowledge graph completion benchmarks show that our OTKGE achieves state-of-the-art performance.

NeurIPS Conference 2022 Conference Paper

Rethinking Image Restoration for Object Detection

  • Shangquan Sun
  • Wenqi Ren
  • Tao Wang
  • Xiaochun Cao

Although image restoration has achieved significant progress, its potential to assist object detectors in adverse imaging conditions lacks enough attention. It is reported that the existing image restoration methods cannot improve the object detector performance and sometimes even reduce the detection performance. To address the issue, we propose a targeted adversarial attack in the restoration procedure to boost object detection performance after restoration. Specifically, we present an ADAM-like adversarial attack to generate pseudo ground truth for restoration training. Resultant restored images are close to original sharp images, and at the same time, lead to better results of object detection. We conduct extensive experiments in image dehazing and low light enhancement and show the superiority of our method over conventional training and other domain adaptation and multi-task methods. The proposed pipeline can be applied to all restoration methods and detectors in both one- and two-stage.

AAAI Conference 2022 Conference Paper

Self-Supervised Graph Neural Networks via Diverse and Interactive Message Passing

  • Liang Yang
  • Cheng Chen
  • Weixun Li
  • Bingxin Niu
  • Junhua Gu
  • Chuan Wang
  • Dongxiao He
  • Yuanfang Guo

By interpreting Graph Neural Networks (GNNs) as the message passing from the spatial perspective, their success is attributed to Laplacian smoothing. However, it also leads to serious over-smoothing issue by stacking many layers. Recently, many efforts have been paid to overcome this issue in semi-supervised learning. Unfortunately, it is more serious in unsupervised node representation learning task due to the lack of supervision information. Thus, most of the unsupervised or self-supervised GNNs often employ onelayer GCN as the encoder. Essentially, the over-smoothing issue is caused by the over-simplification of the existing message passing, which possesses two intrinsic limits: blind message and uniform passing. In this paper, a novel Diverse and Interactive Message Passing (DIMP) is proposed for selfsupervised learning by overcoming these limits. Firstly, to prevent the message from blindness and make it interactive between two connected nodes, the message is determined by both the two connected nodes instead of the attributes of one node. Secondly, to prevent the passing from uniformness and make it diverse over different attribute channels, different propagation weights are assigned to different elements in the message. To this end, a natural implementation of the message in DIMP is the element-wise product of the representations of two connected nodes. From the perspective of numerical optimization, the proposed DIMP is equivalent to performing an overlapping community detection via expectation-maximization (EM). Both the objective function of the community detection and the convergence of EM algorithm guarantee that DMIP can prevent from over-smoothing issue. Extensive evaluations on node-level and graph-level tasks demonstrate the superiority of DIMP on improving performance and overcoming over-smoothing issue.

NeurIPS Conference 2022 Conference Paper

The Minority Matters: A Diversity-Promoting Collaborative Metric Learning Algorithm

  • Shilong Bao
  • Qianqian Xu
  • Zhiyong Yang
  • Yuan He
  • Xiaochun Cao
  • Qingming Huang

Collaborative Metric Learning (CML) has recently emerged as a popular method in recommendation systems (RS), closing the gap between metric learning and Collaborative Filtering. Following the convention of RS, existing methods exploit unique user representation in their model design. This paper focuses on a challenging scenario where a user has multiple categories of interests. Under this setting, we argue that the unique user representation might induce preference bias, especially when the item category distribution is imbalanced. To address this issue, we propose a novel method called Diversity-Promoting Collaborative Metric Learning (DPCML), with the hope of considering the commonly ignored minority interest of the user. The key idea behind DPCML is to include a multiple set of representations for each user in the system. Based on this embedding paradigm, user preference toward an item is aggregated from different embeddings by taking the minimum item-user distance among the user embedding set. Furthermore, we observe that the diversity of the embeddings for the same user also plays an essential role in the model. To this end, we propose a diversity control regularization term to accommodate the multi-vector representation strategy better. Theoretically, we show that DPCML could generalize well to unseen test data by tackling the challenge of the annoying operation that comes from the minimum value. Experiments over a range of benchmark datasets speak to the efficacy of DPCML.

AAAI Conference 2021 Conference Paper

Deep Partial Rank Aggregation for Personalized Attributes

  • Qianqian Xu
  • Zhiyong Yang
  • Zuyao Chen
  • Yangbangyan Jiang
  • Xiaochun Cao
  • Yuan Yao
  • Qingming Huang

In this paper, we study the problem of how to aggregate pairwise personalized attributes (PA) annotations (e. g. , Shoes A is more comfortable than B) from different annotators on the crowdsourcing platforms, which is an emerging topic gaining increasing attention in recent years. Given the crowdsourced annotations, the majority of the traditional literature assumes that all the pairs in the collected dataset are distinguishable. However, this assumption is incompatible with how humans perceive attributes since indistinguishable pairs are ubiquitous for the annotators due to the limitation of human perception. To attack this problem, we propose a novel deep prediction model that could simultaneously detect the indistinguishable pairs and aggregate ranking results for distinguishable pairs. First of all, we represent the pairwise annotations as a multi-graph. Based on such data structure, we propose an end-to-end partial ranking model which consists of a deep backbone architecture and a probabilistic model that captures the generative process of the partial rank annotations. Specifically, to recognize the indistinguishable pairs, the probabilistic model we proposed is equipped with an adaptive perception threshold, where indistinguishable pairs could be automatically detected when the absolute value of the score difference is below the learned threshold. In our empirical studies, we perform a series of experiments on three real-world datasets: LFW-10, Shoes, and Sun. The corresponding results consistently show the superiority of our proposed model.

NeurIPS Conference 2021 Conference Paper

Diverse Message Passing for Attribute with Heterophily

  • Liang Yang
  • Mengzhe Li
  • Liyang Liu
  • Bingxin Niu
  • Chuan Wang
  • Xiaochun Cao
  • Yuanfang Guo

Most of the existing GNNs can be modeled via the Uniform Message Passing framework. This framework considers all the attributes of each node in its entirety, shares the uniform propagation weights along each edge, and focuses on the uniform weight learning. The design of this framework possesses two prerequisites, the simplification of homophily and heterophily to the node-level property and the ignorance of attribute differences. Unfortunately, different attributes possess diverse characteristics. In this paper, the network homophily rate defined with respect to the node labels is extended to attribute homophily rate by taking the attributes as weak labels. Based on this attribute homophily rate, we propose a Diverse Message Passing (DMP) framework, which specifies every attribute propagation weight on each edge. Besides, we propose two specific strategies to significantly reduce the computational complexity of DMP to prevent the overfitting issue. By investigating the spectral characteristics, existing spectral GNNs are actually equivalent to a degenerated version of DMP. From the perspective of numerical optimization, we provide a theoretical analysis to demonstrate DMP's powerful representation ability and the ability of alleviating the over-smoothing issue. Evaluations on various real networks demonstrate the superiority of our DMP on handling the networks with heterophily and alleviating the over-smoothing issue, compared to the existing state-of-the-arts.

AAAI Conference 2021 Conference Paper

Dual Quaternion Knowledge Graph Embeddings

  • Zongsheng Cao
  • Qianqian Xu
  • Zhiyong Yang
  • Xiaochun Cao
  • Qingming Huang

In this paper, we study the problem of learning representations of entities and relations in the knowledge graph for the link prediction task. Our idea is based on the observation that the vast majority of the related work only models the relation as a single geometric operation such as translation or rotation, which limits the representation power of the underlying models and makes it harder to match the complicated relations existed in real-world datasets. To embrace a richer set of relational information, we propose a new method called dual quaternion knowledge graph embeddings (DualE), which introduces dual quaternions into knowledge graph embeddings. Specifically, a dual quaternion behaves like a “complex quaternion” with its real and imaginary part all being quaternary. The core of DualE lies a specific design of dual-quaternion-based multiplication, which universally models relations as the compositions of a series of translation and rotation operations. The major merits of DualE are three-fold: 1) it is the first unified framework embracing both rotation-based and translation-based models in 3D space, 2) it expands the embedding space to the dual quaternion space with a more intuitive physical and geometric interpretation, 3) it satisfies the key patterns and the multiple relations pattern of relational representation learning. Experimental results on four real-world datasets demonstrate the effectiveness of our DualE method.

IJCAI Conference 2021 Conference Paper

Heterogeneous Graph Information Bottleneck

  • Liang Yang
  • Fan Wu
  • Zichen Zheng
  • Bingxin Niu
  • Junhua Gu
  • Chuan Wang
  • Xiaochun Cao
  • Yuanfang Guo

Most attempts on extending Graph Neural Networks (GNNs) to Heterogeneous Information Networks (HINs) implicitly take the direct assumption that the multiple homogeneous attributed networks induced by different meta-paths are complementary. The doubts about the hypothesis of complementary motivate an alternative assumption of consensus. That is, the aggregated node attributes shared by multiple homogeneous attributed networks are essential for node representations, while the specific ones in each homogeneous attributed network should be discarded. In this paper, a novel Heterogeneous Graph Information Bottleneck (HGIB) is proposed to implement the consensus hypothesis in an unsupervised manner. To this end, information bottleneck (IB) is extended to unsupervised representation learning by leveraging self-supervision strategy. Specifically, HGIB simultaneously maximizes the mutual information between one homogeneous network and the representation learned from another homogeneous network, while minimizes the mutual information between the specific information contained in one homogeneous network and the representation learned from this homogeneous network. Model analysis reveals that the two extreme cases of HGIB correspond to the supervised heterogeneous GNN and the infomax on homogeneous graph, respectively. Extensive experiments on real datasets demonstrate that the consensus-based unsupervised HGIB significantly outperforms most semi-supervised SOTA methods based on complementary assumption.

AAAI Conference 2021 Conference Paper

What to Select: Pursuing Consistent Motion Segmentation from Multiple Geometric Models

  • Yangbangyan Jiang
  • Qianqian Xu
  • Ke Ma
  • Zhiyong Yang
  • Xiaochun Cao
  • Qingming Huang

Motion segmentation aims at separating motions of different moving objects in a video sequence. Facing the complicated real-world scenes, recent studies reveal that combining multiple geometric models would be a more effective way than just employing a single one. This motivates a new wave of model-fusion based motion segmentation methods. However, the vast majority of models of this kind merely seek consensus in spectral embeddings. We argue that a simple consensus might be insufficient to filter out the harmful information which is either unreliable or semantically unrelated to the segmentation task. Therefore, how to automatically select valuable patterns across multiple models should be regarded as a key challenge here. In this paper, we present a novel geometric-model-fusion framework for motion segmentation, which targets at constructing a consistent affinity matrix across all the geometric models. Specifically, it incorporates the structural information shared by affinity matrices to select those semantically consistent entries. Meanwhile, a multiplicative decomposition scheme is adopted to ensure structural consistency among multiple affinities. To solve this problem, an alternative optimization scheme is proposed, together with a proof of its global convergence. Experiments on four real-world benchmarks show the superiority of the proposed method.

ICML Conference 2021 Conference Paper

When All We Need is a Piece of the Pie: A Generic Framework for Optimizing Two-way Partial AUC

  • Zhiyong Yang 0001
  • Qianqian Xu 0001
  • Shilong Bao
  • Yuan He 0011
  • Xiaochun Cao
  • Qingming Huang

The Area Under the ROC Curve (AUC) is a crucial metric for machine learning, which evaluates the average performance over all possible True Positive Rates (TPRs) and False Positive Rates (FPRs). Based on the knowledge that a skillful classifier should simultaneously embrace a high TPR and a low FPR, we turn to study a more general variant called Two-way Partial AUC (TPAUC), where only the region with $\mathsf{TPR} \ge \alpha, \mathsf{FPR} \le \beta$ is included in the area. Moreover, a recent work shows that the TPAUC is essentially inconsistent with the existing Partial AUC metrics where only the FPR range is restricted, opening a new problem to seek solutions to leverage high TPAUC. Motivated by this, we present the first trial in this paper to optimize this new metric. The critical challenge along this course lies in the difficulty of performing gradient-based optimization with end-to-end stochastic training, even with a proper choice of surrogate loss. To address this issue, we propose a generic framework to construct surrogate optimization problems, which supports efficient end-to-end training with deep-learning. Moreover, our theoretical analyses show that: 1) the objective function of the surrogate problems will achieve an upper bound of the original problem under mild conditions, and 2) optimizing the surrogate problems leads to good generalization performance in terms of TPAUC with a high probability. Finally, empirical studies over several benchmark datasets speak to the efficacy of our framework.

AAAI Conference 2021 Conference Paper

Why Do Attributes Propagate in Graph Convolutional Neural Networks?

  • Liang Yang
  • Chuan Wang
  • Junhua Gu
  • Xiaochun Cao
  • Bingxin Niu

Many efforts have been paid to enhance Graph Convolutional Network from the perspective of propagation under the philosophy that “Propagation is the essence of the GCNNs”. Unfortunately, its adverse effect is over-smoothing, which makes the performance dramatically drop. To prevent the over-smoothing, many variants are presented. However, the perspective of propagation can’t provide an intuitive and unified interpretation to their effect on prevent over-smoothing. In this paper, we aim at providing a novel explanation to the question of “Why do attributes propagate in GCNNs? ”. which not only gives the essence of the oversmoothing, but also illustrates why the GCN extensions, including multi-scale GCN and GCN with initial residual, can improve the performance. To this end, an intuitive Graph Representation Learning (GRL) framework is presented. GRL simply constrains the node representation similar with the original attribute, and encourages the connected nodes possess similar representations (pairwise constraint). Based on the proposed GRL, exiting GCN and its extensions can be proved as different numerical optimization algorithms, such as gradient descent, of our proposed GRL framework. Inspired by the superiority of conjugate gradient descent compared to common gradient descent, a novel Graph Conjugate Convolutional (GCC) network is presented to approximate the solution to GRL with fast convergence. Specifically, GCC adopts the obtained information of the last layer, which can be represented as the difference between the input and output of the last layer, as the input to the next layer. Extensive experiments demonstrate the superior performance of GCC.

IJCAI Conference 2020 Conference Paper

JANE: Jointly Adversarial Network Embedding

  • Liang Yang
  • Yuexue Wang
  • Junhua Gu
  • Chuan Wang
  • Xiaochun Cao
  • Yuanfang Guo

Motivated by the capability of Generative Adversarial Network on exploring the latent semantic space and capturing semantic variations in the data distribution, adversarial learning has been adopted in network embedding to improve the robustness. However, this important ability is lost in existing adversarially regularized network embedding methods, because their embedding results are directly compared to the samples drawn from perturbation (Gaussian) distribution without any rectification from real data. To overcome this vital issue, a novel Joint Adversarial Network Embedding (JANE) framework is proposed to jointly distinguish the real and fake combinations of the embeddings, topology information and node features. JANE contains three pluggable components, Embedding module, Generator module and Discriminator module. The overall objective function of JANE is defined in a min-max form, which can be optimized via alternating stochastic gradient. Extensive experiments demonstrate the remarkable superiority of the proposed JANE on link prediction (3% gains in both AUC and AP) and node clustering (5% gain in F1 score).

AAAI Conference 2020 Conference Paper

Who Likes What? — SplitLBI in Exploring Preferential Diversity of Ratings

  • Qianqian Xu
  • Jiechao Xiong
  • Zhiyong Yang
  • Xiaochun Cao
  • Qingming Huang
  • Yuan Yao

In recent years, learning user preferences has received significant attention. A shortcoming of existing learning to rank work lies in that they do not take into account the multilevel hierarchies from social choice to individuals. In this paper, we propose a multi-level model which learns both the common preference or utility function over the population based on features of alternatives to-be-compared, and preferential diversity functions conditioning on user categories. Such a multi-level model, enables us to simultaneously learn a coarse-grained social preference function together with a fine-grained personalized diversity. It provides us prediction power for the choices of new users on new alternatives. The key algorithm in this paper is based on Split Linearized Bregman Iteration (SplitLBI) algorithm which generates a dynamic path from the common utility to personalized preferential diversity, at different levels of sparsity on personalization. A synchronized parallel version of SplitLBI is proposed to meet the needs of fast analysis of large-scale data. The validity of the methodology are supported by experiments with both simulated and real-world datasets such as movie and dining restaurant ratings which provides us a coarse-to-fine grained preference learning.

NeurIPS Conference 2019 Conference Paper

DM2C: Deep Mixed-Modal Clustering

  • Yangbangyan Jiang
  • Qianqian Xu
  • Zhiyong Yang
  • Xiaochun Cao
  • Qingming Huang

Data exhibited with multiple modalities are ubiquitous in real-world clustering tasks. Most existing methods, however, pose a strong assumption that the pairing information for modalities is available for all instances. In this paper, we consider a more challenging task where each instance is represented in only one modality, which we call mixed-modal data. Without any extra pairing supervision across modalities, it is difficult to find a universal semantic space for all of them. To tackle this problem, we present an adversarial learning framework for clustering with mixed-modal data. Instead of transforming all the samples into a joint modality-independent space, our framework learns the mappings across individual modal spaces by virtue of cycle-consistency. Through these mappings, we could easily unify all the samples into a single modal space and perform the clustering. Evaluations on several real-world mixed-modal datasets could demonstrate the superiority of our proposed framework.

NeurIPS Conference 2019 Conference Paper

Generalized Block-Diagonal Structure Pursuit: Learning Soft Latent Task Assignment against Negative Transfer

  • Zhiyong Yang
  • Qianqian Xu
  • Yangbangyan Jiang
  • Xiaochun Cao
  • Qingming Huang

In multi-task learning, a major challenge springs from a notorious issue known as negative transfer, which refers to the phenomenon that sharing the knowledge with dissimilar and hard tasks often results in a worsened performance. To circumvent this issue, we propose a novel multi-task learning method, which simultaneously learns latent task representations and a block-diagonal Latent Task Assignment Matrix (LTAM). Different from most of the previous work, pursuing the Block-Diagonal structure of LTAM (assigning latent tasks to output tasks) alleviates negative transfer via collaboratively grouping latent tasks and output tasks such that inter-group knowledge transfer and sharing is suppressed. This goal is challenging, since 1) our notion of Block-Diagonal Property extends the traditional notion for square matrices where the $i$-th column and the $i$-th column represents the same concept; 2) marginal constraints on rows and columns are also required for avoiding isolated latent/output tasks. Facing such challenges, we propose a novel regularizer by means of an equivalent spectral condition realizing this generalized block-diagonal property. Practically, we provide a relaxation scheme which improves the flexibility of the model. With the objective function given, we then propose an alternating optimization method, which not only tells how negative transfer is alleviated in our method but also reveals an interesting connection between our method and the optimal transport problem. Finally, the method is demonstrated on a simulation dataset, three real-world benchmark datasets and further applied to personalized attribute predictions.

NeurIPS Conference 2019 Conference Paper

iSplit LBI: Individualized Partial Ranking with Ties via Split LBI

  • Qianqian Xu
  • Xinwei Sun
  • Zhiyong Yang
  • Xiaochun Cao
  • Qingming Huang
  • Yuan Yao

Due to the inherent uncertainty of data, the problem of predicting partial ranking from pairwise comparison data with ties has attracted increasing interest in recent years. However, in real-world scenarios, different individuals often hold distinct preferences, thus might be misleading to merely look at a global partial ranking while ignoring personal diversity. In this paper, instead of learning a global ranking which is agreed with the consensus, we pursue the tie-aware partial ranking from an individualized perspective. Particularly, we formulate a unified framework which not only can be used for individualized partial ranking prediction, but can also be helpful for abnormal users selection. This is realized by a variable splitting-based algorithm called iSplit LBI. Specifically, our algorithm generates a sequence of estimations with a regularization path, where both the hyperparameters and model parameters are updated. At each step of the path, the parameters can be decomposed into three orthogonal parts, namely, abnormal signals, personalized signals and random noise. The abnormal signals can serve the purpose of abnormal user selection, while the abnormal signals and personalized signals together are mainly responsible for user partial ranking prediction. Extensive experiments on simulated and real-world datasets demonstrate that our new approach significantly outperforms state-of-the-art alternatives.

AAAI Conference 2019 Conference Paper

Learning Personalized Attribute Preference via Multi-Task AUC Optimization

  • Zhiyong Yang
  • Qianqian Xu
  • Xiaochun Cao
  • Qingming Huang

Traditionally, most of the existing attribute learning methods are trained based on the consensus of annotations aggregated from a limited number of annotators. However, the consensus might fail in settings, especially when a wide spectrum of annotators with different interests and comprehension about the attribute words are involved. In this paper, we develop a novel multi-task method to understand and predict personalized attribute annotations. Regarding the attribute preference learning for each annotator as a specific task, we first propose a multi-level task parameter decomposition to capture the evolution from a highly popular opinion of the mass to highly personalized choices that are special for each person. Meanwhile, for personalized learning methods, ranking prediction is much more important than accurate classification. This motivates us to employ an Area Under ROC Curve (AUC) based loss function to improve our model. On top of the AUC-based loss, we propose an efficient method to evaluate the loss and gradients. Theoretically, we propose a novel closed-form solution for one of our non-convex subproblem, which leads to provable convergence behaviors. Furthermore, we also provide a generalization bound to guarantee a reasonable performance. Finally, empirical analysis consistently speaks to the efficacy of our proposed method.

AAAI Conference 2019 Conference Paper

Less but Better: Generalization Enhancement of Ordinal Embedding via Distributional Margin

  • Ke Ma
  • Qianqian Xu
  • Zhiyong Yang
  • Xiaochun Cao

In the absence of prior knowledge, ordinal embedding methods obtain new representation for items in a low-dimensional Euclidean space via a set of quadruple-wise comparisons. These ordinal comparisons often come from human annotators, and sufficient comparisons induce the success of classical approaches. However, collecting a large number of labeled data is known as a hard task, and most of the existing work pay little attention to the generalization ability with insufficient samples. Meanwhile, recent progress in large margin theory discloses that rather than just maximizing the minimum margin, both the margin mean and variance, which characterize the margin distribution, are more crucial to the overall generalization performance. To address the issue of insufficient training samples, we propose a margin distribution learning paradigm for ordinal embedding, entitled Distributional Margin based Ordinal Embedding (DMOE). Precisely, we first define the margin for ordinal embedding problem. Secondly, we formulate a concise objective function which avoids maximizing margin mean and minimizing margin variance directly but exhibits the similar effect. Moreover, an Augmented Lagrange Multiplier based algorithm is customized to seek the optimal solution of DMOE effectively. Experimental studies on both simulated and realworld datasets are provided to show the effectiveness of the proposed algorithm.

AAAI Conference 2019 Conference Paper

Robust Ordinal Embedding from Contaminated Relative Comparisons

  • Ke Ma
  • Qianqian Xu
  • Xiaochun Cao

Existing ordinal embedding methods usually follow a twostage routine: outlier detection is first employed to pick out the inconsistent comparisons; then an embedding is learned from the clean data. However, learning in a multi-stage manner is well-known to suffer from sub-optimal solutions. In this paper, we propose a unified framework to jointly identify the contaminated comparisons and derive reliable embeddings. The merits of our method are three-fold: (1) By virtue of the proposed unified framework, the sub-optimality of traditional methods is largely alleviated; (2) The proposed method is aware of global inconsistency by minimizing a corresponding cost, while traditional methods only involve local inconsistency; (3) Instead of considering the nuclear norm heuristics, we adopt an exact solution for rank equality constraint. Our studies are supported by experiments with both simulated examples and real-world data. The proposed framework provides us a promising tool for robust ordinal embedding from the contaminated comparisons.

IJCAI Conference 2019 Conference Paper

Topology Optimization based Graph Convolutional Network

  • Liang Yang
  • Zesheng Kang
  • Xiaochun Cao
  • Di Jin
  • Bo Yang
  • Yuanfang Guo

In the past few years, semi-supervised node classification in attributed network has been developed rapidly. Inspired by the success of deep learning, researchers adopt the convolutional neural network to develop the Graph Convolutional Networks (GCN), and they have achieved surprising classification accuracy by considering the topological information and employing the fully connected network (FCN). However, the given network topology may also induce a performance degradation if it is directly employed in classification, because it may possess high sparsity and certain noises. Besides, the lack of learnable filters in GCN also limits the performance. In this paper, we propose a novel Topology Optimization based Graph Convolutional Networks (TO-GCN) to fully utilize the potential information by jointly refining the network topology and learning the parameters of the FCN. According to our derivations, TO-GCN is more flexible than GCN, in which the filters are fixed and only the classifier can be updated during the learning process. Extensive experiments on real attributed networks demonstrate the superiority of the proposed TO-GCN against the state-of-the-art approaches.

IJCAI Conference 2019 Conference Paper

Transferable Adversarial Attacks for Image and Video Object Detection

  • Xingxing Wei
  • Siyuan Liang
  • Ning Chen
  • Xiaochun Cao

Identifying adversarial examples is beneficial for understanding deep networks and developing robust models. However, existing attacking methods for image object detection have two limitations: weak transferability---the generated adversarial examples often have a low success rate to attack other kinds of detection methods, and high computation cost---they need much time to deal with video data, where many frames need polluting. To address these issues, we present a generative method to obtain adversarial images and videos, thereby significantly reducing the processing time. To enhance transferability, we manipulate the feature maps extracted by a feature network, which usually constitutes the basis of object detectors. Our method is based on the Generative Adversarial Network (GAN) framework, where we combine a high-level class loss and a low-level feature loss to jointly train the adversarial example generator. Experimental results on PASCAL VOC and ImageNet VID datasets show that our method efficiently generates image and video adversarial examples, and more importantly, these adversarial examples have better transferability, therefore being able to simultaneously attack two kinds of representative object detection models: proposal based models like Faster-RCNN and regression based models like SSD.

IJCAI Conference 2018 Conference Paper

3-in-1 Correlated Embedding via Adaptive Exploration of the Structure and Semantic Subspaces

  • Liang Yang
  • Yuanfang Guo
  • Di Jin
  • Huazhu Fu
  • Xiaochun Cao

Combinational network embedding, which learns the node representation by exploring both topological and non-topological information, becomes popular due to the fact that the two types of information are complementing each other. Most of the existing methods either consider the topological and non-topological information being aligned or possess predetermined preferences during the embedding process. Unfortunately, previous methods fail to either explicitly describe the correlations between topological and non-topological information or adaptively weight their impacts. To address the existing issues, three new assumptions are proposed to better describe the embedding space and its properties. With the proposed assumptions, nodes, communities and topics are mapped into one embedding space. A novel generative model is proposed to formulate the generation process of the network and content from the embeddings, with respect to the Bayesian framework. The proposed model automatically leans to the information which is more discriminative. The embedding result can be obtained by maximizing the posterior distribution by adopting the variational inference and reparameterization trick. Experimental results indicate that the proposed method gives superior performances compared to the state-of-the-art methods when a variety of real-world networks is analyzed.

AAAI Conference 2018 Conference Paper

Audio Visual Attribute Discovery for Fine-Grained Object Recognition

  • Hua Zhang
  • Xiaochun Cao
  • Rui Wang

Current progresses on fine-grained recognition are mainly focus on learning the discriminative feature representation via introducing the visual supervisions e. g. part labels. However, it is time-consuming and needs the professional knowledge to obtain the accuracy annotations. Different from these existing methods based on the visual supervisions, in this paper, we introduce a novel feature named audio visual attributes via discovering the correlations between the visual and audio representations. Specifically, our unified framework is training with video-level category label, which consists of two important modules, the encoder module and the attribute discovery module, to encode the image and audio into vectors and learn the correlations between audio and images, respectively. On the encoder module, we present two types of feed forward convolutional neural network for the image and audio modalities. While an attention driven framework based on recurrent neural network is developed to generate the audio visual attribute representation. Thus, our proposed architecture can be implemented end-to-end in the step of inference. We exploit our models for the problem of fine-grained bird recognition on the CUB200-211 benchmark. The experimental results demonstrate that with the help of audio visual attribute, we achieve the superior or comparable performance to that of strongly supervised approaches on the bird recognition.

AAAI Conference 2018 Conference Paper

Consistent and Specific Multi-View Subspace Clustering

  • Shirui Luo
  • Changqing Zhang
  • Wei Zhang
  • Xiaochun Cao

Multi-view clustering has attracted intensive attention due to the effectiveness of exploiting multiple views of data. However, most existing multi-view clustering methods only aim to explore the consistency or enhance the diversity of different views. In this paper, we propose a novel multi-view subspace clustering method (CSMSC), where consistency and specificity are jointly exploited for subspace representation learning. We formulate the multi-view self-representation property using a shared consistent representation and a set of specific representations, which better fits the real-world datasets. Specifically, consistency models the common properties among all views, while specificity captures the inherent difference in each view. In addition, to optimize the nonconvex problem, we introduce a convex relaxation and develop an alternating optimization algorithm to recover the corresponding data representations. Experimental evaluations on four benchmark datasets demonstrate that the proposed approach achieves better performance over several state-of-thearts.

NeurIPS Conference 2018 Conference Paper

Deep Non-Blind Deconvolution via Generalized Low-Rank Approximation

  • Wenqi Ren
  • Jiawei Zhang
  • Lin Ma
  • Jinshan Pan
  • Xiaochun Cao
  • Wangmeng Zuo
  • Wei Liu
  • Ming-Hsuan Yang

In this paper, we present a deep convolutional neural network to capture the inherent properties of image degradation, which can handle different kernels and saturated pixels in a unified framework. The proposed neural network is motivated by the low-rank property of pseudo-inverse kernels. We first compute a generalized low-rank approximation for a large number of blur kernels, and then use separable filters to initialize the convolutional parameters in the network. Our analysis shows that the estimated decomposed matrices contain the most essential information of the input kernel, which ensures the proposed network to handle various blurs in a unified framework and generate high-quality deblurring results. Experimental results on benchmark datasets with noise and saturated pixels demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.

AAAI Conference 2018 Conference Paper

From Common to Special: When Multi-Attribute Learning Meets Personalized Opinions

  • Zhiyong Yang
  • Qianqian Xu
  • Xiaochun Cao
  • Qingming Huang

Visual attributes, which refer to human-labeled semantic annotations, have gained increasing popularity in a wide range of real world applications. Generally, the existing attribute learning methods fall into two categories: one focuses on learning user-specific labels separately for different attributes, while the other one focuses on learning crowd-sourced global labels jointly for multiple attributes. However, both categories ignore the joint effect of the two mentioned factors: the personal diversity with respect to the global consensus; and the intrinsic correlation among multiple attributes. To overcome this challenge, we propose a novel model to learn user-specific predictors across multiple attributes. In our proposed model, the diversity of personalized opinions and the intrinsic relationship among multiple attributes are unified in a common-to-special manner. To this end, we adopt a three-component decomposition. Specifically, our model integrates a common cognition factor, an attribute-specific bias factor and a user-specific bias factor. Meanwhile Lasso and group Lasso penalties are adopted to leverage efficient feature selection. Furthermore, theoretical analysis is conducted to show that our proposed method could reach reasonable performance. Eventually, the empirical study carried out in this paper demonstrates the effectiveness of our proposed method.

AAAI Conference 2018 Conference Paper

Multi-Facet Network Embedding: Beyond the General Solution of Detection and Representation

  • Liang Yang
  • Yuanfang Guo
  • Xiaochun Cao

In network analysis, community detection and network embedding are two important topics. Community detection tends to obtain the most noticeable partition, while network embedding aims at seeking node representations which contains as many diverse properties as possible. We observe that the current community detection and network embedding problems are being resolved by a general solution, i. e. , “maximizing the consistency between similar nodes while maximizing the distance between the dissimilar nodes”. This general solution only exploits the most noticeable structure (facet) of the network, which effectively satisfies the demands of the community detection. Unfortunately, most of the specific embedding algorithms, which are developed from the general solution, cannot achieve the goal of network embedding by exploring only one facet of the network. To improve the general solution for better modeling the real network, we propose a novel network embedding method, Multi-facet Network Embedding (MNE), to capture the multiple facets of the network. MNE learns multiple embeddings simultaneously, with the Hilbert Schmidt Independence Criterion (HSIC) being the a diversity constraint. To efficiently solve the optimization problem, we propose a Binary HSIC with linear complexity and solve the MNE objective function by adopting the Augmented Lagrange Multiplier (ALM) method. The overall complexity is linear with the scale of the network. Extensive results demonstrate that MNE gives efficient performances and outperforms the state-of-the-art network embedding methods.

AAAI Conference 2018 Conference Paper

Stochastic Non-Convex Ordinal Embedding With Stabilized Barzilai-Borwein Step Size

  • Ke Ma
  • Jinshan Zeng
  • Jiechao Xiong
  • Qianqian Xu
  • Xiaochun Cao
  • Wei Liu
  • Yuan Yao

Learning representation from relative similarity comparisons, often called ordinal embedding, gains rising attention in recent years. Most of the existing methods are batch methods designed mainly based on the convex optimization, say, the projected gradient descent method. However, they are generally time-consuming due to that the singular value decomposition (SVD) is commonly adopted during the update, especially when the data size is very large. To overcome this challenge, we propose a stochastic algorithm called SVRG-SBB, which has the following features: (a) SVD-free via dropping convexity, with good scalability by the use of stochastic algorithm, i. e. , stochastic variance reduced gradient (SVRG), and (b) adaptive step size choice via introducing a new stabilized Barzilai-Borwein (SBB) method as the original version for convex problems might fail for the considered stochastic non-convex optimization problem. Moreover, we show that the proposed algorithm converges to a stationary point at a rate O( 1 T ) in our setting, where T is the number of total iterations. Numerous simulations and real-world data experiments are conducted to show the effectiveness of the proposed algorithm via comparing with the state-of-the-art methods, particularly, much lower computational cost with good prediction performance.

ICML Conference 2016 Conference Paper

False Discovery Rate Control and Statistical Quality Assessment of Annotators in Crowdsourced Ranking

  • Qianqian Xu 0001
  • Jiechao Xiong
  • Xiaochun Cao
  • Yuan Yao 0011

With the rapid growth of crowdsourcing platforms it has become easy and relatively inexpensive to collect a dataset labeled by multiple annotators in a short time. However due to the lack of control over the quality of the annotators, some abnormal annotators may be affected by position bias which can potentially degrade the quality of the final consensus labels. In this paper we introduce a statistical framework to model and detect annotator’s position bias in order to control the false discovery rate (FDR) without a prior knowledge on the amount of biased annotators–the expected fraction of false discoveries among all discoveries being not too high, in order to assure that most of the discoveries are indeed true and replicable. The key technical development relies on some new knockoff filters adapted to our problem and new algorithms based on the Inverse Scale Space dynamics whose discretization is potentially suitable for large scale crowdsourcing data analysis. Our studies are supported by experiments with both simulated examples and real-world data. The proposed framework provides us a useful tool for quantitatively studying annotator’s abnormal behavior in crowdsourcing.

IJCAI Conference 2016 Conference Paper

Makeup Like a Superstar: Deep Localized Makeup Transfer Network

  • Si Liu
  • Xinyu Ou
  • Ruihe Qian
  • Wei Wang
  • Xiaochun Cao

In this paper, we propose a novel Deep Localized Makeup Transfer Network to automatically recommend the most suitable makeup for a female and synthesis the makeup on her face. Given a before-makeup face, her most suitable makeup is determined automatically. Then, both the before makeup and the reference faces are fed into the proposed Deep Transfer Network to generate the after-makeup face. Our end-to-end makeup transfer network have several nice properties including: (1) with complete functions: including foundation, lip gloss, and eye shadow transfer; (2) cosmetic specific: different cosmetics are transferred in different manners; (3) localized: different cosmetics are applied on different facial regions; (4) producing naturally looking results without obvious artifacts; (5) controllable makeup lightness: various results from light makeup to heavy makeup can be generated. Qualitative and quantitative experiments show that our network performs much better than the methods of [Guo and Sim, 2009] and two variants of NerualStyle [Gatys et al. , 2015a].

IJCAI Conference 2016 Conference Paper

Modularity Based Community Detection with Deep Learning

  • Liang Yang
  • Xiaochun Cao
  • Dongxiao He
  • Chuan Wang
  • Xiao Wang
  • Weixiong Zhang

Identification of module or community structures is important for characterizing and understanding complex systems. While designed with different objectives, i. e. , stochastic models for regeneration and modularity maximization models for discrimination, both these two types of model look for low-rank embedding to best represent and reconstruct network topology. However, the mapping through such embedding is linear, whereas real networks have various nonlinear features, making these models less effective in practice. Inspired by the strong representation power of deep neural networks, we propose a novel nonlinear reconstruction method by adopting deep neural networks for representation. We then extend the method to a semi-supervised community detection algorithm by incorporating pairwise constraints among graph nodes. Extensive experimental results on synthetic and real networks show that the new methods are effective, outperforming most state-of-the-art methods for community detection.

AAAI Conference 2016 Conference Paper

Semantic Community Identification in Large Attribute Networks

  • Xiao Wang
  • Di Jin
  • Xiaochun Cao
  • Liang Yang
  • Weixiong Zhang

Identification of modular or community structures of a network is a key to understanding the semantics and functions of the network. While many network community detection methods have been developed, which primarily explore network topologies, they provide little semantic information of the communities discovered. Although structures and semantics are closely related, little effort has been made to discover and analyze these two essential network properties together. By integrating network topology and semantic information on nodes, e. g. , node attributes, we study the problems of detection of communities and inference of their semantics simultaneously. We propose a novel nonnegative matrix factorization (NMF) model with two sets of parameters, the community membership matrix and community attribute matrix, and present efficient updating rules to evaluate the parameters with a convergence guarantee. The use of node attributes improves upon community detection and provides a semantic interpretation to the resultant network communities. Extensive experimental results on synthetic and real-world networks not only show the superior performance of the new method over the state-of-the-art approaches, but also demonstrate its ability to semantically annotate the communities.

IJCAI Conference 2013 Conference Paper

Robust Tensor Clustering with Non-Greedy Maximization

  • Xiaochun Cao
  • Xingxing Wei
  • Yahong Han
  • Yi Yang
  • Dongdai Lin

Tensors are increasingly common in several areas such as data mining, computer graphics, and computer vision. Tensor clustering is a fundamental tool for data analysis and pattern discovery. However, there usually exist outlying data points in realworld datasets, which will reduce the performance of clustering. This motivates us to develop a tensor clustering algorithm that is robust to the outliers. In this paper, we propose an algorithm of Robust Tensor Clustering (RTC). The RTC firstly finds a lower rank approximation of the original tensor data using a L1 norm optimization function. Because the L1 norm doesn’t exaggerate the effect of outliers compared with L2 norm, the minimization of the L1 norm approximation function makes RTC robust to outliers. Then we compute the HOSVD decomposition of this approximate tensor to obtain the final clustering results. Different from the traditional algorithm solving the approximation function with a greedy strategy, we utilize a non-greedy strategy to obtain a better solution. Experiments demonstrate that RTC has better performance than the state-ofthe-art algorithms and is more robust to outliers.