Arrow Research search

Author name cluster

Yang Cao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

49 papers
2 author rows

Possible papers

49

AAAI Conference 2026 Conference Paper

Differentially Private Subspace Fine-Tuning for Large Language Models

  • Lele Zheng
  • Xiang Wang
  • Tao Zhang
  • Yang Cao
  • Ke Cheng
  • Yulong Shen

Fine-tuning large language models on downstream tasks is crucial for realizing their cross-domain potential but often relies on sensitive data, raising privacy concerns. Differential privacy (DP) offers rigorous privacy guarantees and has been widely adopted in fine-tuning; however, naively injecting noise across the high-dimensional parameter space creates perturbations with large norms, degrading performance and destabilizing training. To address this issue, we propose DP-SFT, a two-stage subspace fine-tuning method that substantially reduces noise magnitude while preserving formal DP guarantees. Our intuition is that, during fine-tuning, significant parameter updates lie within a low-dimensional, task-specific subspace, while other directions change minimally. Hence, we only inject DP noise into this subspace to protect privacy without perturbing irrelevant parameters. In phase one, we identify the subspace by analyzing principal gradient directions to capture task-specific update signals. In phase two, we project full gradients onto this subspace, add DP noise, and map the perturbed gradients back to the original parameter space for model updates, markedly lowering noise impact. Experiments on multiple datasets demonstrate that DP-SFT enhances accuracy and stability under rigorous DP constraints, accelerates convergence, and achieves substantial gains over DP fine-tuning baselines.

AAAI Conference 2026 Conference Paper

E-MaT:Event-oriented Mamba for Egocentric Point Tracking

  • Han Han
  • Wei Zhai
  • Baocai Yin
  • Yang Cao
  • Bin Li
  • Zheng-Jun Zha

Egocentric point tracking aims to localize points on object surfaces from a first-person perspective and serves as a critical step toward embodied intelligence. Recent methods rely on video input, tracking query points through feature matching across consecutive frames. However, these methods struggle in highly dynamic settings—a common challenge in first-person perspectives, where the head-mounted camera undergoes frequent and abrupt rotations, resulting in high angular velocities, motion blur, and large inter-frame displacements. In contrast, event cameras capture motion at microsecond temporal resolution, naturally avoiding blur and delivering low-latency, high-fidelity cues crucial for egocentric point tracking. Moreover, rapid egocentric motion disrupts local smoothness, breaking the assumption that spatially adjacent regions share similar motion. Event dynamics expose global motion trends, guiding coherent modeling and consistent feature flow. Therefore, this paper proposes a mamba-based tracking framework that constructs feature modeling paths aligned with the dominant motion trend extracted from events, and modulates feature propagation along these paths based on local motion intensity, enhancing stability by suppressing unreliable signals and emphasizing consistent cues. Additionally, a motion-adaptive suppression module enhances temporal robustness by adaptively suppressing correlation features based on motion intensity variations, mitigating the effects of intensity fluctuations and partial observability. To facilitate research in this domain, a multimodal dataset named DVS-EgoPoints with both events and videos for egocentric point tracking is collected. Experiments on the DVS-EgoPoints dataset and a simulation benchmark demonstrate superior performance over state-of-the-art methods, especially under challenging motion and occlusion conditions.

AAAI Conference 2026 Conference Paper

IdeFN: Identifying Unclicked Space False Negatives via Relaxed Partial Optimal Transport for Conversion Rate Prediction

  • Weiyi Zhong
  • Weiming Liu
  • Lianyong Qi
  • Xiaoran Zhao
  • Xiaolong Xu
  • Haolong Xiang
  • Yang Cao
  • Shichao Pei

Accurate conversion rate (CVR) prediction is critical for recommender systems to capture user conversion intent and increase platform revenues. Traditional CVR models commonly suffer from sample selection bias (SSB) and data sparsity (DS), which has led to the adoption of click-through & conversion rate (CTCVR) multi-task learning frameworks to alleviate these issues. However, existing methods implicitly mislabel some unclicked samples with genuine conversion potential as negatives, thereby exacerbating the false negative sample (FNS) problem. To address this, we propose IdeFN, a multi‑task CVR framework that identifies false negatives in the unclicked space to enable CVR prediction across the entire exposure space and leverages CTR as an auxiliary task for shared‑parameter learning. Specifically, IdeFN consists of two main components, i.e., relaxed partial optimal transport (RPOT) module and sample relabeling mechanism (SRM). The former estimates the soft matching strengths between unclicked samples and positive samples under a relaxed partial optimal transport formulation, establishing corresponding relationships between these samples. The latter adaptively re-labels the unclicked samples according to the derived matching strengths, without relying on static or heuristic thresholds, thus enhancing the reliability of the generated pseudo-labels. Experimental results demonstrate that IdeFN effectively mitigates the FNS problem, achieving substantial improvements in CVR prediction accuracy.

AAAI Conference 2026 Conference Paper

Oblivionis: A Lightweight Learning and Unlearning Framework for Federated Large Language Models

  • Fuyao Zhang
  • Xinyu Yan
  • Tiantong Wu
  • Wenjie Li
  • Tianxiang Chen
  • Yang Cao
  • Ran Yan
  • Longtao Huang

Large Language Models (LLMs) increasingly leverage Federated Learning (FL) to utilize private, task-specific datasets for fine-tuning while preserving data privacy. However, while federated LLM frameworks effectively enable collaborative training without raw data sharing, they critically lack built-in mechanisms for regulatory compliance like GDPR’s right to be forgotten. Integrating private data heightens concerns over data quality and long-term governance, yet existing distributed training frameworks offer no principled way to selectively remove specific client contributions post-training. Due to distributed data silos, stringent privacy constraints, and the intricacies of interdependent model aggregation, federated LLM unlearning is significantly more complex than centralized LLM unlearning. To address this gap, we introduce Oblivionis, a lightweight learning and unlearning framework that enables clients to selectively remove specific private data during federated LLM training, enhancing trustworthiness and regulatory compliance. By unifying FL and unlearning as a dual optimization objective, we incorporate 6 FL and 5 unlearning algorithms for comprehensive evaluation and comparative analysis, establishing a robust pipeline for federated LLM unlearning. Extensive experiments demonstrate that Oblivionis outperforms local training, achieving a robust balance between forgetting efficacy and model utility, with cross-algorithm comparisons providing clear directions for future LLM development.

AAAI Conference 2026 Conference Paper

Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV

  • Wenbo Huang
  • Jinghui Zhang
  • Zhenghao Chen
  • Guang Li
  • Lei Zhang
  • Yang Cao
  • Fang Dong
  • Takahiro Ogawa

Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module (CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.

AAAI Conference 2026 Conference Paper

Privacy on the Fly: A Predictive Adversarial Transformation Network for Mobile Sensor Data

  • Tianle Song
  • Chenhao Lin
  • Yang Cao
  • Zhengyu Zhao
  • Jiahao Sun
  • Chong Zhang
  • Le Yang
  • Chao Shen

Mobile motion sensors such as accelerometers and gyroscopes are now ubiquitously accessible by third-party apps via standard APIs. While enabling rich functionalities like activity recognition and step counting, this openness has also enabled unregulated inference of sensitive user traits, such as gender, age, and even identity, without user consent. Existing privacy-preserving techniques, such as GAN-based obfuscation or differential privacy, typically require access to the full input sequence, introducing latency that is incompatible with real-time scenarios. Worse, they tend to distort temporal and semantic patterns, degrading the utility of the data for benign tasks like activity recognition. To address these limitations, we propose the Predictive Adversarial Transformation Network (PATN), a real-time privacy-preserving framework that leverages historical signals to generate adversarial perturbations proactively. The perturbations are applied immediately upon data acquisition, enabling continuous protection without disrupting application functionality. Experiments on two datasets demonstrate that PATN substantially degrades the performance of privacy inference models, achieving Attack Success Rate (ASR) of 40.11% and 44.65% (reducing inference accuracy to near-random) and increasing the Equal Error Rate (EER) from 8.30% and 7.56% to 41.65% and 46.22%. On ASR, PATN outperforms baseline methods by 16.16% and 31.96%, respectively.

AAAI Conference 2026 Conference Paper

Retrieval-driven Reasoning for Deliberative Visual Classification

  • Jianye Xie
  • Lianyong Qi
  • Fan Wang
  • Anqi Wang
  • Wenjuan Gong
  • Danxin Wang
  • Wanchun Dou
  • Yang Cao

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in visual classification tasks. Existing methods for enhancing VLMs on this task often rely heavily on direct category-to-image matching, which limits generalization and results in suboptimal performance. In addition, these methods provide no understanding of why a specific category is chosen. To address these limitations, we introduce a new deliberative visual classification task that decomposes the classification process into multiple deliberative steps and leverages Large Language Models (LLMs) to perform explicit reasoning before the final decision. Specifically, we propose a Retrieval-driven Reasoning model (RdR) with two components, i.e., retrieval database construction and deliberative category prediction. The first component leverages LLMs to extract category-relevant descriptors and constructs a retrieval database for effective image–descriptor matching. The second component facilitates multiple deliberative steps and performs explicit reasoning based on the retrieved descriptors to augment the category prediction. Extensive experiments on multiple datasets demonstrate that RdR consistently outperforms strong baselines, highlighting its robustness and generalization ability.

NeurIPS Conference 2025 Conference Paper

AegisGuard: RL-Guided Adapter Tuning for TEE-Based Efficient & Secure On-Device Inference

  • CHE WANG
  • Ziqi Zhang
  • Yinggui Wang
  • Tiantong Wang
  • Yurong Hao
  • Jianbo Gao
  • Tao Wei
  • Yang Cao

On-device large models (LMs) reduce cloud dependency but expose proprietary model weights to the end-user, making them vulnerable to white-box model stealing (MS) attacks. A common defense is TEE-Shielded DNN Partition (TSDP), which places all trainable LoRA adapters (fine tuned on private data) inside a trusted execution environment (TEE). However, this design suffers from excessive host-to-TEE communication latency. We propose AegisGuard, a fine tuning and deployment framework that selectively shields the MS sensitive adapters while offloading the rest to the GPU, balancing security and efficiency. AegisGuard integrates two key components: i) RL-based Sensitivity Measurement (RSM), which injects Gaussian noise during training and applies a lightweight reinforcement learning to rank adapters based on their impact on model stealing; and (ii) Shielded-Adapter Compression (SAC), which structurally prunes the selected adapters to reduce both parameter size and intermediate feature maps, further lowering TEE computation and data transfer costs. Extensive experiments demonstrate that AegisGuard achieves black-box level MS resilience (surrogate accuracy around 39%, matching fully shielded baselines), while reducing end-to-end inference latency by 2–3× and cutting TEE memory usage by 4× compared to state-of-the-art TSDP methods.

IJCAI Conference 2025 Conference Paper

Balancing User-Item Structure and Interaction with Large Language Models and Optimal Transport for Multimedia Recommendation

  • Haodong Li
  • Lianyong Qi
  • Weiming Liu
  • Xiaolong Xu
  • Wanchun Dou
  • Yang Cao
  • Xuyun Zhang
  • Amin Beheshti

The rapid growth of multimedia content has driven the development of recommender systems. Most previous work focuses on uncovering latent relationships among items to learn better representations. However, this approach does not sufficiently account for user affinities, potentially leading to an imbalance in the structure modeling of users and items. Moreover, the sparsity and imbalance of user-item interactions further hinder effective representation learning. To address these challenges, we propose a framework called BLAST, which balances structures and interactions via large language models and optimal transport for multimodal recommendation. Specifically, we utilize large language models to summarize side information and generate user profiles. Based on these profiles, we design an intra- and inter-entity structure balancing module to capture item-item and user-user relationships, integrating these affinities into the final representations. Furthermore, we impose constraints on negative sample selection, augment the training data with false negative items and the optimal transport algorithm, thereby leading to smoother interactions. We evaluate BLAST on three real-world datasets, and the results demonstrate that our method significantly outperforms state-of-the-art baselines, which validates the superiority and effectiveness of BLAST.

AAAI Conference 2025 Conference Paper

Boosting Image De-Raining via Central-Surrounding Synergistic Convolution

  • Long Peng
  • Yang Wang
  • Xin Di
  • PeizheXia
  • Xueyang Fu
  • Yang Cao
  • Zheng-Jun Zha

Rainy images suffer from quality degradation due to the synergistic effect of rain streaks and accumulation. The rain streaks are anisotropic and show a specific directional arrangement, while the rain accumulation is isotropic and shows a consistent concentration distribution in local regions. This distribution difference makes unified representation learning for rain streaks and accumulation challenging, which may lead to structure distortion and contrast degradation in the deraining results. To address this problem, a central-surrounding mechanism inspired Synergistic Convolution (SC) is proposed to extract rain streaks and accumulation features simultaneously. Specifically, the SC consists of two parallel novel convolutions: Central-Surrounding Difference Convolution (CSD) and Central-Surrounding Addition Convolution (CSA). In CSD, the difference operation between central and surrounding pixels is injected into the feature extraction process of convolution to perceive the direction distribution of rain streaks. In CSA, the addition operation between central and surrounding pixels is injected into the feature extraction process of convolution to facilitate the modeling of rain accumulation properties. The SC can be used as a general unit to substitute Vanilla Convolution (VC) in current de-raining networks to boost performance. To reduce computational costs, CSA and CSD in SC are merged into a single VC kernel by our parameter equivalent transformation before inferencing. Evaluations of twelve de-raining methods on nine public datasets demonstrate that our proposed SC can comprehensively improve the performance of twelve de-raining networks under various rainy conditions without changing the original network structure or introducing extra computational costs. Even for the current SOTA methods, SC can further achieve SOTA++ performance. The source codes will be publicly available.

IJCAI Conference 2025 Conference Paper

Directing Mamba to Complex Textures: An Efficient Texture-Aware State Space Model for Image Restoration

  • Long Peng
  • Xin Di
  • ZhanFeng Feng
  • Wenbo Li
  • Renjing Pei
  • Yang Wang
  • Xueyang Fu
  • Yang Cao

Image restoration aims to recover details and enhance contrast in degraded images. With the growing demand for high-quality imaging (e. g. , 4K and 8K), achieving a balance between restoration quality and computational efficiency has become increasingly critical. Existing methods, primarily based on CNNs, Transformers, or their hybrid approaches, apply uniform deep representation extraction across the image. However, these methods often struggle to effectively model long-range dependencies and largely overlook the spatial characteristics of image degradation (regions with richer textures tend to suffer more severe damage), making it hard to achieve the best trade-off between restoration quality and efficiency. To address these issues, we propose a novel texture-aware image restoration method, TAMambaIR, which simultaneously perceives image textures and achieves a trade-off between performance and efficiency. Specifically, we introduce a novel Texture-Aware State Space Model, which enhances texture awareness and improves efficiency by modulating the transition matrix of the state-space equation and focusing on regions with complex textures. Additionally, we design a Multi-Directional Perception Block to improve multi-directional receptive fields while maintaining low computational overhead. Extensive experiments on benchmarks for image super-resolution, deraining, and low-light image enhancement demonstrate that TAMambaIR achieves state-of-the-art performance with significantly improved efficiency, establishing it as a robust and efficient framework for image restoration.

NeurIPS Conference 2025 Conference Paper

EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting

  • Bohao Liao
  • Wei Zhai
  • Zengyu Wan
  • Zhixin Cheng
  • Wenfei Yang
  • Yang Cao
  • Tianzhu Zhang
  • Zheng-Jun Zha

Scene reconstruction from casually captured videos has wide real-world applications. Despite recent progress, existing methods relying on traditional cameras tend to fail in high-speed scenarios due to insufficient observations and inaccurate pose estimation. Event cameras, inspired by biological vision, record pixel-wise intensity changes asynchronously with high temporal resolution and low latency, providing valuable scene and motion information in blind inter-frame intervals. In this paper, we introduce the event cameras to aid scene construction from a casually captured video for the first time, and propose Event-Aided Free-Trajectory 3DGS, called EF-3DGS, which seamlessly integrates the advantages of event cameras into 3DGS through three key components. First, we leverage the Event Generation Model (EGM) to fuse events and frames, enabling continuous supervision between discrete frames. Second, we extract motion information through Contrast Maximization (CMax) of warped events, which calibrates camera poses and provides gradient-domain constraints for 3DGS. Third, to address the absence of color information in events, we combine photometric bundle adjustment (PBA) with a Fixed-GS training strategy that separates structure and color optimization, effectively ensuring color consistency across different views. We evaluate our method on the public Tanks and Temples benchmark and a newly collected real-world dataset, RealEv-DAVIS. Our method achieves up to 3dB higher PSNR and 40% lower Absolute Trajectory Error (ATE) compared to state-of-the-art methods under challenging high-speed scenarios.

NeurIPS Conference 2025 Conference Paper

Efficient $k$-Sparse Band–Limited Interpolation with Improved Approximation Ratio

  • Yang Cao
  • Xiaoyu Li
  • Zhao Song
  • Chiwun Yang

We consider the task of interpolating a $k$-sparse band–limited signal from a small collection of noisy time-domain samples. Exploiting a new analytic framework for hierarchical frequency decomposition that performs systematic noise cancellation, we give the first polynomial-time algorithm with a provable $(3+\sqrt{2}+\epsilon)$-approximation guarantee for continuous interpolation. Our method breaks the long-standing $C > 100$ barrier set by the best previous algorithms, sharply reducing the gap to optimal recovery and establishing a new state of the art for high-accuracy band–limited interpolation. We also give a refined ``shrinking-range'' variant that achieves a $(\sqrt{2}+\varepsilon+c)$-approximation on any sub-interval $(1-c)T$ for some $c \in (0, 1)$, which gives even higher interpolation accuracy.

NeurIPS Conference 2025 Conference Paper

Faster Algorithms for Structured John Ellipsoid Computation

  • Yang Cao
  • Xiaoyu Li
  • Zhao Song
  • Xin Yang
  • Tianyi Zhou

The famous theorem of Fritz John states that any convex body has a unique maximal volume inscribed ellipsoid, known as the John Ellipsoid. Computing the John Ellipsoid is a fundamental problem in convex optimization. In this paper, we focus on approximating the John Ellipsoid inscribed in a convex and centrally symmetric polytope defined by $P: = \{ x \in \mathbb{R}^d: -\mathbf{1}_n \leq A x \leq \mathbf{1}_n \}, $ where $ A \in \mathbb{R}^{n \times d}$ is a rank-$d$ matrix and $ \mathbf{1}_n \in \mathbb{R}^n $ is the all-ones vector. We develop two efficient algorithms for approximating the John Ellipsoid. The first is a sketching-based algorithm that runs in nearly input-sparsity time $ \widetilde{O}(\mathrm{nnz}(A) + d^\omega) $, where $ \mathrm{nnz}(A) $ denotes the number of nonzero entries in the matrix $A$ and $ \omega \approx 2. 37$ is the current matrix multiplication exponent. The second is a treewidth-based algorithm that runs in time $ \widetilde{O}(n \tau^2)$, where $\tau$ is the treewidth of the dual graph of the matrix $A$. Our algorithms significantly improve upon the state-of-the-art running time of $ \widetilde{O}(n d^2) $ achieved by [Cohen, Cousins, Lee, and Yang, COLT 2019].

AAAI Conference 2025 Conference Paper

Federated Graph Condensation with Information Bottleneck Principles

  • Bo Yan
  • Sihao He
  • Cheng Yang
  • Shang Liu
  • Yang Cao
  • Chuan Shi

Graph condensation (GC), which reduces the size of a large-scale graph by synthesizing a small-scale condensed graph as its substitution, has benefited various graph learning tasks. However, existing GC methods rely on centralized data storage, which is unfeasible for real-world decentralized data distribution, and overlook data holders' privacy-preserving requirements. To bridge this gap, we propose and study the novel problem of federated graph condensation (FGC) for graph neural networks (GNNs). Specifically, we first propose a general framework for FGC, where we decouple the typical gradient matching process for GC into client-side gradient calculation and server-side gradient matching, integrating knowledge from multiple clients' subgraphs into one smaller condensed graph. Nevertheless, our empirical studies show that under the federated setting, the condensed graph will consistently leak data membership privacy, i.e., the condensed graph during federated training can be utilized to steal training data under the membership inference attack (MIA). To tackle this issue, we innovatively incorporate information bottleneck principles into the FGC, which only needs to extract partial node features in one local pre-training step and utilize the features during federated training. Theoretical and experimental analyses demonstrate that our framework consistently protects membership privacy during training. Meanwhile, it can achieve comparable and even superior performance against existing centralized GC and federated graph learning (FGL) methods.

AAAI Conference 2025 Conference Paper

Hierarchical Gradient-Based Genetic Sampling for Accurate Prediction of Biological Oscillations

  • Heng Rao
  • Yu Gu
  • Jason Zipeng Zhang
  • Ge Yu
  • Yang Cao
  • Minghan Chen

Biological oscillations are periodic changes in various signaling processes crucial for the proper functioning of living organisms. These oscillations are modeled by ordinary differential equations, with coefficient variations leading to diverse periodic behaviors, typically measured by oscillatory frequencies. This paper explores sampling techniques for neural networks to model the relationship between system coefficients and oscillatory frequency. However, the scarcity of oscillations in the vast coefficient space results in many samples exhibiting non-periodic behaviors, and small coefficient changes near oscillation boundaries can significantly alter oscillatory properties. This leads to non-oscillatory bias and boundary sensitivity, making accurate predictions difficult. While existing importance and uncertainty sampling approaches partially mitigate these challenges, they either fail to resolve the sensitivity problem or result in redundant sampling. To address these limitations, we propose the Hierarchical Gradient-based Genetic Sampling (HGGS) framework, which improves the accuracy of neural network predictions for biological oscillations. The first layer, Gradient-based Filtering, extracts sensitive oscillation boundaries and removes redundant non-oscillatory samples, creating a balanced coarse dataset. The second layer, Multi-grid Genetic Sampling, utilizes residual information to refine these boundaries and explore new high-residual regions, increasing data diversity for model training. Experimental results demonstrate that HGGS outperforms seven comparative sampling methods across four biological systems, highlighting its effectiveness in enhancing sampling and prediction accuracy.

IJCAI Conference 2025 Conference Paper

INFP: INdustrial Video Anomaly Detection via Frequency Prioritization

  • Qianzi Yu
  • Kai Zhu
  • Yang Cao
  • Yu Kang

Industrial video anomaly detection aims to perform real-time analysis of video streams from industrial production lines and provide anomaly alerts. Conventional video anomaly detection methods focus more on the overall image, as they aim to identify anomalies among multiple normal samples appearing simultaneously. However, industrial scenarios, where the primary focus is on a single type of product, require attention to local areas to capture fine-grained details and specific patterns. Directly applying conventional methods to industrial scenarios can result in an inability to focus on products moving along fixed trajectories, ineffective utilization of their equidistant periodicity, and greater susceptibility to lighting variations. To address these issues, we propose FreqNet, an encoder-decoder framework that learns frequency-domain features from videos to capture periodic and dynamic characteristics, enhancing the model's robustness. Specifically, a trajectory filter is proposed that takes advantage of the significant difference between moving objects and static backgrounds in the frequency domain by assigning higher weights to fixed moving trajectories. Moreover, a multi-feature fusion module is proposed, in which the frequency domain features of the video are first extracted to leverage the unique equidistant periodicity information of videos from industrial production lines. The extracted frequency domain features are subsequently fused with spatio-temporal features and contextual information is further integrated from the fused representation, effectively mitigating the impact of lighting variations on production lines. Extensive experiments on the benchmark IPAD dataset demonstrate the superiority of our proposed method over the state-of-the-art.

IJCAI Conference 2025 Conference Paper

MMGIA: Gradient Inversion Attack Against Multimodal Federated Learning via Intermodal Correlation

  • Lele Zheng
  • Yang Cao
  • Leo Yu Zhang
  • Wei Wang
  • Yulong Shen
  • Xiaochun Cao

Multimodal federated learning (MMFL) enables collaborative model training across multiple modalities, such as images and text, without requiring direct data sharing. However, the inherent correlations between modalities introduce new privacy vulnerabilities, making MMFL more susceptible to gradient inversion attacks. In this work, we propose MMGIA, an intermodal correlation-driven gradient inversion attack that systematically exploits multimodal correlation to enhance data reconstruction quality. MMGIA consists of a two-stage optimization framework: the first stage independently reconstructs each modality using traditional gradient inversion techniques, while the second stage refines these reconstructions through pre-trained feature extractors to align modalities in a shared latent space. To further improve reconstruction accuracy, we introduce a quality-weighted fusion strategy, which dynamically integrates multimodal embeddings into a global fused representation that serves as a guiding signal for refining each modality’s reconstruction. This ensures that high-quality reconstructions contribute more to the optimization process, preventing degradation in well-reconstructed modalities while enhancing weaker ones. We conduct extensive experiments on multiple multimodal scenarios, demonstrating that MMGIA outperforms both the only existing multimodal attack and state-of-the-art single-modal attacks, revealing the heightened privacy risks in MMFL.

NeurIPS Conference 2025 Conference Paper

PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement

  • ZhanFeng Feng
  • Long Peng
  • Xin Di
  • Yong Guo
  • Wenbo Li
  • Yulun Zhang
  • Renjing Pei
  • Yang Wang

Multi-frame video enhancement tasks aim to improve the spatial and temporal resolution and quality of video sequences by leveraging temporal information from multiple frames, which are widely used in streaming video processing, surveillance, and generation. Although numerous Transformer-based enhancement methods have achieved impressive performance, their computational and memory demands hinder deployment on edge devices. Quantization offers a practical solution by reducing the bit-width of weights and activations to improve efficiency. However, directly applying existing quantization methods to video enhancement tasks often leads to significant performance degradation and loss of fine details. This stems from two limitations: (a) inability to allocate varying representational capacity across frames, which results in suboptimal dynamic range adaptation; (b) over-reliance on full-precision teachers, which limits the learning of low-bit student models. To tackle these challenges, we propose a novel quantization method for video enhancement: Progressive Multi-Frame Quantization for Video Enhancement (PMQ-VE). This framework features a coarse-to-fine two-stage process: Backtracking-based Multi-Frame Quantization (BMFQ) and Progressive Multi-Teacher Distillation (PMTD). BMFQ utilizes a percentile-based initialization and iterative search with pruning and backtracking for robust clipping bounds. PMTD employs a progressive distillation strategy with both full-precision and multiple high-bit (INT) teachers to enhance low-bit models' capacity and quality. Extensive experiments demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance across multiple tasks and benchmarks. The code will be made publicly available.

NeurIPS Conference 2025 Conference Paper

Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds

  • Fan Wang
  • Pengtao Shao
  • Yiming Zhang
  • Bo Yu
  • Shaoshan Liu
  • Ning Ding
  • Yang Cao
  • Yu Kang

In-Context Reinforcement Learning (ICRL) enables agents to learn automatically and on-the-fly from their interactive experiences. However, a major challenge in scaling up ICRL is the lack of scalable task collections. To address this, we propose the procedurally generated tabular Markov Decision Processes, named AnyMDP. Through a carefully designed randomization process, AnyMDP is capable of generating high-quality tasks on a large scale while maintaining relatively low structural biases. To facilitate efficient meta-training at scale, we further introduce decoupled policy distillation and induce prior information in the ICRL framework. Our results demonstrate that, with a sufficiently large scale of AnyMDP tasks, the proposed model can generalize to tasks that were not considered in the training set through versatile in-context learning paradigms. The scalable task set provided by AnyMDP also enables a more thorough empirical investigation of the relationship between data distribution and ICRL performance. We further show that the generalization of ICRL potentially comes at the cost of increased task diversity and longer adaptation periods. This finding carries critical implications for scaling robust ICRL capabilities, highlighting the necessity of diverse and extensive task design, and prioritizing asymptotic performance over few-shot adaptation.

NeurIPS Conference 2025 Conference Paper

ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models

  • Zixun Fang
  • Kai Zhu
  • Zhiheng Liu
  • Yu Liu
  • Wei Zhai
  • Yang Cao
  • Zheng-Jun Zha

Panoramic video generation aims to synthesize 360-degree immersive videos, holding significant importance in the fields of VR, world models, and spatial intelligence. Existing works fail to synthesize high-quality panoramic videos due to the inherent modality gap between panoramic data and perspective data, which constitutes the majority of the training data for modern diffusion models. In this paper, we propose a novel framework utilizing pretrained perspective video models for generating panoramic videos. Specifically, we design a novel panorama representation named ViewPoint map, which possesses global spatial continuity and fine-grained visual details simultaneously. With our proposed Pano-Perspective attention mechanism, the model benefits from pretrained perspective priors and captures the panoramic spatial correlations of the ViewPoint map effectively. Extensive experiments demonstrate that our method can synthesize highly dynamic and spatially consistent panoramic videos, achieving state-of-the-art performance and surpassing previous methods.

AAAI Conference 2024 Conference Paper

A Generalized Shuffle Framework for Privacy Amplification: Strengthening Privacy Guarantees and Enhancing Utility

  • E Chen
  • Yang Cao
  • Yifei Ge

The shuffle model of local differential privacy is an advanced method of privacy amplification designed to enhance privacy protection with high utility. It achieves this by randomly shuffling sensitive data, making linking individual data points to specific individuals more challenging. However, most existing studies have focused on the shuffle model based on (ε0,0)-Locally Differentially Private (LDP) randomizers, with limited consideration for complex scenarios such as (ε0,δ0)-LDP or personalized LDP (PLDP). This hinders a comprehensive understanding of the shuffle model's potential and limits its application in various settings. To bridge this research gap, we propose a generalized shuffle framework that can be applied to PLDP setting. This generalization allows for a broader exploration of the privacy-utility trade-off and facilitates the design of privacy-preserving analyses in diverse contexts. We prove that the shuffled PLDP process approximately preserves μ-Gaussian Differential Privacy with μ = O(1/√n). This approach allows us to avoid the limitations and potential inaccuracies associated with inequality estimations. To strengthen the privacy guarantee, we improve the lower bound by utilizing hypothesis testing instead of relying on rough estimations like the Chernoff bound or Hoeffding's inequality. Furthermore, extensive comparative evaluations clearly show that our approach outperforms existing methods in achieving strong central privacy guarantees while preserving the utility of the global model. We have also carefully designed corresponding algorithms for average function, frequency estimation, and stochastic gradient descent.

JAIR Journal 2024 Journal Article

Detecting Change Intervals with Isolation Distributional Kernel

  • Yang Cao
  • Ye Zhu
  • Kai Ming Ting
  • Flora D. Salim
  • Hong Xian Li
  • Luxing Yang
  • Gang Li

Detecting abrupt changes in data distribution is one of the most significant tasks in streaming data analysis. Although many unsupervised Change-Point Detection (CPD) methods have been proposed recently to identify those changes, they still suffer from missing subtle changes, poor scalability, or/and sensitivity to outliers. To meet these challenges, we are the first to generalise the CPD problem as a special case of the Change-Interval Detection (CID) problem. Then we propose a CID method, named iCID, based on a recent Isolation Distributional Kernel (IDK). iCID identifies the change interval if there is a high dissimilarity score between two non-homogeneous temporal adjacent intervals. The data-dependent property and finite feature map of IDK enabled iCID to efficiently identify various types of change-points in data streams with the tolerance of outliers. Moreover, the proposed online and offline versions of iCID have the ability to optimise key parameter settings. The effectiveness and efficiency of iCID have been systematically verified on both synthetic and real-world datasets.

IJCAI Conference 2024 Conference Paper

Detecting Change Intervalswith Isolation Distributional Kernel (Abstract Reprint)

  • Yang Cao
  • Ye Zhu
  • Kai Ming Ting
  • Flora D. Salim
  • Hong Xian Li
  • Luxing Yang
  • Gang Li

Detecting abrupt changes in data distribution is one of the most significant tasks in streaming data analysis. Although many unsupervised Change-Point Detection (CPD) methods have been proposed recently to identify those changes, they still suffer from missing subtle changes, poor scalability, or/and sensitivity to outliers. To meet these challenges, we are the first to generalise the CPD problem as a special case of the Change-Interval Detection (CID) problem. Then we propose a CID method, named iCID, based on a recent Isolation Distributional Kernel (IDK). iCID identifies the change interval if there is a high dissimilarity score between two non-homogeneous temporal adjacent intervals. The data-dependent property and finite feature map of IDK enabled iCID to efficiently identify various types of change-points in data streams with the tolerance of outliers. Moreover, the proposed online and offline versions of iCID have the ability to optimise key parameter settings. The effectiveness and efficiency of iCID have been systematically verified on both synthetic and real-world datasets.

NeurIPS Conference 2024 Conference Paper

EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views

  • Yuhang Yang
  • Wei Zhai
  • Chengfeng Wang
  • Chengjun Yu
  • Yang Cao
  • Zheng-Jun Zha

Understanding egocentric human-object interaction (HOI) is a fundamental aspect of human-centric perception, facilitating applications like AR/VR and embodied AI. For the egocentric HOI, in addition to perceiving semantics e. g. , ''what'' interaction is occurring, capturing ''where'' the interaction specifically manifests in 3D space is also crucial, which links the perception and operation. Existing methods primarily leverage observations of HOI to capture interaction regions from an exocentric view. However, incomplete observations of interacting parties in the egocentric view introduce ambiguity between visual observations and interaction contents, impairing their efficacy. From the egocentric view, humans integrate the visual cortex, cerebellum, and brain to internalize their intentions and interaction concepts of objects, allowing for the pre-formulation of interactions and making behaviors even when interaction regions are out of sight. In light of this, we propose harmonizing the visual appearance, head motion, and 3D object to excavate the object interaction concept and subject intention, jointly inferring 3D human contact and object affordance from egocentric videos. To achieve this, we present EgoChoir, which links object structures with interaction contexts inherent in appearance and head motion to reveal object affordance, further utilizing it to model human contact. Additionally, a gradient modulation is employed to adopt appropriate clues for capturing interaction regions across various egocentric scenarios. Moreover, 3D contact and affordance are annotated for egocentric videos collected from Ego-Exo4D and GIMO to support the task. Extensive experiments on them demonstrate the effectiveness and superiority of EgoChoir.

AAAI Conference 2024 Conference Paper

Hypercorrelation Evolution for Video Class-Incremental Learning

  • Sen Liang
  • Kai Zhu
  • Wei Zhai
  • Zhiheng Liu
  • Yang Cao

Video class-incremental learning aims to recognize new actions while restricting the catastrophic forgetting of old ones, whose representative samples can only be saved in limited memory. Semantically variable subactions are susceptible to class confusion due to data imbalance. While existing methods address the problem by estimating and distilling the spatio-temporal knowledge, we further explores that the refinement of hierarchical correlations is crucial for the alignment of spatio-temporal features. To enhance the adaptability on evolved actions, we proposes a hierarchical aggregation strategy, in which hierarchical matching matrices are combined and jointly optimized to selectively store and retrieve relevant features from previous tasks. Meanwhile, a correlation refinement mechanism is presented to reinforce the bias on informative exemplars according to online hypercorrelation distribution. Experimental results demonstrate the effectiveness of the proposed method on three standard video class-incremental learning benchmarks, outperforming state-of-the-art methods. Code is available at: https://github.com/Lsen991031/HCE

IJCAI Conference 2024 Conference Paper

Optimal Graph Learning and Nuclear Norm Maximization for Deep Cross-Domain Robust Label Propagation

  • Wei Wang
  • Hanyang Li
  • Ke Shi
  • Chao Huang
  • Yang Cao
  • Cong Wang
  • Xiaochun Cao

Domain adaptation aims to achieve label transfer from a labeled source domain to an unlabeled target domain, where the two domains exhibit different distributions. Existing methods primarily concentrate on designing a feature extractor to learn better domain-invariant features, along with developing an effective classifier for reliable predictions. In this paper, we introduce optimal graph learning to generate a cross-domain graph that effectively connects the two domains, and two domain-specific graphs to capture domain-specific structures. On the one hand, we incorporate the three graphs into the label propagation (LP) classifier to enhance its robustness to distribution difference. On the other hand, we leverage the three graphs to introduce graph embedding losses, promoting the learning of locally discriminative and domain-invariant features. Furthermore, we maximize the nuclear norm of predictions in LP to enhance class diversity, thereby improving its robustness to class imbalance problem. Correspondingly, we develop an efficient algorithm to solve the associated optimization problem. Finally, we integrate the proposed LP and graph embedding losses into a deep neural network, resulting in our proposed deep cross-domain robust LP. Extensive experiments conducted on three cross-domain benchmark datasets demonstrate that our proposed approach could outperform existing state-of-the-art domain adaptation methods.

NeurIPS Conference 2024 Conference Paper

UNIT: Unifying Image and Text Recognition in One Vision Encoder

  • Yi Zhu
  • Yanpeng Zhou
  • Chunwei Wang
  • Yang Cao
  • Jianhua Han
  • Lu Hou
  • Hang Xu

Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a novel training framework aimed at UNifying Image and Text recognition within a single model. Starting with a vision encoder pre-trained with image recognition tasks, UNIT introduces a lightweight language decoder for predicting text outputs and a lightweight vision decoder to prevent catastrophic forgetting of the original image encoding capabilities. The training process comprises two stages: intra-scale pretraining and inter-scale finetuning. During intra-scale pretraining, UNIT learns unified representations from multi-scale inputs, where images and documents are at their commonly used resolution, to enable fundamental recognition capability. In the inter-scale finetuning stage, the model introduces scale-exchanged data, featuring images and documents at resolutions different from the most commonly used ones, to enhance its scale robustness. Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment. Experiments across multiple benchmarks confirm that our method significantly outperforms existing methods on document-related tasks (e. g. , OCR and DocQA) while maintaining the performances on natural images, demonstrating its ability to substantially enhance text recognition without compromising its core image recognition capabilities.

NeurIPS Conference 2023 Conference Paper

CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection

  • Yang Cao
  • Zeng Yihan
  • Hang Xu
  • Dan Xu

Open-vocabulary 3D Object Detection (OV-3DDet) aims to detect objects from an arbitrary list of categories within a 3D scene, which remains seldom explored in the literature. There are primarily two fundamental problems in OV-3DDet, i. e. , localizing and classifying novel objects. This paper aims at addressing the two problems simultaneously via a unified framework, under the condition of limited base categories. To localize novel 3D objects, we propose an effective 3D Novel Object Discovery strategy, which utilizes both the 3D box geometry priors and 2D semantic open-vocabulary priors to generate pseudo box labels of the novel objects. To classify novel object boxes, we further develop a cross-modal alignment module based on discovered novel boxes, to align feature spaces between 3D pointcloud and image/text modalities. Specifically, the alignment process contains a class-agnostic and a class-discriminative alignment, incorporating not only the base objects with annotations but also the increasingly discovered novel objects, resulting in an iteratively enhanced alignment. The novel box discovery and crossmodal alignment are jointly learned to collaboratively benefit each other. Thenovel object discovery can directly impact the cross-modal alignment, while a better feature alignment can, in turn, boost the localization capability, leading to a unified OV-3DDet framework, named CoDA, for simultaneous novel object localization and classification. Extensive experiments on two challenging datasets ( i. e. , SUN-RGBD and ScanNet) demonstrate the effectiveness of our method and also show a significant mAP improvement upon the best-performing alternative method by 80%. Codes and pre-trained models are released on the project page.

NeurIPS Conference 2023 Conference Paper

Customizable Image Synthesis with Multiple Subjects

  • Zhiheng Liu
  • Yifei Zhang
  • Yujun Shen
  • Kecheng Zheng
  • Kai Zhu
  • Ruili Feng
  • Yu Liu
  • Deli Zhao

Synthesizing images with user-specified subjects has received growing attention due to its practical applications. Despite the recent success in single subject customization, existing algorithms suffer from high training cost and low success rate along with increased number of subjects. Towards controllable image synthesis with multiple subjects as the constraints, this work studies how to efficiently represent a particular subject as well as how to appropriately compose different subjects. We find that the text embedding regarding the subject token already serves as a simple yet effective representation that supports arbitrary combinations without any model tuning. Through learning a residual on top of the base embedding, we manage to robustly shift the raw subject to the customized subject given various text conditions. We then propose to employ layout, a very abstract and easy-to-obtain prior, as the spatial guidance for subject arrangement. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image, significantly alleviating the interference across them. Using cross-attention map as the intermediary, we could strengthen the signal of target subjects and weaken the signal of irrelevant subjects within a certain region, significantly alleviating the interference across subjects. Both qualitative and quantitative experimental results demonstrate our superiority over state-of-the-art alternatives under a variety of settings for multi-subject customization.

JBHI Journal 2023 Journal Article

Trajectory-Aware Adaptive Imaging Clue Analysis for Guidewire Artifact Removal in Intravascular Optical Coherence Tomography

  • Gongning Luo
  • Xinghua Ma
  • Jinwen Guo
  • Mingye Zou
  • Wei Wang
  • Yang Cao
  • Kuanquan Wang
  • Shuo Li

Guidewire Artifact Removal (GAR) involves restoring missing imaging signals in areas of IntraVascular Optical Coherence Tomography (IVOCT) videos affected by guidewire artifacts. GAR helps overcome imaging defects and minimizes the impact of missing signals on the diagnosis of CardioVascular Diseases (CVDs). To restore the actual vascular and lesion information within the artifact area, we propose a reliable Trajectory-aware Adaptive imaging Clue analysis Network (TAC-Net) that includes two innovative designs: (i) Adaptive clue aggregation, which considers both texture-focused original (ORI) videos and structure-focused relative total variation (RTV) videos, and suppresses texture-structure imbalance with an active weight-adaptation mechanism; (ii) Trajectory-aware Transformer, which uses a novel attention calculation to perceive the attention distribution of artifact trajectories and avoid the interference of irregular and non-uniform artifacts. We provide a detailed formulation for the procedure and evaluation of the GAR task and conduct comprehensive quantitative and qualitative experiments. The experimental results demonstrate that TAC-Net reliably restores the texture and structure of guidewire artifact areas as expected by experienced physicians ( e. g. , SSIM: 97. 23%). We also discuss the value and potential of the GAR task for clinical applications and computer-aided diagnosis of CVDs.

TMLR Journal 2022 Journal Article

Evolving Decomposed Plasticity Rules for Information-Bottlenecked Meta-Learning

  • Fan Wang
  • Hao Tian
  • Haoyi Xiong
  • Hua Wu
  • Jie Fu
  • Yang Cao
  • Yu Kang
  • Haifeng Wang

Artificial neural networks (ANNs) are typically confined to accomplishing pre-defined tasks by learning a set of static parameters. In contrast, biological neural networks (BNNs) can adapt to various new tasks by continually updating the neural connections based on the inputs, which is aligned with the paradigm of learning effective learning rules in addition to static parameters, \textit{e.g.}, meta-learning. Among various biologically inspired learning rules, Hebbian plasticity updates the neural network weights using local signals without the guide of an explicit target function, thus enabling an agent to learn automatically without human efforts. However, typical plastic ANNs using a large amount of meta-parameters violate the nature of the genomics bottleneck and potentially deteriorate the generalization capacity. This work proposes a new learning paradigm decomposing those connection-dependent plasticity rules into neuron-dependent rules thus accommodating $\Theta(n^2)$ learnable parameters with only $\Theta(n)$ meta-parameters. We also thoroughly study the effect of different neural modulation on plasticity. Our algorithms are tested in challenging random 2D maze environments, where the agents have to use their past experiences to shape the neural connections and improve their performances for the future. The results of our experiment validate the following: 1. Plasticity can be adopted to continually update a randomly initialized RNN to surpass pre-trained, more sophisticated recurrent models, especially when coming to long-term memorization. 2. Following the genomics bottleneck, the proposed decomposed plasticity can be comparable to or even more effective than canonical plasticity rules in some instances.

NeurIPS Conference 2022 Conference Paper

Exploring Figure-Ground Assignment Mechanism in Perceptual Organization

  • Wei Zhai
  • Yang Cao
  • Jing Zhang
  • Zheng-Jun Zha

Perceptual organization is a challenging visual task that aims to perceive and group the individual visual element so that it is easy to understand the meaning of the scene as a whole. Most recent methods building upon advanced Convolutional Neural Network (CNN) come from learning discriminative representation and modeling context hierarchically. However, when the visual appearance difference between foreground and background is obscure, the performance of existing methods degrades significantly due to the visual ambiguity in the discrimination process. In this paper, we argue that the figure-ground assignment mechanism, which conforms to human vision cognitive theory, can be explored to empower CNN to achieve a robust perceptual organization despite visual ambiguity. Specifically, we present a novel Figure-Ground-Aided (FGA) module to learn the configural statistics of the visual scene and leverage it for the reduction of visual ambiguity. Particularly, we demonstrate the benefit of using stronger supervisory signals by teaching (FGA) module to perceive configural cues, \ie, convexity and lower region, that human deem important for the perceptual organization. Furthermore, an Interactive Enhancement Module (IEM) is devised to leverage such configural priors to assist representation learning, thereby achieving robust perception organization with complex visual ambiguities. In addition, a well-founded visual segregation test is designed to validate the capability of the proposed FGA mechanism explicitly. Comprehensive evaluation results demonstrate our proposed FGA mechanism can effectively enhance the capability of perception organization on various baseline models. Nevertheless, the model augmented via our proposed FGA mechanism also outperforms state-of-the-art approaches on four challenging real-world applications.

AAAI Conference 2022 Conference Paper

ProgressiveMotionSeg: Mutually Reinforced Framework for Event-Based Motion Segmentation

  • Jinze Chen
  • Yang Wang
  • Yang Cao
  • Feng Wu
  • Zheng-Jun Zha

Dynamic Vision Sensor (DVS) can asynchronously output the events reflecting apparent motion of objects with microsecond resolution, and shows great application potential in monitoring and other fields. However, the output event stream of existing DVS inevitably contains background activity noise (BA noise) due to dark current and junction leakage current, which will affect the temporal correlation of objects, resulting in deteriorated motion estimation performance. Particularly, the existing filter-based denoising methods cannot be directly applied to suppress the noise in event stream, since there is no spatial correlation. To address this issue, this paper presents a novel progressive framework, in which a Motion Estimation (ME) module and an Event Denoising (ED) module are jointly optimized in a mutually reinforced manner. Specifically, based on the maximum sharpness criterion, ME module divides the input event into several segments by adaptive clustering in a motion compensating warp field, and captures the temporal correlation of event stream according to the clustered motion parameters. Taking temporal correlation as guidance, ED module calculates the confidence that each event belongs to real activity events, and transmits it to ME module to update energy function of motion segmentation for noise suppression. The two steps are iteratively updated until stable motion segmentation results are obtained. Extensive experimental results on both synthetic and real datasets demonstrate the superiority of our proposed approaches against the State-Of-The-Art (SOTA) methods.

NeurIPS Conference 2022 Conference Paper

Uncertainty-Aware Hierarchical Refinement for Incremental Implicitly-Refined Classification

  • Jian Yang
  • Kai Zhu
  • Kecheng Zheng
  • Yang Cao

Incremental implicitly-refined classification task aims at assigning hierarchical labels to each sample encountered at different phases. Existing methods tend to fail in generating hierarchy-invariant descriptors when the novel classes are inherited from the old ones. To address the issue, this paper, which explores the inheritance relations in the process of multi-level semantic increment, proposes an Uncertainty-Aware Hierarchical Refinement (UAHR) scheme. Specifically, our proposed scheme consists of a global representation extension strategy that enhances the discrimination of incremental representation by widening the corresponding margin distance, and a hierarchical distribution alignment strategy that refines the distillation process by explicitly determining the inheritance relationship of the incremental class. Particularly, the shifting subclasses are corrected under the guidance of hierarchical uncertainty, ensuring the consistency of the homogeneous features. Extensive experiments on widely used benchmarks (i. e. , IIRC-CIFAR, IIRC-ImageNet-lite, IIRC-ImageNet-Subset, and IIRC-ImageNet-full) demonstrate the superiority of our proposed method over the state-of-the-art approaches.

AAAI Conference 2021 Conference Paper

FLAME: Differentially Private Federated Learning in the Shuffle Model

  • Ruixuan Liu
  • Yang Cao
  • Hong Chen
  • Ruoyang Guo
  • Masatoshi Yoshikawa

Federated Learning (FL) is a promising machine learning paradigm that enables the analyzer to train a model without collecting users’ raw data. To ensure users’ privacy, differentially private federated learning has been intensively studied. The existing works are mainly based on the curator model or local model of differential privacy. However, both of them have pros and cons. The curator model allows greater accuracy but requires a trusted analyzer. In the local model where users randomize local data before sending them to the analyzer, a trusted analyzer is not required but the accuracy is limited. In this work, by leveraging the privacy amplification effect in the recently proposed shuffle model of differential privacy, we achieve the best of two worlds, i. e. , accuracy in the curator model and strong privacy without relying on any trusted party. We first propose an FL framework in the shuffle model and a simple protocol (SS-Simple) extended from existing work. We find that SS-Simple only provides an insufficient privacy amplification effect in FL since the dimension of the model parameter is quite large. To solve this challenge, we propose an enhanced protocol (SS-Double) to increase the privacy amplification effect by subsampling. Furthermore, for boosting the utility when the model size is greater than the user population, we propose an advanced protocol (SS-Topk) with gradient sparsification techniques. We also provide theoretical analysis and numerical evaluations of the privacy amplification of the proposed protocols. Experiments on realworld dataset validate that SS-Topk improves the testing accuracy by 60. 7% than the local model based FL. We highlight an observation that SS-Topk improves the accuracy by 33. 94% than the curator model based FL without any trusted party. Compared with non-private FL, our protocol SS-Topk only lose 1. 48% accuracy under (2. 348, 5e−6 )-DP per epoch.

IJCAI Conference 2021 Conference Paper

One-Shot Affordance Detection

  • Hongchen Luo
  • Wei Zhai
  • Jing Zhang
  • Yang Cao
  • Dacheng Tao

Affordance detection refers to identifying the potential action possibilities of objects in an image, which is an important ability for robot perception and manipulation. To empower robots with this ability in unseen scenarios, we consider the challenging one-shot affordance detection problem in this paper, i. e. , given a support image that depicts the action purpose, all objects in a scene with the common affordance should be detected. To this end, we devise a One-Shot Affordance Detection (OS-AD) network that firstly estimates the purpose and then transfers it to help detect the common affordance from all candidate images. Through collaboration learning, OS-AD can capture the common characteristics between objects having the same underlying affordance and learn a good adaptation capability for perceiving unseen affordances. Besides, we build a Purpose-driven Affordance Dataset (PAD) by collecting and labeling 4k images from 31 affordance and 72 object categories. Experimental results demonstrate the superiority of our model over previous representative ones in terms of both objective metrics and visual quality. The benchmark suite is at ProjectPage.

IJCAI Conference 2020 Conference Paper

Adaptively Multi-Objective Adversarial Training for Dialogue Generation

  • Xuemiao Zhang
  • Zhouxing Tan
  • Xiaoning Zhang
  • Yang Cao
  • Rui Yan

Naive neural dialogue generation models tend to produce repetitive and dull utterances. The promising adversarial models train the generator against a well-designed discriminator to push it to improve towards the expected direction. However, assessing dialogues requires consideration of many aspects of linguistics, which are difficult to be fully covered by a single discriminator. To address it, we reframe the dialogue generation task as a multi-objective optimization problem and propose a novel adversarial dialogue generation framework with multiple discriminators that excel in different objectives for multiple linguistic aspects, called AMPGAN, whose feasibility is proved by theoretical derivations. Moreover, we design an adaptively adjusted sampling distribution to balance the discriminators and promote the overall improvement of the generator by continuing to focus on these objectives that the generator is not performing well relatively. Experimental results on two real-world datasets show a significant improvement over the baselines.

IJCAI Conference 2020 Conference Paper

Self-Supervised Tuning for Few-Shot Segmentation

  • Kai Zhu
  • Wei Zhai
  • Yang Cao

Few-shot segmentation aims at assigning a category label to each image pixel with few annotated samples. It is a challenging task since the dense prediction can only be achieved under the guidance of latent features defined by sparse annotations. Existing meta-learning based method tends to fail in generating category-specifically discriminative descriptor when the visual features extracted from support images are marginalized in embedding space. To address this issue, this paper presents an adaptive tuning framework, in which the distribution of latent features across different episodes is dynamically adjusted based on a self-segmentation scheme, augmenting category-specific descriptors for label prediction. Specifically, a novel self-supervised inner-loop is firstly devised as the base learner to extract the underlying semantic features from the support image. Then, gradient maps are calculated by back-propagating self-supervised loss through the obtained features, and leveraged as guidance for augmenting the corresponding elements in the embedding space. Finally, with the ability to continuously learn from different episodes, an optimization-based meta-learner is adopted as outer loop of our proposed framework to gradually refine the segmentation results. Extensive experiments on benchmark PASCAL-5i and COCO-20i datasets demonstrate the superiority of our proposed method over state-of-the-art.

IJCAI Conference 2019 Conference Paper

One-Shot Texture Retrieval with Global Context Metric

  • Kai Zhu
  • Wei Zhai
  • Zheng-Jun Zha
  • Yang Cao

In this paper, we tackle one-shot texture retrieval: given an example of a new reference texture, detect and segment all the pixels of the same texture category within an arbitrary image. To address this problem, we present an OS-TR network to encoding both reference patch and query image, leading to achieve texture segmentation towards the reference category. Unlike the existing texture encoding methods that integrate CNN with orderless pooling, we propose a directionality-aware network to capture the texture variations at each direction, resulting in spatially invariant representation. To segment new categories given only few examples, we incorporate a self-gating mechanism into relation network to exploit global context information for adjusting per-channel modulation weights of local relation features. Extensive experiments on benchmark texture datasets and real scenarios demonstrate the above-par segmentation performance and robust generalization across domains of our proposed method.

IJCAI Conference 2019 Conference Paper

VAEGAN: A Collaborative Filtering Framework based on Adversarial Variational Autoencoders

  • Xianwen Yu
  • Xiaoning Zhang
  • Yang Cao
  • Min Xia

Recently, Variational Autoencoders (VAEs) have been successfully applied to collaborative filtering for implicit feedback. However, the performance of the resulting model depends a lot on the expressiveness of the inference model and the latent representation is often too constrained to be expressive enough to capture the true posterior distribution. In this paper, a novel framework named VAEGAN is proposed to address the above issue. In VAEGAN, we first introduce Adversarial Variational Bayes (AVB) to train Variational Autoencoders with arbitrarily expressive inference model. By utilizing Generative Adversarial Networks (GANs) for implicit variational inference, the inference model provides better approximation to the posterior and maximum-likelihood assignment. Then the performance of our model is further improved by introducing an auxiliary discriminative network using adversarial training to achieve high accuracy in recommendation. Furthermore, contractive loss is added to the classical reconstruction cost function as a penalty term to yield robust features and improve the generalization performance. Finally, we show that the performance of our proposed VAEGAN significantly outperforms state-of-the-art baselines on several real-world datasets.

IJCAI Conference 2018 Conference Paper

Enhanced-alignment Measure for Binary Foreground Map Evaluation

  • Deng-Ping Fan
  • Cheng Gong
  • Yang Cao
  • Bo Ren
  • Ming-Ming Cheng
  • Ali Borji

The existing binary foreground map (FM) measures address various types of errors in either pixel-wise or structural ways. These measures consider pixel-level match or image-level information independently, while cognitive vision studies have shown that human vision is highly sensitive to both global information and local details in scenes. In this paper, we take a detailed look at current binary FM evaluation measures and propose a novel and effective E-measure (Enhanced-alignment measure). Our measure combines local pixel values with the image-level mean value in one term, jointly capturing image-level statistics and local pixel matching information. We demonstrate the superiority of our measure over the available measures on 4 popular datasets via 5 meta-measures, including ranking models for applications, demoting generic, random Gaussian noise maps, ground-truth switch, as well as human judgments. We find large improvements in almost all the meta-measures. For instance, in terms of application ranking, we observe improvement ranging from 9. 08% to 19. 65% compared with other popular measures.

IJCAI Conference 2016 Conference Paper

PARecommender: A Pattern-Based System for Route Recommendation

  • Feiyi Tang
  • Jia Zhu
  • Yang Cao
  • Sanli Ma
  • Yulong Chen
  • Jing He
  • Changqin Huang
  • Gansen Zhao

Widely adoption of GPS-enabled devices generates massive trajectory data every minute. The trajectory data can generate meaningful traffic patterns. In this demo, we present a system called PARecommender, which predicts traffic conditions and provides route recommendation based on generated traffic patterns. We first introduce the technical details of PARecommender, and then show several real cases that how PARecommender works.

IROS Conference 2015 Conference Paper

Generation of dynamically feasible and collision free trajectory by applying six-order Bezier curve and local optimal reshaping

  • Liang Yang
  • Dalei Song
  • Jizhong Xiao
  • Jianda Han
  • Liying Yang 0002
  • Yang Cao

This paper considers the problem of generating dynamically feasible and collision free trajectory for unmanned aerial vehicles(UAVs) in cluttered environments. General random-based searching algorithms output piecewise linear paths, which cause big discrepancy when used as navigation reference for UAVs with high speed. Meanwhile, the disturbance may also occur to lead the UAVs into danger. In order to obtain agile autonomy without potential dangers, this paper introduces a three-step method to generate feasible reference. In the first step, a six-order Bezier curve, which uses Tuning Rotation to decrease the curvature, is introduced to smooth the output of the path planner. Then a forward simulation is implemented to find the potential dangerous regions. Finally, the path is reshaped by local optimal reshaping planner to eliminate residual dangers. The three steps form a circulation, the reshaped path sent to the first step again to check dynamic feasibility and safety. The method combining Six-order Bezier curve, Tuning Rotation, and local optimal reshaping is proposed by us for the first time, where the Tuning Rotation is able to meet various curvature requirements without violating the previous path, local optimal reshaping obtains both temporal and spatial reshaping with high time efficiency. The method addresses the system dynamics to achieve agile autonomy, which provides the geometry reference as well as the low level control. The effectiveness of the proposed method is demonstrated by the simulations.

IS Journal 2014 Journal Article

Self-Organizing Networks: From Bio-Inspired to Social-Driven

  • Dongliang Duan
  • Liuqing Yang
  • Yang Cao
  • Jiaolong Wei
  • Xiang Cheng

In today's complicated world of wireless networking, rapid changes and steep challenges could lead to too many users flocking to one or only a few networks, thereby leaving some potentially useful service providers out of the picture. To avoid such negative effects, it might be worthwhile to transition from the idea of bio-inspired to social-driven networking. Here, the authors outline a few rules to help with the transition.