Author name cluster

Shitong Shao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

2 author rows

ICLR Conference 2025 Conference Paper

IV-mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis

Shitong Shao
Zikai Zhou
Bai Lichen
Haoyi Xiong
Zeke Xie

Exploring suitable solutions to improve performance by increasing the computational cost of inference in visual diffusion models is a highly promising direction. Sufficient prior studies have demonstrated that correctly scaling up computation in the sampling process can successfully lead to improved generation quality, enhanced image editing, and compositional generalization. While there have been rapid advancements in developing inference-heavy algorithms for improved image generation, relatively little work has explored inference scaling laws in video diffusion models (VDMs). Furthermore, existing research shows only minimal performance gains that are perceptible to the naked eye. To address this, we design a novel training-free algorithm IV-Mixed Sampler that leverages the strengths of image diffusion models (IDMs) to assist VDMs surpass their current capabilities. The core of IV-Mixed Sampler is to use IDMs to significantly enhance the quality of each video frame and VDMs ensure the temporal coherence of the video during the sampling process. Our experiments have demonstrated that IV-Mixed Sampler achieves state-of-the-art performance on 4 benchmarks including UCF-101-FVD, MSR-VTT-FVD, Chronomagic-Bench-150/1649, and VBench. For example, the open-source Animatediff with IV-Mixed Sampler reduces the UMT-FVD score from 275.2 to 228.6, closing to 223.1 from the closed-source Pika-2.0.

Details

ICLR Conference 2025 Conference Paper

Zigzag Diffusion Sampling: Diffusion Models Can Self-Improve via Self-Reflection

Lichen Bai
Shitong Shao
Zikai Zhou
Zipeng Qi
Zhiqiang Xu 0003
Haoyi Xiong
Zeke Xie

Diffusion models, the most popular generative paradigm so far, can inject conditional information into the generation path to guide the latent towards desired directions. However, existing text-to-image diffusion models often fail to maintain high image quality and high prompt-image alignment for those challenging prompts. To mitigate this issue and enhance existing pretrained diffusion models, we mainly made three contributions in this paper. First, we propose **diffusion self-reflection** that alternately performs denoising and inversion and demonstrate that such diffusion self-reflection can leverage the guidance gap between denoising and inversion to capture prompt-related semantic information with theoretical and empirical evidence. Second, motivated by theoretical analysis, we derive Zigzag Diffusion Sampling (Z-Sampling), a novel self-reflection-based diffusion sampling method that leverages the guidance gap between denosing and inversion to accumulate semantic information step by step along the sampling path, leading to improved sampling results. Moreover, as a plug-and-play method, Z-Sampling can be generally applied to various diffusion models (e.g., accelerated ones and Transformer-based ones) with very limited coding and computational costs. Third, our extensive experiments demonstrate that Z-Sampling can generally and significantly enhance generation quality across various benchmark datasets, diffusion models, and performance evaluation metrics. For example, DreamShaper with Z-Sampling can self-improve with the HPSv2 winning rate up to **94%** over the original results. Moreover, Z-Sampling can further enhance existing diffusion models combined with other orthogonal methods, including Diffusion-DPO. The code is publicly available at [github.com/xie-lab-ml/Zigzag-Diffusion-Sampling](https://github.com/xie-lab-ml/Zigzag-Diffusion-Sampling).

Details

NeurIPS Conference 2024 Conference Paper

Diffusion Models are Certifiably Robust Classifiers

Huanran Chen
Yinpeng Dong
Shitong Shao
Zhongkai Hao
Xiao Yang
Hang Su
Jun Zhu

Generative learning, recognized for its effective modeling of data distributions, offers inherent advantages in handling out-of-distribution instances, especially for enhancing robustness to adversarial attacks. Among these, diffusion classifiers, utilizing powerful diffusion models, have demonstrated superior empirical robustness. However, a comprehensive theoretical understanding of their robustness is still lacking, raising concerns about their vulnerability to stronger future attacks. In this study, we prove that diffusion classifiers possess $O(1)$ Lipschitzness, and establish their certified robustness, demonstrating their inherent resilience. To achieve non-constant Lipschitzness, thereby obtaining much tighter certified robustness, we generalize diffusion classifiers to classify Gaussian-corrupted data. This involves deriving the evidence lower bounds (ELBOs) for these distributions, approximating the likelihood using the ELBO, and calculating classification probabilities via Bayes' theorem. Experimental results show the superior certified robustness of these Noised Diffusion Classifiers (NDCs). Notably, we achieve over 80\% and 70\% certified robustness on CIFAR-10 under adversarial perturbations with $\ell_2$ norms less than 0. 25 and 0. 5, respectively, using a single off-the-shelf diffusion model without any additional data.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Elucidating the Design Space of Dataset Condensation

Shitong Shao
Zikai Zhou
Huanran Chen
Zhiqiang Shen

Dataset condensation, a concept within $\textit{data-centric learning}$, aims to efficiently transfer critical attributes from an original dataset to a synthetic version, meanwhile maintaining both diversity and realism of syntheses. This approach can significantly improve model training efficiency and is also adaptable for multiple application areas. Previous methods in dataset condensation have faced several challenges: some incur high computational costs which limit scalability to larger datasets ($\textit{e. g. ,}$ MTT, DREAM, and TESLA), while others are restricted to less optimal design spaces, which could hinder potential improvements, especially in smaller datasets ($\textit{e. g. ,}$ SRe$^2$L, G-VBSM, and RDED). To address these limitations, we propose a comprehensive designing-centric framework that includes specific, effective strategies like implementing soft category-aware matching, adjusting the learning rate schedule and applying small batch-size. These strategies are grounded in both empirical evidence and theoretical backing. Our resulting approach, $\textbf{E}$lucidate $\textbf{D}$ataset $\textbf{C}$ondensation ($\textbf{EDC}$), establishes a benchmark for both small and large-scale dataset condensation. In our testing, EDC achieves state-of-the-art accuracy, reaching 48. 6% on ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a compression ratio of 0. 78\%. This performance surpasses those of SRe$^2$L, G-VBSM, and RDED by margins of 27. 3%, 17. 2%, and 6. 6%, respectively. Code is available at: https: //github. com/shaoshitong/EDC.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Rethinking Centered Kernel Alignment in Knowledge Distillation

Zikai Zhou
Yunhang Shen
Shitong Shao
Linrui Gong
Shaohui Lin

Knowledge distillation has emerged as a highly effective method for bridging the representation discrepancy between large-scale models and lightweight models. Prevalent approaches involve leveraging appropriate metrics to minimize the divergence or distance between the knowledge extracted from the teacher model and the knowledge learned by the student model. Centered Kernel Alignment (CKA) is widely used to measure representation similarity and has been applied in several knowledge distillation methods. However, these methods are complex and fail to uncover the essence of CKA, thus not answering the question of how to use CKA to achieve simple and effective distillation properly. This paper first provides a theoretical perspective to illustrate the effectiveness of CKA, which decouples CKA to the upper bound of Maximum Mean Discrepancy (MMD) and a constant term. Drawing from this, we propose a novel Relation-Centered Kernel Alignment (RCKA) framework, which practically establishes a connection between CKA and MMD. Furthermore, we dynamically customize the application of CKA based on the characteristics of each task, with less computational source yet comparable performance than the previous methods. The extensive experiments on the CIFAR-100, ImageNet-1k, and MS-COCO demonstrate that our method achieves state-of-the-art performance on almost all teacher-student pairs for image classification and object detection, validating the effectiveness of our approaches. Our code is available in https: //github. com/Klayand/PCKA.

PDF Details DOI

JBHI Journal 2023 Journal Article

MS-FRAN: A Novel Multi-Source Domain Adaptation Method for EEG-Based Emotion Recognition

Wei Li
Wei Huan
Shitong Shao
Bowen Hou
Aiguo Song

Electroencephalogram (EEG)-based emotion recognition has gradually become a research hotspot. However, the large distribution differences of EEG signals across subjects make the current research stuck in a dilemma. To resolve this problem, in this article, we propose a novel and effective method, Multi-Source Feature Representation and Alignment Network (MS-FRAN). The effectiveness of proposed method mainly comes from three new modules: Wide Feature Extractor (WFE) for feature learning, Random Matching Operation (RMO) for model training, and Top- $\mathit{h}$ ranked domain classifier selection (TOP) for emotion classification. MS-FRAN is not only effective in aligning the distributions of each pair of source and target domains, but also capable of reducing the distributional differences among the multiple source domains. Experimental results on the public benchmark datasets SEED and DEAP have demonstrated the advantage of our method over the related competitive approaches for cross-subject EEG-based emotion recognition.

Details DOI

IJCAI Conference 2023 Conference Paper

Teaching What You Should Teach: A Data-Based Distillation Method

Shitong Shao
Huanran Chen
Zhen Huang
Linrui Gong
Shuai Wang
Xinxiao Wu

In real teaching scenarios, an excellent teacher always teaches what he (or she) is good at but the student is not. This gives the student the best assistance in making up for his (or her) weaknesses and becoming a good one overall. Enlightened by this, we introduce the "Teaching what you Should Teach" strategy into a knowledge distillation framework, and propose a data-based distillation method named "TST" that searches for desirable augmented samples to assist in distilling more efficiently and rationally. To be specific, we design a neural network-based data augmentation module with priori bias to find out what meets the teacher's strengths but the student's weaknesses, by learning magnitudes and probabilities to generate suitable data samples. By training the data augmentation module and the generalized distillation paradigm alternately, a student model is learned with excellent generalization ability. To verify the effectiveness of our method, we conducted extensive comparative experiments on object recognition, detection, and segmentation tasks. The results on the CIFAR-100, ImageNet-1k, MS-COCO, and Cityscapes datasets demonstrate that our method achieves state-of-the-art performance on almost all teacher-student pairs. Furthermore, we conduct visualization studies to explore what magnitudes and probabilities are needed for the distillation process.

PDF Details DOI