Arrow Research search

Author name cluster

Guang Dai

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

31 papers
2 author rows

Possible papers

31

AAAI Conference 2026 Conference Paper

Privacy Leaks by Adversaries: Adversarial Iterations for Membership Inference Attack

  • Jing Xue
  • Zhishen Sun
  • Haishan Ye
  • Luo Luo
  • Xiangyu Chang
  • Guang Dai

Membership inference attack (MIA) has become one of the most widely used and effective methods for evaluating the privacy risks of machine learning models. This attack aims to determine whether a specific sample is part of the model's training set by analyzing the model's output. While traditional membership inference attacks focus on leveraging the model’s posterior output, such as confidence on the target sample, we propose IMIA, a novel attack strategy that utilizes the process of generating adversarial samples to infer membership. We propose to infer the member properties of the target sample using the number of iterations required to generate its adversarial sample. We conduct experiments across multiple models and datasets, and our results demonstrate that the number of iterations for generating an adversarial sample is a reliable feature for membership inference, achieving strong performance both in black-box and white-box attack scenarios. This work provides a new perspective for evaluating model privacy and highlights the potential of adversarial example-based features for privacy leakage assessment.

ICML Conference 2025 Conference Paper

DynaMind: Reasoning over Abstract Video Dynamics for Embodied Decision-Making

  • Ziru Wang
  • Mengmeng Wang 0005
  • Jade Dai
  • Teli Ma
  • Guo-Jun Qi
  • Yong Liu 0007
  • Guang Dai
  • Jingdong Wang 0001

Integrating natural language instructions and visual perception with decision-making is a critical challenge for embodied agents. Existing methods often struggle to balance the conciseness of language commands with the richness of video content. To bridge the gap between modalities, we propose extracting key spatiotemporal patterns from video that capture visual saliency and temporal evolution, referred to as dynamic representation. Building on this, we introduce DynaMind, a framework that enhances decision-making through dynamic reasoning. Specifically, we design an adaptive FrameScorer to evaluate video frames based on semantic consistency and visual saliency, assigning each frame an importance score. These scores are used to filter redundant video content and synthesize compact dynamic representations. Leveraging these representations, we predict critical future dynamics and apply a dynamic-guided policy to generate coherent and context-aware actions. Extensive results demonstrate that DynaMind significantly outperforms the baselines across several simulation benchmarks and real-world scenarios.

IJCAI Conference 2025 Conference Paper

Instructing Text-to-Image Diffusion Models via Classifier-Guided Semantic Optimization

  • Yuanyuan Chang
  • Yinghua Yao
  • Tao Qin
  • Mengmeng Wang
  • Ivor Tsang
  • Guang Dai

Text-to-image diffusion models have emerged as powerful tools for high-quality image generation and editing. Many existing approaches rely on text prompts as editing guidance. However, these methods are constrained by the need for manual prompt crafting, which can be time-consuming, introduce irrelevant details, and significantly limit editing performance. In this work, we propose optimizing semantic embeddings guided by attribute classifiers to steer text-to-image models toward desired edits, without relying on text prompts or requiring any training or fine-tuning of the diffusion model. We utilize classifiers to learn precise semantic embeddings at the dataset level. The learned embeddings are theoretically justified as the optimal representation of attribute semantics, enabling disentangled and accurate edits. Experiments further demonstrate that our method achieves high levels of disentanglement and strong generalization across different domains of data. Code is available at https: //github. com/Chang-yuanyuan/CASO.

ICLR Conference 2025 Conference Paper

Manifold Constraint Reduces Exposure Bias in Accelerated Diffusion Sampling

  • Yuzhe Yao
  • Jun Chen 0023
  • Zeyi Huang
  • Haonan Lin
  • Mengmeng Wang 0005
  • Guang Dai
  • Jingdong Wang 0001

Diffusion models have demonstrated significant potential for generating high-quality images, audio, and videos. However, their iterative inference process entails substantial computational costs, limiting practical applications. Recently, researchers have introduced accelerated sampling methods that enable diffusion models to generate samples with far fewer timesteps than those used during training. Nonetheless, as the number of sampling steps decreases, the prediction errors significantly degrade the quality of generated outputs. Additionally, the exposure bias in diffusion models further amplifies these errors. To address these challenges, we leverage a manifold hypothesis to explore the exposure bias problem in depth. Based on this geometric perspective, we propose a manifold constraint that effectively reduces exposure bias during accelerated sampling of diffusion models. Notably, our method involves no additional training and requires only minimal hyperparameter tuning. Extensive experiments demonstrate the effectiveness of our approach, achieving a FID score of 15.60 with 10-step SDXL on MS-COCO, surpassing the baseline by a reduction of 2.57 in FID.

NeurIPS Conference 2025 Conference Paper

MonoLift: Learning 3D Manipulation Policies from Monocular RGB via Distillation

  • Ziru Wang
  • Mengmeng Wang
  • Guang Dai
  • Yongliu Long
  • Jingdong Wang

Although learning 3D manipulation policies from monocular RGB images is lightweight and deployment-friendly, the lack of structural information often leads to inaccurate action estimation. While explicit 3D inputs can mitigate this issue, they typically require additional sensors and introduce data acquisition overhead. An intuitive alternative is to incorporate a pre-trained depth estimator; however, this often incurs substantial inference-time cost. To address this, we propose MonoLift, a tri-level knowledge distillation framework that transfers spatial, temporal, and action-level knowledge from a depth-guided teacher to a monocular RGB student. By jointly distilling geometry-aware features, temporal dynamics, and policy behaviors during training, MonoLift enables the student model to perform 3D-aware reasoning and precise control at deployment using only monocular RGB input. Extensive experiments on both simulated and real-world manipulation tasks show that MonoLift not only outperforms existing monocular approaches but even surpasses several methods that rely on explicit 3D input, offering a resource-efficient and effective solution for vision-based robotic control. The video demonstration is available on our project page: https: //robotasy. github. io/MonoLift/.

ICLR Conference 2025 Conference Paper

ProAdvPrompter: A Two-Stage Journey to Effective Adversarial Prompting for LLMs

  • Hao Di
  • Tong He
  • Haishan Ye
  • Yinghui Huang 0001
  • Xiangyu Chang
  • Guang Dai
  • Ivor W. Tsang

As large language models (LLMs) are increasingly being integrated into various real-world applications, the identification of their vulnerabilities to jailbreaking attacks becomes an essential component of ensuring the safety and reliability of LLMs. Previous studies have developed LLM assistants, known as the adversarial prompter, to automatically generate suffixes that manipulate target LLMs into generating harmful and undesirable outputs. However, these approaches often suffer from low performance or generate semantically meaningless prompts, which can be easily identified by perplexity-based defenses. In this paper, we introduce a novel two-stage method, $\texttt{ProAdvPrompter}$, that significantly improves the performance of adversarial prompters. In $\texttt{ProAdvPrompter}$, the first stage (Exploration) utilizes the loss information to guide the adversarial prompter in generating suffixes that are more likely to elicit harmful responses. Then the second stage (Exploitation) iteratively fine-tunes the prompter using high-quality generated adversarial suffixes to further boost performance. Additionally, we incorporate the prompt template to aid in the Exploration stage and propose a filtering mechanism to accelerate the training process in the Exploitation stage. We evaluate $\texttt{ProAdvPrompter}$ against the well-aligned LLMs (i.e., Llama2-Chat-7B and Llama3-chat-8B), achieving attack success rates of 99.68% and 97.12% respectively after 10 trials on the AdvBench dataset, thereby enhancing performance by $\sim 2$ times compared to previous works. Moreover, $\texttt{ProAdvPrompter}$ reduces training time by 20% on Llama3-Instruct-8B, generates more generalized adversarial suffixes, and demonstrates resilience against the perplexity defense. An ablation study further evaluates the effects of key components in $\texttt{ProAdvPrompter}$ (the prompt template and the filtering mechanism).

ICLR Conference 2025 Conference Paper

Second-Order Fine-Tuning without Pain for LLMs: A Hessian Informed Zeroth-Order Optimizer

  • Yanjun Zhao 0001
  • Sizhe Dang
  • Haishan Ye
  • Guang Dai
  • Yi Qian 0004
  • Ivor W. Tsang

Fine-tuning large language models (LLMs) is necessary for specific downstream tasks, but classic first-order optimizer entails prohibitive GPU memory because of the back propagation. Recent works such as MeZO have turned to zeroth-order optimizers for fine-tuning, which reduce substantial memory by using two forward passes. However, heterogeneous curvatures across different parameter dimensions in LLMs often cause model convergence instability or even failure. In this work, we propose HiZOO, a diagonal Hessian informed Zeroth-Order Optimizer , which is the first work to leverage the diagonal Hessian to enhance ZOO for fine-tuning LLMs. We provide theoretical proof for HiZOO and visualize the optimization trajectories on test functions to illustrate how it improves convergence in handling heterogeneous curvatures. Extensive experiments on various models (RoBERTa, OPT, Phi-2 and LLama3, with 350M$\sim$66B parameters) indicate that HiZOO significantly reduces training steps and enhances model accuracy, while keeping the memory advantage of ZOO. For example, on SST2 task HiZOO achieves $8\times$ speedup and better accuracy over MeZO across different models. We also propose HiZOO-L, which reduces the Hessian memory cost to 10\% of the MeZO, while maintaining almost same performance. Compared with ZO-Adam, HiZOO-L achieves a 4.3\% improvement, just using 50\% of the GPU memory. Code is available at https://anonymous.4open.science/r/HiZOO-27F8.

AAAI Conference 2025 Conference Paper

SpotActor: Training-Free Layout-Controlled Consistent Image Generation

  • Jiahao Wang
  • Caixia Yan
  • Weizhan Zhang
  • Haonan Lin
  • Mengmeng Wang
  • Guang Dai
  • Tieliang Gong
  • Hao Sun

Text-to-image diffusion models significantly enhance the efficiency of artistic creation with high-fidelity image generation. However, in typical application scenarios like comic book production, they can neither place each subject into its expected spot nor maintain the consistent appearance of each subject across images. For these issues, we pioneer a novel task, Layout-to-Consistent-Image (L2CI) generation, which produces consistent and compositional images in accordance with the given layout conditions and text prompts. To accomplish this challenging task, we present a new formalization of dual energy guidance with optimization in a dual semantic-latent space and thus propose a training-free pipeline, SpotActor, which features a layout-conditioned optimizing stage and a consistent sampling stage. In the optimizing stage, we innovate a nuanced layout energy function to mimic the attention activations with a sigmoid-like objective. While in the sampling stage, we design Regional Interconnection Self-Attention (RISA) and Semantic Fusion Cross-Attention (SFCA) mechanisms that allow mutual interactions across images. To evaluate the performance, we present ActorBench, a specified benchmark with hundreds of reasonable prompt-box pairs stemming from object detection datasets. Comprehensive experiments are conducted to demonstrate the effectiveness of our method. The results prove that SpotActor fulfills the expectations of this task and showcases the potential for practical applications with superior layout alignment, subject consistency, prompt conformity and background diversity.

IJCAI Conference 2025 Conference Paper

VidEvo: Evolving Video Editing through Exhaustive Temporal Modeling

  • Sizhe Dang
  • Huan Liu
  • Mengmeng Wang
  • Xin Lai
  • Guang Dai
  • Jingdong Wang

Text-guided video editing (TGVE) has become a recent hotspot due to its entertainment value and practical applications. To reduce overhead, existing methods primarily extend from text-to-image diffusion models and typically involve reconstruction and editing phases. However, challenges persist, particularly in enhancing temporal consistency of a video while adhering to textual alignment requirements. A crucial factor leading to the aforementioned issue is the inadequate and implicit tuning of the attention module within existing methods, which is specifically designed to capture temporal information. In light of this, we introduce VidEvo, a novel one-shot video editing method that leverages explicit cues derived from the original video to enhance temporal modeling. By integrating null-video embedding (NVE) and window-frame attention (WFA) components, VidEvo facilitates the smooth and coherent generation of videos from global and local perspectives simultaneously. To be specific, NVE learns a set of multi-scale temporal embeddings within the visual space during the reconstruction phase. These embeddings are subsequently directly injected into the attention module of the editing phase, explicitly augmenting the temporal consistency of the entire video. On the other hand, WFA enhances local temporal modeling by dynamically optimizing attention mechanisms between adjacent frames, which improves temporal coherence with reduced computational costs. Experimental evaluations show that VidEvo enhances frame-to-frame temporal consistency. Ablation studies confirm NVE and WFA’s effectiveness and their plug-and-play capability with other methods.

AAAI Conference 2024 Conference Paper

A Multimodal, Multi-Task Adapting Framework for Video Action Recognition

  • Mengmeng Wang
  • Jiazheng Xing
  • Boyuan Jiang
  • Jun Chen
  • Jianbiao Mei
  • Xingxing Zuo
  • Guang Dai
  • Jingdong Wang

Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models' generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named M2-CLIP to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals, including the original contrastive learning head, a cross-modal classification head, a cross-modal masked language modeling head, and a visual classification head. This multi-task decoder adeptly satisfies the need for strong supervised performance within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.

ICML Conference 2024 Conference Paper

Can Gaussian Sketching Converge Faster on a Preconditioned Landscape?

  • Yilong Wang
  • Haishan Ye
  • Guang Dai
  • Ivor W. Tsang

This paper focuses on the large-scale optimization which is very popular in the big data era. The gradient sketching is an important technique in the large-scale optimization. Specifically, the random coordinate descent algorithm is a kind of gradient sketching method with the random sampling matrix as the sketching matrix. In this paper, we propose a novel gradient sketching called GSGD (Gaussian Sketched Gradient Descent). Compared with the classical gradient sketching methods such as the random coordinate descent and SEGA (Hanzely et al. , 2018), our GSGD does not require the importance sampling but can achieve a fast convergence rate matching the ones of these methods with importance sampling. Furthermore, if the objective function has a non-smooth regularization term, our GSGD can also exploit the implicit structure information of the regularization term to achieve a fast convergence rate. Finally, our experimental results substantiate the effectiveness and efficiency of our algorithm.

ICLR Conference 2024 Conference Paper

Decentralized Riemannian Conjugate Gradient Method on the Stiefel Manifold

  • Jun Chen 0023
  • Haishan Ye
  • Mengmeng Wang 0005
  • Tianxin Huang
  • Guang Dai
  • Ivor W. Tsang
  • Yong Liu 0007

The conjugate gradient method is a crucial first-order optimization method that generally converges faster than the steepest descent method, and its computational cost is much lower than that of second-order methods. However, while various types of conjugate gradient methods have been studied in Euclidean spaces and on Riemannian manifolds, there is little study for those in distributed scenarios. This paper proposes a decentralized Riemannian conjugate gradient descent (DRCGD) method that aims at minimizing a global function over the Stiefel manifold. The optimization problem is distributed among a network of agents, where each agent is associated with a local function, and the communication between agents occurs over an undirected connected graph. Since the Stiefel manifold is a non-convex set, a global function is represented as a finite sum of possibly non-convex (but smooth) local functions. The proposed method is free from expensive Riemannian geometric operations such as retractions, exponential maps, and vector transports, thereby reducing the computational complexity required by each agent. To the best of our knowledge, DRCGD is the first decentralized Riemannian conjugate gradient algorithm to achieve global convergence over the Stiefel manifold.

ICML Conference 2024 Conference Paper

Double Stochasticity Gazes Faster: Snap-Shot Decentralized Stochastic Gradient Tracking Methods

  • Hao Di
  • Haishan Ye
  • Xiangyu Chang
  • Guang Dai
  • Ivor W. Tsang

In decentralized optimization, $m$ agents form a network and only communicate with their neighbors, which gives advantages in data ownership, privacy, and scalability. At the same time, decentralized stochastic gradient descent ($\texttt{SGD}$) methods, as popular decentralized algorithms for training large-scale machine learning models, have shown their superiority over centralized counterparts. Distributed stochastic gradient tracking $\texttt{DSGT}$ has been recognized as the popular and state-of-the-art decentralized $\texttt{SGD}$ method due to its proper theoretical guarantees. However, the theoretical analysis of $\texttt{DSGT}$ shows that its iteration complexity is $\tilde{\mathcal{O}} \left(\frac{\bar{\sigma}^2}{m\mu \varepsilon} + \frac{\sqrt{L}\bar{\sigma}}{\mu(1 - \lambda_2(W))^{1/2} C_W \sqrt{\varepsilon} }\right)$, where the doubly stochastic matrix $W$ represents the network topology and $ C_W $ is a parameter that depends on $W$. Thus, it indicates that the convergence property of $\texttt{DSGT}$ is heavily affected by the topology of the communication network. To overcome the weakness of $\texttt{DSGT}$, we resort to the snap-shot gradient tracking skill and propose two novel algorithms, snap-shot $\texttt{DSGT}$ ($\texttt{SS-DSGT}$) and accelerated snap-shot $\texttt{DSGT}$ ($\texttt{ASS-DSGT}$). We further justify that $\texttt{SS-DSGT}$ exhibits a lower iteration complexity compared to $\texttt{DSGT}$ in the general communication network topology. Additionally, $\texttt{ASS-DSGT}$ matches $\texttt{DSGT}$’s iteration complexity $\mathcal{O}\left( \frac{\bar{\sigma}^2}{m\mu \varepsilon} + \frac{\sqrt{L}\bar{\sigma}}{\mu (1 - \lambda_2(W))^{1/2}\sqrt{\varepsilon}} \right)$ under the same conditions as $\texttt{DSGT}$. Numerical experiments validate $\texttt{SS-DSGT}$’s superior performance performance in the general communication network topology and exhibit better practical performance of $\texttt{ASS-DSGT}$ on the specified $W$ compared to $\texttt{DSGT}$.

ICML Conference 2024 Conference Paper

Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient

  • Hao Di
  • Haishan Ye
  • Yueling Zhang
  • Xiangyu Chang
  • Guang Dai
  • Ivor W. Tsang

Variance reduction techniques are designed to decrease the sampling variance, thereby accelerating convergence rates of first-order (FO) and zeroth-order (ZO) optimization methods. However, in composite optimization problems, ZO methods encounter an additional variance called the coordinate-wise variance, which stems from the random gradient estimation. To reduce this variance, prior works require estimating all partial derivatives, essentially approximating FO information. This approach demands $\mathcal{O}(d)$ function evaluations ($d$ is the dimension size), which incurs substantial computational costs and is prohibitive in high-dimensional scenarios. This paper proposes the Zeroth-order Proximal Double Variance Reduction ($\texttt{ZPDVR}$) method, which utilizes the averaging trick to reduce both sampling and coordinate-wise variances. Compared to prior methods, $\texttt{ZPDVR}$ relies solely on random gradient estimates, calls the stochastic zeroth-order oracle (SZO) in expectation $\mathcal{O}(1)$ times per iteration, and achieves the optimal $\mathcal{O}(d(n + \kappa)\log (\frac{1}{\epsilon}))$ SZO query complexity in the strongly convex and smooth setting, where $\kappa$ represents the condition number and $\epsilon$ is the desired accuracy. Empirical results validate $\texttt{ZPDVR}$’s linear convergence and demonstrate its superior performance over other related methods.

NeurIPS Conference 2024 Conference Paper

Flipped Classroom: Aligning Teacher Attention with Student in Generalized Category Discovery

  • Haonan Lin
  • Wenbin An
  • Jiahao Wang
  • Yan Chen
  • Feng Tian
  • Mengmeng Wang
  • Guang Dai
  • Qianying Wang

Recent advancements have shown promise in applying traditional Semi-Supervised Learning strategies to the task of Generalized Category Discovery (GCD). Typically, this involves a teacher-student framework in which the teacher imparts knowledge to the student to classify categories, even in the absence of explicit labels. Nevertheless, GCD presents unique challenges, particularly the absence of priors for new classes, which can lead to the teacher's misguidance and unsynchronized learning with the student, culminating in suboptimal outcomes. In our work, we delve into why traditional teacher-student designs falter in generalized category discovery as compared to their success in closed-world semi-supervised learning. We identify inconsistent pattern learning as the crux of this issue and introduce FlipClass—a method that dynamically updates the teacher to align with the student's attention, instead of maintaining a static teacher reference. Our teacher-attention-update strategy refines the teacher's focus based on student feedback, promoting consistent pattern recognition and synchronized learning across old and new classes. Extensive experiments on a spectrum of benchmarks affirm that FlipClass significantly surpasses contemporary GCD methods, establishing new standards for the field.

JMLR Journal 2024 Journal Article

Learning Discretized Neural Networks under Ricci Flow

  • Jun Chen
  • Hanwen Chen
  • Mengmeng Wang
  • Guang Dai
  • Ivor W. Tsang
  • Yong Liu

In this paper, we study Discretized Neural Networks (DNNs) composed of low-precision weights and activations, which suffer from either infinite or zero gradients due to the non-differentiable discrete function during training. Most training-based DNNs in such scenarios employ the standard Straight-Through Estimator (STE) to approximate the gradient w.r.t. discrete values. However, the use of STE introduces the problem of gradient mismatch, arising from perturbations in the approximated gradient. To address this problem, this paper reveals that this mismatch can be interpreted as a metric perturbation in a Riemannian manifold, viewed through the lens of duality theory. Building on information geometry, we construct the Linearly Nearly Euclidean (LNE) manifold for DNNs, providing a background for addressing perturbations. By introducing a partial differential equation on metrics, i.e., the Ricci flow, we establish the dynamical stability and convergence of the LNE metric with the $L^2$-norm perturbation. In contrast to previous perturbation theories with convergence rates in fractional powers, the metric perturbation under the Ricci flow exhibits exponential decay in the LNE manifold. Experimental results across various datasets demonstrate that our method achieves superior and more stable performance for DNNs compared to other representative training-based methods. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

AAAI Conference 2024 Conference Paper

Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation

  • Zhuohang Dang
  • Minnan Luo
  • Chengyou Jia
  • Guang Dai
  • Xiaojun Chang
  • Jingdong Wang

Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. Recently, to alleviate expensive data collection, co-occurring pairs from the Internet are automatically harvested for training. However, it inevitably includes mismatched pairs, i.e., noisy correspondences, undermining supervision reliability and degrading performance. Current methods leverage deep neural networks' memorization effect to address noisy correspondences, which overconfidently focus on similarity-guided training with hard negatives and suffer from self-reinforcing errors. In light of above, we introduce a novel noisy correspondence learning framework, namely Self-Reinforcing Errors Mitigation (SREM). Specifically, by viewing sample matching as classification tasks within the batch, we generate classification logits for the given sample. Instead of a single similarity score, we refine sample filtration through energy uncertainty and estimate model's sensitivity of selected clean samples using swapped classification entropy, in view of the overall prediction distribution. Additionally, we propose cross-modal biased complementary learning to leverage negative matches overlooked in hard-negative training, further improving model optimization stability and curbing self-reinforcing errors. Extensive experiments on challenging benchmarks affirm the efficacy and efficiency of SREM.

NeurIPS Conference 2024 Conference Paper

OneActor: Consistent Subject Generation via Cluster-Conditioned Guidance

  • Jiahao Wang
  • Caixia Yan
  • Haonan Lin
  • Weizhan Zhang
  • Mengmeng Wang
  • Tieliang Gong
  • Guang Dai
  • Hao Sun

Text-to-image diffusion models benefit artists with high-quality image generation. Yet their stochastic nature hinders artists from creating consistent images of the same subject. Existing methods try to tackle this challenge and generate consistent content in various ways. However, they either depend on external restricted data or require expensive tuning of the diffusion model. For this issue, we propose a novel one-shot tuning paradigm, termed OneActor. It efficiently performs consistent subject generation solely driven by prompts via a learned semantic guidance to bypass the laborious backbone tuning. We lead the way to formalize the objective of consistent subject generation from a clustering perspective, and thus design a cluster-conditioned model. To mitigate the overfitting challenge shared by one-shot tuning pipelines, we augment the tuning with auxiliary samples and devise two inference strategies: semantic interpolation and cluster guidance. These techniques are later verified to significantly improve the generation quality. Comprehensive experiments show that our method outperforms a variety of baselines with satisfactory subject consistency, superior prompt conformity as well as high image quality. Our method is capable of multi-subject generation and compatible with popular diffusion extensions. Besides, we achieve a $4\times$ faster tuning speed than tuning-based baselines and, if desired, avoid increasing the inference time. Furthermore, our method can be naturally utilized to pre-train a consistent subject generation network from scratch, which will implement this research task into more practical applications. (Project page: https: //johnneywang. github. io/OneActor-webpage/)

NeurIPS Conference 2024 Conference Paper

Schedule Your Edit: A Simple yet Effective Diffusion Noise Schedule for Image Editing

  • Haonan Lin
  • Yan Chen
  • Jiahao Wang
  • Wenbin An
  • Mengmeng Wang
  • Feng Tian
  • Yong Liu
  • Guang Dai

Text-guided diffusion models have significantly advanced image editing, enabling high-quality and diverse modifications driven by text prompts. However, effective editing requires inverting the source image into a latent space, a process often hindered by prediction errors inherent in DDIM inversion. These errors accumulate during the diffusion process, resulting in inferior content preservation and edit fidelity, especially with conditional inputs. We address these challenges by investigating the primary contributors to error accumulation in DDIM inversion and identify the singularity problem in traditional noise schedules as a key issue. To resolve this, we introduce the Logistic Schedule, a novel noise schedule designed to eliminate singularities, improve inversion stability, and provide a better noise space for image editing. This schedule reduces noise prediction errors, enabling more faithful editing that preserves the original content of the source image. Our approach requires no additional retraining and is compatible with various existing editing methods. Experiments across eight editing tasks demonstrate the Logistic Schedule's superior performance in content preservation and edit fidelity compared to traditional noise schedules, highlighting its adaptability and effectiveness. The project page is available at https: //lonelvino. github. io/SYE/.

AAAI Conference 2024 Conference Paper

SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-Form Layout-to-Image Generation

  • Chengyou Jia
  • Minnan Luo
  • Zhuohang Dang
  • Guang Dai
  • Xiaojun Chang
  • Mengmeng Wang
  • Jingdong Wang

Despite significant progress in Text-to-Image (T2I) generative models, even lengthy and complex text descriptions still struggle to convey detailed controls. In contrast, Layout-to-Image (L2I) generation, aiming to generate realistic and complex scene images from user-specified layouts, has risen to prominence. However, existing methods transform layout information into tokens or RGB images for conditional control in the generative process, leading to insufficient spatial and semantic controllability of individual instances. To address these limitations, we propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. Owing to rich spatial and semantic information encapsulated in well-designed feature maps, SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. Additionally, we propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms. The former aims to model the relationships among multiple objects within scenes while the latter is designed to heighten the model's sensitivity to the spatial information embedded in the guidance. Extensive experiments demonstrate that SSMG achieves highly promising results, setting a new state-of-the-art across a range of metrics encompassing fidelity, diversity, and controllability.

NeurIPS Conference 2023 Conference Paper

SUBP: Soft Uniform Block Pruning for 1$\times$N Sparse CNNs Multithreading Acceleration

  • JINGYANG XIANG
  • Siqi Li
  • Jun Chen
  • Guang Dai
  • Shipeng Bai
  • Yukai Ma
  • Yong Liu

The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1$\times$N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space saving by a \emph{Block Sparse Row} matrix. 2) Excellent performance at a high sparsity. 3) Significant speedups on CPUs with Advanced Vector Extensions. Recent work requires selecting and fine-tuning 1$\times$N sparse weights based on dense pre-trained weights, leading to the problems such as expensive training cost and memory access, sub-optimal model quality, as well as unbalanced workload across threads (different sparsity across output channels). To overcome them, this paper proposes a novel \emph{\textbf{S}oft \textbf{U}niform \textbf{B}lock \textbf{P}runing} (SUBP) approach to train a uniform 1$\times$N sparse structured network from scratch. Specifically, our approach tends to repeatedly allow pruned blocks to regrow to the network based on block angular redundancy and importance sampling in a uniform manner throughout the training process. It not only makes the model less dependent on pre-training, reduces the model redundancy and the risk of pruning the important blocks permanently but also achieves balanced workload. Empirically, on ImageNet, comprehensive experiments across various CNN architectures show that our SUBP consistently outperforms existing 1$\times$N and structured sparsity methods based on pre-trained models or training from scratch. Source codes and models are available at \url{https: //github. com/JingyangXiang/SUBP}.

JMLR Journal 2012 Journal Article

Coherence Functions with Applications in Large-Margin Classification Methods

  • Zhihua Zhang
  • Dehua Liu
  • Guang Dai
  • Michael I. Jordan

Support vector machines (SVMs) naturally embody sparseness due to their use of hinge loss functions. However, SVMs can not directly estimate conditional class probabilities. In this paper we propose and study a family of coherence functions, which are convex and differentiable, as surrogates of the hinge function. The coherence function is derived by using the maximum-entropy principle and is characterized by a temperature parameter. It bridges the hinge function and the logit function in logistic regression. The limit of the coherence function at zero temperature corresponds to the hinge function, and the limit of the minimizer of its expected error is the minimizer of the expected error of the hinge loss. We refer to the use of the coherence function in large-margin classification as " C-learning," and we present efficient coordinate descent algorithms for the training of regularized C -learning models. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

JMLR Journal 2011 Journal Article

Bayesian Generalized Kernel Mixed Models

  • Zhihua Zhang
  • Guang Dai
  • Michael I. Jordan

We propose a fully Bayesian methodology for generalized kernel mixed models (GKMMs), which are extensions of generalized linear mixed models in the feature space induced by a reproducing kernel. We place a mixture of a point-mass distribution and Silverman's g -prior on the regression vector of a generalized kernel model (GKM). This mixture prior allows a fraction of the components of the regression vector to be zero. Thus, it serves for sparse modeling and is useful for Bayesian computation. In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging method is used for posterior prediction. When the feature basis expansion in the reproducing kernel Hilbert space is treated as a stochastic process, this approach can be related to the Karhunen-Loève expansion of a Gaussian process (GP). Thus, our sparse modeling framework leads to a flexible approximation method for GPs. [abs] [ pdf ][ bib ] &copy JMLR 2011. ( edit, beta )

JMLR Journal 2010 Journal Article

Regularized Discriminant Analysis, Ridge Regression and Beyond

  • Zhihua Zhang
  • Guang Dai
  • Congfu Xu
  • Michael I. Jordan

Fisher linear discriminant analysis (FDA) and its kernel extension-kernel discriminant analysis (KDA)-are well known methods that consider dimensionality reduction and classification jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efficient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a flexible and efficient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

NeurIPS Conference 2009 Conference Paper

Optimal Scoring for Unsupervised Learning

  • Zhihua Zhang
  • Guang Dai

We are often interested in casting classification and clustering problems in a regression framework, because it is feasible to achieve some statistical properties in this framework by imposing some penalty criteria. In this paper we illustrate optimal scoring, which was originally proposed for performing Fisher linear discriminant analysis by regression, in the application of unsupervised learning. In particular, we devise a novel clustering algorithm that we call optimal discriminant clustering (ODC). We associate our algorithm with the existing unsupervised learning algorithms such as spectral clustering, discriminative clustering and sparse principal component analysis. Thus, our work shows that optimal scoring provides a new approach to the implementation of unsupervised learning. This approach facilitates the development of new unsupervised learning algorithms.

IJCAI Conference 2007 Conference Paper

  • Guang Dai
  • Dit-Yan Yeung

Kernel discriminant analysis (KDA) is one of the most effective nonlinear techniques for dimensionality reduction and feature extraction. It can be applied to a wide range of applications involving high-dimensional data, including images, gene expressions, and text data. This paper develops a new algorithm to further improve the overall performance of KDA by effectively integrating the boosting and KDA techniques. The proposed method, called boosting kernel discriminant analysis (BKDA), possesses several appealing properties. First, like all kernel methods, it handles nonlinearity in a disciplined manner that is also computationally attractive; second, by introducing pairwise class discriminant information into the discriminant criterion and simultaneously employing boosting to robustly adjust the information, it further improves the classification accuracy; third, by calculating the significant discriminant information in the null space of the within-class scatter operator, it also effectively deals with the small sample size problem which is widely encountered in real-world applications for KDA; fourth, by taking advantage of the boosting and KDA techniques, it constitutes a strong ensemble-based KDA framework. Experimental results on gene expression data demonstrate the promising performance of the proposed methodology.

IJCAI Conference 2007 Conference Paper

  • Dit-Yan Yeung
  • Hong Chang
  • Guang Dai

In recent years, metric learning in the semi-supervised setting has aroused a lot of research interests. One type of semi-supervised metric learning utilizes supervisory information in the form of pairwise similarity or dissimilarity constraints. However, most methods proposed so far are either limited to linear metric learning or unable to scale up well with the data set size. In this paper, we propose a nonlinear metric learning method based on the kernel approach. By applying low-rank approximation to the kernel matrix, our method can handle significantly larger data sets. Moreover, our low-rank approximation scheme can naturally lead to out-of-sample generalization. Experiments performed on both artificial and real-world data show very promising results.

AAAI Conference 2006 Conference Paper

Tensor Embedding Methods

  • Guang Dai

Over the past few years, some embedding methods have been proposed for feature extraction and dimensionality reduction in various machine learning and pattern classification tasks. Among the methods proposed are Neighborhood Preserving Embedding (NPE), Locality Preserving Projection (LPP) and Local Discriminant Embedding (LDE) which have been used in such applications as face recognition and image/video retrieval. However, although the data in these applications are more naturally represented as higher-order tensors, the embedding methods can only work with vectorized data representations which may not capture well some useful information in the original data. Moreover, highdimensional vectorized representations also suffer from the curse of dimensionality and the high computational demand. In this paper, we propose some novel tensor embedding methods which, unlike previous methods, take data directly in the form of tensors of arbitrary order as input. These methods allow the relationships between dimensions of a tensor representation to be efficiently characterized. Moreover, they also allow the intrinsic local geometric and topological properties of the manifold embedded in a tensor space to be naturally estimated. Furthermore, they do not suffer from the curse of dimensionality and the high computational demand. We demonstrate the effectiveness of the proposed tensor embedding methods on a face recognition application and compare them with some previous methods. Extensive experiments show that our methods are not only more effective but also more efficient.