Author name cluster

Yimu Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Hawaii: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models

Yimu Wang
Mozhgan Nasr Azadani
Sean Sedwards
Krzysztof Czarnecki

Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII, compared to the popular open-source VLMs.

PDF Details

JMLR Journal 2025 Journal Article

Lexicographic Lipschitz Bandits: New Algorithms and a Lower Bound

Bo Xue
Ji Cheng
Fei Liu
Yimu Wang
Lijun Zhang
Qingfu Zhang

This paper studies a multiobjective bandit problem under lexicographic ordering, wherein the learner aims to maximize $m$ objectives, each with different levels of importance. First, we introduce the local trade-off, $\lambda_*$, which depicts the trade-off between different objectives. For the case when an upper bound of $\lambda_*$ is known, i.e., $\lambda\geq\lambda_*$, we develop an algorithm that achieves a general regret bound of $\widetilde{O}(\Lambda^i(\lambda)T^{(d_z^i+1)/(d_z^i+2)})$ for the $i$-th objective, where $i\in\{1,2,\ldots,m\}$, $\Lambda^i(\lambda)=1+\lambda+\cdots+\lambda^{i-1}$, $d_z^i$ is the zooming dimension for the $i$-th objective, and $T$ is the time horizon. Next, we provide a matching lower bound for the lexicographic Lipschitz bandit problem, proving that our algorithm is optimal in terms of $\lambda_*$ and $T$. Finally, for the case where $m=2$, we remove the dependence on the knowledge about $\lambda_*$, albeit at the cost of increasing the regret bound to $\widetilde{O}(\Lambda^i(\lambda_*)T^{(3d_z^i+4)/(3d_z^i+6)})$, which remains optimal in terms of $\lambda_*$. Compared to existing work on lexicographic multi-armed bandits, our approach improves the current regret bound of $\widetilde{O}(T^{2/3})$ and extends the number of arms to infinity. Numerical experiments confirm the effectiveness of our algorithms. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

PDF Details

TMLR Journal 2025 Journal Article

Rethinking Spectral Augmentation for Contrast-based Graph Self-Supervised Learning

Xiangru Jian
Xinjian Zhao
Wei Pang
Chaolong Ying
Yimu Wang
Yaoyao Xu
Tianshu Yu

The recent surge in contrast-based graph self-supervised learning has prominently featured an intensified exploration of spectral cues. Spectral augmentation, which involves modifying a graph's spectral properties such as eigenvalues or eigenvectors, is widely believed to enhance model performance. However, an intriguing paradox emerges, as methods grounded in seemingly conflicting assumptions regarding the spectral domain demonstrate notable enhancements in learning performance. Through extensive empirical studies, we find that simple edge perturbations - random edge dropping for node-level and random edge adding for graph-level self-supervised learning - consistently yield comparable or superior performance while being significantly more computationally efficient. This suggests that the computational overhead of sophisticated spectral augmentations may not justify their practical benefits. Our theoretical analysis of the InfoNCE loss bounds for shallow GNNs further supports this observation. The proposed insights represent a significant leap forward in the field, potentially refining the understanding and implementation of graph self-supervised learning.

PDF Details

TMLR Journal 2025 Journal Article

Survey of Video Diffusion Models: Foundations, Implementations, and Applications

Yimu Wang
Xuye Liu
Wei Pang
Li Ma
Shuai Yuan
Paul Debevec
Ning Yu

Recent advances in diffusion models have revolutionized video generation, offering superior temporal consistency and visual quality compared to traditional generative adversarial networks-based approaches. While this emerging field shows tremendous promise in applications, it faces significant challenges in motion consistency, computational efficiency, and ethical considerations. This survey provides a comprehensive review of diffusion-based video generation, examining its evolution, technical foundations, and practical applications. We present a systematic taxonomy of current methodologies, analyze architectural innovations and optimization strategies, and investigate applications across low-level vision tasks such as denoising and super-resolution. Additionally, we explore the synergies between diffusion-based video generation and related domains, including video representation learning, question answering, and retrieval. Compared to the existing surveys (Lei et al., 2024a;b; Melniket al., 2024; Cao et al., 2023; Xing et al., 2024c) which focus on specific aspects of video generation, such as human video synthesis (Lei et al., 2024a) or long-form content generation (Lei et al., 2024b), our work provides a broader, more updated, and more fine-grained perspective on diffusion-based approaches with a special section for evaluation metrics, industry solutions, and training engineering techniques in video generation. This survey serves as a foundational resource for researchers and practitioners working at the intersection of diffusion models and video generation, providing insights into both the theoretical frameworks and practical implementations that drive this rapidly evolving field.

PDF Details

AAAI Conference 2024 Conference Paper

Lost Domain Generalization Is a Natural Consequence of Lack of Training Domains

Yimu Wang
Yihan Wu
Hongyang Zhang

We show a hardness result for the number of training domains required to achieve a small population error in the test domain. Although many domain generalization algorithms have been developed under various domain-invariance assumptions, there is significant evidence to indicate that out-of-distribution (o.o.d.) test accuracy of state-of-the-art o.o.d. algorithms is on par with empirical risk minimization and random guess on the domain generalization benchmarks such as DomainBed. In this work, we analyze its cause and attribute the lost domain generalization to the lack of training domains. We show that, in a minimax lower bound fashion, any learning algorithm that outputs a classifier with an ε excess error to the Bayes optimal classifier requires at least poly(1/ε) number of training domains, even though the number of training data sampled from each training domain is large. Experiments on the DomainBed benchmark demonstrate that o.o.d. test accuracy is monotonically increasing as the number of training domains increases. Our result sheds light on the intrinsic hardness of domain generalization and suggests benchmarking o.o.d. algorithms by the datasets with a sufficient number of training domains.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Multiobjective Lipschitz Bandits under Lexicographic Ordering

Bo Xue
Ji Cheng
Fei Liu
Yimu Wang
Qingfu Zhang

This paper studies the multiobjective bandit problem under lexicographic ordering, wherein the learner aims to simultaneously maximize? objectives hierarchically. The only existing algorithm for this problem considers the multi-armed bandit model, and its regret bound is O((KT)^(2/3)) under a metric called priority-based regret. However, this bound is suboptimal, as the lower bound for single objective multi-armed bandits is Omega(KlogT). Moreover, this bound becomes vacuous when the arm number K is infinite. To address these limitations, we investigate the multiobjective Lipschitz bandit model, which allows for an infinite arm set. Utilizing a newly designed multi-stage decision-making strategy, we develop an improved algorithm that achieves a general regret bound of O(T^((d_z^i+1)/(d_z^i+2))) for the i-th objective, where d_z^i is the zooming dimension for the i-th objective, with i in {1,2,...,m}. This bound matches the lower bound of the single objective Lipschitz bandit problem in terms of T, indicating that our algorithm is almost optimal. Numerical experiments confirm the effectiveness of our algorithm.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Efficient Algorithms for Generalized Linear Bandits with Heavy-tailed Rewards

Bo Xue
Yimu Wang
Yuanyu Wan
Jinfeng Yi
Lijun Zhang

This paper investigates the problem of generalized linear bandits with heavy-tailed rewards, whose $(1+\epsilon)$-th moment is bounded for some $\epsilon\in (0, 1]$. Although there exist methods for generalized linear bandits, most of them focus on bounded or sub-Gaussian rewards and are not well-suited for many real-world scenarios, such as financial markets and web-advertising. To address this issue, we propose two novel algorithms based on truncation and mean of medians. These algorithms achieve an almost optimal regret bound of $\widetilde{O}(dT^{\frac{1}{1+\epsilon}})$, where $d$ is the dimension of contextual information and $T$ is the time horizon. Our truncation-based algorithm supports online learning, distinguishing it from existing truncation-based approaches. Additionally, our mean-of-medians-based algorithm requires only $O(\log T)$ rewards and one estimator per epoch, making it more practical. Moreover, our algorithms improve the regret bounds by a logarithmic factor compared to existing algorithms when $\epsilon=1$. Numerical experimental results confirm the merits of our algorithms.

PDF Details

ICLR Conference 2023 Conference Paper

Multimodal Federated Learning via Contrastive Representation Ensemble

Qiying Yu
Yang Liu 0165
Yimu Wang
Ke Xu
Jingjing Liu

With the increasing amount of multimedia data on modern mobile systems and IoT infrastructures, harnessing these rich multimodal data without breaching user privacy becomes a critical issue. Federated learning (FL) serves as a privacy-conscious alternative to centralized machine learning. However, existing FL methods extended to multimodal data all rely on model aggregation on single modality level, which restrains the server and clients to have identical model architecture for each modality. This limits the global model in terms of both model complexity and data capacity, not to mention task diversity. In this work, we propose \textit{Contrastive Representation Ensemble and Aggregation for Multimodal FL (CreamFL)}, a multimodal federated learning framework that enables training larger server models from clients with heterogeneous model architectures and data modalities, while only communicating knowledge on public dataset. To achieve better multimodal representation fusion, we design a global-local cross-modal ensemble strategy to aggregate client representations. To mitigate local model drift caused by two unprecedented heterogeneous factors stemming from multimodal discrepancy (\textit{modality gap} and \textit{task gap}), we further propose two inter-modal and intra-modal contrasts to regularize local training, which complements information of the absent modality for uni-modal clients and regularizes local clients to head towards global consensus. Thorough evaluations and ablation studies on image-text retrieval and visual question answering tasks showcase the superiority of CreamFL over state-of-the-art FL methods and its practical value.

Details

IJCAI Conference 2021 Conference Paper

Deep Unified Cross-Modality Hashing by Pairwise Data Alignment

Yimu Wang
Bo Xue
Quan Cheng
Yuhui Chen
Lijun Zhang

With the increasing amount of multimedia data, cross-modality hashing has made great progress as it achieves sub-linear search time and low memory space. However, due to the huge discrepancy between different modalities, most existing cross-modality hashing methods cannot learn unified hash codes and functions for modalities at the same time. The gap between separated hash codes and functions further leads to bad search performance. In this paper, to address the issues above, we propose a novel end-to-end Deep Unified Cross-Modality Hashing method named DUCMH, which is able to jointly learn unified hash codes and unified hash functions by alternate learning and data alignment. Specifically, to reduce the discrepancy between image and text modalities, DUCMH utilizes data alignment to learn an auxiliary image to text mapping under the supervision of image-text pairs. For text data, hash codes can be obtained by unified hash functions, while for image data, DUCMH first maps images to texts by the auxiliary mapping, and then uses the mapped texts to obtain hash codes. DUCMH utilizes alternate learning to update unified hash codes and functions. Extensive experiments on three representative image-text datasets demonstrate the superiority of our DUCMH over several state-of-the-art cross-modality hashing methods.

PDF Details DOI

IJCAI Conference 2020 Conference Paper

Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed Payoffs

Bo Xue
Guanghui Wang
Yimu Wang
Lijun Zhang

In this paper, we study the problem of stochastic linear bandits with finite action sets. Most of existing work assume the payoffs are bounded or sub-Gaussian, which may be violated in some scenarios such as financial markets. To settle this issue, we analyze the linear bandits with heavy-tailed payoffs, where the payoffs admit finite 1+epsilon moments for some epsilon in (0, 1]. Through median of means and dynamic truncation, we propose two novel algorithms which enjoy a sublinear regret bound of widetilde{O}(d^(1/2)T^(1/(1+epsilon))), where d is the dimension of contextual information and T is the time horizon. Meanwhile, we provide an Omega(d^(epsilon/(1+epsilon))T^(1/(1+epsilon))) lower bound, which implies our upper bound matches the lower bound up to polylogarithmic factors in the order of d and T when epsilon=1. Finally, we conduct numerical experiments to demonstrate the effectiveness of our algorithms and the empirical results strongly support our theoretical guarantees.

PDF Details DOI