Arrow Research search

Author name cluster

Yifan Hu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers
2 author rows

Possible papers

16

AAAI Conference 2025 Conference Paper

Adaptive Multi-Scale Decomposition Framework for Time Series Forecasting

  • Yifan Hu
  • Peiyuan Liu
  • Peng Zhu
  • Dawei Cheng
  • Tao Dai

Transformer-based and MLP-based methods have emerged as leading approaches in time series forecasting (TSF). However, real-world time series often show different patterns at different scales, and future changes are shaped by the interplay of these overlapping scales, requiring high-capacity models. While Transformer-based methods excel in capturing long-range dependencies, they suffer from high computational complexities and tend to overfit. Conversely, MLP-based methods offer computational efficiency and adeptness in modeling temporal dynamics, but they struggle with capturing temporal patterns with complex scales effectively. Based on the observation of multi-scale entanglement effect in time series, we propose a novel MLP-based Adaptive Multi-Scale Decomposition (AMD) framework for TSF. Our framework decomposes time series into distinct temporal patterns at multiple scales, leveraging the Multi-Scale Decomposable Mixing (MDM) block to dissect and aggregate these patterns. Complemented by the Dual Dependency Interaction (DDI) block and the Adaptive Multi-predictor Synthesis (AMS) block, our approach effectively models both temporal and channel dependencies and utilizes autocorrelation to refine multi-scale data integration. Comprehensive experiments demonstrate our AMD framework not only overcomes the limitations of existing methods but also consistently achieves state-of-the-art performance across various datasets.

ICML Conference 2025 Conference Paper

Efficiently Serving Large Multimodal Models Using EPD Disaggregation

  • Gursimran Singh
  • Xinglu Wang
  • Yifan Hu
  • Timothy Tin Long Yu
  • Linzi Xing
  • Wei Jiang
  • Zhefeng Wang
  • Xiaolong Bai

Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step negatively affects key Service Level Objectives (SLOs), such as time to first token (TTFT) and time per output token (TPOT). We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. Unlike current systems, which bundle encoding and prefill together, our approach decouples these steps, unlocking new opportunities and optimizations. These include a mechanism to cache multimedia tokens for efficient transfer, a novel way to parallelize the encoding load within a request, a module for optimal resource allocation for disaggregated serving, and a novel role-switching method to handle changing workload characteristics. Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15$\times$ lower peak memory utilization), batch sizes (up to 22$\times$ larger), 10$\times$ more images per request, and 2. 2$\times$ larger KV caches. Furthermore, it leads to significant improvements in SLO attainment (up to 90–100% improvement) and TTFT (up to 71% reduction), compared to systems that do not disaggregate. The code is available at https: //github. com/vbdi/epdserve.

ICML Conference 2025 Conference Paper

MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

  • Tianze Wang
  • Dongnan Gui
  • Yifan Hu
  • Shuhang Lin
  • Linjun Zhang

Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multi-dimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.

AAAI Conference 2025 Conference Paper

Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech

  • Rui Liu
  • Shuwei He
  • Yifan Hu
  • Haizhou Li

Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. The challenge of this task lies in understanding the spatial environment from the image. Many attempts have been made to extract global spatial visual information from the RGB space of an spatial image. However, local and depth image information are crucial for understanding the spatial environment, which previous works have ignored. To address the issues, we propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS, termed M2SE-VTTS. The multi-modal aims to take both the RGB and Depth spaces of the spatial image to learn more comprehensive spatial information, and the multi-scale seeks to model the local and global spatial knowledge simultaneously. Specifically, we first split the RGB and Depth images into patches and adopt the Gemini-generated environment captions to guide the local spatial understanding. After that, the multi-modal and multi-scale features are integrated by the local-aware global spatial understanding. In this way, M2SE-VTTS effectively models the interactions between local and global spatial contexts in the multi-modal spatial environment. Objective and subjective evaluations suggest that our model outperforms the advanced baselines in environmental speech generation.

JBHI Journal 2025 Journal Article

SiamFSA: Optical Flow-driven Structural-aware Siamese Network for Ultrasound Videos Landmark Tracking

  • Guang-Quan Zhou
  • Yifan Hu
  • Qinghan Yang
  • Ruo-Li Wang
  • Yang Chen

Accurate anatomical landmark tracking within ultrasound video is a crucial analysis task with many clinical applications. However, the non-rigid deformations caused by motion and probe extrusion lead to intra-object semantic and scale variations, resulting in inaccurate landmark tracking. Additionally, the inevitable intrinsic speckle noise and imaging artifacts exacerbate the dissimilarity of targets, further complicating the landmark tracking. In this study, we propose a novel Optical Flow-driven Structural-aware Siamese Network, SiamFSA, for landmark tracking in continuous ultrasound images. This approach implicitly incorporates structure and motion priors into the Siamese model to compensate for the influence of intra-object variations caused by protean tissue deformation. Specifically, we imposed an auxiliary fine-grained heatmap regression branch into the Siamese-based backbone to couple anatomical landmarks with position-sensitive templates for better localization. Moreover, we propose a structural drift correction mechanism to align relative positions across continuous frames guided by the optical flow from the template. This mechanism ensures intra-object semantic similarity of anatomical deformations and facilitates better salient feature extraction under landmark-centered alignment constraints. Meanwhile, a structural prior affine transformation module is designed to optimize the template view for landmark tracking by exploring intra-object scale variations during motion, thereby enhancing foreground semantic perception. Extensive experiments on both public and in-house ultrasound datasets demonstrate that our SiamFSA handles protean anatomical landmark tracking more effectively than other state-of-the-art methods, showing its potential in clinical analysis tasks.

NeurIPS Conference 2025 Conference Paper

Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2

  • Ziqi Zhou
  • Yifan Hu
  • Yufei Song
  • Zijing Li
  • Shengshan Hu
  • Leo Yu Zhang
  • Dezhong Yao
  • Long Zheng

Recent studies reveal the vulnerability of the image segmentation foundation model SAM to adversarial examples. Its successor, SAM2, has attracted significant attention due to its strong generalization capability in video segmentation. However, its robustness remains unexplored, and it is unclear whether existing attacks on SAM can be directly transferred to SAM2. In this paper, we first analyze the performance gap of existing attacks between SAM and SAM2 and highlight two key challenges arising from their architectural differences: directional guidance from the prompt and semantic entanglement across consecutive frames. To address these issues, we propose UAP-SAM2, the first cross-prompt universal adversarial attack against SAM2 driven by dual semantic deviation. For cross-prompt transferability, we begin by designing a target-scanning strategy that divides each frame into k regions, each randomly assigned a prompt, to reduce prompt dependency during optimization. For effectiveness, we design a dual semantic deviation framework that optimizes a UAP by distorting the semantics within the current frame and disrupting the semantic consistency across consecutive frames. Extensive experiments on six datasets across two segmentation tasks demonstrate the effectiveness of the proposed method for SAM2. The comparative results show that UAP-SAM2 significantly outperforms state-of-the-art (SOTA) attacks by a large margin.

NeurIPS Conference 2024 Conference Paper

Contextual Bilevel Reinforcement Learning for Incentive Alignment

  • Vinzenz Thoma
  • Barna Pasztor
  • Andreas Krause
  • Giorgia Ramponi
  • Yifan Hu

The optimal policy in various real-world strategic decision-making problems depends both on the environmental configuration and exogenous events. For these settings, we introduce Contextual Bilevel Reinforcement Learning (CB-RL), a stochastic bilevel decision-making model, where the lower level consists of solving a contextual Markov Decision Process (CMDP). CB-RL can be viewed as a Stackelberg Game where the leader and a random context beyond the leader’s control together decide the setup of many MDPs that potentially multiple followers best respond to. This framework extends beyond traditional bilevel optimization and finds relevance in diverse fields such as RLHF, tax design, reward shaping, contract theory and mechanism design. We propose a stochastic Hyper Policy Gradient Descent (HPGD) algorithm to solve CB-RL, and demonstrate its convergence. Notably, HPGD uses stochastic hypergradient estimates, based on observations of the followers’ trajectories. Therefore, it allows followers to use any training procedure and the leader to be agnostic of the specific algorithm, which aligns with various real-world scenarios. We further consider the setting when the leader can influence the training of followers and propose an accelerated algorithm. We empirically demonstrate the performance of our algorithm for reward shaping and tax design.

AAAI Conference 2024 Conference Paper

Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling

  • Rui Liu
  • Yifan Hu
  • Yi Ren
  • Xiang Yin
  • Haizhou Li

Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. While recognising the significance of CSS task, the prior studies have not thoroughly investigated the emotional expressiveness problems due to the scarcity of emotional conversational datasets and the difficulty of stateful emotion modeling. In this paper, we propose a novel emotional CSS model, termed ECSS, that includes two main components: 1) to enhance emotion understanding, we introduce a heterogeneous graph-based emotional context modeling mechanism, which takes the multi-source dialogue history as input to model the dialogue context and learn the emotion cues from the context; 2) to achieve emotion rendering, we employ a contrastive learning-based emotion renderer module to infer the accurate emotion style for the target utterance. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity, and annotate additional emotional information on the existing conversational dataset (DailyTalk). Both objective and subjective evaluations suggest that our model outperforms the baseline models in understanding and rendering emotions. These evaluations also underscore the importance of comprehensive emotional annotations. Code and audio samples can be found at: https://github.com/walker-hyf/ECSS.

NeurIPS Conference 2024 Conference Paper

Group Robust Preference Optimization in Reward-free RLHF

  • Shyam Sundhar Ramesh
  • Yifan Hu
  • Iason Chaimalas
  • Viraj Mehta
  • Pier Giuseppe Sessa
  • Haitham Bou Ammar
  • Ilija Bogunovic

Adapting large language models (LLMs) for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers' groups (e. g. , different demographics, ethnicities, company teams, etc. ), traditional RLHF approaches adopt a "one-size-fits-all" approach, i. e. , they indiscriminately assume and optimize a single preference model, thus not being robust to unique characteristics and needs of the various groups. To address this limitation, we propose a novel Group Robust Preference Optimization (GRPO) method to align LLMs to individual groups' preferences robustly. Our approach builds upon reward-free direct preference optimization methods, but unlike previous approaches, it seeks a robust policy which maximizes the worst-case group performance. To achieve this, GRPO adaptively and sequentially weights the importance of different groups, prioritizing groups with worse cumulative loss. We theoretically study the feasibility of GRPO and analyze its convergence for the log-linear policy class. By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.

NeurIPS Conference 2024 Conference Paper

Stochastic Optimization Algorithms for Instrumental Variable Regression with Streaming Data

  • Xuxing Chen
  • Abhishek Roy
  • Yifan Hu
  • Krishnakumar Balasubramanian

We develop and analyze algorithms for instrumental variable regression by viewing the problem as a conditional stochastic optimization problem. In the context of least-squares instrumental variable regression, our algorithms neither require matrix inversions nor mini-batches thereby providing a fully online approach for performing instrumental variable regression with streaming data. When the true model is linear, we derive rates of convergence in expectation, that are of order $\mathcal{O}(\log T/T)$ and $\mathcal{O}(1/T^{1-\epsilon})$ for any $\epsilon>0$, respectively under the availability of two-sample and one-sample oracles respectively. Importantly, under the availability of the two-sample oracle, the aforementioned rate is actually agnostic to the relationship between confounder and the instrumental variable demonstrating the flexibility of the proposed approach in alleviating the need for explicit model assumptions required in recent works based on reformulating the problem as min-max optimization problems. Experimental validation is provided to demonstrate the advantages of the proposed algorithms over classical approaches like the 2SLS method.

NeurIPS Conference 2023 Conference Paper

Contextual Stochastic Bilevel Optimization

  • Yifan Hu
  • Jie Wang
  • Yao Xie
  • Andreas Krause
  • Daniel Kuhn

We introduce contextual stochastic bilevel optimization (CSBO) -- a stochastic bilevel optimization framework with the lower-level problem minimizing an expectation conditioned on some contextual information and the upper-level decision variable. This framework extends classical stochastic bilevel optimization when the lower-level decision maker responds optimally not only to the decision of the upper-level decision maker but also to some side information and when there are multiple or even infinite many followers. It captures important applications such as meta-learning, personalized federated learning, end-to-end learning, and Wasserstein distributionally robust optimization with side information (WDRO-SI). Due to the presence of contextual information, existing single-loop methods for classical stochastic bilevel optimization are unable to converge. To overcome this challenge, we introduce an efficient double-loop gradient method based on the Multilevel Monte-Carlo (MLMC) technique and establish its sample and computational complexities. When specialized to stochastic nonconvex optimization, our method matches existing lower bounds. For meta-learning, the complexity of our method does not depend on the number of tasks. Numerical experiments further validate our theoretical results.

AAAI Conference 2021 Conference Paper

Going Deeper With Directly-Trained Larger Spiking Neural Networks

  • Hanle Zheng
  • YuJie Wu
  • Lei Deng
  • Yifan Hu
  • Guoqi Li

Spiking neural networks (SNNs) are promising in a bioplausible coding for spatio-temporal information and eventdriven signal processing, which is very suited for energyefficient implementation in neuromorphic hardware. However, the unique working mode of SNNs makes them more difficult to train than traditional networks. Currently, there are two main routes to explore the training of deep SNNs with high performance. The first is to convert a pre-trained ANN model to its SNN version, which usually requires a long coding window for convergence and cannot exploit the spatio-temporal features during training for solving temporal tasks. The other is to directly train SNNs in the spatio-temporal domain. But due to the binary spike activity of the firing function and the problem of gradient vanishing or explosion, current methods are restricted to shallow architectures and thereby difficult in harnessing large-scale datasets (e. g. ImageNet). To this end, we propose a threshold-dependent batch normalization (tdB- N) method based on the emerging spatio-temporal backpropagation, termed “STBP-tdBN”, enabling direct training of a very deep SNN and the efficient implementation of its inference on neuromorphic hardware. With the proposed method and elaborated shortcut connection, we significantly extend directly-trained SNNs from a shallow structure (<10 layer) to a very deep structure (50 layers). Furthermore, we theoretically analyze the effectiveness of our method based on “Block Dynamical Isometry” theory. Finally, we report superior accuracy results including 93. 15% on CIFAR-10, 67. 8% on DVS-CIFAR10, and 67. 05% on ImageNet with very few timesteps. To our best knowledge, it’s the first time to explore the directly-trained deep SNNs with high performance on ImageNet. We believe this work shall pave the way of fully exploiting the advantages of SNNs and attract more researchers to contribute in this field.

NeurIPS Conference 2021 Conference Paper

On the Bias-Variance-Cost Tradeoff of Stochastic Optimization

  • Yifan Hu
  • Xin Chen
  • Niao He

We consider stochastic optimization when one only has access to biased stochastic oracles of the objective, and obtaining stochastic gradients with low biases comes at high costs. This setting captures a variety of optimization paradigms widely used in machine learning, such as conditional stochastic optimization, bilevel optimization, and distributionally robust optimization. We examine a family of multi-level Monte Carlo (MLMC) gradient methods that exploit a delicate trade-off among the bias, the variance, and the oracle cost. We provide a systematic study of their convergences and total computation complexities for strongly convex, convex, and nonconvex objectives, and demonstrate their superiority over the naive biased stochastic gradient method. Moreover, when applied to conditional stochastic optimization, the MLMC gradient methods significantly improve the best-known sample complexity in the literature.

NeurIPS Conference 2020 Conference Paper

Biased Stochastic First-Order Methods for Conditional Stochastic Optimization and Applications in Meta Learning

  • Yifan Hu
  • Siqi Zhang
  • Xin Chen
  • Niao He

Conditional stochastic optimization covers a variety of applications ranging from invariant learning and causal inference to meta-learning. However, constructing unbiased gradient estimators for such problems is challenging due to the composition structure. As an alternative, we propose a biased stochastic gradient descent (BSGD) algorithm and study the bias-variance tradeoff under different structural assumptions. We establish the sample complexities of BSGD for strongly convex, convex, and weakly convex objectives under smooth and non-smooth conditions. Our lower bound analysis shows that the sample complexities of BSGD cannot be improved for general convex objectives and nonconvex objectives except for smooth nonconvex objectives with Lipschitz continuous gradient estimator. For this special setting, we propose an accelerated algorithm called biased SpiderBoost (BSpiderBoost) that matches the lower bound complexity. We further conduct numerical experiments on invariant logistic regression and model-agnostic meta-learning to illustrate the performance of BSGD and BSpiderBoost.

AAAI Conference 2018 Conference Paper

HARP: Hierarchical Representation Learning for Networks

  • Haochen Chen
  • Bryan Perozzi
  • Yifan Hu
  • Steven Skiena

We present HARP, a novel method for learning low dimensional embeddings of a graph’s nodes which preserves higherorder structural features. Our proposed method achieves this by compressing the input graph prior to embedding it, effectively avoiding troublesome embedding configurations (i. e. local minima) which can pose problems to non-convex optimization. HARP works by finding a smaller graph which approximates the global structure of its input. This simplified graph is used to learn a set of initial representations, which serve as good initializations for learning representations in the original, detailed graph. We inductively extend this idea, by decomposing a graph in a series of levels, and then embed the hierarchy of graphs from the coarsest one to the original graph. HARP is a general meta-strategy to improve all of the stateof-the-art neural algorithms for embedding graphs, including DeepWalk, LINE, and Node2vec. Indeed, we demonstrate that applying HARP’s hierarchical paradigm yields improved implementations for all three of these methods, as evaluated on classification tasks on real-world graphs such as DBLP, Blog- Catalog, and CiteSeer, where we achieve a performance gain over the original implementations by up to 14% Macro F1.