Arrow Research search

Author name cluster

Bei Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
2 author rows

Possible papers

12

AAAI Conference 2026 Conference Paper

GRAM-R²: Self-Training Generative Foundation Reward Models for Reward Reasoning

  • Chenglong Wang
  • Yongyu Mu
  • Hang Zhou
  • Yifu Huo
  • Ziming Zhu
  • Jiali Zeng
  • Murun Yang
  • Bei Li

Major progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs to generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short in instilling explicit reasoning capabilities into reward models. To bridge this gap, we propose a self-training approach that can leverage unlabeled data to scale up reward reasoning in reward models. Based on this approach, we develop GRAM-R² a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R² can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as policy optimization and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R² consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

AAAI Conference 2026 Conference Paper

Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models

  • Chenglong Wang
  • Yifu Huo
  • Yang Gan
  • Yongyu Mu
  • Qiaozhi He
  • Murun Yang
  • Bei Li
  • Chunliang Zhang

Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation method, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, we introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability. Through extensive experiments, we find that MRMBench strongly correlates with LLM alignment performance, supporting it as a reliable reference for developing advanced reward models. By analyzing the evaluation results on MRMBench, we reveal that reward models struggle to simultaneously capture preferences across multiple dimensions, highlighting the potential of multi-objective optimization in reward modeling. Furthermore, our results demonstrate that the proposed inference-time probing method provides a reliable metric for assessing the confidence of reward predictions, leading to improved alignment of large language models.

AAAI Conference 2026 Conference Paper

SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

  • Yuan Ge
  • Junxiang Zhang
  • Xiaoqian Liu
  • Bei Li
  • Xiangnan Ma
  • Chenglong Wang
  • Kaiyang Ye
  • Yangfan Du

Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose SageLM, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce SpeechFeedback, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42% and 26.20%, respectively.

AAAI Conference 2026 Conference Paper

Tuning Medical Foundation Models for Inner Ear Temporal CT Analysis with Plug-and-play Domain Knowledge Aggregator

  • Weixun Wan
  • Xinyang Jiang
  • Zilong Wang
  • Bei Li
  • Cairong Zhao

High-resolution computed tomography (CT) is essential for diagnosing hearing loss and planning interventions such as cochlear implantation, as it provides detailed visualization of inner-ear anatomy. This paper focuses on advancing AI-based analysis of inner-ear CT scans to support clinical decision-making. However, a major challenge lies in the scarcity of annotated data, which limits the applicability of conventional supervised learning techniques. To address this, we present the first publicly available Children's Inner Ear CT Dataset (CIED), comprising 722 CT scans labeled for structural anomaly detection, postoperative hearing outcome prediction, and anatomical segmentation. In addition, we explore the use of medical foundation models to improve generalization in data-scarce scenarios. Existing parameter-efficient adaptation methods often fall short in two ways: they lack a unified mechanism to adapt across diverse foundation model architectures and they are not specifically designed to incorporate domain expert knowledge of inner-ear anatomy and pathology. To overcome these limitations, we propose Domain Knowledge Guided Tuning (DKGT), a plug-and-play framework that introduces a unified adapter—Domain Knowledge Aggregator (DKA)—to inject radiomics-based anatomical features into foundation models via cross-attention. DKA supports various backbone types and preserves pretrained representations of foundation model while enabling multi-layer integration of expert knowledge. Extensive experiments across multiple tasks demonstrate that DKGT consistently outperforms state-of-the-art classification methods, achieving superior performance and generalizability on inner-ear CT analysis.

ICLR Conference 2025 Conference Paper

Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

  • Ruichen Shao
  • Bei Li
  • Gangao Liu
  • Yang Chen
  • ZhouXiang
  • Jingang Wang
  • Xunliang Cai
  • Peng Li

Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers from a length bias, generating responses longer than those from the reference model. Existing solutions like SimPO and SamPO address this issue but uniformly treat the contribution of rewards across sequences, overlooking temporal dynamics. To this end, we propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. This dynamic weighting mechanism adjusts the influence of each reward based on its position in the sequence, prioritizing earlier tokens that are more critical for alignment. By adaptively focusing on more relevant feedback, our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences. Experimental results on several benchmarks show that our approach consistently outperforms vanilla DPO by 5.9-8.8 points on AlpacaEval 2 and 3.3-9.7 points on Arena-Hard across different model architectures and sizes. Furthermore, additional experiments on mathematical and reasoning benchmarks (MMLU, GSM8K, and MATH) confirm that our method enhances performance without compromising general capabilities. Our codebase would be available at \url{https://github.com/LotuSrc/D2PO}.

ICML Conference 2025 Conference Paper

GRAM: A Generative Foundation Reward Model for Reward Generalization

  • Chenglong Wang 0002
  • Yang Gan
  • Yifu Huo
  • Yongyu Mu
  • Qiaozhi He
  • Murun Yang
  • Bei Li
  • Tong Xiao 0001

In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on labeled human preference data. In this paper, we explore methods that train reward models using both unlabeled and labeled data. Building on the generative models in LLMs, we develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning. We also show that by using label smoothing, we are in fact optimizing a regularized pairwise ranking loss. This result, in turn, provides a new view of training reward models, which links generative models and discriminative models under the same class of training objectives. The outcome of these techniques is a foundation reward model, which can be applied to a wide range of tasks with little or no further fine-tuning effort. Extensive experiments show that this model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, achieving significant performance improvements over several strong baseline models.

NeurIPS Conference 2025 Conference Paper

MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization

  • Chenglong Wang
  • Yang Gan
  • Hang Zhou
  • Chi Hu
  • Yongyu Mu
  • Kai Song
  • Murun Yang
  • Bei Li

Recent advances in diffusion language models (DLMs) have presented a promising alternative to traditional autoregressive large language models (LLMs). However, DLMs still lag behind LLMs in reasoning performance, especially as the number of denoising steps decreases. Our analysis reveals that this shortcoming arises primarily from the independent generation of masked tokens across denoising steps, which fails to capture the token correlation. In this paper, we define two types of token correlation: intra-sequence correlation and inter-sequence correlation, and demonstrate that enhancing these correlations improves reasoning performance. To this end, we propose a Multi-Reward Optimization (MRO) approach, which encourages DLMs to consider the token correlation during the denoising process. More specifically, our MRO approach leverages test-time scaling, reject sampling, and reinforcement learning to directly optimize the token correlation with multiple elaborate rewards. Additionally, we introduce group step and importance sampling strategies to mitigate reward variance and enhance sampling efficiency. Through extensive experiments, we demonstrate that MRO not only improves reasoning performance but also achieves significant sampling speedups while maintaining high performance on reasoning benchmarks.

ICLR Conference 2024 Conference Paper

Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers

  • Qingyan Guo
  • Rui Wang 0028
  • Junliang Guo
  • Bei Li
  • Kaitao Song
  • Xu Tan 0003
  • Guoqing Liu
  • Jiang Bian 0002

Large Language Models (LLMs) excel in various tasks, but they rely on carefully crafted prompts that often demand substantial human effort. To automate this process, in this paper, we propose a novel framework for discrete prompt optimization, called EvoPrompt, which borrows the idea of evolutionary algorithms (EAs) as they exhibit good performance and fast convergence. To enable EAs to work on discrete prompts, which are natural language expressions that need to be coherent and human-readable, we connect LLMs with EAs. This approach allows us to simultaneously leverage the powerful language processing capabilities of LLMs and the efficient optimization performance of EAs. Specifically, abstaining from any gradients or parameters, EvoPrompt starts from a population of prompts and iteratively generates new prompts with LLMs based on the evolutionary operators, improving the population based on the development set. We optimize prompts for both closed- and open-source LLMs including GPT-3.5 and Alpaca, on 31 datasets covering language understanding, generation tasks, as well as BIG-Bench Hard (BBH) tasks. EvoPrompt significantly outperforms human-engineered prompts and existing methods for automatic prompt generation (e.g., up to 25% on BBH). Furthermore, EvoPrompt demonstrates that connecting LLMs with EAs creates synergies, which could inspire further research on the combination of LLMs and conventional algorithms.

AAAI Conference 2024 Conference Paper

ESRL: Efficient Sampling-Based Reinforcement Learning for Sequence Generation

  • Chenglong Wang
  • Hang Zhou
  • Yimin Hu
  • Yifu Huo
  • Bei Li
  • Tongran Liu
  • Tong Xiao
  • Jingbo Zhu

Applying Reinforcement Learning (RL) to sequence generation models enables the direct optimization of long-term rewards (e.g., BLEU and human feedback), but typically requires large-scale sampling over a space of action sequences. This is a computational challenge as presented by the practice of sequence generation problems, such as machine translation, where we often deal with a large action space (e.g., a vocabulary) and a long action sequence (e.g., a translation). In this work, we introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL. We experiment with our approaches on the traditional sequence generation tasks, including machine translation and abstractive summarization. Furthermore, we evaluate our approaches in RL from human feedback (RLHF) through training a large language model using the reward model. Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption. Notably, ESRL yields consistent performance gains over the strong REINFORCE, minimum risk training, and proximal policy optimization methods. The code is available at https://github.com/wangclnlp/DeepSpeed-Chat-Extension/examples/esrl.

NeurIPS Conference 2024 Conference Paper

Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning

  • Bei Li
  • Tong Zheng
  • Rui Wang
  • Jiahao Liu
  • Qingyan Guo
  • Junliang Guo
  • Xu Tan
  • Tong Xiao

Residual networks, as discrete approximations of Ordinary Differential Equations (ODEs), have inspired significant advancements in neural network design, including multistep methods, high-order methods, and multi-particle dynamical systems. The precision of the solution to ODEs significantly affects parameter optimization, thereby impacting model performance. In this work, we present a series of advanced explorations of Transformer architecture design to minimize the error compared to the true ``solution. '' First, we introduce a predictor-corrector learning framework to minimize truncation errors, which consists of a high-order predictor and a multistep corrector. Second, we propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Extensive experiments on large-scale machine translation, abstractive summarization, language modeling, and natural language understanding benchmarks demonstrate the superiority of our approach. On the WMT'14 English-German and English-French tasks, our model achieved BLEU scores of 30. 95 and 44. 27, respectively. Furthermore, on the OPUS multilingual machine translation task, our model surpasses a robust 3. 8B DeepNet by an average of 2. 9 SacreBLEU, using only 1/3 parameters. Notably, it also beats LLama models by 5. 7 accuracy points on the LM Harness Evaluation.

ICML Conference 2022 Conference Paper

Learning Multiscale Transformer Models for Sequence Generation

  • Bei Li
  • Tong Zheng
  • Yi Jing
  • Chengbo Jiao
  • Tong Xiao 0001
  • JingBo Zhu

Multiscale feature hierarchies have been witnessed the success in the computer vision area. This further motivates researchers to design multiscale Transformer for natural language processing, mostly based on the self-attention mechanism. For example, restricting the receptive field across heads or extracting local fine-grained features via convolutions. However, most of existing works directly modeled local features but ignored the word-boundary information. This results in redundant and ambiguous attention distributions, which lacks of interpretability. In this work, we define those scales in different linguistic units, including sub-words, words and phrases. We built a multiscale Transformer model by establishing relationships among scales based on word-boundary information and phrase-level prior knowledge. The proposed \textbf{U}niversal \textbf{M}ulti\textbf{S}cale \textbf{T}ransformer, namely \textsc{Umst}, was evaluated on two sequence generation tasks. Notably, it yielded consistent performance gains over the strong baseline on several test sets without sacrificing the efficiency.

AAAI Conference 2021 Conference Paper

Learning Light-Weight Translation Models from Deep Transformer

  • Bei Li
  • Ziyang Wang
  • Hui Liu
  • Quan Du
  • Tong Xiao
  • Chunliang Zhang
  • Jingbo Zhu

Recently, deep models have shown tremendous improvements in neural machine translation (NMT). However, systems of this kind are computationally expensive and memory intensive. In this paper, we take a natural step towards learning strong but light-weight NMT systems. We proposed a novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model. The experimental results on several benchmarks validate the effectiveness of our method. Our compressed model is 8 times shallower than the deep model, with almost no loss in BLEU. To further enhance the teacher model, we present a Skipping Sub-Layer method to randomly omit sub-layers to introduce perturbation into training, which achieves a BLEU score of 30. 63 on English-German newstest2014. The code is publicly available at https: //github. com/libeineu/GPKD.