Arrow Research search

Author name cluster

Oleksii Kuchaiev

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers
2 author rows

Possible papers

3

ICLR Conference 2025 Conference Paper

HelpSteer2-Preference: Complementing Ratings with Preferences

  • Zhilin Wang
  • Alexander Bukharin
  • Olivier Delalleau
  • Daniel Egert
  • Gerald Shen
  • Jiaqi Zeng
  • Oleksii Kuchaiev
  • Yi Dong 0003

Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than the other, when adequately matched for data. This is primarily because these approaches require data collected in different (but incompatible) formats, meaning that adequately matched data is not available in existing public datasets. To tackle this problem, we release preference annotations (designed for Bradley-Terry training) to complement existing ratings (designed for Regression style training) in the HelpSteer2 dataset. To improve data interpretability, preference annotations are accompanied with human-written justifications. Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. Based on insights derived from such a comparison, we propose a novel approach to combine Bradley-Terry and Regression reward modeling. A Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. This reward model can then be used with REINFORCE algorithm (RLHF) to align an Instruct model to reach 85.0 on Arena Hard, which is No. 1 as of 1 Oct 2024. We open-source this dataset (CC-BY-4.0 license) at https://huggingface.co/datasets/nvidia/HelpSteer2#preferences-new---1-oct-2024 and openly release the trained Reward and Instruct models at https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward and https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct .

NeurIPS Conference 2025 Conference Paper

HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

  • Zhilin Wang
  • Jiaqi Zeng
  • Olivier Delalleau
  • Hoo-Chang Shin
  • Felipe Soares
  • Alexander Bukharin
  • Ellie Evans
  • Yi Dong

Preference datasets are essential for training general-domain, instruction-following language models with Reinforcement Learning from Human Feedback (RLHF). Each subsequent data release raises expectations for future data collection, meaning there is a constant need to advance the quality and diversity of openly available preference data. To address this need, we introduce HelpSteer3-Preference, a permissively licensed (CC-BY-4. 0), high-quality, human-annotated preference dataset comprising of over 40, 000 samples. These samples span diverse real-world applications of large language models (LLMs), including tasks relating to STEM, coding and multilingual scenarios. Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82. 4%) and JudgeBench (73. 7%). This represents a substantial improvement (~10% absolute) over the previously best-reported results from existing RMs. We demonstrate HelpSteer3-Preference can also be applied to train Generative RMs and how policy models can be aligned with RLHF using our RMs.

NeurIPS Conference 2024 Conference Paper

HelpSteer 2: Open-source dataset for training top-performing reward models

  • Zhilin Wang
  • Yi Dong
  • Olivier Delalleau
  • Jiaqi Zeng
  • Gerald Shen
  • Daniel Egert
  • Jimmy J. Zhang
  • Makesh N. Sreedhar

High-quality preference datasets are essential for training reward models that can effectively guide large language models (LLMs) in generating high-quality responses aligned with human preferences. As LLMs become stronger and better aligned, permissively licensed preference datasets, such as Open Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for reward modeling. Methods that distil preference data from proprietary LLMs such as GPT-4 have restrictions on commercial usage imposed by model providers. To improve upon both generated responses and attribute labeling quality, we release HelpSteer2, a permissively licensed preference dataset (CC-BY-4. 0). Using a powerful Nemotron-4-340B base model trained on HelpSteer2, we are able to achieve the SOTA score (92. 0%) on Reward-Bench's primary dataset, outperforming currently listed open and proprietary models, as of June 12th, 2024. Notably, HelpSteer2 consists of only ten thousand response pairs, an order of magnitude fewer than existing preference datasets (e. g. , HH-RLHF), which makes it highly efficient for training reward models. Our extensive experiments demonstrate that reward models trained with HelpSteer2 are effective in aligning LLMs. Additionally, we propose SteerLM 2. 0, a model alignment approach that can effectively make use of the rich multi-attribute score predicted by our reward models. HelpSteer2 is available at https: //huggingface. co/datasets/nvidia/HelpSteer2 and code is available at https: //github. com/NVIDIA/NeMo-Aligner