Iterative Reasoning Preference Optimization

Richard Y. Pang; Weizhe Yuan; Kyunghyun Cho; He He; Sainbayar Sukhbaatar; Jason Weston

doi:10.52202/079017-3702

Back to NeurIPS

NeurIPS 2024

Iterative Reasoning Preference Optimization

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details DOI

Abstract

Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks. In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps. We train using a modified DPO loss with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For example, we see a large improvement from 55. 6% to 81. 6% on GSM8K and an accuracy of 88. 7% with majority voting out of 32 samples.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue: Annual Conference on Neural Information Processing Systems
Archive span: 1987-2025
Indexed papers: 30776
Paper id: 291874853781231263