Arrow Research search

Author name cluster

Shawn Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

1 paper
1 author row

Possible papers

1

NeurIPS Conference 2025 Conference Paper

ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

  • Shulin Huang
  • Linyi Yang
  • Yan Song
  • Shawn Chen
  • Leyang Cui
  • Ziyu Wan
  • Qingcheng Zeng
  • Ying Wen

Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to robustly evaluate the reasoning capability of LLMs. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2, 912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces data contamination impact. Our data and codes are available at https: //github. com/huangshulin123/ThinkBench.