Author name cluster

Xuefeng Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

1 author row

NeurIPS Conference 2025 Conference Paper

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Jiangjie Chen
Qianyu He
Siyu Yuan
Aili Chen
Zhicheng Cai
Weinan Dai
Hongli Yu
Jiaze Chen

Large Language Models (LLMs), such as OpenAI’s o1 and DeepSeek’s R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce ENIGMATA, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across 7 categories, each with: 1) a generator that produces unlimited examples with controllable difficulty, and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose ENIGMATA-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2. 5-32B-ENIGMATA, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like ENIGMATA-Eval, ARC-AGI (32. 8%), and ARC-AGI 2 (0. 6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1. 5-Thinking (20B activated parameters and 200B total parameters), puzzle data from ENIGMATA further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of ENIGMATA. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Project page: https: //seed-enigmata. github. io.

PDF Details

AAAI Conference 2025 Conference Paper

Evaluating Mathematical Reasoning Beyond Accuracy

Shijie Xia
Xuefeng Li
Yixin Liu
Tongshuang Wu
Pengfei Liu

The leaderboard of Large Language Models (LLMs) in mathematical tasks has been continuously updated. However, the majority of evaluations focus solely on the final results, neglecting the quality of the intermediate steps. This oversight can mask underlying problems, such as logical errors or unnecessary steps in the reasoning process. To measure reasoning beyond final-answer accuracy, we introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps. ReasonEval employs validity and redundancy to characterize the reasoning quality, as well as accompanying LLMs to assess them automatically. We explore different design options for the LLM-based evaluators and empirically demonstrate that ReasonEval, when instantiated with base models possessing strong mathematical knowledge and trained with high-quality labeled data, consistently outperforms baseline methods in the meta-evaluation datasets. We also highlight the strong generalization capabilities of ReasonEval. By utilizing ReasonEval to evaluate LLMs specialized in math, we find that an increase in final-answer accuracy does not necessarily guarantee an improvement in the overall quality of the reasoning steps for challenging mathematical problems. Additionally, we observe that ReasonEval can play a significant role in data selection. We open-source the best-performing model, meta-evaluation script, and all evaluation results to facilitate future research.

PDF Details DOI

EAAI Journal 2024 Journal Article

Deocclusion and integration of advantages for a better hand pose

Xuefeng Li
Xiangbo Lin

Estimating hand pose in the case of hand and object interaction faces the challenge of occlusion. Traditional methods that use contact information to alleviate this problem have limited applications because accurately estimating the required shape and pose of unfamiliar objects is a difficult task. This paper address this problem with occlusion removal at image level, removing the object by the proposed self-supervision method, which reduces the labor required to collect the paired labels of occlusion and deocclusion of the hand. In addition, this paper consider occlusion as valuable information and propose an integration strategy to enrich the extracted features from the occluded hand image and deoccluded hand image. Validation experiments are performed to show the proposed model’s main contributions. The experimental results on two widely-used public datasets demonstrate that the proposed model outperforms other state-of-the-art methods.

Details DOI

NeurIPS Conference 2024 Conference Paper

MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Zengzhi Wang
Xuefeng Li
Rui Xia
Pengfei Liu

High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9. 5 billion tokens. Throughout its creation, we adhered to the principle of “less is more”, firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates and conducted continual pre-training experiments, booting the performance on common mathematical reasoning benchmarks. We aim for our MathPile to boost language models’ mathematical reasoning abilities and open-source its different versions and processing scripts to advance the field.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

Zhen Huang
Zengzhi Wang
Shijie Xia
Xuefeng Li
Haoyang Zou
Ruijie Xu
Run-Ze Fan
Lyumanshan Ye

The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i. e. , AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11, 163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling complex scientific challenges and facilitating discoveries. Beyond evaluating performance across various disciplines using answer-only criteria, we conduct detailed experiments and analyses from multiple perspectives. We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions. Our extensive evaluations reveal that even advanced models like GPT-4o only achieve a 39. 97\% overall accuracy (28. 67\% for mathematics and 29. 71\% for physics), illustrating current AI limitations in complex reasoning and multimodal integration. Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond. We also provide a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features.

PDF Details DOI