Arrow Research search

Author name cluster

Zimu Lu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers
2 author rows

Possible papers

8

AAAI Conference 2026 Conference Paper

From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

  • Weikang Shi
  • Houxing Ren
  • Junting Pan
  • Aojun Zhou
  • Ke Wang
  • Zimu Lu
  • Yunqiao Yang
  • Yuxuan Hu

Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.

ICLR Conference 2025 Conference Paper

MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code

  • Zimu Lu
  • Aojun Zhou
  • Ke Wang 0036
  • Houxing Ren
  • Weikang Shi
  • Junting Pan
  • Mingjie Zhan
  • Hongsheng Li 0001

Code has been shown to be effective in enhancing the mathematical reasoning abilities of large language models due to its precision and accuracy. Previous works involving continued mathematical pretraining often include code that utilizes math-related packages, which are primarily designed for fields such as engineering, machine learning, signal processing, or module testing, rather than being directly focused on mathematical reasoning. In this paper, we introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining. Our approach begins with the construction of a high-quality mathematical continued pretraining dataset by incorporating math-related web data, code using mathematical packages, math textbooks, and synthetic data. Next, we construct reasoning steps by extracting LaTeX expressions, the conditions needed for the expressions, and the results of the expressions from the previously collected dataset. Based on this extracted information, we generate corresponding code to accurately capture the mathematical reasoning process. Appending the generated code to each reasoning step results in data consisting of paired natural language reasoning steps and their corresponding code. Combining this data with the original dataset results in a 19.2B-token high-performing mathematical pretraining corpus, which we name MathCode-Pile. Training several popular base models with this corpus significantly improves their mathematical abilities, leading to the creation of the MathCoder2 family of models. All of our data processing and training code is open-sourced, ensuring full transparency and easy reproducibility of the entire data collection and training pipeline.

TMLR Journal 2025 Journal Article

Step-Controlled DPO: Leveraging Stepwise Errors for Enhancing Mathematical Reasoning of Language Models

  • Zimu Lu
  • Aojun Zhou
  • Ke Wang
  • Houxing Ren
  • Weikang Shi
  • Yunqiao Yang
  • Junting Pan
  • Mingjie Zhan

Direct Preference Optimization (DPO) has proven effective at improving the performance of large language models (LLMs) on downstream tasks such as reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO), a method for automatically providing stepwise error supervision by creating negative samples of mathematical reasoning rationales that start making errors at a specified step. By applying these samples in DPO training, SCDPO can better align the model to avoid reasoning errors and output accurate reasoning steps. Qualitative analysis of the credit assignment of SCDPO and DPO demonstrates the effectiveness of SCDPO at identifying errors in mathematical solutions. We then apply SCDPO to an InternLM2-20B model, resulting in a 20B model that achieves competitive scores of 88.5% on GSM8K and 58.1% on MATH, rivaling all other open-source LLMs, showing the great potential of our method. The code, models and data are released to inspire future work.

NeurIPS Conference 2025 Conference Paper

UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

  • Han Xiao
  • Guozhi Wang
  • Yuxiang Chai
  • Zimu Lu
  • Weifeng Lin
  • Hao He
  • Lue Fan
  • Liuyang Bian

In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently processes historical context and unifies action-level and task-level rewards. To support the training of UI-Genie-RM, we develop deliberately-designed data generation strategies including rule-based verification, controlled trajectory corruption, and hard negative mining. To address the second challenge, a self-improvement pipeline progressively expands solvable complex GUI tasks by enhancing both the agent and reward models through reward-guided exploration and outcome verification in dynamic environments. For training the model, we generate UI-Genie-RM-517k and UI-Genie-Agent-16k, establishing the first reward-specific dataset for GUI agents while demonstrating high-quality synthetic trajectory generation without manual annotation. Experimental results show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks with three generations of data-model self-improvement. We open-source our complete framework implementation and generated datasets to facilitate further research in https: //github. com/Euphoria16/UI-Genie.

NeurIPS Conference 2025 Conference Paper

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

  • Zimu Lu
  • Yunqiao Yang
  • Houxing Ren
  • Haotian Hou
  • Han Xiao
  • Ke Wang
  • Weikang Shi
  • Aojun Zhou

LLM‑based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we generate test cases targeting each functionality described in the instructions. These test cases are then manually filtered, refined, and organized to ensure accuracy, resulting in a total of 647 test cases. Each test case specifies an operation to be performed on the website and the expected outcome of the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute test cases on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks—Bolt. diy, OpenHands, and Aider—using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt. diy powered by DeepSeek-R1, achieves only 27. 8\% accuracy on the test cases, highlighting the challenging nature of our benchmark. Additionally, we construct WebGen-Instruct, a training set consisting of 6, 667 website-generation instructions. Training Qwen2. 5-Coder-32B-Instruct on Bolt. diy trajectories generated from a subset of the training set achieves an accuracy of 38. 2\%, surpassing the performance of the best proprietary model. We release our data-generation, training, and testing code, along with both the datasets and model weights at https: //github. com/mnluzimu/WebGen-Bench.

ICLR Conference 2024 Conference Paper

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

  • Ke Wang 0036
  • Houxing Ren
  • Aojun Zhou
  • Zimu Lu
  • Sichun Luo
  • Weikang Shi
  • Renrui Zhang
  • Linqi Song

The recently released GPT-4 Code Interpreter has demonstrated remarkable proficiency in solving challenging math problems, primarily attributed to its ability to seamlessly reason with natural language, generate code, execute code, and continue reasoning based on the execution output. In this paper, we present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations and, consequently, enhancing their mathematical reasoning abilities. We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions, referred to as MathCodeInstruct. Each solution interleaves $\textit{natural language}$, $\textit{code}$, and $\textit{execution results}$. We also introduce a customized supervised fine-tuning and inference approach. This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems. Impressively, the MathCoder models achieve state-of-the-art scores among open-source LLMs on the MATH (45.2%) and GSM8K (83.9%) datasets, substantially outperforming other open-source alternatives. Notably, the MathCoder model not only surpasses ChatGPT-3.5 and PaLM-2 on GSM8K and MATH but also outperforms GPT-4 on the competition-level MATH dataset. The proposed dataset and models will be released upon acceptance.

NeurIPS Conference 2024 Conference Paper

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

  • Ke Wang
  • Junting Pan
  • Weikang Shi
  • Zimu Lu
  • Houxing Ren
  • Aojun Zhou
  • Mingjie Zhan
  • Hongsheng Li

Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models exceeding human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3, 040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs. Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on \datasetname, underscoring the imperative for further advancements in LMMs. Moreover, our detailed categorization allows for a thorough error analysis of LMMs, offering valuable insights to guide future research and development. The dataset is released at MathLLMs/MathVision

ICLR Conference 2024 Conference Paper

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

  • Aojun Zhou
  • Ke Wang 0036
  • Zimu Lu
  • Weikang Shi
  • Sichun Luo
  • Zipeng Qin
  • Shaoqing Lu
  • Anya Jia

Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the Code Usage Frequency of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit $\underline{\text{c}}$ode-based $\underline{\text{s}}$elf-$\underline{\text{v}}$erification (CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as "False", the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset.