Arrow Research search

Author name cluster

Yi Hu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

NeurIPS Conference 2025 Conference Paper

Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs

  • Yi Hu
  • Shijia Kang
  • Haotong Yang
  • Haotian Xu
  • Muhan Zhang

Length generalization—the ability to solve problems longer than those seen during training—remains a critical challenge for large language models (LLMs). Previous work modifies positional encodings (PEs) and data formats to improve length generalization on specific symbolic tasks such as addition and sorting. However, these approaches are fundamentally limited to special tasks, often degrading general language performance. Furthermore, they are typically evaluated on small transformers trained from scratch on single tasks and can cause performance drop when applied during post-training stage of practical LLMs with general capabilities. Hu et al. , (2024) proposed Rule-Following Fine-Tuning (RFFT) to improve length generalization in the post-training stage of LLMs. Despite its compatibility with practical models and strong performance, RFFT is proposed for single tasks too, requiring re-training for each individual task with extensive examples. In this paper, we study length generalization in multi-task settings and propose Meta Rule-Following Fine-Tuning (Meta-RFFT), the first framework enabling robust cross-task length generalization. As our first contribution, we construct a large length generalization dataset containing 86 tasks spanning code execution, number processing, symbolic and logical reasoning tasks, beyond the common addition or multiplication tasks. Secondly, we show that cross-task length generalization is possible with Meta-RFFT—after training on a large number of tasks and instances, the models achieve remarkable length generalization ability on unseen tasks with minimal fine-tuning or one-shot prompting. For example, after fine-tuning on 1 to 5 digit addition, our 32B model achieves 95% accuracy on 30 digit addition, significantly outperforming the state-of-the-art reasoning models (DeepSeek-R1-671B: 72%; QwQ-32B: 32%), despite never seeing this task during RF-pretraining.

ICLR Conference 2025 Conference Paper

Number Cookbook: Number Understanding of Language Models and How to Improve It

  • Haotong Yang 0001
  • Yi Hu
  • Shijia Kang
  • Zhouchen Lin
  • Muhan Zhang

Large language models (LLMs) can solve an increasing number of complex reasoning tasks while making surprising mistakes in basic numerical understanding and processing (such as $9.11 > 9.9$). The latter ability is essential for tackling complex arithmetic and mathematical problems and serves as a foundation for most reasoning tasks, but previous work paid little attention to it or only discussed several restricted tasks (like integer addition). In this paper, we comprehensively investigate the numerical understanding and processing ability (NUPA) of LLMs. Firstly, we introduce a benchmark covering four common numerical representations and 17 distinct numerical tasks in four major categories, resulting in 41 meaningful combinations in total. These tasks are derived from primary and secondary education curricula, encompassing nearly all everyday numerical understanding and processing scenarios, and the rules of these tasks are very simple and clear. Through the benchmark, we find that current LLMs fail frequently in many of the tasks. To study the problem, we train small models with existing and potential techniques for enhancing NUPA (such as tokenizers, PEs, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can improve NUPA a lot on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models. We further explore the impact of chain-of-thought techniques on NUPA. Our work provides a more detailed and comprehensive understanding of NUPA in LLMs.

NeurIPS Conference 2025 Conference Paper

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

  • Shi Qiu
  • Shaoyang Guo
  • Zhuo-Yang Song
  • Yunbo Sun
  • Zeyu Cai
  • Jiashen Wei
  • Tianyu Luo
  • Yixuan Yin

Current benchmarks for evaluating the reasoning capabilities of Large Language Models (LLMs) face significant limitations: task oversimplification, data contamination, and flawed evaluation items. These deficiencies necessitate more rigorous assessment methods. To address these limitations, we introduce PHYBench, a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty. PHYBench addresses data contamination through original content and employs a systematic curation pipeline to eliminate flawed items. Evaluations show that PHYBench activates more tokens and provides stronger differentiation between reasoning models compared to other baselines like AIME 2024, OlympiadBench and GPQA. Even the best-performing model, Gemini 2. 5 Pro, achieves only 36. 9\% accuracy compared to human experts' 61. 9\%. To further enhance evaluation precision, we introduce the Expression Edit Distance (EED) Score for mathematical expression assessment, which improves sample efficiency by 204\% over binary scoring. Moreover, PHYBench effectively elicits multi-step and multi-condition reasoning, providing a platform for examining models' reasoning robustness, preferences, and deficiencies. The benchmark results and dataset are publicly available at https: //www. phybench. cn/.

ICML Conference 2024 Conference Paper

Case-Based or Rule-Based: How Do Transformers Do the Math?

  • Yi Hu
  • Xiaojuan Tang
  • Haotong Yang 0001
  • Muhan Zhang

Despite the impressive performance in a variety of complex tasks, modern large language models (LLMs) still have trouble dealing with some math problems that are simple and intuitive for humans, such as addition. While we can easily learn basic rules of addition and apply them to new problems of any length, LLMs struggle to do the same. Instead, they may rely on similar cases seen in the training corpus for help. We define these two different reasoning mechanisms as " rule-based reasoning " and " case-based reasoning ". Since rule-based reasoning is essential for acquiring systematic generalization ability, we aim to explore exactly whether transformers use rule-based or case-based reasoning for math problems. Through carefully designed intervention experiments on five math tasks, we confirm that transformers are performing case-based reasoning, no matter whether scratchpad is used, which aligns with the previous observations that transformers use subgraph matching/shortcut learning to reason. To mitigate such problems, we propose a Rule-Following Fine-Tuning (RFFT) technique to teach transformers to perform rule-based reasoning. Specifically, we provide explicit rules in the input and then instruct transformers to recite and follow the rules step by step. Through RFFT, we successfully enable LLMs fine-tuned on 1-5 digit addition to generalize to up to 12-digit addition with over 95% accuracy, which is over 40% higher than scratchpad. The significant improvement demonstrates that teaching LLMs to use rules explicitly helps them learn rule-based reasoning and generalize better in length. Code is available at https: //github. com/GraphPKU/Case_or_Rule.

IROS Conference 2024 Conference Paper

Deep Ad-hoc Sub-Team Partition Learning for Multi-Agent Air Combat Cooperation

  • Songyuan Fan
  • Haiyin Piao
  • Yi Hu
  • Feng Jiang 0001
  • Roushu Yang

In the future, unmanned autonomous air combat will encounter large-scale confrontation scenarios, where agents must consider complex time-varying relationships among aircraft when making decisions. Previous works have already introduced Multi-Agent Reinforcement Learning (MARL) into air combat and succeeded in surpassing the human expert level. However, they mainly focus on small-scale air combat with low relationship complexity, e. g. , 1-vs-1 or 2-vs-2. As more agents join the confrontation, existing algorithms tend to suffer significant performance degradation due to the increase in problem dimensions. In view of this, this paper proposes Deep Ad-hoc Sub-Team Partition Learning(DASPL) to address large-scale air combat problems. DASPL models multi-agent air combat as a graph to handle the complex relations and introduces an automatic partitioning mechanism to generate dynamic sub-teams, which converts the existing large-scale multi-agent air combat cooperation problem into multiple small-scale equivalence problems. Additionally, DASPL incorporates an efficient message passing method among the participating sub-teams.

IJCAI Conference 2024 Conference Paper

InstructEdit: Instruction-Based Knowledge Editing for Large Language Models

  • Ningyu Zhang
  • Bozhong Tian
  • Siyuan Cheng
  • Xiaozhuan Liang
  • Yi Hu
  • Kouying Xue
  • Yanjie Gou
  • Xi Chen

Knowledge editing for large language models can offer an efficient solution to alter a model’s behavior without negatively impacting the overall performance. However, the current approaches encounter issues with limited generalizability across tasks, necessitating one distinct editor for each task, significantly hindering the broader applications. To address this, we take the first step to analyze the multi-task generalization issue in knowledge editing. Specifically, we develop an instruction-based editing technique, termed InstructEdit, which facilitates the editor's adaptation to various task performances simultaneously using simple instructions. With only one unified editor for each LLM, we empirically demonstrate that InstructEdit can improve the editor's control, leading to an average 14. 86% increase in Reliability in multi-task editing setting. Furthermore, experiments involving holdout unseen task illustrate that InstructEdit consistently surpass previous strong baselines. To further investigate the underlying mechanisms of instruction-based knowledge editing, we analyze the principal components of the editing gradient directions, which unveils that instructions can help control optimization direction with stronger OOD generalization.

AAAI Conference 2024 Conference Paper

T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering

  • Lei Wang
  • Yi Hu
  • Jiabang He
  • Xing Xu
  • Ning Liu
  • Hui Liu
  • Heng Tao Shen

Large Language Models (LLMs) have recently demonstrated exceptional performance in various Natural Language Processing (NLP) tasks. They have also shown the ability to perform chain-of-thought (CoT) reasoning to solve complex problems. Recent studies have explored CoT reasoning in complex multimodal scenarios, such as the science question answering task, by fine-tuning multimodal models with high-quality human-annotated CoT rationales. However, collecting high-quality COT rationales is usually time-consuming and costly. Besides, the annotated rationales are hardly accurate due to the external essential information missed. To address these issues, we propose a novel method termed T-SciQ that aims at teaching science question answering with LLM signals. The T-SciQ approach generates high-quality CoT rationales as teaching signals and is advanced to train much smaller models to perform CoT reasoning in complex modalities. Additionally, we introduce a novel data mixing strategy to produce more effective teaching data samples for simple and complex science question answer problems. Extensive experimental results show that our T-SciQ method achieves a new state-of-the-art performance on the ScienceQA benchmark, with an accuracy of 96.18%. Moreover, our approach outperforms the most powerful fine-tuned baseline by 4.5%. The code is publicly available at https://github.com/T-SciQ/T-SciQ.

IROS Conference 2023 Conference Paper

Autonomous Ultrasound Scanning Towards Standard Plane Using Interval Interaction Probabilistic Movement Primitives

  • Yi Hu
  • Mahdi Tavakoli

Learning from demonstrations is the paradigm where robots acquire new skills demonstrated by an expert and alleviate the physical burden on experts to perform repetitive tasks. Ultrasound scanning is one of the ways to view the anatomical structures of soft tissues, but it is repetitive for some tissue scanning tasks. In this study, an autonomous ultrasound scanning towards a standard plane framework is proposed. Interaction probabilistic movement primitives (iProMP) was proposed for the collaborative tasks for human and robot movement. Inspired by the interval type-2 fuzzy system, an interval iProMP is proposed to learn the ultrasound scanning navigation strategy from scanning demonstrations and the collaborative agents are the robot movement and ultrasound image information. The proposed interval iProMP improves the capacity of dealing with uncertainties due to insufficient observations during reproduction. U-Net is applied to recognize the desired ultrasound image shown during demonstrations and a confidence map is used to evaluate the ultrasound image quality. Breast seroma scanning is chosen as the ultrasound scanning task to validate the performance of the proposed autonomous ultrasound scanning framework. Ultrasound navigation is to realize autonomous ultrasound scanning for localizing the breast seroma. The simulation comparison result shows the better performance of the proposed interval iProMP under insufficient observation, compared to traditional iProMP. The experiment result validates the feasibility and generality of the proposed autonomous ultrasound scanning framework using interval iProMP with a higher success rate than that with traditional iProMP.