Arrow Research search

Author name cluster

Yi Hu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

NeurIPS Conference 2025 Conference Paper

Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs

  • Yi Hu
  • Shijia Kang
  • Haotong Yang
  • Haotian Xu
  • Muhan Zhang

Length generalization—the ability to solve problems longer than those seen during training—remains a critical challenge for large language models (LLMs). Previous work modifies positional encodings (PEs) and data formats to improve length generalization on specific symbolic tasks such as addition and sorting. However, these approaches are fundamentally limited to special tasks, often degrading general language performance. Furthermore, they are typically evaluated on small transformers trained from scratch on single tasks and can cause performance drop when applied during post-training stage of practical LLMs with general capabilities. Hu et al. , (2024) proposed Rule-Following Fine-Tuning (RFFT) to improve length generalization in the post-training stage of LLMs. Despite its compatibility with practical models and strong performance, RFFT is proposed for single tasks too, requiring re-training for each individual task with extensive examples. In this paper, we study length generalization in multi-task settings and propose Meta Rule-Following Fine-Tuning (Meta-RFFT), the first framework enabling robust cross-task length generalization. As our first contribution, we construct a large length generalization dataset containing 86 tasks spanning code execution, number processing, symbolic and logical reasoning tasks, beyond the common addition or multiplication tasks. Secondly, we show that cross-task length generalization is possible with Meta-RFFT—after training on a large number of tasks and instances, the models achieve remarkable length generalization ability on unseen tasks with minimal fine-tuning or one-shot prompting. For example, after fine-tuning on 1 to 5 digit addition, our 32B model achieves 95% accuracy on 30 digit addition, significantly outperforming the state-of-the-art reasoning models (DeepSeek-R1-671B: 72%; QwQ-32B: 32%), despite never seeing this task during RF-pretraining.

ICLR Conference 2025 Conference Paper

Number Cookbook: Number Understanding of Language Models and How to Improve It

  • Haotong Yang 0001
  • Yi Hu
  • Shijia Kang
  • Zhouchen Lin
  • Muhan Zhang

Large language models (LLMs) can solve an increasing number of complex reasoning tasks while making surprising mistakes in basic numerical understanding and processing (such as $9.11 > 9.9$). The latter ability is essential for tackling complex arithmetic and mathematical problems and serves as a foundation for most reasoning tasks, but previous work paid little attention to it or only discussed several restricted tasks (like integer addition). In this paper, we comprehensively investigate the numerical understanding and processing ability (NUPA) of LLMs. Firstly, we introduce a benchmark covering four common numerical representations and 17 distinct numerical tasks in four major categories, resulting in 41 meaningful combinations in total. These tasks are derived from primary and secondary education curricula, encompassing nearly all everyday numerical understanding and processing scenarios, and the rules of these tasks are very simple and clear. Through the benchmark, we find that current LLMs fail frequently in many of the tasks. To study the problem, we train small models with existing and potential techniques for enhancing NUPA (such as tokenizers, PEs, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can improve NUPA a lot on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models. We further explore the impact of chain-of-thought techniques on NUPA. Our work provides a more detailed and comprehensive understanding of NUPA in LLMs.

NeurIPS Conference 2025 Conference Paper

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

  • Shi Qiu
  • Shaoyang Guo
  • Zhuo-Yang Song
  • Yunbo Sun
  • Zeyu Cai
  • Jiashen Wei
  • Tianyu Luo
  • Yixuan Yin

Current benchmarks for evaluating the reasoning capabilities of Large Language Models (LLMs) face significant limitations: task oversimplification, data contamination, and flawed evaluation items. These deficiencies necessitate more rigorous assessment methods. To address these limitations, we introduce PHYBench, a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty. PHYBench addresses data contamination through original content and employs a systematic curation pipeline to eliminate flawed items. Evaluations show that PHYBench activates more tokens and provides stronger differentiation between reasoning models compared to other baselines like AIME 2024, OlympiadBench and GPQA. Even the best-performing model, Gemini 2. 5 Pro, achieves only 36. 9\% accuracy compared to human experts' 61. 9\%. To further enhance evaluation precision, we introduce the Expression Edit Distance (EED) Score for mathematical expression assessment, which improves sample efficiency by 204\% over binary scoring. Moreover, PHYBench effectively elicits multi-step and multi-condition reasoning, providing a platform for examining models' reasoning robustness, preferences, and deficiencies. The benchmark results and dataset are publicly available at https: //www. phybench. cn/.

ICML Conference 2024 Conference Paper

Case-Based or Rule-Based: How Do Transformers Do the Math?

  • Yi Hu
  • Xiaojuan Tang
  • Haotong Yang 0001
  • Muhan Zhang

Despite the impressive performance in a variety of complex tasks, modern large language models (LLMs) still have trouble dealing with some math problems that are simple and intuitive for humans, such as addition. While we can easily learn basic rules of addition and apply them to new problems of any length, LLMs struggle to do the same. Instead, they may rely on similar cases seen in the training corpus for help. We define these two different reasoning mechanisms as " rule-based reasoning " and " case-based reasoning ". Since rule-based reasoning is essential for acquiring systematic generalization ability, we aim to explore exactly whether transformers use rule-based or case-based reasoning for math problems. Through carefully designed intervention experiments on five math tasks, we confirm that transformers are performing case-based reasoning, no matter whether scratchpad is used, which aligns with the previous observations that transformers use subgraph matching/shortcut learning to reason. To mitigate such problems, we propose a Rule-Following Fine-Tuning (RFFT) technique to teach transformers to perform rule-based reasoning. Specifically, we provide explicit rules in the input and then instruct transformers to recite and follow the rules step by step. Through RFFT, we successfully enable LLMs fine-tuned on 1-5 digit addition to generalize to up to 12-digit addition with over 95% accuracy, which is over 40% higher than scratchpad. The significant improvement demonstrates that teaching LLMs to use rules explicitly helps them learn rule-based reasoning and generalize better in length. Code is available at https: //github. com/GraphPKU/Case_or_Rule.

IROS Conference 2024 Conference Paper

Deep Ad-hoc Sub-Team Partition Learning for Multi-Agent Air Combat Cooperation

  • Songyuan Fan
  • Haiyin Piao
  • Yi Hu
  • Feng Jiang 0001
  • Roushu Yang

In the future, unmanned autonomous air combat will encounter large-scale confrontation scenarios, where agents must consider complex time-varying relationships among aircraft when making decisions. Previous works have already introduced Multi-Agent Reinforcement Learning (MARL) into air combat and succeeded in surpassing the human expert level. However, they mainly focus on small-scale air combat with low relationship complexity, e. g. , 1-vs-1 or 2-vs-2. As more agents join the confrontation, existing algorithms tend to suffer significant performance degradation due to the increase in problem dimensions. In view of this, this paper proposes Deep Ad-hoc Sub-Team Partition Learning(DASPL) to address large-scale air combat problems. DASPL models multi-agent air combat as a graph to handle the complex relations and introduces an automatic partitioning mechanism to generate dynamic sub-teams, which converts the existing large-scale multi-agent air combat cooperation problem into multiple small-scale equivalence problems. Additionally, DASPL incorporates an efficient message passing method among the participating sub-teams.

IJCAI Conference 2024 Conference Paper

InstructEdit: Instruction-Based Knowledge Editing for Large Language Models

  • Ningyu Zhang
  • Bozhong Tian
  • Siyuan Cheng
  • Xiaozhuan Liang
  • Yi Hu
  • Kouying Xue
  • Yanjie Gou
  • Xi Chen

Knowledge editing for large language models can offer an efficient solution to alter a model’s behavior without negatively impacting the overall performance. However, the current approaches encounter issues with limited generalizability across tasks, necessitating one distinct editor for each task, significantly hindering the broader applications. To address this, we take the first step to analyze the multi-task generalization issue in knowledge editing. Specifically, we develop an instruction-based editing technique, termed InstructEdit, which facilitates the editor's adaptation to various task performances simultaneously using simple instructions. With only one unified editor for each LLM, we empirically demonstrate that InstructEdit can improve the editor's control, leading to an average 14. 86% increase in Reliability in multi-task editing setting. Furthermore, experiments involving holdout unseen task illustrate that InstructEdit consistently surpass previous strong baselines. To further investigate the underlying mechanisms of instruction-based knowledge editing, we analyze the principal components of the editing gradient directions, which unveils that instructions can help control optimization direction with stronger OOD generalization.

AAAI Conference 2024 Conference Paper

T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering

  • Lei Wang
  • Yi Hu
  • Jiabang He
  • Xing Xu
  • Ning Liu
  • Hui Liu
  • Heng Tao Shen

Large Language Models (LLMs) have recently demonstrated exceptional performance in various Natural Language Processing (NLP) tasks. They have also shown the ability to perform chain-of-thought (CoT) reasoning to solve complex problems. Recent studies have explored CoT reasoning in complex multimodal scenarios, such as the science question answering task, by fine-tuning multimodal models with high-quality human-annotated CoT rationales. However, collecting high-quality COT rationales is usually time-consuming and costly. Besides, the annotated rationales are hardly accurate due to the external essential information missed. To address these issues, we propose a novel method termed T-SciQ that aims at teaching science question answering with LLM signals. The T-SciQ approach generates high-quality CoT rationales as teaching signals and is advanced to train much smaller models to perform CoT reasoning in complex modalities. Additionally, we introduce a novel data mixing strategy to produce more effective teaching data samples for simple and complex science question answer problems. Extensive experimental results show that our T-SciQ method achieves a new state-of-the-art performance on the ScienceQA benchmark, with an accuracy of 96.18%. Moreover, our approach outperforms the most powerful fine-tuned baseline by 4.5%. The code is publicly available at https://github.com/T-SciQ/T-SciQ.

YNIMG Journal 2024 Journal Article

The acquired dyad inclination and decreased interpersonal brain communication in the pursuit of collective benefit

  • Shuyi Li
  • Linwei Yu
  • Xiaorong Gan
  • Yingying Hou
  • Yafeng Pan
  • Yi Luo
  • Yi Hu

People perform better collectively than individually, a phenomenon known as the collective benefit. To pursue the benefit, they may learn from previous behaviors, come to know whose initial opinion should be valued, and develop the inclination to take it as the collective one. Such learning may affect interpersonal brain communication. To test these hypotheses, this study recruited participant dyads to conduct a perceptual task on which they made individual decisions first and then the collective one. The enhanced interpersonal brain synchronization (IBS) between participants was explored when individual decisions were in disagreement vs. agreement. Computational modeling revealed that participant dyads developed the dyad inclination of taking the higher-able participants', not the lower-able ones' decisions as their collective ones. Brain analyses unveiled the enhanced IBS at frontopolar areas, premotor areas, supramarginal gyri, and right temporal-parietal junctions. The premotor IBS correlated negatively with dyad inclination and collective benefit in the absence of correction. The Granger causality analyses further supported the negative relation of dyad inclination with inter-brain communication. This study highlights that dyads learn to weigh individuals' decisions, resulting in dyad inclinations, and explores associated inter-brain communication, offering insights into the dynamics of collective decision-making.

IROS Conference 2023 Conference Paper

Autonomous Ultrasound Scanning Towards Standard Plane Using Interval Interaction Probabilistic Movement Primitives

  • Yi Hu
  • Mahdi Tavakoli

Learning from demonstrations is the paradigm where robots acquire new skills demonstrated by an expert and alleviate the physical burden on experts to perform repetitive tasks. Ultrasound scanning is one of the ways to view the anatomical structures of soft tissues, but it is repetitive for some tissue scanning tasks. In this study, an autonomous ultrasound scanning towards a standard plane framework is proposed. Interaction probabilistic movement primitives (iProMP) was proposed for the collaborative tasks for human and robot movement. Inspired by the interval type-2 fuzzy system, an interval iProMP is proposed to learn the ultrasound scanning navigation strategy from scanning demonstrations and the collaborative agents are the robot movement and ultrasound image information. The proposed interval iProMP improves the capacity of dealing with uncertainties due to insufficient observations during reproduction. U-Net is applied to recognize the desired ultrasound image shown during demonstrations and a confidence map is used to evaluate the ultrasound image quality. Breast seroma scanning is chosen as the ultrasound scanning task to validate the performance of the proposed autonomous ultrasound scanning framework. Ultrasound navigation is to realize autonomous ultrasound scanning for localizing the breast seroma. The simulation comparison result shows the better performance of the proposed interval iProMP under insufficient observation, compared to traditional iProMP. The experiment result validates the feasibility and generality of the proposed autonomous ultrasound scanning framework using interval iProMP with a higher success rate than that with traditional iProMP.

YNIMG Journal 2022 Journal Article

Group polarization calls for group-level brain communication

  • Yingying Hou
  • Dingning Zhang
  • Xiaorong Gan
  • Yi Hu

Group of people shows the shift towards extreme of decision-making as opposed to individuals. Previous studies have revealed two directions of group polarization, i.e., risky shift and cautious shift, but how group of brains drive these shifts remains unknown. In the current study, we arranged risk advantage and disadvantage situations to elicit group polarization of risky shift and cautious shift respectively, and examined the averaged inter-brain synchronization (ABS) among participant triads during group decision making versus individual decision making. The elicited group polarizations were accompanied by the enhanced ABS at bilateral prefrontal areas and left temporoparietal junction (TPJ). Specifically, the TPJ ABS was equivalent in risky shift and cautious shift, and based on machine learning analyses, could predict the extent of group polarization; for two shifts, it negatively correlated with negative emotion. However, the right prefrontal ABS was stronger in risky shift than in cautious shift, and the same area showed the larger brain deactivation in former shift, indicating weaker executive control. For the left prefrontal ABS, only the equivalent ABS was found for two shifts. In sum, group polarization of risky shift and cautious shift calls for inter-brain communication at the group level, and the former shift is with deactivation and more brain synchronization. Our study suggests emotional and cognitive adjustment in decision making of the group compared with individuals.

YNIMG Journal 2022 Journal Article

Integration of social status and trust through interpersonal brain synchronization

  • Xiaojun Cheng
  • Yujiao Zhu
  • Yinying Hu
  • Xiaolin Zhou
  • Yafeng Pan
  • Yi Hu

Trust can be a dynamic social process, during which the social identity of the interacting agents (e.g., an investor and a trustee) can bias trust outcomes. Here, we investigated how social status modulates trust and the neural mechanisms underlying this process. An investor and a trustee performed a 10-round repeated trust game while their brain activity was being simultaneously recorded using functional near-infrared spectroscopy. The social status (either high or low) of both investors and trustees was manipulated via a math competition task. The behavioral results showed that in the initial round, individuals invested more in low-status partners. However, the investment ratio increased faster as the number of rounds increased during trust interaction when individuals were paired with a high-status partner. This increasing trend was particularly prominent in the low (investor)-high (trustee) status group. Moreover, the low-high group showed increased investor-trustee brain synchronization in the right temporoparietal junction as the number of rounds increased, while brain activation in the right dorsolateral prefrontal cortex of the investor decreased as the number of rounds increased. Both interpersonal brain synchronization and brain activation predicted investment performance at the early stage; furthermore, two-brain data provided earlier predictions than did single-brain data. These effects were detectable in the investment phase in the low-high group only; no comparable effects were observed in the repayment phase or other groups. Overall, this study demonstrated a multi-brain mechanism for the integration of social status and trust.

YNIMG Journal 2020 Journal Article

Instructor-learner brain coupling discriminates between instructional approaches and predicts learning

  • Yafeng Pan
  • Suzanne Dikker
  • Pavel Goldstein
  • Yi Zhu
  • Cuirong Yang
  • Yi Hu

The neural mechanisms that support naturalistic learning via effective pedagogical approaches remain elusive. Here we used functional near-infrared spectroscopy to measure brain activity from instructor-learner dyads simultaneously during dynamic conceptual learning. Results revealed that brain-to-brain coupling was correlated with learning outcomes, and, crucially, appeared to be driven by specific scaffolding behaviors on the part of the instructors (e. g. , asking guiding questions or providing hints). Brain-to-brain coupling enhancement was absent when instructors used an explanation approach (e. g. , providing definitions or clarifications). Finally, we found that machine-learning techniques were more successful when decoding instructional approaches (scaffolding vs. explanation) from brain-to-brain coupling data than when using a single-brain method. These findings suggest that brain-to-brain coupling as a pedagogically relevant measure tracks the naturalistic instructional process during instructor-learner interaction throughout constructive engagement, but not information clarification.

YNIMG Journal 2020 Journal Article

The averaged inter-brain coherence between the audience and a violinist predicts the popularity of violin performance

  • Yingying Hou
  • Bei Song
  • Yinying Hu
  • Yafeng Pan
  • Yi Hu

Why is some music well-received whereas other music is not? Previous research has indicated the close temporal dependencies of neural activity among performers and among audiences. However, it is unknown whether similar neural contingencies exist between performers and audiences. Here, we used dual near-infrared spectroscopy (NIRS) to assess whether inter-brain synchronization between violinist and audience underlies the popularity of violin performance. In the experiment, individual audience members (16 females) watched pre-recorded videos, each lasting 100 ​s or so, in which a violinist performed 12 musical pieces. The results showed that the popularity of the performance correlated with the left-temporal inter-brain coherence (IBC) between the audience and the violinist. The correlation was stronger at late watching (>50 ​s) than at early watching (≤50 ​s). The smaller the Granger causality from the audience to the violinist was, the higher was the popularity of the piece with the audience. Discriminant analysis showed that the IBC could distinguish high popularity from low popularity. Further analysis using support vector regression showed that the IBC could also predict the popularity. These findings reveal the association of IBC with the popularity of violin performance. Music appreciation involves the brains of music producers and perceivers in a temporally aligned network through which audiences perceive the intentions of the performer and show positive emotions related to the musical performance.

YNIMG Journal 2018 Journal Article

Interpersonal synchronization of inferior frontal cortices tracks social interactive learning of a song

  • Yafeng Pan
  • Giacomo Novembre
  • Bei Song
  • Xianchun Li
  • Yi Hu

Much of human learning emerges as a result of interaction with others. Yet, this interpersonal process has been poorly characterized from a neurophysiological perspective. This study investigated (i) whether Interpersonal Brain Synchronization (IBS) can reliably mark social interactive learning, and specifically (ii) during what kind of interactive behavior. We recorded brain activity from learner-instructor dyads using functional Near-Infrared Spectroscopy (fNIRS) during the acquisition of a music song. We made four fundamental observations. First, during the interactive learning task, brain activity recorded from the bilateral Inferior Frontal Cortex (IFC) synchronized across the learner and the instructor. Second, such IBS was observed in particular when the learner was observing the instructor's vocal behavior and when the learning experience entailed a turn-taking and more active mode of interaction. Third, this specific enhancement of IBS predicted learner's behavioral performance. Fourth, Granger causality analyses further disclosed that the signal recorded from the instructor's brain better predicted that recorded from the learner's brain than vice versa. Together, these results indicate that social interactive learning can be neurophysiologically characterized in terms of IBS. Furthermore, they suggest that the learner's involvement in the learning experience, alongside the instructor's modeling, are key factors driving the alignment of neural processes across learner and instructor. Such alignment impacts upon the real-time acquisition of new information and eventually upon the learning (behavioral) performance. Hence, besides providing a biological characterization of social interactive learning, our results hold relevance for clinical and pedagogical practices.