Arrow Research search

Author name cluster

Yiling Lou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers
1 author row

Possible papers

3

AAAI Conference 2026 Conference Paper

BAMAS: Structuring Budget-Aware Multi-Agent Systems

  • Liming Yang
  • Junyu Luo
  • Xuanzhe Liu
  • Yiling Lou
  • Zhenpeng Chen

Large language model (LLM)-based multi-agent systems have emerged as a powerful paradigm for enabling autonomous agents to solve complex tasks. As these systems scale in complexity, cost becomes an important consideration for practical deployment. However, existing work rarely addresses how to structure multi-agent systems under explicit budget constraints. In this paper, we propose BAMAS, a novel approach for building multi-agent systems with budget awareness. BAMAS first selects an optimal set of LLMs by formulating and solving an Integer Linear Programming problem that balances performance and cost. It then determines how these LLMs should collaborate by leveraging a reinforcement learning-based method to select the interaction topology. Finally, the system is instantiated and executed based on the selected agents and their collaboration topology. We evaluate BAMAS on three representative tasks and compare it with state-of-the-art agent construction methods. Results show that BAMAS achieves comparable performance while reducing cost by up to 86%.

AAAI Conference 2026 Conference Paper

Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-Based Test Oracles

  • Zihao Xu
  • Junchen Ding
  • Yiling Lou
  • Kun Zhang
  • Dong Gong
  • Yuekang Li

Large Language Models (LLMs) have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SMARTYPAT-BENCH, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling, such as fallacy-type imbalance and labor-intensive annotation, we introduce SMARTYPAT, an automated framework powered by logic programming-based oracles. SMARTYPAT utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural language sentences by LLMs, ensuring precise fallacy rep- resentation. Extensive evaluation demonstrates that SMARTYPAT produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.

NeurIPS Conference 2025 Conference Paper

Can Agent Fix Agent Issues?

  • Alfin Wijaya Rahardja
  • Junwei Liu
  • Weitong Chen
  • Zhenpeng Chen
  • Yiling Lou

LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements. Therefore, automatically resolving agent issues (i. e. ,bug reports or feature requests) is a crucial and challenging task. While recent software engineering (SE) agents (e. g. , SWE-agent) have shown promise in addressing issues in traditional software systems, it remains unclear how effectively they can resolve real-world issues in agent systems, which differ significantly from traditional software. To fill this gap, we first manually analyze 201 real-world agent issues and identify common categories of agent issues. We then spend 500 person-hours constructing AgentIssue-bench, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). We further evaluate state-of-the-art SE agents on AgentIssue-bench and reveal their limited effectiveness (. e. , with only 0. 67% - 4. 67% resolution rates). These results underscore the unique challenges of maintaining agent systems compared to traditional software, highlighting the need for further research to develop advanced SE agents for resolving agent issues.