Arrow Research search

Author name cluster

Bowen Cao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
2 author rows

Possible papers

5

TMLR Journal 2026 Journal Article

AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting

  • Shijue Huang
  • Hongru WANG
  • Wanjun Zhong
  • Zhaochen Su
  • Jiazhan Feng
  • Bowen Cao
  • Yi R. Fung

With the advent of test-time scaling, Large Reasoning Models have achieved remarkable performance. However, the reinforcement learning process used to unlock these capabilities often leads to uncontrolled generation length, resulting in substantial computational overhead and unnecessary "overthinking" on simple tasks. Current methods either uniformly minimize reasoning tokens, thereby neglecting the necessity for more intricate reasoning on complex tasks, or employ precise token-level control, which often hinges on accurate difficulty estimation and suffers from unreliable model interpretation for nuanced instructions. To address these limitations, we introduce AdaCtrl, a novel framework that can dynamically adjust its reasoning length based on the model’s self-assessed problem difficulty and also allow human-in-the-loop control of the budget to prioritize either efficiency or effectiveness. Specifically, we carefully develop a two-stage training pipeline: 1) Cold-start fine-tuning stage, where we first design explicit difficulty-aware tags (e.g., ``[Easy]'' or ``[Hard]'') to indicate difficulty of problems, and train the model on a curated dataset to align its reasoning behavior with these difficulty levels; and 2) Difficulty-aware reinforcement learning stage, which further refines the model’s adaptive reasoning behavior and calibrates its self-assessment of problem difficulty. In this way, AdaCtrl not only empowers the model to adaptively assess the difficulty of problem and adjust reasoning budget allocation, but also enables the user to explicitly control the desired reasoning mode by injecting the specific difficulty-aware tag. Empirical results across four benchmarks show that, compared to different types of baselines, AdaCtrl effectively balances performance and computational efficiency, leading to performance improvements while dynamically reducing response lengths by up to 90%.

NeurIPS Conference 2024 Conference Paper

On the Worst Prompt Performance of Large Language Models

  • Bowen Cao
  • Deng Cai
  • Zhisong Zhang
  • Yuexian Zou
  • Wai Lam

The performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts, which raises significant concerns about their reliability in real-world scenarios. Existing studies often divide prompts into task-level instructions and case-level inputs and primarily focus on evaluating and improving robustness against variations in tasks-level instructions. However, this setup fails to fully address the diversity of real-world user queries and assumes the existence of task-specific datasets. To address these limitations, we introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries and emphasizes the importance of using the worst prompt performance to gauge the lower bound of model performance. Extensive experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance; for instance, a difference of 45. 48% between the worst and best performance for the Llama-2-70B-chat model, with its worst performance dipping as low as 9. 38%. We further illustrate the difficulty in identifying the worst prompt from both model-agnostic and model-dependent perspectives, emphasizing the absence of a shortcut to characterize the worst prompt. We also attempt to enhance the worst prompt performance using existing prompt engineering and prompt consistency methods, but find that their impact is limited. These findings underscore the need to create more resilient LLMs that can maintain high performance across diverse prompts.

ICLR Conference 2024 Conference Paper

Retrieval is Accurate Generation

  • Bowen Cao
  • Deng Cai 0002
  • Leyang Cui
  • Xuxin Cheng
  • Wei Bi
  • Yuexian Zou
  • Shuming Shi 0001

Standard language models generate text by selecting tokens from a fixed, finite, and standalone vocabulary. We introduce a novel method that selects context-aware phrases from a collection of supporting documents. One of the most significant challenges for this paradigm shift is determining the training oracles, because a string of text can be segmented in various ways and each segment can be retrieved from numerous possible documents. To address this, we propose to initialize the training oracles using linguistic heuristics and, more importantly, bootstrap the oracles through iterative self-reinforcement. Extensive experiments show that our model not only outperforms standard language models on a variety of knowledge-intensive tasks but also demonstrates improved generation quality in open-ended text generation. For instance, compared to the standard language model counterpart, our model raises the accuracy from 23.47% to 36.27% on OpenbookQA, and improves the MAUVE score from 42.61% to 81.58% in open-ended text generation. Remarkably, our model also achieves the best performance and the lowest latency among several retrieval-augmented baselines. In conclusion, we assert that retrieval is more accurate generation and hope that our work will encourage further research on this new paradigm shift.

AAAI Conference 2023 Conference Paper

FiTs: Fine-Grained Two-Stage Training for Knowledge-Aware Question Answering

  • Qichen Ye
  • Bowen Cao
  • Nuo Chen
  • Weiyuan Xu
  • Yuexian Zou

Knowledge-aware question answering (KAQA) requires the model to answer questions over a knowledge base, which is essential for both open-domain QA and domain-specific QA, especially when language models alone cannot provide all the knowledge needed. Despite the promising result of recent KAQA systems which tend to integrate linguistic knowledge from pre-trained language models (PLM) and factual knowledge from knowledge graphs (KG) to answer complex questions, a bottleneck exists in effectively fusing the representations from PLMs and KGs because of (i) the semantic and distributional gaps between them, and (ii) the difficulties in joint reasoning over the provided knowledge from both modalities. To address the above two problems, we propose a Fine-grained Two-stage training framework (FiTs) to boost the KAQA system performance: The first stage aims at aligning representations from the PLM and the KG, thus bridging the modality gaps between them, named knowledge adaptive post-training. The second stage, called knowledge-aware fine-tuning, aims to improve the model's joint reasoning ability based on the aligned representations. In detail, we fine-tune the post-trained model via two auxiliary self-supervised tasks in addition to the QA supervision. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on three benchmarks in the commonsense reasoning (i.e., CommonsenseQA, OpenbookQA) and medical question answering (i.e., MedQA-USMILE) domains.

AAAI Conference 2023 Conference Paper

FTM: A Frame-Level Timeline Modeling Method for Temporal Graph Representation Learning

  • Bowen Cao
  • Qichen Ye
  • Weiyuan Xu
  • Yuexian Zou

Learning representations for graph-structured data is essential for graph analytical tasks. While remarkable progress has been made on static graphs, researches on temporal graphs are still in its beginning stage. The bottleneck of the temporal graph representation learning approach is the neighborhood aggregation strategy, based on which graph attributes share and gather information explicitly. Existing neighborhood aggregation strategies fail to capture either the short-term features or the long-term features of temporal graph attributes, leading to unsatisfactory model performance and even poor robustness and domain generality of the representation learning method. To address this problem, we propose a Frame-level Timeline Modeling (FTM) method that helps to capture both short-term and long-term features and thus learns more informative representations on temporal graphs. In particular, we present a novel link-based framing technique to preserve the short-term features and then incorporate a timeline aggregator module to capture the intrinsic dynamics of graph evolution as long-term features. Our method can be easily assembled with most temporal GNNs. Extensive experiments on common datasets show that our method brings great improvements to the capability, robustness, and domain generality of backbone methods in downstream tasks. Our code can be found at https://github.com/yeeeqichen/FTM.