Arrow Research search

Author name cluster

Michel Galley

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
2 author rows

Possible papers

12

ICML Conference 2025 Conference Paper

CollabLLM: From Passive Responders to Active Collaborators

  • Shirley Wu
  • Michel Galley
  • Baolin Peng
  • Hao Cheng 0002
  • Gavin Li
  • Yao Dou
  • Weixin Cai
  • James Y. Zou

Large Language Models are typically trained with next-turn rewards, limiting their ability to optimize for long-term interaction. As a result, they often respond passively to ambiguous or open-ended user requests, failing to help users reach their ultimate intents and leading to inefficient conversations. To address these limitations, we introduce CollabLLM, a novel and general training framework that enhances multiturn human-LLM collaboration. Its key innovation is a collaborative simulation that estimates the long-term contribution of responses using Multiturn-aware Rewards. By reinforcement fine-tuning these rewards, CollabLLM goes beyond responding to user requests, and actively uncovers user intent and offers insightful suggestions—a key step towards more human-centered AI. We also devise a multiturn interaction benchmark with three challenging tasks such as document creation. CollabLLM significantly outperforms our baselines with averages of 18. 5% higher task performance and 46. 3% improved interactivity by LLM judges. Finally, we conduct a large user study with 201 judges, where CollabLLM increases user satisfaction by 17. 6% and reduces user spent time by 10. 4%.

ICLR Conference 2025 Conference Paper

ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning

  • Xiao Yu 0011
  • Baolin Peng
  • Vineeth Vajipey
  • Hao Cheng 0002
  • Michel Galley
  • Jianfeng Gao 0001
  • Zhou Yu 0005

Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon planning tasks. To address these limitations, we introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test-time algorithm designed to enhance the ability of AI agents, e.g., powered by GPT-4o, to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Moreover, we improve the agent's performance by fine-tuning GPT-4o through self-learning, using R-MCTS generated tree traversals without any human-provided labels. On the challenging VisualWebArena benchmark, our GPT-4o-based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. The fine-tuned GPT-4o matches 97\% of R-MCTS's performance while reducing compute usage by a factor of four at test time. Furthermore, qualitative results reveal that the fine-tuned GPT-4o model demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success. Moreover, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs' reasoning and planning capabilities for agentic applications via test-time search and self-learning.

ICLR Conference 2024 Conference Paper

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

  • Pan Lu
  • Hritik Bansal
  • Tony Xia
  • Jiacheng Liu 0010
  • Chunyuan Li
  • Hannaneh Hajishirzi
  • Hao Cheng 0002
  • Kai-Wei Chang 0001

Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/.

NeurIPS Conference 2023 Conference Paper

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

  • Pan Lu
  • Baolin Peng
  • Hao Cheng
  • Michel Galley
  • Kai-Wei Chang
  • Ying Nian Wu
  • Song-Chun Zhu
  • Jianfeng Gao

Large language models (LLMs) have achieved remarkable progress in solving various natural language processing tasks due to emergent reasoning abilities. However, LLMs have inherent limitations as they are incapable of accessing up-to-date information (stored on the Web or in task-specific knowledge bases), using external tools, and performing precise mathematical and logical reasoning. In this paper, we present Chameleon, an AI system that mitigates these limitations by augmenting LLMs with plug-and-play modules for compositional reasoning. Chameleon synthesizes programs by composing various tools (e. g. , LLMs, off-the-shelf vision models, web search engines, Python functions, and heuristic-based modules) for accomplishing complex reasoning tasks. At the heart of Chameleon is an LLM-based planner that assembles a sequence of tools to execute to generate the final response. We showcase the effectiveness of Chameleon on two multi-modal knowledge-intensive reasoning tasks: ScienceQA and TabMWP. Chameleon, powered by GPT-4, achieves an 86. 54% overall accuracy on ScienceQA, improving the best published few-shot result by 11. 37%. On TabMWP, GPT-4-powered Chameleon improves the accuracy by 17. 0%, lifting the state of the art to 98. 78%. Our analysis also shows that the GPT-4-powered planner exhibits more consistent and rational tool selection via inferring potential constraints from instructions, compared to a ChatGPT-powered planner.

NeurIPS Conference 2023 Conference Paper

Guiding Large Language Models via Directional Stimulus Prompting

  • Zekun Li
  • Baolin Peng
  • Pengcheng He
  • Michel Galley
  • Jianfeng Gao
  • Xifeng Yan

We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) towards specific desired outputs. Instead of directly adjusting LLMs, our method employs a small tunable policy model (e. g. , T5) to generate an auxiliary directional stimulus prompt for each input instance. These directional stimulus prompts act as nuanced, instance-specific hints and clues to guide LLMs in generating desired outcomes, such as including specific keywords in the generated summary. Our approach sidesteps the challenges of direct LLM tuning by optimizing the policy model to explore directional stimulus prompts that align LLMs with desired behaviors. The policy model can be optimized through 1) supervised fine-tuning using labeled data and 2) reinforcement learning from offline or online rewards based on the LLM's output. We evaluate our method across various tasks, including summarization, dialogue response generation, and chain-of-thought reasoning. Our experiments indicate a consistent improvement in the performance of LLMs such as ChatGPT, Codex, and InstructGPT on these supervised tasks with minimal labeled data. Remarkably, by utilizing merely 80 dialogues from the MultiWOZ dataset, our approach boosts ChatGPT's performance by a relative 41. 4%, achieving or exceeding the performance of some fully supervised state-of-the-art models. Moreover, the instance-specific chain-of-thought prompt generated through our method enhances InstructGPT's reasoning accuracy, outperforming both generalized human-crafted prompts and those generated through automatic prompt engineering. The code and data are publicly available at https: //github. com/Leezekun/Directional-Stimulus-Prompting.

AAAI Conference 2022 Conference Paper

RetGen: A Joint Framework for Retrieval and Grounded Text Generation Modeling

  • Yizhe Zhang
  • Siqi Sun
  • Xiang Gao
  • Yuwei Fang
  • Chris Brockett
  • Michel Galley
  • Jianfeng Gao
  • Bill Dolan

Recent advances in large-scale pre-training such as GPT-3 allow seemingly high quality text to be generated from a given prompt. However, such generation systems often suffer from problems of hallucinated facts, and are not inherently designed to incorporate useful external information. Grounded generation models appear to offer remedies, but their training typically relies on rarely-available parallel data where information-relevant documents are provided for context. We propose a framework that alleviates this data constraint by jointly training a grounded generator and document retriever on the language model signal. The model learns to reward retrieval of the documents with the highest utility in generation, and attentively combines them using a Mixture-of-Experts (MoE) ensemble to generate follow-on text. We demonstrate that both generator and retriever can take advantage of this joint training and work synergistically to produce more informative and relevant text in both prose and dialogue generation.

AAAI Conference 2021 Conference Paper

A Controllable Model of Grounded Response Generation

  • Zeqiu Wu
  • Michel Galley
  • Chris Brockett
  • Yizhe Zhang
  • Xiang Gao
  • Chris Quirk
  • Rik Koncel-Kedziorski
  • Jianfeng Gao

Current end-to-end neural conversation models inherently lack the flexibility to impose semantic control in the response generation process, often resulting in uninteresting responses. Attempts to boost informativeness alone come at the expense of factual accuracy, as attested by pretrained language models’ propensity to “hallucinate” facts. While this may be mitigated by access to background knowledge, there is scant guarantee of relevance and informativeness in generated responses. We propose a framework that we call controllable grounded response generation (CGRG), in which lexical control phrases are either provided by a user or automatically extracted by a control phrase predictor from dialogue context and grounding knowledge. Quantitative and qualitative results show that, using this framework, a transformer based model with a novel inductive attention mechanism, trained on a conversation-like Reddit dataset, outperforms strong generation baselines.

AAAI Conference 2021 Conference Paper

Data Augmentation for Abstractive Query-Focused Multi-Document Summarization

  • Ramakanth Pasunuru
  • Asli Celikyilmaz
  • Michel Galley
  • Chenyan Xiong
  • Yizhe Zhang
  • Mohit Bansal
  • Jianfeng Gao

The progress in Query-focused Multi-Document Summarization (QMDS) has been limited by the lack of sufficient largescale high-quality training datasets. We present two QMDS training datasets, which we construct using two data augmentation methods: (1) transferring the commonly used singledocument CNN/Daily Mail summarization dataset to create the QMDSCNN dataset, and (2) mining search-query logs to create the QMDSIR dataset. These two datasets have complementary properties, i. e. , QMDSCNN has real summaries but queries are simulated, while QMDSIR has real queries but simulated summaries. To cover both these real summary and query aspects, we build abstractive end-to-end neural network models on the combined datasets that yield new state-of-theart transfer results on DUC datasets. We also introduce new hierarchical encoders that enable a more efficient encoding of the query together with multiple documents. Empirical results demonstrate that our data augmentation and encoding methods outperform baseline models on automatic metrics, as well as on human evaluations along multiple attributes.

AAAI Conference 2018 Conference Paper

A Knowledge-Grounded Neural Conversation Model

  • Marjan Ghazvininejad
  • Chris Brockett
  • Ming-Wei Chang
  • Bill Dolan
  • Jianfeng Gao
  • Wen-tau Yih
  • Michel Galley

Neural network models are capable of generating extremely natural sounding conversational interactions. However, these models have been mostly applied to casual scenarios (e. g. , as “chatbots”) and have yet to demonstrate they can serve in more useful conversational applications. This paper presents a novel, fully data-driven, and knowledge-grounded neural conversation model aimed at producing more contentful responses. We generalize the widely-used Sequence-to- Sequence (SEQ2SEQ) approach by conditioning responses on both conversation history and external “facts”, allowing the model to be versatile and applicable in an open-domain setting. Our approach yields significant improvements over a competitive SEQ2SEQ baseline. Human judges found that our outputs are significantly more informative.

NeurIPS Conference 2018 Conference Paper

Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization

  • Yizhe Zhang
  • Michel Galley
  • Jianfeng Gao
  • Zhe Gan
  • Xiujun Li
  • Chris Brockett
  • Bill Dolan

Responses generated by neural conversational models tend to lack informativeness and diversity. We present Adversarial Information Maximization (AIM), an adversarial learning framework that addresses these two related but distinct problems. To foster response diversity, we leverage adversarial training that allows distributional matching of synthetic and real responses. To improve informativeness, our framework explicitly optimizes a variational lower bound on pairwise mutual information between query and response. Empirical results from automatic and human evaluations demonstrate that our methods significantly boost informativeness and diversity.

AAAI Conference 2006 Short Paper

Automatic Summarization of Conversational Multi-Party Speech

  • Michel Galley

My research interests lie in natural language processing, in particular text-to-text generation, summarization, machine translation, and computational models of syntax. My dissertation addresses the problem of generating abstractive summaries of meeting recordings, for which I am currently exploring probabilistic techniques that capitalize on discourse and syntax-level representations.

IJCAI Conference 2003 Conference Paper

Improving Word Sense Disambiguation in Lexical Chaining

  • Michel Galley
  • Kathleen McKeown

Previous algorithms to compute lexical chains suffer either from a lack of accuracy in word sense disambiguation (WSD) or from computational inefficiency. In this paper, we present a new lineartime algorithm for lexical chaining that adopts the assumption of one sense per discourse. Our results show an improvement over previous algorithms when evaluated on a WSD task.