Arrow Research search

Author name cluster

Vu Le

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers
1 author row

Possible papers

6

AAAI Conference 2025 Conference Paper

Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions

  • Bhuvanashree Murugadoss
  • Christian Poelitz
  • Ian Drosos
  • Vu Le
  • Nick McKenna
  • Carina Suzana Negreanu
  • Chris Parnin
  • Advait Sarkar

LLMs-as-a-judge is a recently popularized method which replaces human judgements in task evaluation with automatic evaluation using LLMs. Due to widespread use of RLHF (Reinforcement Learning from Human Feedback), state-of-the-art LLMs like GPT4 and Llama3 are expected to have strong alignment with human preferences when prompted for a quality judgement, such as the coherence of a text. While this seems beneficial, it is not clear whether the assessments by an LLM-as-a-judge constitute only an evaluation based on the instructions in the prompts, or reflect its preference for high-quality data similar to its fine-tune data. To investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements, we analyze prompts with increasing levels of instructions about the target quality of an evaluation, for several LLMs-as-a-judge. Further, we compare to a prompt-free method using model perplexity as a quality measure instead. We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges. Overall, we show that the LLMs-as-a-judge benefit only little from highly detailed instructions in prompts and that perplexity can sometimes align better with human judgements than prompting, especially on textual quality.

AAAI Conference 2024 System Paper

EmFORE: Learning Email Folder Classification Rules by Demonstration

  • Mukul Singh
  • Gust Verbruggen
  • José Cambronero
  • Vu Le
  • Sumit Gulwani

Tools that help with email folder management are limited, as users have to manually write rules to assign emails to folders. We present EMFORE, an iterative learning system that automatically learns and updates such rules from observations. EMFORE is fast enough to suggest and update rules in real time and suppresses mails with low confidence to reduce the number of false positives. EMFORE can use different rule grammars, and thus be adapted to different clients, without changing the user experience. Previous methods do not learn rules, require complete retraining or multiple new examples after making a mistake, and do not distinguish between inbox and other folders. EMFORE learns rules incrementally and can make the neutral decision of leaving emails in the inbox, making it an ideal candidate for integration in email clients.

AAAI Conference 2024 Conference Paper

FLAME: A Small Language Model for Spreadsheet Formulas

  • Harshit Joshi
  • Abishai Ebenezer
  • José Cambronero Sanchez
  • Sumit Gulwani
  • Aditya Kanade
  • Vu Le
  • Ivan Radiček
  • Gust Verbruggen

Spreadsheets are a vital tool for end-user data management. Using large language models for formula authoring assistance in these environments can be difficult, as these models are expensive to train and challenging to deploy due to their size (up to billions of parameters). We present FLAME, a transformer-based model trained exclusively on Excel formulas that leverages domain insights to achieve competitive performance while being substantially smaller (60M parameters) and training on two orders of magnitude less data. We curate a training dataset using sketch deduplication, introduce an Excel-specific formula tokenizer, and use domain-specific versions of masked span prediction and noisy auto-encoding as pre-training objectives. We evaluate FLAME on formula repair, formula completion, and similarity-based formula retrieval. FLAME can outperform much larger models, such as the Davinci (175B) and Cushman (12B) variants of Codex and CodeT5 (220M), in 10 of 14 evaluation settings for the repair and completion tasks. For formula retrieval, FLAME outperforms CodeT5, CodeBERT, and GraphCodeBERT.

AAAI Conference 2023 Conference Paper

Repair Is Nearly Generation: Multilingual Program Repair with LLMs

  • Harshit Joshi
  • José Cambronero Sanchez
  • Sumit Gulwani
  • Vu Le
  • Gust Verbruggen
  • Ivan Radiček

Most programmers make mistakes when writing code. Some of these mistakes are small and require few edits to the original program – a class of errors recently termed last mile mistakes. These errors break the flow for experienced developers and can stump novice programmers. Existing automated repair techniques targeting this class of errors are language-specific and do not easily carry over to new languages. Transferring symbolic approaches requires substantial engineering and neural approaches require data and retraining. We introduce RING, a multilingual repair engine powered by a large language model trained on code (LLMC) such as Codex. Such a multilingual engine enables a flipped model for programming assistance, one where the programmer writes code and the AI assistance suggests fixes, compared to traditional code suggestion technology. Taking inspiration from the way programmers manually fix bugs, we show that a prompt-based strategy that conceptualizes repair as localization, transformation, and candidate ranking, can successfully repair programs in multiple languages with minimal effort. We present the first results for such a multilingual repair engine by evaluating on 6 different languages and comparing performance to language-specific repair engines. We show that RING can outperform language-specific repair engines for three of these languages.

AAAI Conference 2005 System Paper

A Learning and Reasoning System for Intelligence Analysis

  • Mihai Boicu
  • Cindy Ayers
  • Cristina Boicu
  • Bogdan Stanescu
  • Vu Le

This paper presents a personal cognitive assistant, called Disciple-LTA, that can acquire expertise in intelligence analysis directly from intelligence analysts, can train new analysts, and can help analysts find solutions to complex problems through mixed-initiative reasoning, making possible the synergistic integration of a human’s experience and creativity with an automated agent’s knowledge and speed, and facilitating the collaboration with complementary experts and their agents.