Arrow Research search

Author name cluster

Daniel Khashabi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers
2 author rows

Possible papers

20

TMLR Journal 2026 Journal Article

Genomic Next-Token Predictors are In-Context Learners

  • Nathan Breslow
  • Aayush Mishra
  • Michael Schatz
  • Anqi Liu
  • Mahler Revsine
  • Daniel Khashabi

In-context learning (ICL) -- the capacity of a model to infer and apply abstract patterns from examples provided within its input -- has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in *human* language. This raises a fundamental question: can ICL arise *organically* in other sequence domains purely through large-scale predictive training? To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data.

ICLR Conference 2025 Conference Paper

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

  • Jingyu Zhang
  • Ahmed Elgohary
  • Ahmed Magooda
  • Daniel Khashabi
  • Benjamin Van Durme

The current paradigm for safety alignment of large language models (LLMs) follows a _one-size-fits-all_ approach: the model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model with _static_ safety standards too restrictive to be useful, as well as too costly to be re-aligned. We propose _Controllable Safety Alignment_ (CoSA), a framework designed to adapt models to diverse safety requirements without re-training. Instead of aligning a fixed model, we align models to follow _safety configs_—free-form natural language descriptions of the desired safety behaviors—that are provided as part of the system prompt. To adjust model safety behavior, authorized users only need to modify such safety configs at inference time. To enable that, we propose CoSAlign, a data-centric method for aligning LLMs to easily adapt to diverse safety configs. Furthermore, we devise a novel controllability evaluation protocol that considers both helpfulness and configured safety, summarizing them into CoSA-Score, and construct CoSApien, a _human-authored_ benchmark that consists of real-world LLM use cases with diverse safety requirements and corresponding evaluation prompts. We show that CoSAlign leads to substantial gains of controllability over strong baselines including in-context alignment. Our framework encourages better representation and adaptation to pluralistic human values in LLMs, and thereby increasing their practicality.

NeurIPS Conference 2025 Conference Paper

FEEDBACK FRICTION: LLMs Struggle to Fully Incorporate External Feedback

  • Dongwei Jiang
  • Bowei Zhang
  • Andrew Wang
  • Nicholas Andrews
  • Daniel Khashabi

Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-perfect and complete feedback, we would expect them to fully integrate the feedback and reach correct solutions. In this paper, we systematically investigate LLMs’ ability to incorporate feedback by designing a controlled experimental environment. For each problem, a solver model attempts a solution, then a feedback generator with access to near-complete ground-truth answers produces targeted feedback, after which the solver tries again. We evaluate this pipeline across a diverse range of tasks, including math reasoning, knowledge reasoning, scientific reasoning, and general multi-domain evaluations with state-of-the-art language models including Claude 3. 7 with extended thinking. Surprisingly, even under these near-ideal conditions, solver models consistently show resistance to feedback, a limitation that we term FEEDBACK FRICTION. To mitigate this limitation, we experiment with sampling-based strategies like progressive temperature increases and explicit rejection of previously attempted incorrect answers, which yield improvements but still fail to help models achieve target performance. We analyze FEEDBACK FRICTION and find that models’ confidence on specific questions, measured by semantic entropy, predicts feedback resistance: high-confidence predictions remain resistant to external correction. We hope that highlighting this issue in LLMs will help future research in self-improvement.

ICLR Conference 2025 Conference Paper

GenEx: Generating an Explorable World

  • Taiming Lu
  • Tianmin Shu
  • Alan L. Yuille
  • Daniel Khashabi
  • Jieneng Chen

Understanding, navigating, and exploring the 3D physical real world has long been a central challenge in the development of artificial intelligence. In this work, we take a step toward this goal by introducing *GenEx*, a system capable of planning complex embodied world exploration, guided by its generative imagination that forms expectations about the surrounding environments. *GenEx* generates high-quality, continuous 360-degree virtual environments, achieving robust loop consistency and active 3D mapping over extended trajectories. Leveraging generative imagination, GPT-assisted agents can undertake complex embodied tasks, including goal-agnostic exploration and goal-driven navigation. Agents utilize imagined observations to update their beliefs, simulate potential outcomes, and enhance their decision-making. Training on the synthetic urban dataset *GenEx-DB* and evaluation on *GenEx-EQA* demonstrate that our approach significantly improves agents' planning capabilities, providing a transformative platform toward intelligent, imaginative embodied exploration.

AAAI Conference 2025 Conference Paper

SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses

  • Dongwei Jiang
  • Jingyu Zhang
  • Orion Weller
  • Nathaniel Weir
  • Benjamin Van Durme
  • Daniel Khashabi

Can LLMs consistently improve their previous outputs for better results? For this to be true, LLMs would need to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first formulate a unified framework that allows us to compare the generative and discriminative capability of any model on any task. In our resulting experimental analysis of several open-source and industrial LLMs, we observe that model’s are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.

ICML Conference 2025 Conference Paper

SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

  • Tianjian Li
  • Daniel Khashabi

Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that on-policy data consistently outperforms off-policy data for preference learning, others indicate that the advantages of on-policy data are task-dependent, highlighting the need for a systematic exploration of their interplay. In this work, we show that on-policy and off-policy data offer complementary strengths: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on subjective tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SimpleMix, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse tasks and benchmarks demonstrate that SimpleMix substantially improves language model alignment. Specifically, SimpleMix improves upon on-policy DPO and off-policy DPO by an average of 6. 03 on Alpaca Eval 2. 0. Moreover, it surpasses prior approaches that are much more complex in combining on- and off-policy data, such as HyPO and DPO-Mix-P, by an average of 3. 05. These findings validate the effectiveness and efficiency of SimpleMix for enhancing preference-based alignment.

AAAI Conference 2025 Conference Paper

WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment

  • Jiefu Ou
  • Arda Uzunoğlu
  • Benjamin Van Durme
  • Daniel Khashabi

AI systems make decisions in physical environments through primitive actions or affordances that are accessed via API calls. While deploying AI agents in the real world involves numerous high-level actions, existing embodied simulators offer a limited set of domain-salient APIs. This naturally brings up the questions: how many primitive actions (APIs) are needed for a versatile embodied agent, and how should they look like? We explore this via a thought experiment: assuming that wikiHow tutorials cover a wide variety of human-written tasks, what is the space of APIs needed to cover these instructions? We propose a framework to iteratively induce new APIs by grounding wikiHow instruction to situated agent policies. Inspired by recent successes in large language models (LLMs) for embodied planning, we propose a few-shot prompting to steer GPT-4 to generate Pythonic programs as agent policies and bootstrap a universe of APIs by 1) reusing a seed set of APIs; and then 2) fabricate new API calls when necessary. The focus of this thought experiment is on defining these APIs rather than their excitability. We apply the proposed pipeline on instructions from wiki- How tutorials. On a small fraction (0.5%) of tutorials, we induce an action space of 300+ APIs necessary for capturing the rich variety of tasks in the physical world. A detailed automatic and human analysis of the induction output reveals that the proposed pipeline enables effective reuse and creation of APIs. Moreover, a manual review revealed that existing simulators support only a small subset of the induced APIs (9 of the top 50 frequent APIs), motivating the development of action-rich embodied environments.

NeurIPS Conference 2024 Conference Paper

DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

  • Weiting Tan
  • Jingyu Zhang
  • Lingfeng Shen
  • Daniel Khashabi
  • Philipp Koehn

Non-autoregressive Transformers (NATs) are recently applied in direct speech-to-speech translation systems, which convert speech across different languages without intermediate text data. Although NATs generate high-quality outputs and offer faster inference than autoregressive models, they tend to produce incoherent and repetitive results due to complex data distribution (e. g. , acoustic and linguistic variations in speech). In this work, we introduce DiffNorm, a diffusion-based normalization strategy that simplifies data distributions for training NAT models. After training with a self-supervised noise estimation objective, DiffNorm constructs normalized target data by denoising synthetically corrupted speech features. Additionally, we propose to regularize NATs with classifier-free guidance, improving model robustness and translation quality by randomly dropping out source information during training. Our strategies result in a notable improvement of about $+7$ ASR-BLEU for English-Spanish (En-Es) translation and $+2$ ASR-BLEU for English-French (En-Fr) on the CVSS benchmark, while attaining over $14\times$ speedup for En-Es and $5 \times$ speedup for En-Fr translations compared to autoregressive baselines.

NeurIPS Conference 2024 Conference Paper

Efficient Large Multi-modal Models via Visual Context Compression

  • Jieneng Chen
  • Luoxin Ye
  • Ju He
  • Zhao-Yang Wang
  • Daniel Khashabi
  • Alan Yuille

While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in multi-modal LLMs (MLLMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experimentsshow that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens to enhance training and inference efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while maintaining training efficiency, we develop LLaVolta as a light and staged training scheme that incorporates stage-wise visual context compression to progressively compress the visual tokens from heavily to lightly compression during training, yielding no loss of information when testing. Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs and improving inference efficiency.

ICLR Conference 2024 Conference Paper

Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

  • Tianjian Li
  • Haoran Xu
  • Philipp Koehn
  • Daniel Khashabi
  • Kenton Murray

Text generation models are notoriously vulnerable to errors in the training data. With the wide-spread availability of massive amounts of web-crawled data becoming more commonplace, how can we enhance the robustness of models trained on a massive amount of noisy web-crawled text? In our work, we propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective that truncates noisy data. Compared to methods that only uses the negative log-likelihood loss to estimate data quality, our method provides a more accurate estimation by considering the distribution of non-target tokens, which is often overlooked by previous work. Through comprehensive experiments across language modeling, machine translation, and text summarization, we show that equipping text generation models with ENT improves generation quality over standard training and previous soft and hard truncation methods. Furthermore, we show that our method improves the robustness of models against two of the most detrimental types of noise in machine translation, resulting in an increase of more than 2 BLEU points over the MLE baseline when up to 50\% of noise is added to the data.

ICML Conference 2024 Conference Paper

Position: Do pretrained Transformers Learn In-Context by Gradient Descent?

  • Lingfeng Shen
  • Aayush Mishra
  • Daniel Khashabi

The emergence of In-Context Learning (ICL) in LLMs remains a remarkable phenomenon that is partially understood. To explain ICL, recent studies have created theoretical connections to Gradient Descent (GD). We ask, do such connections hold up in actual pre-trained language models? We highlight the limiting assumptions in prior works that make their setup considerably different from the practical setup in which language models are trained. For example, their experimental verification uses ICL objective (training models explicitly for ICL), which differs from the emergent ICL in the wild. Furthermore, the theoretical hand-constructed weights used in these studies have properties that don’t match those of real LLMs. We also look for evidence in real models. We observe that ICL and GD have different sensitivity to the order in which they observe demonstrations. Finally, we probe and compare the ICL vs. GD hypothesis in a natural setting. We conduct comprehensive empirical analyses on language models pre-trained on natural data (LLaMa-7B). Our comparisons of three performance metrics highlight the inconsistent behavior of ICL and GD as a function of various factors such as datasets, models, and the number of demonstrations. We observe that ICL and GD modify the output distribution of language models differently. These results indicate that the equivalence between ICL and GD remains an open hypothesis and calls for further studies.

ICLR Conference 2024 Conference Paper

The Trickle-down Impact of Reward Inconsistency on RLHF

  • Lingfeng Shen
  • Sihao Chen
  • Linfeng Song
  • Lifeng Jin
  • Baolin Peng
  • Haitao Mi
  • Daniel Khashabi
  • Dong Yu 0001

Standard practice within Reinforcement Learning from Human Feedback (RLHF) involves optimizing against a Reward Model (RM), which itself is trained to reflect human preferences for desirable generations. A notable subject that is understudied is the (in-)consistency of RMs --- whether they can recognize the semantic changes to different prompts and appropriately adapt their reward assignments --- and their impact on the downstream RLHF model. In this paper, we visit a series of research questions relevant to RM inconsistency: (1) How can we measure the consistency of reward models? (2) How consistent are the existing RMs and how can we improve them? (3) In what ways does reward inconsistency influence the chatbots resulting from the RLHF model training? We propose **Contrast Instruction** -- a benchmarking strategy for the consistency of RM. Each example in **Contrast Instruction** features a pair of lexically similar instructions with different ground truth responses. A consistent RM is expected to rank the corresponding instruction and response higher than other combinations. We observe that current RMs trained with the standard ranking objective fail miserably on \contrast{} compared to average humans. To show that RM consistency can be improved efficiently without using extra training budget, we propose two techniques **ConvexDA** and **RewardFusion**, which enhance reward consistency through extrapolation during the RM training and inference stage, respectively. We show that RLHF models trained with a more consistent RM yield more useful responses, suggesting that reward inconsistency exhibits a trickle-down effect on the downstream RLHF process.

TMLR Journal 2023 Journal Article

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

  • Aarohi Srivastava
  • Abhinav Rastogi
  • Abhishek Rao
  • Abu Awal Md Shoeb
  • Abubakar Abid
  • Adam Fisch
  • Adam R. Brown
  • Adam Santoro

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG- bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood develop- ment, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google- internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

ICLR Conference 2023 Conference Paper

Generating Sequences by Learning to Self-Correct

  • Sean Welleck
  • Ximing Lu
  • Peter West
  • Faeze Brahman
  • Tianxiao Shen
  • Daniel Khashabi
  • Yejin Choi 0001

Sequence generation applications require satisfying semantic constraints, such as ensuring that programs are correct, using certain keywords, or avoiding undesirable content. Language models, whether fine-tuned or prompted with few-shot demonstrations, frequently violate these constraints, and lack a mechanism to iteratively revise their outputs. Moreover, some powerful language models are of extreme scale or inaccessible, making it inefficient, if not infeasible, to update their parameters for task-specific adaptation. We present Self-Correction, an approach that decouples an imperfect base generator (an off-the-shelf language model or supervised sequence-to-sequence model) from a separate corrector that learns to iteratively correct imperfect generations. To train the corrector, we propose an online training procedure that can use either scalar or natural language feedback on intermediate imperfect generations. We show that Self-Correction improves upon the base generator in three diverse generation tasks - mathematical program synthesis, lexically-constrained generation, and toxicity control - even when the corrector is much smaller than the base generator.

NeurIPS Conference 2022 Conference Paper

COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics

  • Lianhui Qin
  • Sean Welleck
  • Daniel Khashabi
  • Yejin Choi

Many applications of text generation require incorporating different constraints to control the semantics or style of generated text. These constraints can be hard (e. g. , ensuring certain keywords are included in the output) and soft (e. g. , contextualizing the output with the left- or right-hand context). In this paper, we present Energy-based Constrained Decoding with Langevin Dynamics (COLD), a decoding framework which unifies constrained generation as specifying constraints through an energy function, then performing efficient differentiable reasoning over the constraints through gradient-based sampling. COLD decoding is a flexible framework that can be applied directly to off-the-shelf left-to-right language models without the need for any task-specific fine-tuning, as demonstrated through three challenging text generation applications: lexically-constrained generation, abductive reasoning, and counterfactual reasoning. Our experiments on these constrained generation tasks point to the effectiveness of our approach, both in terms of automatic and human evaluation.

IJCAI Conference 2020 Conference Paper

TransOMCS: From Linguistic Graphs to Commonsense Knowledge

  • Hongming Zhang
  • Daniel Khashabi
  • Yangqiu Song
  • Dan Roth

Commonsense knowledge acquisition is a key problem for artificial intelligence. Conventional methods of acquiring commonsense knowledge generally require laborious and costly human annotations, which are not feasible on a large scale. In this paper, we explore a practical way of mining commonsense knowledge from linguistic graphs, with the goal of transferring cheap knowledge obtained with linguistic patterns into expensive commonsense knowledge. The result is a conversion of ASER [Zhang et al. , 2020], a large-scale selectional preference knowledge resource, into TransOMCS, of the same representation as ConceptNet [Liu and Singh, 2004] but two orders of magnitude larger. Experimental results demonstrate the transferability of linguistic knowledge to commonsense knowledge and the effectiveness of the proposed approach in terms of quantity, novelty, and quality. TransOMCS is publicly available at: https: //github. com/HKUST-KnowComp/TransOMCS.

AAAI Conference 2018 Conference Paper

Question Answering as Global Reasoning Over Semantic Abstractions

  • Daniel Khashabi
  • Tushar Khot
  • Ashish Sabharwal
  • Dan Roth

We propose a novel method for exploiting the semantic structure of text to answer multiple-choice questions. The approach is especially suitable for domains that require reasoning over a diverse set of linguistic constructs but have limited training data. To address these challenges, we present the first system, to the best of our knowledge, that reasons over a wide range of semantic abstractions of the text, which are derived using off-the-shelf, general-purpose, pre-trained natural language modules such as semantic role labelers, coreference resolvers, and dependency parsers. Representing multiple abstractions as a family of graphs, we translate question answering (QA) into a search for an optimal subgraph that satisfies certain global and local properties. This formulation generalizes several prior structured QA systems. Our system, SEMANTICILP, demonstrates strong performance on two domains simultaneously. In particular, on a collection of challenging science QA datasets, it outperforms various state-ofthe-art approaches, including neural models, broad coverage information retrieval, and specialized techniques using structured knowledge bases, by 2%-6%.

AAAI Conference 2016 Conference Paper

Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions

  • Peter Clark
  • Oren Etzioni
  • Tushar Khot
  • Ashish Sabharwal
  • Oyvind Tafjord
  • Peter Turney
  • Daniel Khashabi

What capabilities are required for an AI system to pass standard 4th Grade Science Tests? Previous work has examined the use of Markov Logic Networks (MLNs) to represent the requisite background knowledge and interpret test questions, but did not improve upon an information retrieval (IR) baseline. In this paper, we describe an alternative approach that operates at three levels of representation and reasoning: information retrieval, corpus statistics, and simple inference over a semi-automatically constructed knowledge base, to achieve substantially improved results. We evaluate the methods on six years of unseen, unedited exam questions from the NY Regents Science Exam (using only non-diagram, multiple choice questions), and show that our overall system’s score is 71. 3%, an improvement of 23. 8% (absolute) over the MLN-based method described in previous work. We conclude with a detailed analysis, illustrating the complementary strengths of each method in the ensemble. Our datasets are being released to enable further research.

IJCAI Conference 2016 Conference Paper

Question Answering via Integer Programming over Semi-Structured Knowledge

  • Daniel Khashabi
  • Tushar Khot
  • Ashish Sabharwal
  • Peter Clark
  • Oren Etzioni
  • Dan Roth

Answering science questions posed in natural language is an important AI challenge. Answering such questions often requires non-trivial inference and knowledge that goes beyond factoid retrieval. Yet, most systems for this task are based on relatively shallow Information Retrieval (IR) and statistical correlation techniques operating on large unstructured corpora. We propose a structured inference system for this task, formulated as an Integer Linear Program (ILP), that answers natural language questions using a semi-structured knowledge base derived from text, including questions requiring multi-step inference and a combination of multiple facts. On a dataset of real, unseen science questions, our system significantly outperforms (+14%) the best previous attempt at structured reasoning for this task, which used Markov Logic Networks (MLNs). It also improves upon a previous ILP formulation by 17. 7%. When combined with unstructured inference methods, the ILP system significantly boosts overall performance (+10%). Finally, we show our approach is substantially more robust to a simple answer perturbation compared to statistical correlation methods.

NeurIPS Conference 2015 Conference Paper

Online Learning with Adversarial Delays

  • Kent Quanrud
  • Daniel Khashabi

We study the performance of standard online learning algorithms when the feedback is delayed by an adversary. We show that \texttt{online-gradient-descent} and \texttt{follow-the-perturbed-leader} achieve regret $O(\sqrt{D})$ in the delayed setting, where $D$ is the sum of delays of each round's feedback. This bound collapses to an optimal $O(\sqrt{T})$ bound in the usual setting of no delays (where $D = T$). Our main contribution is to show that standard algorithms for online learning already have simple regret bounds in the most general setting of delayed feedback, making adjustments to the analysis and not to the algorithms themselves. Our results help affirm and clarify the success of recent algorithms in optimization and machine learning that operate in a delayed feedback model.