Arrow Research search

Author name cluster

Hal Daumé III

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

32 papers
2 author rows

Possible papers

32

ICLR Conference 2025 Conference Paper

Natural Language Inference Improves Compositionality in Vision-Language Models

  • Paola Cascante-Bonilla
  • Yu Hou
  • Yang Trista Cao
  • Hal Daumé III
  • Rachel Rudinger

Compositional reasoning in Vision-Language Models (VLMs) remains challenging as these models often struggle to relate objects, attributes, and spatial relationships. Recent methods aim to address these limitations by relying on the semantics of the textual description, using Large Language Models (LLMs) to break them down into subsets of questions and answers. However, these methods primarily operate on the surface level, failing to incorporate deeper lexical understanding while introducing incorrect assumptions generated by the LLM. In response to these issues, we present Caption Expansion with Contradictions and Entailments (CECE), a principled approach that leverages Natural Language Inference (NLI) to generate entailments and contradictions from a given premise. CECE produces lexically diverse sentences while maintaining their core meaning. Through extensive experiments, we show that CECE enhances interpretability and reduces overreliance on biased or superficial features. By balancing CECE along the original premise, we achieve significant improvements over previous methods without requiring additional fine-tuning, producing state-of-the-art results on benchmarks that score agreement with human judgments for image-text alignment, and achieving an increase in performance on Winoground of $+19.2\%$ (group score) and $+12.9\%$ on EqBen (group score) over the best prior work (finetuned with targeted data).

ICLR Conference 2025 Conference Paper

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

  • Ruijie Zheng
  • Yongyuan Liang
  • Shuaiyi Huang
  • Jianfeng Gao 0001
  • Hal Daumé III
  • Andrey Kolobov
  • Furong Huang
  • Jianwei Yang

Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models’ spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.

ICLR Conference 2024 Conference Paper

DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization

  • Guowei Xu 0001
  • Ruijie Zheng
  • Yongyuan Liang
  • Xiyao Wang
  • Zhecheng Yuan
  • Tianying Ji
  • Yu Luo
  • Xiaoyu Liu 0003

Visual reinforcement learning (RL) has shown promise in continuous control tasks. Despite its progress, current algorithms are still unsatisfactory in virtually every aspect of the performance such as sample efficiency, asymptotic performance, and their robustness to the choice of random seeds. In this paper, we identify a major shortcoming in existing visual RL methods that is the agents often exhibit sustained inactivity during early training, thereby limiting their ability to explore effectively. Expanding upon this crucial observation, we additionally unveil a significant correlation between the agents' inclination towards motorically inactive exploration and the absence of neuronal activity within their policy networks. To quantify this inactivity, we adopt dormant ratio as a metric to measure inactivity in the RL agent's network. Empirically, we also recognize that the dormant ratio can act as a standalone indicator of an agent's activity level, regardless of the received reward signals. Leveraging the aforementioned insights, we introduce DrM, a method that uses three core mechanisms to guide agents' exploration-exploitation trade-offs by actively minimizing the dormant ratio. Experiments demonstrate that DrM achieves significant improvements in sample efficiency and asymptotic performance with no broken seeds (76 seeds in total) across three continuous control benchmark environments, including DeepMind Control Suite, MetaWorld, and Adroit. Most importantly, DrM is the first model-free algorithm that consistently solves tasks in both the Dog and Manipulator domains from the DeepMind Control Suite as well as three dexterous hand manipulation tasks without demonstrations in Adroit, all based on pixel observations.

ICML Conference 2024 Conference Paper

Premier-TACO is a Few-Shot Policy Learner: Pretraining Multitask Representation via Temporal Action-Driven Contrastive Loss

  • Ruijie Zheng
  • Yongyuan Liang
  • Xiyao Wang
  • Shuang Ma
  • Hal Daumé III
  • Huazhe Xu
  • John Langford 0001
  • Praveen Palanisamy

We present Premier-TACO, a multitask feature representation learning approach designed to improve few-shot policy learning efficiency in sequential decision-making tasks. Premier-TACO leverages a subset of multitask offline datasets for pretraining a general feature representation, which captures critical environmental dynamics and is fine-tuned using minimal expert demonstrations. It advances the temporal action contrastive learning (TACO) objective, known for state-of-the-art results in visual control tasks, by incorporating a novel negative example sampling strategy. This strategy is crucial in significantly boosting TACO’s computational efficiency, making large-scale multitask offline pretraining feasible. Our extensive empirical evaluation in a diverse set of continuous control benchmarks including Deepmind Control Suite, MetaWorld, and LIBERO demonstrate Premier-TACO’s effective- ness in pretraining visual representations, significantly enhancing few-shot imitation learning of novel tasks.

ICML Conference 2024 Conference Paper

PRISE: LLM-Style Sequence Compression for Learning Temporal Action Abstractions in Control

  • Ruijie Zheng
  • Ching-An Cheng
  • Hal Daumé III
  • Furong Huang
  • Andrey Kolobov

Temporal action abstractions, along with belief state representations, are a powerful knowledge sharing mechanism for sequential decision making. In this work, we propose a novel view that treats inducing temporal action abstractions as a sequence compression problem. To do so, we bring a subtle but critical component of LLM training pipelines – input tokenization via byte pair encoding (BPE) – to bear on the seemingly distant task of learning skills of variable time span in continuous control domains. We introduce an approach called Primitive Sequence Encoding (PRISE) that combines continuous action quantization with BPE to learn powerful action abstractions. We empirically show that high-level skills discovered by PRISE from a multitask set of robotic manipulation demonstrations significantly boost the learning performance of behavior cloning on downstream tasks.

NeurIPS Conference 2023 Conference Paper

$\texttt{TACO}$: Temporal Latent Action-Driven Contrastive Loss for Visual Reinforcement Learning

  • Ruijie Zheng
  • Xiyao Wang
  • Yanchao Sun
  • Shuang Ma
  • Jieyu Zhao
  • Huazhe Xu
  • Hal Daumé III
  • Furong Huang

Despite recent progress in reinforcement learning (RL) from raw pixel data, sample inefficiency continues to present a substantial obstacle. Prior works have attempted to address this challenge by creating self-supervised auxiliary tasks, aiming to enrich the agent's learned representations with control-relevant information for future state prediction. However, these objectives are often insufficient to learn representations that can represent the optimal policy or value function, and they often consider tasks with small, abstract discrete action spaces and thus overlook the importance of action representation learning in continuous control. In this paper, we introduce $\texttt{TACO}$: $\textbf{T}$emporal $\textbf{A}$ction-driven $\textbf{CO}$ntrastive Learning, a simple yet powerful temporal contrastive learning approach that facilitates the concurrent acquisition of latent state and action representations for agents. $\texttt{TACO}$ simultaneously learns a state and an action representation by optimizing the mutual information between representations of current states paired with action sequences and representations of the corresponding future states. Theoretically, $\texttt{TACO}$ can be shown to learn state and action representations that encompass sufficient information for control, thereby improving sample efficiency. For online RL, $\texttt{TACO}$ achieves 40% performance boost after one million environment interaction steps on average across nine challenging visual continuous control tasks from Deepmind Control Suite. In addition, we show that $\texttt{TACO}$ can also serve as a plug-and-play module adding to existing offline visual RL methods to establish the new state-of-the-art performance for offline visual RL across offline datasets with varying quality.

NeurIPS Conference 2023 Conference Paper

ASL Citizen: A Community-Sourced Dataset for Advancing Isolated Sign Language Recognition

  • Aashaka Desai
  • Lauren Berger
  • Fyodor Minakov
  • Nessa Milano
  • Chinmay Singh
  • Kriston Pumphrey
  • Richard Ladner
  • Hal Daumé III

Sign languages are used as a primary language by approximately 70 million D/deaf people world-wide. However, most communication technologies operate in spoken and written languages, creating inequities in access. To help tackle this problem, we release ASL Citizen, the first crowdsourced Isolated Sign Language Recognition (ISLR) dataset, collected with consent and containing 83, 399 videos for 2, 731 distinct signs filmed by 52 signers in a variety of environments. We propose that this dataset be used for sign language dictionary retrieval for American Sign Language (ASL), where a user demonstrates a sign to their webcam to retrieve matching signs from a dictionary. We show that training supervised machine learning classifiers with our dataset advances the state-of-the-art on metrics relevant for dictionary retrieval, achieving 63\% accuracy and a recall-at-10 of 91\%, evaluated entirely on videos of users who are not present in the training or validation sets.

ICML Conference 2022 Conference Paper

A Framework for Learning to Request Rich and Contextually Useful Information from Humans

  • Khanh X. Nguyen
  • Yonatan Bisk
  • Hal Daumé III

When deployed, AI agents will encounter problems that are beyond their autonomous problem-solving capabilities. Leveraging human assistance can help agents overcome their inherent limitations and robustly cope with unfamiliar situations. We present a general interactive framework that enables an agent to request and interpret rich, contextually useful information from an assistant that has knowledge about the task and the environment. We demonstrate the practicality of our framework on a simulated human-assisted navigation problem. Aided with an assistance-requesting policy learned by our method, a navigation agent achieves up to a 7{\texttimes} improvement in success rate on tasks that take place in previously unseen environments, compared to fully autonomous behavior. We show that the agent can take advantage of different types of information depending on the context, and analyze the benefits and challenges of learning the assistance-requesting policy when the assistant can recursively decompose tasks into subtasks.

AAAI Conference 2021 Conference Paper

A Novice-Reviewer Experiment to Address Scarcity of Qualified Reviewers in Large Conferences

  • Ivan Stelmakh
  • Nihar B. Shah
  • Aarti Singh
  • Hal Daumé III

Conference peer review constitutes a human-computation process whose importance cannot be overstated: not only it identifies the best submissions for acceptance, but, ultimately, it impacts the future of the whole research area by promoting some ideas and restraining others. A surge in the number of submissions received by leading AI conferences has challenged the sustainability of the review process by increasing the burden on the pool of qualified reviewers which is growing at a much slower rate. In this work, we consider the problem of reviewer recruiting with a focus on the scarcity of qualified reviewers in large conferences. Specifically, we design a procedure for (i) recruiting reviewers from the population not typically covered by major conferences and (ii) guiding them through the reviewing pipeline. In conjunction with the ICML 2020 — a large, top-tier machine learning conference — we recruit a small set of reviewers through our procedure and compare their performance with the general population of ICML reviewers. Our experiment reveals that a combination of the recruiting and guiding mechanisms allows for a principled enhancement of the reviewer pool and results in reviews of superior quality compared to the conventional pool of reviews as evaluated by senior members of the program committee (meta-reviewers).

AAAI Conference 2021 Conference Paper

Meta-Learning Effective Exploration Strategies for Contextual Bandits

  • Amr Sharaf
  • Hal Daumé III

In contextual bandits, an algorithm must choose actions given observed contexts, learning from a reward signal that is observed only for the action chosen. This leads to an exploration/exploitation trade-off: the algorithm must balance taking actions it already believes are good with taking new actions to potentially discover better choices. We develop a meta-learning algorithm, MÊLÉE, that learns an exploration policy based on simulated, synthetic contextual bandit tasks. MÊLÉE uses imitation learning against these simulations to train an exploration policy that can be applied to true contextual bandit tasks at test time. We evaluate MÊLÉE on both a natural contextual bandit problem derived from a learning to rank dataset as well as hundreds of simulated contextual bandit problems derived from classification tasks. MÊLÉE outperforms seven strong baselines on most of these datasets by leveraging a rich feature representation for learning an exploration strategy.

JMLR Journal 2019 Journal Article

Active Learning for Cost-Sensitive Classification

  • Akshay Krishnamurthy
  • Alekh Agarwal
  • Tzu-Kuo Huang
  • Hal Daumé III
  • John Langford

We design an active learning algorithm for cost-sensitive multiclass classification: problems where different errors have different costs. Our algorithm, COAL, makes predictions by regressing to each label's cost and predicting the smallest. On a new example, it uses a set of regressors that perform well on past data to estimate possible costs for each label. It queries only the labels that could be the best, ignoring the sure losers. We prove COAL can be efficiently implemented for any regression family that admits squared loss optimization; it also enjoys strong guarantees with respect to predictive performance and labeling effort. We empirically compare COAL to passive learning and several active learning baselines, showing significant improvements in labeling effort and test cost on real-world datasets. [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

ICML Conference 2019 Conference Paper

Contextual Memory Trees

  • Wen Sun 0002
  • Alina Beygelzimer
  • Hal Daumé III
  • John Langford 0001
  • Paul Mineiro

We design and study a Contextual Memory Tree (CMT), a learning memory controller that inserts new memories into an experience store of unbounded size. It operates online and is designed to efficiently query for memories from that store, supporting logarithmic time insertion and retrieval operations. Hence CMT can be integrated into existing statistical learning algorithms as an augmented memory unit without substantially increasing training and inference computation. Furthermore CMT operates as a reduction to classification, allowing it to benefit from advances in representation or architecture. We demonstrate the efficacy of CMT by augmenting existing multi-class and multi-label classification algorithms with CMT and observe statistical improvement. We also test CMT learning on several image-captioning tasks to demonstrate that it performs computationally better than a simple nearest neighbors memory system while benefitting from reward learning.

ICML Conference 2019 Conference Paper

Non-Monotonic Sequential Text Generation

  • Sean Welleck
  • Kianté Brantley
  • Hal Daumé III
  • Kyunghyun Cho

Standard sequential generation methods assume a pre-specified generation order, such as text generation methods which generate words from left to right. In this work, we propose a framework for training models of text generation that operate in non-monotonic orders; the model directly learns good orders, without any additional annotation. Our framework operates by generating a word at an arbitrary position, and then recursively generating words to its left and then words to its right, yielding a binary tree. Learning is framed as imitation learning, including a coaching method which moves from imitating an oracle to reinforcing the policy’s own preferences. Experimental results demonstrate that using the proposed method, it is possible to learn policies which generate text without pre-specifying a generation order, while achieving competitive performance with conventional left-to-right generation.

ICML Conference 2019 Conference Paper

Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback

  • Chicheng Zhang
  • Alekh Agarwal
  • Hal Daumé III
  • John Langford 0001
  • Sahand Negahban

We investigate the feasibility of learning from both fully-labeled supervised data and contextual bandit data. We specifically consider settings in which the underlying learning signal may be different between these two data sources. Theoretically, we state and prove no-regret algorithms for learning that is robust to divergences between the two sources. Empirically, we evaluate some of these algorithms on a large selection of datasets, showing that our approaches are feasible, and helpful in practice.

ICML Conference 2018 Conference Paper

Hierarchical Imitation and Reinforcement Learning

  • Hoang Minh Le 0002
  • Nan Jiang 0008
  • Alekh Agarwal
  • Miroslav Dudík
  • Yisong Yue
  • Hal Daumé III

We study how to effectively leverage expert feedback to learn sequential decision-making policies. We focus on problems with sparse rewards and long time horizons, which typically pose significant challenges in reinforcement learning. We propose an algorithmic framework, called hierarchical guidance, that leverages the hierarchical structure of the underlying problem to integrate different modes of expert interaction. Our framework can incorporate different combinations of imitation learning (IL) and reinforcement learning (RL) at different levels, leading to dramatic reductions in both expert effort and cost of exploration. Using long-horizon benchmarks, including Montezuma’s Revenge, we demonstrate that our approach can learn significantly faster than hierarchical RL, and be significantly more label-efficient than standard IL. We also theoretically analyze labeling cost for certain instantiations of our framework.

ICML Conference 2017 Conference Paper

Active Learning for Cost-Sensitive Classification

  • Akshay Krishnamurthy
  • Alekh Agarwal
  • Tzu-Kuo Huang
  • Hal Daumé III
  • John Langford 0001

We design an active learning algorithm for cost-sensitive multiclass classification: problems where different errors have different costs. Our algorithm, COAL, makes predictions by regressing to each label’s cost and predicting the smallest. On a new example, it uses a set of regressors that perform well on past data to estimate possible costs for each label. It queries only the labels that could be the best, ignoring the sure losers. We prove COAL can be efficiently implemented for any regression family that admits squared loss optimization; it also enjoys strong guarantees with respect to predictive performance and labeling effort. Our experiment with COAL show significant improvements in labeling effort and test cost over passive and active baselines.

ICML Conference 2017 Conference Paper

Logarithmic Time One-Against-Some

  • Hal Daumé III
  • Nikos Karampatziakis
  • John Langford 0001
  • Paul Mineiro

We create a new online reduction of multiclass classification to binary classification for which training and prediction time scale logarithmically with the number of classes. We show that several simple techniques give rise to an algorithm which is superior to previous logarithmic time classification approaches while competing with one-against-all in space. The core construction is based on using a tree to select a small subset of labels with high recall, which are then scored using a one-against-some structure with high precision.

ICML Conference 2015 Conference Paper

Learning to Search Better than Your Teacher

  • Kai-Wei Chang 0001
  • Akshay Krishnamurthy
  • Alekh Agarwal
  • Hal Daumé III
  • John Langford 0001

Methods for learning to search for structured prediction typically imitate a reference policy, with existing theoretical guarantees demonstrating low regret compared to that reference. This is unsatisfactory in many applications where the reference policy is suboptimal and the goal of learning is to improve upon it. Can learning to search work even when the reference is poor? We provide a new learning to search algorithm, LOLS, which does well relative to the reference policy, but additionally guarantees low regret compared to deviations from the learned policy: a local-optimality guarantee. Consequently, LOLS can improve upon the reference policy, unlike previous algorithms. This enables us to develop structured contextual bandits, a partial information structured prediction setting with many potential applications.

ICML Conference 2013 Conference Paper

Predictable Dual-View Hashing

  • Mohammad Rastegari
  • Jonghyun Choi
  • Shobeir Fakhraei
  • Hal Daumé III
  • Larry S. Davis

We propose a Predictable Dual-View Hashing (PDH) algorithm which embeds proximity of data samples in the original spaces. We create a cross-view hamming space with the ability to compare information from previously incomparable domains with a notion of ‘predictability’. By performing comparative experimental analysis on two large datasets, PASCAL-Sentence and SUN-Attribute, we demonstrate the superiority of our method to the state-of-the-art dual-view binary code learning algorithms.

ICRA Conference 2012 Conference Paper

Towards a Watson that sees: Language-guided action recognition for robots

  • Ching Lik Teo
  • Yezhou Yang
  • Hal Daumé III
  • Cornelia Fermüller
  • Yiannis Aloimonos

For robots of the future to interact seamlessly with humans, they must be able to reason about their surroundings and take actions that are appropriate to the situation. Such reasoning is only possible when the robot has knowledge of how the World functions, which must either be learned or hard-coded. In this paper, we propose an approach that exploits language as an important resource of high-level knowledge that a robot can use, akin to IBM's Watson in Jeopardy! . In particular, we show how language can be leveraged to reduce the ambiguity that arises from recognizing actions involving hand-tools from video data. Starting from the premise that tools and actions are intrinsically linked, with one explaining the existence of the other, we trained a language model over a large corpus of English newswire text so that we can extract this relationship directly. This model is then used as a prior to select the best tool and action that explains the video. We formalize the approach in the context of 1) an unsupervised recognition and 2) a supervised classification scenario by an EM formulation for the former and integrating language features for the latter. Results are validated over a new hand-tool action dataset, and comparisons with state of the art STIP features showed significantly improved results when language is used. In addition, we discuss the implications of these results and how it provides a framework for integrating language into vision on other robotic applications.

IJCAI Conference 2009 Conference Paper

  • Piyush Rai
  • Hal Daumé III
  • Suresh Venkatasubramanian

We present a streaming model for large-scale classification (in the context of 2-SVM) by leveraging connections between learning and computational geometry. The streaming model imposes the constraint that only a single pass over the data is allowed. The 2-SVM is known to have an equivalent formulation in terms of the minimum enclosing ball (MEB) problem, and an efficient algorithm based on the idea of core sets exists (CVM) [Tsang et al. , 2005]. CVM learns a (1+ε)-approximate MEB for a set of points and yields an approximate solution to corresponding SVM instance. However CVM works in batch mode requiring multiple passes over the data. This paper presents a single-pass SVM which is based on the minimum enclosing ball of streaming data. We show that the MEB updates for the streaming case can be easily adapted to learn the SVM weight vector in a way similar to using online stochastic gradient updates. Our algorithm performs polylogarithmic computation at each example, and requires very small and constant storage. Experimental results show that, even in such restrictive settings, we can learn efficiently in just one pass and get accuracies comparable to other stateof-the-art SVM solvers (batch and online). We also give an analysis of the algorithm, and discuss some open issues and possible extensions.

IJCAI Conference 2009 Conference Paper

  • Arvind Agarwal
  • Hal Daumé III

We present an approach to semi-supervised learning based on an exponential family characterization. Our approach generalizes previous work on coupled priors for hybrid generative/discriminative models. Our model is more flexible and natural than previous approaches. Experimental results on several data sets show that our approach also performs better in practice.

UAI Conference 2009 Conference Paper

Bayesian Multitask Learning with Latent Hierarchies

  • Hal Daumé III

We learn multiple hypotheses for related tasks under a latent hierarchical relationship between tasks. We exploit the intuition that for domain adaptation, we wish to share classifier structure, but for multitask learning, we wish to share covariance structure. Our hierarchical model is seen to subsume several previously proposed multitask learning models and performs well on three distinct realworld data sets.

JMLR Journal 2005 Journal Article

A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

  • Hal Daumé III
  • Daniel Marcu

We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to define distributions over the countably infinite sets that naturally arise in this problem. We add supervision to our model by positing the existence of a set of unobserved random variables (we call these "reference types") that are generic across all clusters. Inference in our framework, which requires integrating over infinitely many parameters, is solved using Markov chain Monte Carlo techniques. We present algorithms for both conjugate and non-conjugate priors. We present a simple---but general---parameterization of our model based on a Gaussian assumption. We evaluate this model on one artificial task and three real-world tasks, comparing it against both unsupervised and state-of-the-art supervised algorithms. Our results show that our model is able to outperform other models across a variety of tasks and performance metrics. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )