Author name cluster

Carlos Guestrin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

57 papers

2 author rows

ICML Conference 2025 Conference Paper

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun 0020
Xinhao Li
Karan Dalal
Jiarui Xu
Arjun Vikram
Genghan Zhang
Yann Dubois
Xinlei Chen

Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden states. We present a practical framework for instantiating sequence modeling layers with linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1. 3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Similar to Transformer, TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.

NeurIPS Conference 2025 Conference Paper

metaTextGrad: Automatically optimizing language model optimizers

Guowei Xu
Mert Yuksekgonul
Carlos Guestrin
James Zou

Large language models (LLMs) are increasingly used in learning algorithms, evaluations, and optimization tasks. Recent studies have shown that using LLM-based optimizers to automatically optimize model prompts, demonstrations, predictions themselves, or other components can significantly enhance the performance of AI systems, as demonstrated by frameworks such as DSPy and TextGrad. However, optimizers built on language models themselves are usually designed by humans with manual design choices; optimizers themselves are not optimized. Moreover, these optimizers are general purpose by design, to be useful to a broad audience, and are not tailored for specific tasks. To address these challenges, we propose metaTextGrad, which focuses on designing a meta-optimizer to further enhance existing optimizers and align them to be good optimizers for a given task. Our approach consists of two key components: a meta prompt optimizer and a meta structure optimizer. The combination of these two significantly improves performance across multiple benchmarks, achieving an average absolute performance improvement of up to 6% compared to the best baseline.

ICLR Conference 2025 Conference Paper

Model Equality Testing: Which Model is this API Serving?

Irena Gao
Percy Liang
Carlos Guestrin

Users often interact with large language models through black-box inference APIs, both for closed- and open-weight models (e.g., Llama models are popularly accessed via Amazon Bedrock and Azure AI Studio). In order to cut costs or add functionality, API providers may quantize, watermark, or finetune the underlying model, changing the output distribution --- possibly without notifying users. We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem, where the user collects samples from the API and a reference distribution and conducts a statistical test to see if the two distributions are the same. We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per prompt. We then apply this test to commercial inference APIs from Summer 2024 for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.

NeurIPS Conference 2024 Conference Paper

Post-Hoc Reversal: Are We Selecting Models Prematurely?

Rishabh Ranjan
Saurabh Garg
Mrigank Raman
Carlos Guestrin
Zachary Lipton

Trained models are often composed with post-hoc transforms such as temperature scaling (TS), ensembling and stochastic weight averaging (SWA) to improve performance, robustness, uncertainty estimation, etc. However, such transforms are typically applied only after the base models have already been finalized by standard means. In this paper, we challenge this practice with an extensive empirical study. In particular, we demonstrate a phenomenon that we call post-hoc reversal, where performance trends are reversed after applying post-hoc transforms. This phenomenon is especially prominent in high-noise settings. For example, while base models overfit badly early in training, both ensembling and SWA favor base models trained for more epochs. Post-hoc reversal can also prevent the appearance of double descent and mitigate mismatches between test loss and test error seen in base models. Preliminary analyses suggest that these transforms induce reversal by suppressing the influence of mislabeled examples, exploiting differences in their learning dynamics from those of clean examples. Based on our findings, we propose post-hoc selection, a simple technique whereby post-hoc metrics inform model development decisions such as early stopping, checkpointing, and broader hyperparameter choices. Our experiments span real-world vision, language, tabular and graph datasets. On an LLM instruction tuning dataset, post-hoc selection results in >1. 5x MMLU improvement compared to naive selection.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback

Yann Dubois
Chen Xuechen Li
Rohan Taori
Tianyi Zhang
Ishaan Gulrajani
Jimmy Ba
Carlos Guestrin
Percy S. Liang

Large language models (LLMs) such as ChatGPT have seen widespread adoption due to their ability to follow user instructions well. Developing these LLMs involves a complex yet poorly understood workflow requiring training with human feedback. Replicating and understanding this instruction-following process faces three major challenges: the high cost of data collection, the lack of trustworthy evaluation, and the absence of reference method implementations. We address these bottlenecks with AlpacaFarm, a simulator that enables research and development for learning from feedback at a low cost. First, we design LLM based simulator for human feedback that is 45x cheaper than crowdworkers and displays high agreement with humans. Second, we identify an evaluation dataset representative of real-world instructions and propose an automatic evaluation procedure. Third, we contribute reference implementations for several methods (PPO, best-of-n, expert iteration, among others) that learn from pairwise feedback. Finally, as an end-to-end validation of AlpacaFarm, we train and evaluate eleven models on 10k pairs of human feedback and show that rankings of models trained in AlpacaFarm match rankings of models trained on human data. As a demonstration of the research possible in AlpacaFarm, we find that methods that use a reward model can substantially improve over supervised fine-tuning and that our reference PPO implementation leads to a +10% win-rate improvement against Davinci003.

NeurIPS Conference 2023 Conference Paper

Beyond Confidence: Reliable Models Should Also Consider Atypicality

Mert Yuksekgonul
Linjun Zhang
James Y. Zou
Carlos Guestrin

While most machine learning models can provide confidence in their predictions, confidence is insufficient to understand a prediction's reliability. For instance, the model may have a low confidence prediction if the input is not well-represented in the training dataset or if the input is inherently ambiguous. In this work, we investigate the relationship between how atypical~(rare) a sample or a class is and the reliability of a model's predictions. We first demonstrate that atypicality is strongly related to miscalibration and accuracy. In particular, we empirically show that predictions for atypical inputs or atypical classes are more overconfident and have lower accuracy. Using these insights, we show incorporating atypicality improves uncertainty quantification and model performance for discriminative neural networks and large language models. In a case study, we show that using atypicality improves the performance of a skin lesion classifier across different skin tone groups without having access to the group attributes. Overall, we propose that models should use not only confidence but also atypicality to improve uncertainty quantification and performance. Our results demonstrate that simple post-hoc atypicality estimators can provide significant value.

IJCAI Conference 2021 Conference Paper

Beyond Accuracy: Behavioral Testing of NLP Models with Checklist (Extended Abstract)

Marco Tulio Ribeiro
Tongshuang Wu
Carlos Guestrin
Sameer Singh

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

PDF Details DOI

ICML Conference 2021 Conference Paper

Learning Neural Network Subspaces

Mitchell Wortsman
Maxwell Horton
Carlos Guestrin
Ali Farhadi
Mohammad Rastegari

Recent observations have advanced our understanding of the neural network optimization landscape, revealing the existence of (1) paths of high accuracy containing diverse solutions and (2) wider minima offering improved performance. Previous methods observing diverse paths require multiple training runs. In contrast we aim to leverage both property (1) and (2) with a single method and in a single training run. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks. These neural network subspaces contain diverse solutions that can be ensembled, approaching the ensemble performance of independently trained networks without the training cost. Moreover, using the subspace midpoint boosts accuracy, calibration, and robustness to label noise, outperforming Stochastic Weight Averaging.

ICML Conference 2020 Conference Paper

AdaScale SGD: A User-Friendly Algorithm for Distributed Training

Tyler B. Johnson
Pulkit Agrawal 0002
Haijie Gu
Carlos Guestrin

When using large-batch training to speed up stochastic gradient descent, learning rates must adapt to new batch sizes in order to maximize speed-ups and preserve model quality. Re-tuning learning rates is resource intensive, while fixed scaling rules often degrade model quality. We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training. By continually adapting to the gradient’s variance, AdaScale automatically achieves speed-ups for a wide range of batch sizes. We formally describe this quality with AdaScale’s convergence bound, which maintains final objective values, even as batch sizes grow large and the number of iterations decreases. In empirical comparisons, AdaScale trains well beyond the batch size limits of popular “linear learning rate scaling” rules. This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks. AdaScale’s qualitative behavior is similar to that of "warm-up" heuristics, but unlike warm-up, this behavior emerges naturally from a principled mechanism. The algorithm introduces negligible computational overhead and no new hyperparameters, making AdaScale an attractive choice for large-scale training in practice.

ICML Conference 2019 Conference Paper

Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment

Chen Huang 0001
Shuangfei Zhai
Walter Talbott
Miguel Ángel Bautista 0001
Shih-Yu Sun
Carlos Guestrin
Joshua M. Susskind

In most machine learning training paradigms a fixed, often handcrafted, loss function is assumed to be a good proxy for an underlying evaluation metric. In this work we assess this assumption by meta-learning an adaptive loss function to directly optimize the evaluation metric. We propose a sample efficient reinforcement learning approach for adapting the loss dynamically during training. We empirically show how this formulation improves performance by simultaneously optimizing the evaluation metric and smoothing the loss landscape. We verify our method in metric learning and classification scenarios, showing considerable improvements over the state-of-the-art on a diverse set of tasks. Importantly, our method is applicable to a wide range of loss functions and evaluation metrics. Furthermore, the learned policies are transferable across tasks and data, demonstrating the versatility of the method.

NeurIPS Conference 2019 Conference Paper

Adversarial Fisher Vectors for Unsupervised Representation Learning

Shuangfei Zhai
Walter Talbott
Carlos Guestrin
Joshua Susskind

We examine Generative Adversarial Networks (GANs) through the lens of deep Energy Based Models (EBMs), with the goal of exploiting the density model that follows from this formulation. In contrast to a traditional view where the discriminator learns a constant function when reaching convergence, here we show that it can provide useful information for downstream tasks, e. g. , feature extraction for classification. To be concrete, in the EBM formulation, the discriminator learns an unnormalized density function (i. e. , the negative energy term) that characterizes the data manifold. We propose to evaluate both the generator and the discriminator by deriving corresponding Fisher Score and Fisher Information from the EBM. We show that by assuming that the generated examples form an estimate of the learned density, both the Fisher Information and the normalized Fisher Vectors are easy to compute. We also show that we are able to derive a distance metric between examples and between sets of examples. We conduct experiments showing that the GAN-induced Fisher Vectors demonstrate competitive performance as unsupervised feature extractors for classification and perceptual similarity tasks. Code is available at \url{https: //github. com/apple/ml-afv}.

AAAI Conference 2018 Conference Paper

Anchors: High-Precision Model-Agnostic Explanations

Marco Tulio Ribeiro
Sameer Singh
Carlos Guestrin

We introduce a novel model-agnostic system that explains the behavior of complex models with high-precision rules called anchors, representing local, “sufﬁcient” conditions for predictions. We propose an algorithm to efﬁciently compute these explanations for any black-box model with high-probability guarantees. We demonstrate the ﬂexibility of anchors by explaining a myriad of different models for different domains and tasks. In a user study, we show that anchors enable users to predict how a model would behave on unseen instances with less effort and higher precision, as compared to existing linear explanations or no explanations.

NeurIPS Conference 2018 Conference Paper

Learning to Optimize Tensor Programs

Tianqi Chen
Lianmin Zheng
Eddie Yan
Ziheng Jiang
Thierry Moreau
Luis Ceze
Carlos Guestrin
Arvind Krishnamurthy

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution are key enablers of effective deep learning systems. However, existing systems rely on manually optimized libraries such as cuDNN where only a narrow range of server class GPUs are well-supported. The reliance on hardware specific operator libraries limits the applicability of high-level graph optimizations and incurs significant engineering costs when deploying to new hardware targets. We use learning to remove this engineering burden. We learn domain specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search by effective model transfer across workloads. Experimental results show that our framework delivers performance competitive with state-of-the-art hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPU.

NeurIPS Conference 2018 Conference Paper

Training Deep Models Faster with Robust, Approximate Importance Sampling

Tyler Johnson
Carlos Guestrin

In theory, importance sampling speeds up stochastic gradient algorithms for supervised learning by prioritizing training examples. In practice, the cost of computing importances greatly limits the impact of importance sampling. We propose a robust, approximate importance sampling procedure (RAIS) for stochastic gradient de- scent. By approximating the ideal sampling distribution using robust optimization, RAIS provides much of the benefit of exact importance sampling with drastically reduced overhead. Empirically, we find RAIS-SGD and standard SGD follow similar learning curves, but RAIS moves faster through these paths, achieving speed-ups of at least 20% and sometimes much more.

ICML Conference 2017 Conference Paper

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent

Tyler B. Johnson
Carlos Guestrin

Coordinate descent (CD) is a scalable and simple algorithm for solving many optimization problems in machine learning. Despite this fact, CD can also be very computationally wasteful. Due to sparsity in sparse regression problems, for example, the majority of CD updates often result in no progress toward the solution. To address this inefficiency, we propose a modified CD algorithm named “StingyCD. ” By skipping over many updates that are guaranteed to not decrease the objective value, StingyCD significantly reduces convergence times. Since StingyCD only skips updates with this guarantee, however, StingyCD does not fully exploit the problem’s sparsity. For this reason, we also propose StingyCD+, an algorithm that achieves further speed-ups by skipping updates more aggressively. Since StingyCD and StingyCD+ rely on simple modifications to CD, it is also straightforward to use these algorithms with other approaches to scaling optimization. In empirical comparisons, StingyCD and StingyCD+ improve convergence times considerably for several L1-regularized optimization problems.

NeurIPS Conference 2016 Conference Paper

Unified Methods for Exploiting Piecewise Linear Structure in Convex Optimization

Tyler Johnson
Carlos Guestrin

We develop methods for rapidly identifying important components of a convex optimization problem for the purpose of achieving fast convergence times. By considering a novel problem formulation—the minimization of a sum of piecewise functions—we describe a principled and general mechanism for exploiting piecewise linear structure in convex optimization. This result leads to a theoretically justified working set algorithm and a novel screening test, which generalize and improve upon many prior results on exploiting structure in convex optimization. In empirical comparisons, we study the scalability of our methods. We find that screening scales surprisingly poorly with the size of the problem, while our working set algorithm convincingly outperforms alternative approaches.

ICML Conference 2015 Conference Paper

Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization

Tyler B. Johnson
Carlos Guestrin

By reducing optimization to a sequence of small subproblems, working set methods achieve fast convergence times for many challenging problems. Despite excellent performance, theoretical understanding of working sets is limited, and implementations often resort to heuristics to determine subproblem size, makeup, and stopping criteria. We propose Blitz, a fast working set algorithm accompanied by useful guarantees. Making no assumptions on data, our theory relates subproblem size to progress toward convergence. This result motivates methods for optimizing algorithmic parameters and discarding irrelevant variables as iterations progress. Applied to L1-regularized learning, Blitz convincingly outperforms existing solvers in sequential, limited-memory, and distributed settings. Blitz is not specific to L1-regularized learning, making the algorithm relevant to many applications involving sparsity or constraints.

NeurIPS Conference 2014 Conference Paper

Divide-and-Conquer Learning by Anchoring a Conical Hull

Tianyi Zhou
Jeff Bilmes
Carlos Guestrin

We reduce a broad class of machine learning problems, usually addressed by EM or sampling, to the problem of finding the $k$ extremal rays spanning the conical hull of a data point set. These $k$ ``anchors'' lead to a global solution and a more interpretable model that can even outperform EM and sampling on generalization error. To find the $k$ anchors, we propose a novel divide-and-conquer learning scheme ``DCA'' that distributes the problem to $\mathcal O(k\log k)$ same-type sub-problems on different low-D random hyperplanes, each can be solved by any solver. For the 2D sub-problem, we present a non-iterative solver that only needs to compute an array of cosine values and its max/min entries. DCA also provides a faster subroutine for other methods to check whether a point is covered in a conical hull, which improves algorithm design in multiple dimensions and brings significant speedup to learning. We apply our method to GMM, HMM, LDA, NMF and subspace clustering, then show its competitive performance and scalability over other methods on rich datasets.

ICML Conference 2014 Conference Paper

Stochastic Gradient Hamiltonian Monte Carlo

Tianqi Chen 0001
Emily B. Fox
Carlos Guestrin

Hamiltonian Monte Carlo (HMC) sampling methods provide a mechanism for defining distant proposals with high acceptance probabilities in a Metropolis-Hastings framework, enabling more efficient exploration of the state space than standard random-walk proposals. The popularity of such methods has grown significantly in recent years. However, a limitation of HMC methods is the required gradient computation for simulation of the Hamiltonian dynamical system-such computation is infeasible in problems involving a large sample size or streaming data. Instead, we must rely on a noisy gradient estimate computed from a subset of the data. In this paper, we explore the properties of such a stochastic gradient HMC approach. Surprisingly, the natural implementation of the stochastic approximation can be arbitrarily bad. To address this problem we introduce a variant that uses second-order Langevin dynamics with a friction term that counteracts the effects of the noisy gradient, maintaining the desired target distribution as the invariant distribution. Results on simulated data validate our theory. We also provide an application of our methods to a classification task using neural networks and to online Bayesian matrix factorization.

ICML Conference 2012 Conference Paper

Hierarchical Exploration for Accelerating Contextual Bandits

Yisong Yue
Sue Ann Hong
Carlos Guestrin

IJCAI Conference 2011 Conference Paper

Connecting the Dots between News Articles

Dafna Shahaf
Carlos Guestrin

The process of extracting useful knowledge from large datasets has become one of the most pressing problems in today's society. The problem spans entire sectors, from scientists to intelligence analysts and web users, all of whom are constantly struggling to keep up with the larger and larger amounts of content published every day. With this much data, it is often easy to miss the big picture. In this paper, we investigate methods for automatically connecting the dots - providing a structured, easy way to navigate within a new topic and discover hidden connections. We focus on the news domain: given two news articles, our system automatically finds a coherent chain linking them together. For example, it can recover the chain of events leading from the decline of home prices (2007) to the health-care debate (2009). We formalize the characteristics of a good chain and provide efficient algorithms to connect two fixed endpoints. We incorporate user feedback into our framework, allowing the stories to be refined and personalized. Finally, we evaluate our algorithm over real news data. Our user studies demonstrate the algorithm's effectiveness in helping users understanding the news.

PDF Details DOI

UAI Conference 2011 Conference Paper

Efficient Probabilistic Inference with Partial Ranking Queries

Jonathan Huang
Ashish Kapoor
Carlos Guestrin

Distributions over rankings are used to model data in various settings such as preference analysis and political elections. The factorial size of the space of rankings, however, typically forces one to make structural assumptions, such as smoothness, sparsity, or probabilistic independence about these underlying distributions. We approach the modeling problem from the computational principle that one should make structural assumptions which allow for efficient calculation of typical probabilistic queries. For ranking models, "typical" queries predominantly take the form of partial ranking queries (e.g., given a user's top-k favorite movies, what are his preferences over remaining movies?). In this paper, we argue that riffled independence factorizations proposed in recent literature [7, 8] are a natural structural assumption for ranking distributions, allowing for particularly efficient processing of partial ranking queries.

NeurIPS Conference 2011 Conference Paper

Linear Submodular Bandits and their Application to Diversified Retrieval

Yisong Yue
Carlos Guestrin

Diversified retrieval and online learning are two core research areas in the design of modern information retrieval systems. In this paper, we propose the linear submodular bandits problem, which is an online learning setting for optimizing a general class of feature-rich submodular utility models for diversified retrieval. We present an algorithm, called LSBGREEDY, and prove that it efficiently converges to a near-optimal model. As a case study, we applied our approach to the setting of personalized news recommendation, where the system must recommend small sets of news articles selected from tens of thousands of available articles each day. In a live user study, we found that LSBGREEDY significantly outperforms existing online learning approaches.

ICML Conference 2011 Conference Paper

Parallel Coordinate Descent for L1-Regularized Loss Minimization

Joseph K. Bradley
Aapo Kyrola
Danny Bickson
Carlos Guestrin

TIST Journal 2011 Journal Article

Submodularity and its applications in optimized information gathering

Andreas Krause
Carlos Guestrin

Where should we place sensors to efficiently monitor natural drinking water resources for contamination? Which blogs should we read to learn about the biggest stories on the Web? These problems share a fundamental challenge: How can we obtain the most useful information about the state of the world, at minimum cost? Such information gathering, or active learning, problems are typically NP-hard, and were commonly addressed using heuristics without theoretical guarantees about the solution quality. In this article, we describe algorithms which efficiently find provably near-optimal solutions to large, complex information gathering problems. Our algorithms exploit submodularity, an intuitive notion of diminishing returns common to many sensing problems: the more sensors we have already deployed, the less we learn by placing another sensor. In addition to identifying the most informative sensing locations, our algorithms can handle more challenging settings, where sensors need to be able to reliably communicate over lossy links, where mobile robots are used for collecting data, or where solutions need to be robust against adversaries and sensor failures. We also present results applying our algorithms to several real-world sensing tasks, including environmental monitoring using robotic sensors, activity recognition using a built sensing chair, a sensor placement challenge, and deciding which blogs to read on the Web.

NeurIPS Conference 2010 Conference Paper

Evidence-Specific Structures for Rich Tractable CRFs

Anton Chechetka
Carlos Guestrin

We present a simple and effective approach to learning tractable conditional random fields with structure that depends on the evidence. Our approach retains the advantages of tractable discriminative models, namely efficient exact inference and exact parameter learning. At the same time, our algorithm does not suffer a large expressive power penalty inherent to fixed tractable structures. On real-life relational datasets, our approach matches or exceeds state of the art accuracy of the dense models, and at the same time provides an order of magnitude speedup

UAI Conference 2010 Conference Paper

GraphLab: A New Framework For Parallel Machine Learning

Yucheng Low
Joseph E. Gonzalez
Aapo Kyrola
Danny Bickson
Carlos Guestrin
Joseph M. Hellerstein

Designing and implementing efficient, provably correct parallel machine learning (ML) algorithms is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance. We demonstrate the expressiveness of the GraphLab framework by designing and implementing parallel versions of belief propagation, Gibbs sampling, Co-EM, Lasso and Compressed Sensing. We show that using GraphLab we can achieve excellent parallel performance on large scale real-world problems.

NeurIPS Conference 2010 Conference Paper

Inference with Multivariate Heavy-Tails in Linear Models

Danny Bickson
Carlos Guestrin

Heavy-tailed distributions naturally occur in many real life problems. Unfortunately, it is typically not possible to compute inference in closed-form in graphical models which involve such heavy tailed distributions. In this work, we propose a novel simple linear graphical model for independent latent random variables, called linear characteristic model (LCM), defined in the characteristic function domain. Using stable distributions, a heavy-tailed family of distributions which is a generalization of Cauchy, L\'evy and Gaussian distributions, we show for the first time, how to compute both exact and approximate inference in such a linear multivariate graphical model. LCMs are not limited to only stable distributions, in fact LCMs are always defined for any random variables (discrete, continuous or a mixture of both). We provide a realistic problem from the field of computer networks to demonstrate the applicability of our construction. Other potential application is iterative decoding of linear channels with non-Gaussian noise.

ICML Conference 2010 Conference Paper

Learning Hierarchical Riffle Independent Groupings from Rankings

Jonathan Huang
Carlos Guestrin

ICML Conference 2010 Conference Paper

Learning Tree Conditional Random Fields

Joseph K. Bradley
Carlos Guestrin

UAI Conference 2009 Conference Paper

Distributed Parallel Inference on Large Factor Graphs

Joseph E. Gonzalez
Yucheng Low
Carlos Guestrin
David R. O'Hallaron

As computer clusters become more common and the size of the problems encountered in the field of AI grows, there is an increasing demand for efficient parallel inference algorithms. We consider the problem of parallel inference on large factor graphs in the distributed memory setting of computer clusters. We develop a new efficient parallel inference algorithm, DBRSplash, which incorporates over-segmented graph partitioning, belief residual scheduling, and uniform work Splash operations. We empirically evaluate the DBRSplash algorithm on a 120 processor cluster and demonstrate linear to super-linear performance gains on large factor graph models.

JMLR Journal 2009 Journal Article

Fourier Theoretic Probabilistic Inference over Permutations

Jonathan Huang
Carlos Guestrin
Leonidas Guibas

Permutations are ubiquitous in many real-world problems, such as voting, ranking, and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact and factorized probability distribution representations, such as graphical models, cannot capture the mutual exclusivity constraints associated with permutations. In this paper, we use the "low-frequency" terms of a Fourier decomposition to represent distributions over permutations compactly. We present Kronecker conditioning, a novel approach for maintaining and updating these distributions directly in the Fourier domain, allowing for polynomial time bandlimited approximations. Low order Fourier-based approximations, however, may lead to functions that do not correspond to valid distributions. To address this problem, we present a quadratic program defined directly in the Fourier domain for projecting the approximation onto a relaxation of the polytope of legal marginal distributions. We demonstrate the effectiveness of our approach on a real camera-based multi-person tracking scenario. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

NeurIPS Conference 2009 Conference Paper

Riffled Independence for Ranked Data

Jonathan Huang
Carlos Guestrin

Representing distributions over permutations can be a daunting task due to the fact that the number of permutations of n objects scales factorially in n. One recent way that has been used to reduce storage complexity has been to exploit probabilistic independence, but as we argue, full independence assumptions impose strong sparsity constraints on distributions and are unsuitable for modeling rankings. We identify a novel class of independence structures, called riffled independence, which encompasses a more expressive family of distributions while retaining many of the properties necessary for performing efficient inference and reducing sample complexity. In riffled independence, one draws two permutations independently, then performs the riffle shuffle, common in card games, to combine the two permutations to form a single permutation. In ranking, riffled independence corresponds to ranking disjoint sets of objects independently, then interleaving those rankings. We provide a formal introduction and present algorithms for using riffled independence within Fourier-theoretic frameworks which have been explored by a number of recent papers.

JMLR Journal 2008 Journal Article

Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies

Andreas Krause
Ajit Singh
Carlos Guestrin

When monitoring spatial phenomena, which can often be modeled as Gaussian processes (GPs), choosing sensor locations is a fundamental task. There are several common strategies to address this task, for example, geometry or disk models, placing sensors at the points of highest entropy (variance) in the GP model, and A-, D-, or E-optimal design. In this paper, we tackle the combinatorial optimization problem of maximizing the mutual information between the chosen locations and the locations which are not selected. We prove that the problem of finding the configuration that maximizes mutual information is NP-complete. To address this issue, we describe a polynomial-time approximation that is within (1-1/ e ) of the optimum by exploiting the submodularity of mutual information. We also show how submodularity can be used to obtain online bounds, and design branch and bound search procedures. We then extend our algorithm to exploit lazy evaluations and local structure in the GP, yielding significant speedups. We also extend our approach to find placements which are robust against node failures and uncertainties in the model. These extensions are again associated with rigorous theoretical approximation guarantees, exploiting the submodularity of the objective function. We demonstrate the advantages of our approach towards optimizing mutual information in a very extensive empirical study on two real-world data sets. [abs] [ pdf ][ bib ] &copy JMLR 2008. ( edit, beta )

JMLR Journal 2008 Journal Article

Robust Submodular Observation Selection

Andreas Krause
H. Brendan McMahan
Carlos Guestrin
Anupam Gupta

In many applications, one has to actively select among a set of expensive observations before making an informed decision. For example, in environmental monitoring, we want to select locations to measure in order to most effectively predict spatial phenomena. Often, we want to select observations which are robust against a number of possible objective functions. Examples include minimizing the maximum posterior variance in Gaussian Process regression, robust experimental design, and sensor placement for outbreak detection. In this paper, we present the Submodular Saturation algorithm, a simple and efficient algorithm with strong theoretical approximation guarantees for cases where the possible objective functions exhibit submodularity, an intuitive diminishing returns property. Moreover, we prove that better approximation algorithms do not exist unless NP-complete problems admit efficient algorithms. We show how our algorithm can be extended to handle complex cost functions (incorporating non-unit observation cost or communication and path costs). We also show how the algorithm can be used to near-optimally trade off expected-case (e.g., the Mean Square Prediction Error in Gaussian Process regression) and worst-case (e.g., maximum predictive variance) performance. We show that many important machine learning problems fit our robust submodular observation selection formalism, and provide extensive empirical evaluation on several real-world problems. For Gaussian Process regression, our algorithm compares favorably with state-of-the-art heuristics described in the geostatistics literature, while being simpler, faster and providing theoretical guarantees. For robust experimental design, our algorithm performs favorably compared to SDP-based algorithms. [abs] [ pdf ][ bib ] &copy JMLR 2008. ( edit, beta )

IJCAI Conference 2007 Conference Paper

Amarjeet Singh
Andreas Krause
Carlos Guestrin
William Kaiser
Maxim Batalin

In many sensing applications, including environmental monitoring, we must cover a large space with limited resources. One approach is to use robots to move sensors around this space. Planning the motion of these robots - coordinating their paths in order to maximize the amount of information collected while placing bounds on their resources (e. g. , path length or battery capacity) - is an NP-hard problem. In this paper, we present an efficient path planning algorithm that coordinates multiple robots, each having a resource constraint, that maximizes the "informativeness" of their visited locations. In particular, we use a Gaussian Process to model the underlying phenomenon, and use the mutual information between the visited locations and remainder of the space to characterize the amount of information collected. We provide strong theoretical approximation guarantees for our algorithm by exploiting the submodularity property of mutual information. In addition, we improve the efficiency of our approach by extending the algorithm using branch and bound and a region-based decomposition of thespace. We provide an extensive empirical analysis of our algorithm, comparing with existing heuristics on datasets from several real world sensing applications.

NeurIPS Conference 2007 Conference Paper

Efficient Inference for Distributions on Permutations

Jonathan Huang
Carlos Guestrin
Leonidas Guibas

Permutations are ubiquitous in many real world problems, such as voting, rankings and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact representations such as graphical models cannot efﬁciently capture the mutual exclusivity con- straints associated with permutations. In this paper, we use the “low-frequency” terms of a Fourier decomposition to represent such distributions compactly. We present Kronecker conditioning, a general and efﬁcient approach for maintaining these distributions directly in the Fourier domain. Low order Fourier-based approximations can lead to functions that do not correspond to valid distributions. To address this problem, we present an efﬁcient quadratic program deﬁned directly in the Fourier domain to project the approximation onto a relaxed form of the marginal polytope. We demonstrate the effectiveness of our approach on a real camera-based multi-people tracking setting.

NeurIPS Conference 2007 Conference Paper

Efficient Principled Learning of Thin Junction Trees

Anton Chechetka
Carlos Guestrin

We present the first truly polynomial algorithm for learning the structure of bounded-treewidth junction trees -- an attractive subclass of probabilistic graphical models that permits both the compact representation of probability distributions and efficient exact inference. For a constant treewidth, our algorithm has polynomial time and sample complexity, and provides strong theoretical guarantees in terms of $KL$ divergence from the true distribution. We also present a lazy extension of our approach that leads to very significant speed ups in practice, and demonstrate the viability of our method empirically, on several real world datasets. One of our key new theoretical insights is a method for bounding the conditional mutual information of arbitrarily large sets of random variables with only a polynomial number of mutual information computations on fixed-size subsets of variables, when the underlying distribution can be approximated by a bounded treewidth junction tree.

ICML Conference 2007 Conference Paper

Nonmyopic active learning of Gaussian processes: an exploration-exploitation approach

Andreas Krause 0001
Carlos Guestrin

AAAI Conference 2007 Conference Paper

Nonmyopic Informative Path Planning in Spatio-Temporal Models

Alexandra Meliou
Carlos Guestrin

In many sensing applications we must continuously gather information to provide a good estimate of the state of the environment at every point in time. A robot may tour an environment, gathering information every hour. In a wireless sensor network, these tours correspond to packets being transmitted. In these settings, we are often faced with resource restrictions, like energy constraints. The users issue queries with certain expectations on the answer quality. Thus, we must optimize the tours to ensure the satisfaction of the user constraints, while at the same time minimize the cost of the query plan. For a single timestep, this optimization problem is NP-hard, but recent approximation algorithms with theoretical guarantees provide good solutions. In this paper, we present a new efﬁcient algorithm, exploiting dynamic programming and submodularity of the information collected, that efﬁciently plans data collection tours for an entire (ﬁnite) horizon. Our algorithm can use any single step procedure as a black box, and, based on its properties, provides strong theoretical guarantees for the solution. We also provide an extensive empirical analysis demonstrating the beneﬁts of nonmyopic planning in two real world sensing applications.

NeurIPS Conference 2007 Conference Paper

Selecting Observations against Adversarial Objectives

Andreas Krause
Brendan McMahan
Carlos Guestrin
Anupam Gupta

In many applications, one has to actively select among a set of expensive observa- tions before making an informed decision. Often, we want to select observations which perform well when evaluated with an objective function chosen by an adver- sary. Examples include minimizing the maximum posterior variance in Gaussian Process regression, robust experimental design, and sensor placement for outbreak detection. In this paper, we present the Submodular Saturation algorithm, a sim- ple and efﬁcient algorithm with strong theoretical approximation guarantees for the case where the possible objective functions exhibit submodularity, an intuitive diminishing returns property. Moreover, we prove that better approximation al- gorithms do not exist unless NP-complete problems admit efﬁcient algorithms. We evaluate our algorithm on several real-world problems. For Gaussian Process regression, our algorithm compares favorably with state-of-the-art heuristics de- scribed in the geostatistics literature, while being simpler, faster and providing theoretical guarantees. For robust experimental design, our algorithm performs favorably compared to SDP-based algorithms.

ICML Conference 2006 Conference Paper

Data association for topic intensity tracking

Andreas Krause 0001
Jure Leskovec
Carlos Guestrin

NeurIPS Conference 2006 Conference Paper

Distributed Inference in Dynamical Systems

Stanislav Funiak
Carlos Guestrin
Rahul Sukthankar
Mark Paskin

We present a robust distributed algorithm for approximate probabilistic inference in dynamical systems, such as sensor networks and teams of mobile robots. Using assumed density filtering, the network nodes maintain a tractable representation of the belief state in a distributed fashion. At each time step, the nodes coordinate to condition this distribution on the observations made throughout the network, and to advance this estimate to the next time step. In addition, we identify a significant challenge for probabilistic inference in dynamical systems: message losses or network partitions can cause nodes to have inconsistent beliefs about the current state of the system. We address this problem by developing distributed algorithms that guarantee that nodes will reach an informative consistent distribution when communication is re-established. We present a suite of experimental results on real-world sensor data for two real sensor network deployments: one with 25 cameras and another with 54 temperature sensors.

ICML Conference 2005 Conference Paper

Learning structured prediction models: a large margin approach

Ben Taskar
Vassil Chatalbashev
Daphne Koller
Carlos Guestrin

We consider large margin estimation in a broad range of prediction models where inference involves solving combinatorial optimization problems, for example, weighted graph-cuts or matchings. Our goal is to learn parameters such that inference using the model reproduces correct answers on the training data. Our method relies on the expressive power of convex optimization problems to compactly capture inference or solution optimality in structured prediction models. Directly embedding this structure within the learning formulation produces concise convex problems for efficient estimation of very complex and diverse models. We describe experimental results on a matching task, disulfide connectivity prediction, showing significant improvements over state-of-the-art methods.

UAI Conference 2005 Conference Paper

Near-optimal Nonmyopic Value of Information in Graphical Models

Andreas Krause 0001
Carlos Guestrin

A fundamental issue in real-world systems, such as sensor networks, is the selection of observations which most effectively reduce uncertainty. More specifically, we address the long standing problem of nonmyopically selecting the most informative subset of variables in a graphical model. We present the first efficient randomized algorithm providing a constant factor (1-1/e-epsilon) approximation guarantee for any epsilon > 0 with high confidence. The algorithm leverages the theory of submodular functions, in combination with a polynomial bound on sample complexity. We furthermore prove that no polynomial time algorithm can provide a constant factor approximation better than (1 - 1/e) unless P = NP. Finally, we provide extensive evidence of the effectiveness of our method on two complex real-world datasets.

ICML Conference 2005 Conference Paper

Near-optimal sensor placements in Gaussian processes

Carlos Guestrin
Andreas Krause 0001
Ajit Paul Singh

When monitoring spatial phenomena, which are often modeled as Gaussian Processes (GPs), choosing sensor locations is a fundamental task. A common strategy is to place sensors at the points of highest entropy (variance) in the GP model. We propose a mutual information criteria, and show that it produces better placements. Furthermore, we prove that finding the configuration that maximizes mutual information is NP-complete. To address this issue, we describe a polynomial-time approximation that is within (1 -- 1/ e ) of the optimum by exploiting the submodularity of our criterion. This algorithm is extended to handle local structure in the GP, yielding significant speedups. We demonstrate the advantages of our approach on two real-world data sets.

UAI Conference 2004 Conference Paper

Robust Probabilistic Inference in Distributed Systems

Mark A. Paskin
Carlos Guestrin

Probabilistic inference problems arise naturally in distributed systems such as sensor networks and teams of mobile robots. Inference algorithms that use message passing are a natural fit for distributed systems, but they must be robust to the failure situations that arise in real-world settings, such as unreliable communication and node failures. Unfortunately, the popular sum�product algorithm can yield very poor estimates in these settings because the nodes' beliefs before convergence can be arbitrarily different from the correct posteriors. In this paper, we present a new message passing algorithm for probabilistic inference which provides several crucial guarantees that the standard sum�product algorithm does not. Not only does it converge to the correct posteriors, but it is also guaranteed to yield a principled approximation at any point before convergence. In addition, the computational complexity of the message passing updates depends only upon the model, and is dependent of the network topology of the distributed system. We demonstrate the approach with detailed experimental results on a distributed sensor calibration task using data from an actual sensor network deployment.

UAI Conference 2004 Conference Paper

Solving Factored MDPs with Continuous and Discrete Variables

Carlos Guestrin
Milos Hauskrecht
Branislav Kveton

Although many real-world stochastic planning problems are more naturally formulated by hybrid models with both discrete and continuous variables, current state-of-the-art methods cannot adequately address these problems. We present the first framework that can exploit problem structure for modeling and solving hybrid problems efficiently. We formulate these problems as hybrid Markov decision processes (MDPs with continuous and discrete state and action variables), which we assume can be represented in a factored way using a hybrid dynamic Bayesian network (hybrid DBN). This formulation also allows us to apply our methods to collaborative multiagent settings. We present a new linear program approximation method that exploits the structure of the hybrid MDP and lets us compute approximate value functions more efficiently. In particular, we describe a new factored discretization of continuous variables that avoids the exponential blow-up of traditional approaches. We provide theoretical bounds on the quality of such an approximation and on its scale-up potential. We support our theoretical arguments with experiments on a set of control problems with up to 28-dimensional continuous state space and 22-dimensional action space.

IJCAI Conference 2003 Conference Paper

Generalizing Plans to New Environments in Relational MDPs

Carlos Guestrin
Daphne Roller
Chris Gearhart
Neal Kanodia

A longstanding goal in planning research is the ability to generalize plans developed for some set of environments to a new but similar environment, with minimal or no replanning. Such generalization can both reduce planning time and allow us to tackle larger domains than the ones tractable for direct planning. In this paper, we present an approach to the generalization problem based on a new framework of relational Markov Decision Processes (RMDPs). An RMDP can model a set of similar environments by representing objects as instances of different classes. In order to generalize plans to multiple environments, we define an approximate value function specified in terms of classes of objects and, in a multiagent setting, by classes of agents. This class-based approximate value function is optimized relative to a sampled subset of environments, and computed using an efficient linear programming method. We prove that a polynomial number of sampled environments suffices to achieve performance close to the performance achievable when optimizing over the entire space. Our experimental results show that our method generalizes plans successfully to new, significantly larger, environments, with minimal loss of performance relative to environment-specific planning. We demonstrate our approach on a real strategic computer war game.

NeurIPS Conference 2003 Conference Paper

Max-Margin Markov Networks

Ben Taskar
Carlos Guestrin
Daphne Koller

In typical classiﬁcation tasks, we seek a function which assigns a label to a sin- gle object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of conﬁdence of the classiﬁer, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guaran- tees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ig- nore structure in the problem, assigning labels independently to each object, los- ing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum mar- gin Markov (M3) networks incorporate both kernels, which efﬁciently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efﬁcient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwrit- ten character recognition and collective hypertext classiﬁcation demonstrate very signiﬁcant gains over previous approaches.

ICML Conference 2002 Conference Paper

Algorithm-Directed Exploration for Model-Based Reinforcement Learning in Factored MDPs

Carlos Guestrin
Relu Patrascu
Dale Schuurmans

AAAI Conference 2002 Conference Paper

Context-Specific Multiagent Coordination and Planning with Factored MDPs

Carlos Guestrin
and Daphne Koller

We present an algorithm for coordinated decision making in cooperative multiagent settings, where the agents’ value function can be represented as a sum of context-specific value rules. The task of finding an optimal joint action in this setting leads to an algorithm where the coordination structure between agents depends on the current state of the system and even on the actual numerical values assigned to the value rules. We apply this framework to the task of multiagent planning in dynamic systems, showing how a joint value function of the associated Markov Decision Process can be approximated as a set of value rules using an efficient linear programming algorithm. The agents then apply the coordination graph algorithm at each iteration of the process to decide on the highest-value joint action, potentially leading to a different coordination pattern at each step of the plan.

ICML Conference 2002 Conference Paper

Coordinated Reinforcement Learning

Carlos Guestrin
Michail G. Lagoudakis
Ronald Parr

UAI Conference 2002 Conference Paper

Distributed Planning in Hierarchical Factored MDPs

Carlos Guestrin
Geoffrey J. Gordon

We present a principled and efficient planning algorithm for collaborative multiagent dynamical systems. All computation, during both the planning and the execution phases, is distributed among the agents; each agent only needs to model and plan for a small part of the system. Each of these local subsystems is small, but once they are combined they can represent an exponentially larger problem. The subsystems are connected through a subsystem hierarchy. Coordination and communication between the agents is not imposed, but derived directly from the structure of this hierarchy. A globally consistent plan is achieved by a message passing algorithm, where messages correspond to natural local reward functions and are computed by local linear programs; another message passing algorithm allows us to execute the resulting policy. When two portions of the hierarchy share the same structure, our algorithm can reuse plans and messages to speed up computation.

NeurIPS Conference 2001 Conference Paper

Multiagent Planning with Factored MDPs

Carlos Guestrin
Daphne Koller
Ronald Parr

We present a principled and efﬁcient planning algorithm for cooperative multia- gent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the en- tire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian net- work (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function. This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. We provide a simple and efﬁcient method for computing such an approximate value function by solving a single linear pro- gram, whose size is determined by the interaction between the value function structure and the DBN. We thereby avoid the exponential blowup in the state and action space. We show that our approach compares favorably with approaches based on reward sharing. We also show that our algorithm is an efﬁcient alterna- tive to more complicated algorithms even in the single agent case.

UAI Conference 2001 Conference Paper

Robust Combination of Local Controllers

Carlos Guestrin
Dirk Ormoneit

Planning problems are hard, motion planning, for example, isPSPACE-hard. Such problems are even more difficult in the presence of uncertainty. Although, Markov Decision Processes (MDPs) provide a formal framework for such problems, finding solutions to high dimensional continuous MDPs is usually difficult, especially when the actions and time measurements are continuous. Fortunately, problem-specific knowledge allows us to design controllers that are good locally, though having no global guarantees. We propose a method of nonparametrically combining local controllers to obtain globally good solutions. We apply this formulation to two types of problems : motion planning (stochastic shortest path) and discounted MDPs. For motion planning, we argue that usual MDP optimality criterion (expected cost) may not be practically relevant. Wepropose an alternative: finding the minimum cost path,subject to the constraint that the robot must reach the goal withhigh probability. For this problem, we prove that a polynomial number of samples is sufficient to obtain a high probability path. For discounted MDPs, we propose a formulation that explicitly deals with model uncertainty, i.e., the problem introduced when transition probabilities are not known exactly. We formulate the problem as a robust linear program which directly incorporates this type of uncertainty.

IROS Conference 1998 Conference Paper

Fast software image stabilization with color registration

Carlos Guestrin
Fábio Gagliardi Cozman
Eric Krotkov

We present the formulation and implementation of an image stabilization system capable of stabilizing video with very large displacements between frames. A coarse-to-fine technique is applied in resolution and in model spaces. The registration algorithm uses phase correlation to obtain an initial estimate for translation between images; then Levenberg-Marquardt method for nonlinear optimization is applied to refine the solution. Registration is performed in color space, using a subset of the pixels selected by a gradient-based sub-sampling criterion. This software implementation runs at 5 Hz on non-dedicated hardware (Silicon Graphics R10000 workstation).