Author name cluster

Daphne Koller

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

102 papers

2 author rows

ICML Conference 2013 Conference Paper

A Fast and Exact Energy Minimization Algorithm for Cycle MRFs

Huayan Wang
Daphne Koller

The presence of cycles gives rise to the difficulty in performing inference for MRFs. Handling cycles efficiently would greatly enhance our ability to tackle general MRFs. In particular, for dual decomposition of energy minimization (MAP inference), using cycle subproblems leads to a much tighter relaxation than using trees, but solving the cycle subproblems turns out to be the bottleneck. In this paper, we present a fast and exact algorithm for energy minimization in cycle MRFs, which can be used as a subroutine in tackling general MRFs. Our method builds on junction-tree message passing, with a large portion of the message entries pruned for efficiency. The pruning conditions fully exploit the structure of a cycle. Experimental results show that our algorithm is more than an order of magnitude faster than other state-of-the-art fast inference methods, and it performs consistently well in several different real problems.

ICML Conference 2013 Conference Paper

Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passing

Huayan Wang
Daphne Koller

Max-product (max-sum) message passing algorithms are widely used for MAP inference in MRFs. It has many variants sharing a common flavor of passing "messages" over some graph-object. Recent advances revealed that its convergent versions (such as MPLP, MSD, TRW-S) can be viewed as performing block coordinate descent (BCD) in a dual objective. That is, each BCD step achieves dual-optimal w. r. t. a block of dual variables (messages), thereby decreases the dual objective monotonically. However, most existing algorithms are limited to updating blocks selected in rather restricted ways. In this paper, we show a "unified" message passing algorithm that: (a) subsumes MPLP, MSD, and TRW-S as special cases when applied to their respective choices of dual objective and blocks, and (b) is able to perform BCD under much more flexible choices of blocks (including very large blocks) as well as the dual objective itself (that arise from an arbitrary dual decomposition).

ICML Conference 2012 Conference Paper

Modeling Latent Variable Uncertainty for Loss-based Learning

M. Pawan Kumar
Benjamin Packer
Daphne Koller

NeurIPS Conference 2012 Conference Paper

Shifting Weights: Adapting Object Detectors from Image to Video

Kevin Tang
Vignesh Ramanathan
Li Fei-Fei
Daphne Koller

Typical object detectors trained on images perform poorly on video, as there is a clear distinction in domain between the two types of data. In this paper, we tackle the problem of adapting object detectors learned from images to work well on videos. We treat the problem as one of unsupervised domain adaptation, in which we are given labeled data from the source domain (image), but only unlabeled data from the target domain (video). Our approach, self-paced domain adaptation, seeks to iteratively adapt the detector by re-training the detector with automatically discovered target domain examples, starting with the easiest first. At each iteration, the algorithm adapts by considering an increased number of target domain examples, and a decreased number of source domain examples. To discover target domain examples from the vast amount of video data, we introduce a simple, robust approach that scores trajectory tracks instead of bounding boxes. We also show how rich and expressive features specific to the target domain can be incorporated under the same framework. We show promising results on the 2011 TRECVID Multimedia Event Detection and LabelMe Video datasets that illustrate the benefit of our approach to adapt object detectors to video.

NeurIPS Conference 2011 Conference Paper

Active Classification based on Value of Classifier

Tianshi Gao
Daphne Koller

Modern classification tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classifiers, which are then applied at test time and then pieced together in a fixed procedure determined in advance or at training time. We present an active classification process at the test time, where each classifier in a large ensemble is viewed as a potential observation that might inform our classification process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classification gain from each observation as well as its computational cost. The expected classification gain is computed using a probabilistic model that uses the outcome from previous observations. This active classification process is applied at test time for each individual test instance, resulting in an efficient instance-specific decision path. We demonstrate the benefit of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classification accuracy at a fraction of the computational costs of traditional methods.

IJCAI Conference 2011 Conference Paper

Discovering Deformable Motifs in Continuous Time Series Data

Suchi Saria
Andrew Duchi
Daphne Koller

Continuous time series data often comprise or contain repeated motifs - patterns that have similar shape, and yet exhibit nontrivial variability. Identifying these motifs, even in the presence of variation, is an important subtask in both unsupervised knowledge discovery and constructing useful features for discriminative tasks. This paper addresses this task using a probabilistic framework that models generation of data as switching between a random walk state and states that generate motifs. A motif is generated from a continuous shape template that can undergo non-linear transformations such as temporal warping and additive noise. We propose an unsupervised algorithm that simultaneously discovers both the set of canonical shape templates and a template-specific model of variability manifested in the data. Experimental results on three real-world data sets demonstrate that our model is able to recover templates in data where repeated instances show large variability. The recovered templates provide higher classification accuracy and coverage when compared to those from alternatives such as random projection based methods and simpler generative models that do not model variability. Moreover, in analyzing physiological signals from infants in the ICU, we discover both known signatures as well as novel physiomarkers.

PDF Details DOI

ICML Conference 2011 Conference Paper

Multiclass Boosting with Hinge Loss based on Output Coding

Tianshi Gao
Daphne Koller

ICML Conference 2010 Conference Paper

Accelerated dual decomposition for MAP inference

Vladimir Jojic
Stephen Gould
Daphne Koller

ICML Conference 2010 Conference Paper

Non-Local Contrastive Objectives

David Vickrey
Cliff Chiung-Yu Lin
Daphne Koller

ICRA Conference 2010 Conference Paper

Real-time identification and localization of body parts from depth images

Christian Plagemann
Varun Ganapathi
Daphne Koller
Sebastian Thrun

We deal with the problem of detecting and identifying body parts in depth images at video frame rates. Our solution involves a novel interest point detector for mesh and range data that is particularly well suited for analyzing human shape. The interest points, which are based on identifying geodesic extrema on the surface mesh, coincide with salient points of the body, which can be classified as, e. g. , hand, foot or head using local shape descriptors. Our approach also provides a natural way of estimating a 3D orientation vector for a given interest point. This can be used to normalize the local shape descriptors to simplify the classification problem as well as to directly estimate the orientation of body parts in space. Experiments involving ground truth labels acquired via an active motion capture system show that our interest points in conjunction with a boosted patch classifier are significantly better in detecting body parts in depth images than state-of-the-art sliding-window based detectors.

NeurIPS Conference 2010 Conference Paper

Self-Paced Learning for Latent Variable Models

M. Kumar
Benjamin Packer
Daphne Koller

Latent variable models are a powerful tool for addressing several tasks in machine learning. However, the algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. To alleviate this problem, we build on the intuition that, rather than considering all samples simultaneously, the algorithm should be presented with the training data in a meaningful order that facilitates learning. The order of the samples is determined by how easy they are. The main challenge is that often we are not provided with a readily computable measure of the easiness of samples. We address this issue by proposing a novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector. The number of samples selected is governed by a weight that is annealed until the entire training data has been considered. We empirically demonstrate that the self-paced learning algorithm outperforms the state of the art method for learning a latent structural SVM on four applications: object localization, noun phrase coreference, motif finding and handwritten digit recognition.

NeurIPS Conference 2009 Conference Paper

Learning a Small Mixture of Trees

M. Kumar
Daphne Koller

The problem of approximating a given probability distribution using a simpler distribution plays an important role in several areas of machine learning, e. g. variational inference and classification. Within this context, we consider the task of learning a mixture of tree distributions. Although mixtures of trees can be learned by minimizing the KL-divergence using an EM algorithm, its success depends heavily on the initialization. We propose an efficient strategy for obtaining a good initial set of trees that attempts to cover the entire observed distribution by minimizing the $\alpha$-divergence with $\alpha = \infty$. We formulate the problem using the fractional covering framework and present a convergent sequential algorithm that only relies on solving a convex program at each iteration. Compared to previous methods, our approach results in a significantly smaller mixture of trees that provides similar or better accuracies. We demonstrate the usefulness of our approach by learning pictorial structures for face recognition.

UAI Conference 2009 Conference Paper

MAP Estimation of Semi-Metric MRFs via Hierarchical Graph Cuts

M. Pawan Kumar
Daphne Koller

We consider the task of obtaining the maximum a posteriori estimate of discrete pairwise random fields with arbitrary unary potentials and semimetric pairwise potentials. For this problem, we propose an accurate hierarchical move making strategy where each move is computed efficiently by solving an st-MINCUT problem. Unlike previous move making approaches, e. g. the widely used α-expansion algorithm, our method obtains the guarantees of the standard linear programming ( LP) relaxation for the important special case of metric labeling. Unlike the existing LP relaxation solvers, e. g. interior-point algorithms or tree-reweighted message passing, our method is significantly faster as it uses only the efficient st-MINCUT algorithm in its design. Using both synthetic and real data experiments, we show that our technique outperforms several commonly used algorithms.

NeurIPS Conference 2009 Conference Paper

Region-based Segmentation and Object Detection

Stephen Gould
Tianshi Gao
Daphne Koller

Object detection and multi-class image segmentation are two closely related tasks that can be greatly improved when solved jointly by feeding information from one task to the other. However, current state-of-the-art models use a separate representation for each task making joint inference clumsy and leaving classification of many parts of the scene ambiguous. In this work, we propose a hierarchical region-based approach to joint object detection and image segmentation. Our approach reasons about pixels, regions and objects in a coherent probabilistic model. Importantly, our model gives a single unified description of the scene. We explain every pixel in the image and enforce global consistency between all variables in our model. We run experiments on challenging vision datasets and show significant improvement over state-of-the-art object detection accuracy.

NeurIPS Conference 2008 Conference Paper

Cascaded Classification Models: Combining Models for Holistic Scene Understanding

Geremy Heitz
Stephen Gould
Ashutosh Saxena
Daphne Koller

One of the original goals of computer vision was to fully understand a natural scene. This requires solving several problems simultaneously, including object detection, labeling of meaningful regions, and 3d reconstruction. While great progress has been made in tackling each of these problems in isolation, only recently have researchers again been considering the difficult task of assembling various methods to the mutual benefit of all. We consider learning a set of such classification models in such a way that they both solve their own problem and help each other. We develop a framework known as Cascaded Classification Models (CCM), where repeated instantiations of these classifiers are coupled by their input/output variables in a cascade that improves performance at each level. Our method requires only a limited âblack boxâ interface with the models, allowing us to use very sophisticated, state-of-the-art classifiers without having to look under the hood. We demonstrate the effectiveness of our method on a large set of natural images by combining the subtasks of scene categorization, object detection, multiclass image segmentation, and 3d scene reconstruction.

UAI Conference 2008 Conference Paper

Constrained Approximate Maximum Entropy Learning of Markov Random Fields

Varun Ganapathi
David Vickrey
John C. Duchi
Daphne Koller

Parameter estimation in Markov random fields (MRFs) is a difficult task, in which inference over the network is run in the inner loop of a gradient descent procedure. Replacing exact inference with approximate methods such as loopy belief propagation (LBP) can suffer from poor convergence. In this paper, we provide a different approach for combining MRF learning and Bethe approximation. We consider the dual of maximum likelihood Markov network learning — maximizing entropy with moment matching constraints — and then approximate both the objective and the constraints in the resulting optimization problem. Unlike previous work along these lines (Teh & Welling, 2003), our formulation allows parameter sharing between features in a general log-linear model, parameter regularization and conditional training. We show that piecewise training (Sutton & McCallum, 2005) is a very restricted special case of this formulation. We study two optimization strategies: one based on a single convex approximation and one that uses repeated convex approximations. We show results on several real-world networks that demonstrate that these algorithms can significantly outperform learning with loopy and piecewise. Our results also provide a framework for analyzing the trade-offs of different relaxations of the entropy objective and of the constraints.

UAI Conference 2008 Conference Paper

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies

Gal Elidan
Benjamin Packer
Geremy Heitz
Daphne Koller

When related learning tasks are naturally arranged in a hierarchy, an appealing approach for coping with scarcity of instances is that of transfer learning using a hierarchical Bayes framework. As fully Bayesian computations can be difficult and computationally demanding, it is often desirable to use posterior point estimates that facilitate (relatively) efficient prediction. However, the hierarchical Bayes framework does not always lend itself naturally to this maximum aposteriori goal. In this work we propose an undirected reformulation of hierarchical Bayes that relies on priors in the form of similarity measures. We introduce the notion of “degree of transfer” weights on components of these similarity measures, and show how they can be automatically learned within a joint probabilistic framework. Importantly, our reformulation results in a convex objective for many learning problems, thus facilitating optimal posterior point estimation using standard optimization techniques. In addition, we no longer require proper priors, allowing for flexible and straightforward specification of joint distributions over transfer hierarchies. We show that our framework is effective for learning models that are part of transfer hierarchies for two real-life tasks: object shape modeling using Gaussian density estimation and document classification.

JMLR Journal 2008 Journal Article

Max-margin Classification of Data with Absent Features

Gal Chechik
Geremy Heitz
Gal Elidan
Pieter Abbeel
Daphne Koller

We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. [abs] [ pdf ][ bib ] &copy JMLR 2008. ( edit, beta )

UAI Conference 2008 Conference Paper

Projected Subgradient Methods for Learning Sparse Gaussians

John C. Duchi
Stephen Gould
Daphne Koller

Gaussian Markov random fields (GMRFs) are useful in a broad range of applications. In this paper we tackle the problem of learning a sparse GMRF in a high-dimensional space. Our approach uses the ℓ1-norm as a regularization on the inverse covariance matrix. We utilize a novel projected gradient method, which is faster than previous methods in practice and equal to the best performing of these in asymptotic complexity. We also extend the ℓ1-regularized objective to the problem of sparsifying entire blocks within the inverse covariance matrix. Our methods generalize fairly easily to this case, while other methods do not. We demonstrate that our extensions give better generalization performance on two real domains—biological network analysis and a 2D-shape modeling image task.

NeurIPS Conference 2008 Conference Paper

Shape-Based Object Localization for Descriptive Classification

Geremy Heitz
Gal Elidan
Benjamin Packer
Daphne Koller

Discriminative tasks, including object categorization and detection, are central components of high-level computer vision. Sometimes, however, we are interested in more refined aspects of the object in an image, such as pose or particular regions. In this paper we develop a method (LOOPS) for learning a shape and image feature model that can be trained on a particular object class, and used to outline instances of the class in novel images. Furthermore, while the training data consists of uncorresponded outlines, the resulting LOOPS model contains a set of landmark points that appear consistently across instances, and can be accurately localized in an image. Our model achieves state-of-the-art results in precisely outlining objects that exhibit large deformations and articulations in cluttered natural images. These localizations can then be used to address a range of tasks, including descriptive classification, search, and clustering.

ICML Conference 2007 Conference Paper

Learning a meta-level prior for feature relevance from multiple related tasks

Su-In Lee
Vassil Chatalbashev
David Vickrey
Daphne Koller

In many prediction tasks, selecting relevant features is essential for achieving good generalization performance. Most feature selection algorithms consider all features to be a priori equally likely to be relevant. In this paper, we use transfer learning---learning on an ensemble of related tasks---to construct an informative prior on feature relevance. We assume that features themselves have meta-features that are predictive of their relevance to the prediction task, and model their relevance as a function of the meta-features using hyperparameters (called meta-priors ). We present a convex optimization algorithm for simultaneously learning the meta-priors and feature weights from an ensemble of related prediction tasks which share a similar relevance structure. Our approach transfers the "meta-priors" among different tasks, which makes it possible to deal with settings where tasks have nonoverlapping features or the relevance of the features vary over the tasks. We show that learning feature relevance improves performance on two real data sets which illustrate such settings: (1) predicting ratings in a collaborative filtering task, and (2) distinguishing arguments of a verb in a sentence.

UAI Conference 2007 Conference Paper

Reasoning at the Right Time Granularity

Suchi Saria
Uri Nodelman
Daphne Koller

Most real-world dynamic systems are composed of different components that often evolve at very different rates. In traditional temporal graphical models, such as dynamic Bayesian networks, time is modeled at a fixed granularity, generally selected based on the rate at which the fastest component evolves. Inference must then be performed at this fastest granularity, potentially at significant computational cost. Continuous Time Bayesian Networks (CTBNs) avoid time-slicing in the representation by modeling the system as evolving continuously over time. The expectation-propagation (EP) inference algorithm of Nodelman et al. (2005) can then vary the inference granularity over time, but the granularity is uniform across all parts of the system, and must be selected in advance. In this paper, we provide a new EP algorithm that utilizes a general cluster graph architecture where clusters contain distributions that can overlap in both space (set of variables) and time. This architecture allows different parts of the system to be modeled at very different time granularities, according to their current rate of evolution. We also provide an information-theoretic criterion for dynamically re-partitioning the clusters during inference to tune the level of approximation to the current rate of evolution. This avoids the need to hand-select the appropriate granularity, and allows the granularity to adapt as information is transmitted across the network. We present experiments demonstrating that this approach can result in significant computational savings.

ICML Conference 2006 Conference Paper

Constructing informative priors using transfer learning

Rajat Raina
Andrew Y. Ng
Daphne Koller

UAI Conference 2006 Conference Paper

Continuous Time Markov Networks

Tal El-Hay
Nir Friedman
Daphne Koller
Raz Kupferman

A central task in many applications is reasoning about processes that change in a continuous time. The mathematical framework of Continuous Time Markov Processes provides the basic foundations for modeling such systems. Recently, Nodelman et al introduced continuous time Bayesian networks (CTBNs), which allow a compact representation of continuous-time processes over a factored state space. In this paper, we introduce continuous time Markov networks (CTMNs), an alternative representation language that represents a different type of continuous-time dynamics. In many real life processes, such as biological and chemical systems, the dynamics of the process can be naturally described as an interplay between two forces - the tendency of each entity to change its state, and the overall fitness or energy function of the entire system. In our model, the first force is described by a continuous-time proposal process that suggests possible local changes to the state of the system at different rates. The second force is represented by a Markov network that encodes the fitness, or desirability, of different states; a proposed local change is then accepted with a probability that is a function of the change in the fitness distribution. We show that the fitness distribution is also the stationary distribution of the Markov process, so that this representation provides a characterization of a temporal process whose stationary distribution has a compact graphical representation. This allows us to naturally capture a different type of structure in complex dynamical processes, such as evolving biological sequences. We describe the semantics of the representation, its basic properties, and how it compares to CTBNs. We also provide algorithms for learning such models from data, and discuss its applicability to biological sequence evolution.

NeurIPS Conference 2006 Conference Paper

Efficient Structure Learning of Markov Networks using $L_1$-Regularization

Su-In Lee
Varun Ganapathi
Daphne Koller

Markov networks are commonly used in a wide variety of applications, ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally efficient method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the log-linear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efficient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: pixel values in the MNIST data, and genetic sequence variations in the human HapMap data. We show that our L1 -based method achieves considerably higher generalization performance than the more standard L2 -based method (a Gaussian parameter prior) or pure maximum-likelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem. Undirected graphical models, such as Markov networks or log-linear models, have been used in an ever-growing variety of applications, including computer vision, natural language, computational biology, and more. However, as this modeling framework is used in increasingly more complex and less well-understood domains, the problem of selecting from the exponentially large space of possible network structures becomes of great importance. Including all of the possibly relevant interactions in the model generally leads to overfitting, and can also lead to difficulties in running inference over the network. Moreover, learning a "good" structure can be an important task in its own right, as it can provide insight about the underlying structure in the domain. Unfortunately, the problem of learning Markov networks remains a challenge. The key difficulty is that the maximum likelihood (ML) parameters of these networks have no analytic closed form; finding these parameters requires an iterative procedure (such as conjugate gradient [15] or BFGS [5]), where each iteration runs inference over the current model. This type of procedure is computationally expensive even for models where inference is tractable. The problem of structure learning is considerably harder. The dominant type of solution to this problem uses greedy local heuristic search, which incrementally modifies the model by adding and possibly deleting features.

JMLR Journal 2006 Journal Article

Learning Factor Graphs in Polynomial Time and Sample Complexity

Pieter Abbeel
Daphne Koller
Andrew Y. Ng

We study the computational and sample complexity of parameter and structure learning in graphical models. Our main result shows that the class of factor graphs with bounded degree can be learned in polynomial time and from a polynomial number of training examples, assuming that the data is generated by a network in this class. This result covers both parameter estimation for a known network structure and structure learning. It implies as a corollary that we can learn factor graphs for both Bayesian networks and Markov networks of bounded degree, in polynomial time and sample complexity. Importantly, unlike standard maximum likelihood estimation algorithms, our method does not require inference in the underlying network, and so applies to networks where inference is intractable. We also show that the error of our learned model degrades gracefully when the generating distribution is not a member of the target class of networks. In addition to our main result, we show that the sample complexity of parameter learning in graphical models has an O (1) dependence on the number of variables in the model when using the KL-divergence normalized by the number of variables as the performance criterion. [abs] [ pdf ][ bib ] &copy JMLR 2006. ( edit, beta )

NeurIPS Conference 2006 Conference Paper

Max-margin classification of incomplete data

Gal Chechik
Geremy Heitz
Gal Elidan
Pieter Abbeel
Daphne Koller

We consider the problem of learning classifiers for structurally incomplete data, where some ob jects have a subset of features inherently absent due to complex relationships between the features. The common approach for handling missing features is to begin with a preprocessing phase that completes the missing features, and then use a standard classification procedure. In this paper we show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate this task using a geometrically-inspired ob jective function, and discuss two optimization approaches: The linearly separable case is written as a set of convex feasibility problems, and the non-separable case has a non-convex ob jective that we optimize iteratively. By avoiding the pre-processing phase in which the data is completed, these approaches offer considerable computational savings. More importantly, we show that by elegantly handling complex patterns of missing values, our approach is both competitive with other methods when the values are missing at random and outperforms them when the missing values have non-trivial structure. We demonstrate our results on two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images.

UAI Conference 2006 Conference Paper

Residual Belief Propagation: Informed Scheduling for Asynchronous Message Passing

Gal Elidan
Ian McGraw
Daphne Koller

Inference for probabilistic graphical models is still very much a practical challenge in large domains. The commonly used and effective belief propagation (BP) algorithm and its generalizations often do not converge when applied to hard, real-life inference tasks. While it is widely recognized that the scheduling of messages in these algorithms may have significant consequences, this issue remains largely unexplored. In this work, we address the question of how to schedule messages for asynchronous propagation so that a fixed point is reached faster and more often. We first show that any reasonable asynchronous BP converges to a unique fixed point under conditions similar to those that guarantee convergence of synchronous BP. In addition, we show that the convergence rate of a simple round-robin schedule is at least as good as that of synchronous propagation. We then propose residual belief propagation (RBP), a novel, easy-to-implement, asynchronous propagation algorithm that schedules messages in an informed way, that pushes down a bound on the distance from the fixed point. Finally, we demonstrate the superiority of RBP over state-of-the-art methods for a variety of challenging synthetic and real-life problems: RBP converges significantly more often than other methods; and it significantly reduces running time until convergence, even when other methods converge.

NeurIPS Conference 2006 Conference Paper

Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks

Alexis Battle
Gal Chechik
Daphne Koller

We present a probabilistic model applied to the fMRI video rating prediction task of the Pittsburgh Brain Activity Interpretation Competition (PBAIC) [2]. Our goal is to predict a time series of subjective, semantic ratings of a movie given functional MRI data acquired during viewing by three subjects. Our method uses conditionally trained Gaussian Markov random fields, which model both the relationships between the subjects' fMRI voxel measurements and the ratings, as well as the dependencies of the ratings across time steps and between subjects. We also employed non-traditional methods for feature selection and regularization that exploit the spatial structure of voxel activity in the brain. The model displayed good performance in predicting the scored ratings for the three subjects in test data sets, and a variant of this model was the third place entrant to the 2006 PBAIC.

NeurIPS Conference 2006 Conference Paper

Using Combinatorial Optimization within Max-Product Belief Propagation

Daniel Tarlow
Gal Elidan
Daphne Koller
John Duchi

In general, the problem of computing a maximum a posteriori (MAP) assignment in a Markov random field (MRF) is computationally intractable. However, in certain subclasses of MRF, an optimal or close-to-optimal assignment can be found very efficiently using combinatorial optimization algorithms: certain MRFs with mutual exclusion constraints can be solved using bipartite matching, and MRFs with regular potentials can be solved using minimum cut methods. However, these solutions do not apply to the many MRFs that contain such tractable components as sub-networks, but also other non-complying potentials. In this paper, we present a new method, called C O M P O S E, for exploiting combinatorial optimization for sub-networks within the context of a max-product belief propagation algorithm. C O M P O S E uses combinatorial optimization for computing exact maxmarginals for an entire sub-network; these can then be used for inference in the context of the network as a whole. We describe highly efficient methods for computing max-marginals for subnetworks corresponding both to bipartite matchings and to regular networks. We present results on both synthetic and real networks encoding correspondence problems between images, which involve both matching constraints and pairwise geometric constraints. We compare to a range of current methods, showing that the ability of C O M P O S E to transmit information globally across the network leads to improved convergence, decreased running time, and higher-scoring assignments.

UAI Conference 2005 Conference Paper

Expectation Maximization and Complex Duration Distributions for Continuous Time Bayesian Networks

Uri Nodelman
Christian R. Shelton
Daphne Koller

Continuous time Bayesian networks (CTBNs) describe structured stochastic processes with finitely many states that evolve over continuous time. A CTBN is a directed (possibly cyclic) dependency graph over a set of variables, each of which represents a finite state continuous time Markov process whose transition model is a function of its parents. We address the problem of learning the parameters and structure of a CTBN from partially observed data. We show how to apply expectation maximization (EM) and structural expectation maximization (SEM) to CTBNs. The availability of the EM algorithm allows us to extend the representation of CTBNs to allow a much richer class of transition durations distributions, known as phase distributions. This class is a highly expressive semi-parametric representation, which can approximate any duration distribution arbitrarily closely. This extension to the CTBN framework addresses one of the main limitations of both CTBNs and DBNs - the restriction to exponentially / geometrically distributed duration. We present experimental results on a real data set of people's life spans, showing that our algorithm learns reasonable models - structure and parameters - from partially observed data, and, with the use of phase distributions, achieves better performance than DBNs.

UAI Conference 2005 Conference Paper

Expectation Propagation for Continuous Time Bayesian Networks

Uri Nodelman
Daphne Koller
Christian R. Shelton

Continuous time Bayesian networks (CTBNs) describe structured stochastic processes with finitely many states that evolve over continuous time. A CTBN is a directed (possibly cyclic) dependency graph over a set of variables, each of which represents a finite state continuous time Markov process whose transition model is a function of its parents. As shown previously, exact inference in CTBNs is intractable. We address the problem of approximate inference, allowing for general queries conditioned on evidence over continuous time intervals and at discrete time points. We show how CTBNs can be parameterized within the exponential family, and use that insight to develop a message passing scheme in cluster graphs and allows us to apply expectation propagation to CTBNs. The clusters in our cluster graph do not contain distributions over the cluster variables at individual time points, but distributions over trajectories of the variables throughout a duration. Thus, unlike discrete time temporal models such as dynamic Bayesian networks, we can adapt the time granularity at which we reason for different variables and in different conditions.

UAI Conference 2005 Conference Paper

Learning Factor Graphs in Polynomial Time & Sample Complexity

Pieter Abbeel
Daphne Koller
Andrew Y. Ng

We study computational and sample complexity of parameter and structure learning in graphical models. Our main result shows that the class of factor graphs with bounded factor size and bounded connectivity can be learned in polynomial time and polynomial number of samples, assuming that the data is generated by a network in this class. This result covers both parameter estimation for a known network structure and structure learning. It implies as a corollary that we can learn factor graphs for both Bayesian networks and Markov networks of bounded degree, in polynomial time and sample complexity. Unlike maximum likelihood estimation, our method does not require inference in the underlying network, and so applies to networks where inference is intractable. We also show that the error of our learned model degrades gracefully when the generating distribution is not a member of the target class of networks.

JMLR Journal 2005 Journal Article

Learning Module Networks

Eran Segal
Dana Pe'er
Aviv Regev
Daphne Koller
Nir Friedman

Methods for learning Bayesian networks can discover dependency structure between observed variables. Although these methods are useful in many applications, they run into computational and statistical problems in domains that involve a large number of variables. In this paper, we consider a solution that is applicable when many variables have similar behavior. We introduce a new class of models, module networks, that explicitly partition the variables into modules, so that the variables in each module share the same parents in the network and the same conditional probability distribution. We define the semantics of module networks, and describe an algorithm that learns the modules' composition and their dependency structure from data. Evaluation on real data in the domains of gene expression and the stock market shows that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

ICML Conference 2005 Conference Paper

Learning structured prediction models: a large margin approach

Ben Taskar
Vassil Chatalbashev
Daphne Koller
Carlos Guestrin

We consider large margin estimation in a broad range of prediction models where inference involves solving combinatorial optimization problems, for example, weighted graph-cuts or matchings. Our goal is to learn parameters such that inference using the model reproduces correct answers on the training data. Our method relies on the expressive power of convex optimization problems to compactly capture inference or solution optimality in structured prediction models. Directly embedding this structure within the learning formulation produces concise convex problems for efficient estimation of very complex and diverse models. We describe experimental results on a matching task, disulfide connectivity prediction, showing significant improvements over state-of-the-art methods.

UAI Conference 2005 Conference Paper

Ordering-Based Search: A Simple and Effective Algorithm for Learning Bayesian Networks

Marc Teyssier 0001
Daphne Koller

One of the basic tasks for Bayesian networks (BNs) is that of learning a network structure from data. The BN-learning problem is NP-hard, so the standard solution is heuristic search. Many approaches have been proposed for this task, but only a very small number outperform the baseline of greedy hill-climbing with tabu lists; moreover, many of the proposed algorithms are quite complex and hard to implement. In this paper, we propose a very simple and easy-to-implement method for addressing this task. Our approach is based on the well-known fact that the best network (of bounded in-degree) consistent with a given node ordering can be found very efficiently. We therefore propose a search not over the space of structures, but over the space of orderings, selecting for each ordering the best network consistent with it. This search space is much smaller, makes more global search steps, has a lower branching factor, and avoids costly acyclicity checks. We present results for this algorithm on both synthetic and real data sets, evaluating both the score of the network found and in the running time. We show that ordering-based search outperforms the standard baseline, and is competitive with recent algorithms that are much harder to implement.

ICRA Conference 2004 Conference Paper

Detecting and Modeling Doors with Mobile Robots

Dragomir Anguelov
Daphne Koller
Evan Parker
Sebastian Thrun

We describe a probabilistic framework for detection and modeling of doors from sensor data acquired in corridor environments with mobile robots. The framework captures shape, color, and motion properties of door and wall objects. The probabilistic model is optimized with a version of the expectation maximization algorithm, which segments the environment into door and wall objects and learns their properties. The framework allows the robot to generalize the properties of detected object instances to new object instances. We demonstrate the algorithm on real-world data acquired by a Pioneer robot equipped with a laser range finder and an omni-directional camera. Our results show that our algorithm reliably segments the environment into walls and doors, finding both doors that move and doors that do not move. We show that our approach achieves better results than models that only capture behavior, or only capture appearance.

NeurIPS Conference 2004 Conference Paper

Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale

Haidong Wang
Eran Segal
Asa Ben-Hur
Daphne Koller
Douglas Brutlag

Protein interactions typically arise from a physical interaction of one or more small sites on the surface of the two proteins. Identifying these sites is very important for drug and protein design. In this paper, we propose a computational method based on probabilistic relational model that at- tempts to address this task using high-throughput protein interaction data and a set of short sequence motifs. We learn the model using the EM algorithm, with a branch-and-bound algorithm as an approximate infer- ence for the E-step. Our method searches for motifs whose presence in a pair of interacting proteins can explain their observed interaction. It also tries to determine which motif pairs have high affinity, and can therefore lead to an interaction. We show that our method is more accurate than others at predicting new protein-protein interactions. More importantly, by examining solved structures of protein complexes, we find that 2/3 of the predicted active motifs correspond to actual interaction sites. 1 Introduction Many cellular functions are carried out through physical interactions between proteins. Discovering the protein interaction map can therefore help to better understand the work- ings of the cell. Indeed, there has been much work recently on developing high-throughput methods to produce a more complete map of protein-protein interactions [1, 2, 3]. Interactions between two proteins arise from physical interactions between small re- gions on the surface of the proteins [4] (see Fig. 2(b)). Finding interaction sites is an important task, which is of particular relevance to drug design. There is currently no high- throughput experimental method to achieve this goal, so computational methods are re- quired. Existing methods either require solving a protein's 3D structure (e. g. , [5]), and therefore are computationally very costly and not applicable on a genome-wide scale, or use known interaction sites as training data (e. g. , [6]), which are relatively scarce and hence have poor coverage. Other work focuses on refining the highly noisy high-throughput inter- action maps [7, 8, 9], or on assessing the confidence levels of the observed interactions [10]. In this paper, we propose a computational method for predicting protein interactions P1 P2 P5 d a d A A A A A P a d b d d 5 P1 A A A A A ab db ad dd bd B S B B S Bab B S ab ab b ab db c d b T T T P P4 12 15 25 2 P O O O 3 b (a) (b) Figure 1: (a) Simple illustration of our assumptions for protein-protein interactions. The small elements denote motif occurrences on proteins, with red denoting active and gray denoting inactive motifs. (b) A fragment of our probabilistic model, for the proteins P1, P2, P5. We use yellow to denote an assignment of the value true, and black to denote the value false; full circles denote an assignment observed in the data, and patterned circles an assignment hypothesized by our algorithm. The dependencies involving inactive motif pairs were removed from the graph because they do not affect the rest of the model. and the sites at which the interactions take place, which uses as input only high-throughput protein-protein interaction data and the protein sequences. In particular, our method as- sumes no knowledge of the 3D protein structure, or of the sites at which binding occurs. Our approach is based on the assumption that interaction sites can be described using a limited repertoire of conserved sequence motifs [11]. This is a reasonable assumption since interaction sites are significantly more conserved than the rest of the protein surface [12]. Given a protein interaction map, our method tries to explain the observed interactions by identifying a set of sites of motif occurrence on every pair of interacting proteins through which the interaction is mediated. To understand the intuition behind our approach, con- sider the example of Fig. 1(a). Here, the interaction pattern of the protein P1 can best be explained using the motif pair a, b, where a appears in P1 and b in the proteins P2, P3, P4 but not in P5. By contrast, the motif pair d, b is not as good an explanation, because d also appears in P5, which has a different interaction pattern. In general, our method aims to identify motif pairs that have high affinity, potentially leading to interaction between protein pairs that contain them. However, a sequence motif might be used for a different purpose, and not give rise to an active binding site; it might also be buried inside the protein, and thus be inaccessible for interaction. Thus, the appearance of an appropriate motif does not always imply interaction. A key feature of our approach is that we allow each motif occurrence in a protein to be either active or inactive. Interactions are then induced only by the interactions of high- affinity active motifs in the two proteins. Thus, in our example, the motif d in p2 is inactive, and hence does not lead to an interaction between p2 and p4, despite the affinity between the motif pair c, d. We note that Deng et al. [8] proposed a somewhat related method for genome-wide analysis of protein interaction data, based on protein domains. However, their method is focused on predicting protein-protein interactions and not on revealing the site of interaction, and they do not allow for the possibility that some domains are inactive. Our goal is thus to identify two components: the affinities between pairs of motifs, and the activity of the occurrences of motifs in different proteins. Our algorithm addresses this problem by using the framework of Bayesian networks [13] and probabilistic relational models [14], which allows us to handle the inherent noise in the protein interaction data and the uncertain relationship between interactions and motif pairs. We construct a model encoding our assumption that protein interactions are induced by the interactions of active motif pairs. We then use the EM algorithm [15], to fill in the details of the model, learning both the motif affinities and activities from the observed data of protein-protein interactions and protein motif occurrences. We address the computational complexity of the E-step in these large, densely connected models by using an approximate inference procedure based on branch-and-bound. We evaluated our model on protein-protein interactions in yeast and Prosite motifs [11]. As a basic performance measure, we evaluated the ability of our method to predict new protein-protein interactions, showing that it achieves better performance than several other models. In particular, our results validate our assumption that we can explain interactions via the interactions of active sequence motifs. More importantly, we analyze the ability of our method to discover the mechanism by which the interaction occurs. Finally, we examined co-crystallized protein pairs where the 3D structure of the interaction is known, so that we can determine the sites at which the interaction took place. We show that our active motifs are more likely to participate in interactions. 2 The Probabilistic Model The basic entities in our probabilistic model are the proteins and the set of sequence motifs that can mediate protein interactions. Our model therefore contains a set of protein entities P = {P1, .. ., Pn}, with the motifs that occur in them. Each protein P is associated with the set of motifs that occur in it, denoted by P. M. As we discussed, a key premise of our approach is that a specific occurrence of a sequence motif may or may not be active. Thus, each motif occurrence a P. M is associated with a binary-value variable P. Aa, which takes the value true if Aa is active in protein P and false otherwise. We structure the prior probability P (P. Aa = true) = min{0. 8, 3+0. 1|P. M| }, to capture our intuition that |P. M | the number of active motifs in a protein is roughly a constant fraction of the total number of motifs in the protein, but that even proteins with few motifs tend to have at least some number of active motifs. A pair of active motifs in two proteins can potentially bind and induce an interaction between the corresponding proteins. Thus, in our model, a pair of proteins interact if each contains an active motif, and this pair of motifs bind to each other. The probability with which two motifs bind to each other is called their affinity. We encode this assumption by including in our model entities Tij corresponding to a pair of proteins Pi, Pj. For each pair of motifs a Pi. M and b Pj. M, we introduce a variable Tij. Aab, which is a deterministic AND of the activity of these two motifs. Intuitively, this variable represents whether the pair of motifs can potentially interact. The probability with which two active motif occurrences bind is their affinity. We model the binding event between two motif occurrences using a variable Tij. Bab, and define: P (Tij. Bab = true | Tij. Aab = true) = ab and P (Tij. Bab = true | Tij. Aab = false) = 0, where ab is the affinity between motifs a and b. This model reflects our assumption that two motif occurrences can bind only if they are both active, but their actual binding probability depends on their affinity. Note that this affinity is a feature of the motif pair and does not depend on the proteins in which they appear. We must also account for interactions that are not explained by our set of motifs, whether because of false positives in the data, or because of inadequacies of our model or of our motif set. Thus, we add a spurious binding variable Tij. S, for cases where an interaction between Pi and Pj exists, but cannot be explained well by our set of active motifs. The probability that a spurious binding occurs is given by P (Tij. S = true) = S. Finally, we observe an interaction between two proteins if and only if some form of binding occurs, whether by a motif pair or a spurious binding. Thus, we define a variable Tij. O, which represents whether protein i was observed to interact with protein j, to be a deterministic OR of all the binding variables Tij. S and Tij. Bab. Overall, Tij. O is a noisy-OR [13] of all motif pair variables Tij. Aab. Note that our model accounts for both types of errors in the protein interaction data. False negatives (missing interactions) in the data are addressed through the fact that the presence of an active motif pair only implies that binding takes place with some probability. False positives (wrong interactions) in the data are addressed through the introduction of the spurious interaction variables. The full model defines a joint probability distribution over the entire set of attributes: P (P. A, T. A, T. B, T. S, T. O) = P (P i aP i. Aa) i. M P (T aP ij. Aab | Pi. Aa, Pj. Ab)P (Tij. Bab | Tij. Aab) i. M, bPj. M ij P (Tij. S)P (Tij. O | Tij. B, Tij. S) where each of these conditional probability distributions is as specified above. We use to denote the entire set of model parameters {a, b}a, b {S}. An instantiation of our probabilistic model is illustrated in Fig. 1(b). 3 Learning the Model We now turn to the task of learning the model from the data. In a typical setting, we are given as input a protein interaction data set, specifying a set of proteins P and a set of observed interacting pairs T. O. We are also given a set of potentially relevant motifs, and the occurrences of these motifs in the different proteins in P. Thus, all the variables except for the O variables are hidden. Our learning task is thus twofold: we need to infer the values of the hidden variables, both the activity variables P. A, T. A, and the binding variables T. B, T. S; we also need to find a setting of the model parameters, which specify the motif affinities. We use a variant of the EM algorithm [15] to find both an assignment to the parameters, and an assignment to the motif variables P. A, which is a local maximum of the likelihood function P (T. O, P. A | ). Note that, to maximize this objective, we search for a MAP assignment to the motif activity variables, but sum out over the other hidden variables. This design decision is reasonable in our setting, where determining motif activities is an important goal; it is a key assumption for our computational procedure. As in most applications of EM, our main difficulty arises in the E-step, where we need to compute the distribution over the hidden variables given the settings of the observed variables and the current parameter settings. In our model, any two motif variables (both within the same protein and across different proteins) are correlated, as there exists a path of influence between them in the underlying Bayesian network (see Fig. 1(c)). These cor- relations make the task of computing the posterior distribution over the hidden variables intractable, and we must resort to an approximate computation. While we could apply a general purpose approximate inference algorithm such as loopy belief propagation [16], such methods may not converge in densely connected model such as this one, and there are few guarantees on the quality of the results even if they do converge. Fortunately, our model turns out to have additional structure that we can exploit. We now describe an ap- proximate inference algorithm that is tailored to our model, and is guaranteed to converge to a (strong) local maximum. Our first observation is that the only variables that correlate the different protein pairs Tij are the motif variables P. A. Given an assignment to these activity variables, the net- work decomposes into a set of independent subnetworks, one for each protein pair. Based on this observation, we divide our computation of the E-step into two parts. In the first, we find an assignment to the motif variables in each protein, P. A; in the second, we com- pute the posterior probability over the binding motif pair variables T. B, T. S, given the assignment to the motif variables. We begin by describing the second phase. We observe that, as all the motif pair vari- ables, T. A, are fully determined by the motif variables, the only variables left to reason about are the binding variables T. B and T. S. The variables for any pair Tij are inde- pendent of the rest of the model given the instantiation to T. A and the interaction evi- dence. That fact, combined with the noisy-OR form of the interaction, allows us to com- pute the posterior probability required in the E-step exactly and efficiently. Specifically, the computation for the variables associated with a particular protein pair Tij is as fol- lows, where we omit the common prefix Tij to simplify notation. If Aab = false, then P (Bab = true | Aab = false, O, ) = 0. Otherwise, if Aab = true, then P (B P (B ab | A, )P (O | Bab = true, A, ) ab = true | A, O, ) =. P (O | A, ) The first term in the numerator is simply the motif affinity ab; the second term is 1 if O = true and 0 otherwise. The numerator can easily be computed as P (O | A, ) = 1 - (1 - S) (1 - A ab). The computation for P (S) is very similar. a, b =true We now turn to the first phase, of finding a setting to all of the motif variables. Un- fortunately, as we discussed, the model is highly interconnected, and a finding an optimal joint setting to all of these variables P. A is intractable. We thus approximate finding this joint assignment using a method that exploits our specific structure. Our method iterates over proteins, finding in each iteration the optimal assignment to the motif variables of each protein given the current assignment to the motif activities in the remaining proteins. The process repeats, iterating over proteins, until convergence. As we discussed, the likelihood of each assignment to Pi. A can be easily computed using the method described above. However, the computation for each protein is still ex- ponential in the number of motifs it contains, which can be large (e. g. , 15). However, in our specific model, we can apply the following branch-and-bound algorithm (similar to an approach proposed by Henrion [17] for BN2O networks) to find the globally optimal as- signment to the motif variables of each protein. The idea is that we search over the space of possible assignments Pi. A for one that maximizes the objective we wish to maximize. We can show that if making a motif active relative to one assignment does not improve the objective, it will also not improve the objective relative to a large set of other assignments. More precisely, let f (Pi. A) = P (Pi. A, P-i. A|O, ) denote the objective we wish to maximize, where P-i. A is the fixed assignment to motif variables in all proteins except Pi. Let Pi. A-a denote the assignment to all the motif variables in Pi except for Aa. We compute the ratio of f after we switch Pi. Aa from false to true. Let ha(Pj) = (1 - P ab) denote the probability that motif a does not bind with j. Ab =true any active motif in Pj. We can now compute: f (P g i. Aa = true, Pi. A-a) a(Pi. A-a) = = f (Pi. Aa = false, Pi. A-a) 1 - g 1 - (1 - S)ha(Pj) hb(Pj) h a=b, Pi. Ab=true a(Pj ) (1) 1 - (1 - S) h a=b, P b(Pj ) 1jn 1jn i. Ab =true Tij. O=false Tij. O=true where g is the prior probability for a motif in protein Pi to be active. Now, consider a different point in the search, where our current motif activity assign- ment is Pi. A-a, which has all the active motifs in Pi. A-a and some additional ones. The first two terms in the product of Eq. (1) are the same for a(Pi. A-a) and a(Pi. A-a). For the final term (the large fraction), one can show using some algebraic manipulation that this term in a(Pi. A-a) is lower than that for a(Pi. A-a). We conclude that a(Pi. A-a) a(Pi. A-a), and hence that: f (Pi. Aa = true, Pi. A-a) f (P 1 i. Aa = true, Pi. A-a) 1. f (Pi. Aa = false, Pi. A-a) f (Pi. Aa = false, Pi. A- ) a It follows that, if switching motif a from inactive to active relative to Pi. A decreases f, it will also decrease f if we have some additional active motifs. We can exploit this property in a branch-and-bound algorithm in order to find the glob- ally optimal assignment Pi. A. Our algorithm keeps a set V of viable candidates for motif assignments. For presentation, we encode assignments via the set of active motifs they contain. Initially, V contains only the empty assignment {}. We start out by considering motif assignments with a single active motif. We put such an assignment {a} in V if its f -score is higher than f ({}). Now, we consider assignments {a, b} that have two active motifs. We consider {a, b} only if both {a} and {b} are in V. If so, we evaluate its f -score, and add it to V if this score is greater than that of {a} and {b}. Otherwise, we throw it away. We continue this process for all assignments of size k: For each assignment with active motif set S, we test whether S - {a} V for all a S; if we compare f (S) to each f (S - {a}), and add it if it dominates all of them. The algorithm terminates when, from some k, no assignment of size k is saved. To understand the intuition behind this pruning procedure, consider a candidate assign- ment {a, b, c, d}, and assume that {a, b, c} V, but {b, c, d} V. In this case, we must have that {b, c} V, but adding d to that assignment reduces the f -score. In this case, as shown by our analysis, adding d to the superset {a, b, c} would also reduce the f -score. This algorithm is still exponential in worst case. However, in our setting, a protein with many motifs has a low prior probability that each of them is active. Hence, adding new motifs is less likely to increase the f -score, and the algorithm tends to terminate quickly. As we show in Section 4, this algorithm significantly reduces the cost of our procedure. Our E-step finds an assignment to P. A which is a strong local optimum of the ob- jective function max P (P. A | T. O, ): The assignment has higher probability than any assignment that changes any of the motif variables for any single protein. For that assign- ment, our algorithm also computes the distribution over all of the binding variables, as described above. Using this completion, we can now easily compute the (expected) suffi- cient statistics for the different parameters in the model. As each of these parameters is a simple binomial distribution, the maximum likelihood estimation in the M-step is entirely standard; we omit details.

ICML Conference 2004 Conference Paper

Learning associative Markov networks

Ben Taskar
Vassil Chatalbashev
Daphne Koller

UAI Conference 2004 Conference Paper

Recovering Articulated Object Models from 3D Range Data

Dragomir Anguelov
Daphne Koller
Hoi-Cheung Pang
Praveen Srinivasan
Sebastian Thrun

We address the problem of unsupervised learning of complex articulated object models from 3D range data. We describe an algorithm whose input is a set of meshes corresponding to different configurations of an articulated object. The algorithm automatically recovers a decomposition of the object into approximately rigid parts, the location of the parts in the different object instances, and the articulated object skeleton linking the parts. Our algorithm first registers allthe meshes using an unsupervised non-rigid technique described in a companion paper. It then segments the meshes using a graphical model that captures the spatial contiguity of parts. The segmentation is done using the EM algorithm, iterating between finding a decomposition of the object into rigid parts, and finding the location of the parts in the object instances. Although the graphical model is densely connected, the object decomposition step can be performed optimally and efficiently, allowing us to identify a large number of object parts while avoiding local maxima. We demonstrate the algorithm on real world datasets, recovering a 15-part articulated model of a human puppet from just 7 different puppet configurations, as well as a 4 part model of a fiexing arm where significant non-rigid deformation was present.

NeurIPS Conference 2004 Conference Paper

The Correlated Correspondence Algorithm for Unsupervised Registration of Nonrigid Surfaces

Dragomir Anguelov
Praveen Srinivasan
Hoi-cheung Pang
Daphne Koller
Sebastian Thrun
James Davis

We present an unsupervised algorithm for registering 3D surface scans of an object undergoing significant deformations. Our algorithm does not need markers, nor does it assume prior knowledge about object shape, the dynamics of its deformation, or scan alignment. The algorithm registers two meshes by optimizing a joint probabilistic model over all point-to- point correspondences between them. This model enforces preservation of local mesh geometry, as well as more global constraints that capture the preservation of geodesic distance between corresponding point pairs. The algorithm applies even when one of the meshes is an incomplete range scan; thus, it can be used to automatically fill in the remaining sur- faces for this partial scan, even if those surfaces were previously only seen in a different configuration. We evaluate the algorithm on several real-world datasets, where we demonstrate good results in the presence of significant movement of articulated parts and non-rigid surface defor- mation. Finally, we show that the output of the algorithm can be used for compelling computer graphics tasks such as interpolation between two scans of a non-rigid object and automatic recovery of articulated object models. 1 Introduction The construction of 3D object models is a key task for many graphics applications. It is becoming increasingly common to acquire these models from a range scan of a physical object. This paper deals with an important subproblem of this acquisition task -- the problem of registering two deforming surfaces corresponding to different configurations of the same non-rigid object. The main difficulty in the 3D registration problem is determining the correspondences of points on one surface to points on the other. Local regions on the surface are rarely distinc- tive enough to determine the correct correspondence, whether because of noise in the scans, or because of symmetries in the object shape. Thus, the set of candidate correspondences to a given point is usually large. Determining the correspondence for all object points results in a combinatorially large search problem. The existing algorithms for deformable surface A results video is available at http: //robotics. stanford. edu/drago/cc/video. mp4 Figure 1: A) Registration results for two meshes. Nonrigid ICP and its variant augmented with spin images get stuck in local maxima. Our CC algorithm produces a largely correct registration, although with an artifact in the right shoulder (inset). B) Illustration of the link deformation process C) The CC algorithm which uses only deformation potentials can violate mesh geometry. Near regions can map to far ones (segment AB) and far regions can map to near ones (points C, D). registration make the problem tractable by assuming significant prior knowledge about the objects being registered. Some rely on the presence of markers on the object [1, 20], while others assume prior knowledge about the object dynamics [16], or about the space of non- rigid deformations [15, 5]. Algorithms that make neither restriction [18, 12] simplify the problem by decorrelating the choice of correspondences for the different points in the scan. However, this approximation is only good in the case when the object deformation is small; otherwise, it results in poor local maxima as nearby points in one scan are allowed to map to far-away points in the other. Our algorithm defines a joint probabilistic model over all correspondences, which ex- plicitly model the correlations between them -- specifically, that nearby points in one mesh should map to nearby points in the other. Importantly, the notion of "nearby" used in our model is defined in terms of geodesic distance over the mesh. We define a probabilistic model over the set of correspondences, that encodes these geodesic distance constraints as well as penalties for link twisting and stretching, and high-level local surface features [14]. We then apply loopy belief propagation [21] to this model, in order to solve for the entire set of correspondences simultaneously. The result is a registration that respects the surface geometry. To the best of our knowledge, the algorithm we present in this paper is the first algorithm which allows the registration of 3D surfaces of an object where the object config- urations can vary significantly, there is no prior knowledge about object shape or dynamics of deformation, and nothing whatsoever is known about the object alignment. Moreover, unlike many methods, our algorithm can be used to register a partial scan to a complete model, greatly increasing its applicability. We apply our approach to three datasets containing 3D scans of a wooden puppet, a human arm and entire human bodies in different configurations. We demonstrate good registration results for scan pairs exhibiting articulated motion, non-rigid deformations, or both. We also describe three applications of our method. In our first application, we show how a partial scan of an object can be registered onto a fully specified model in a dif- ferent configuration. The resulting registration allows us to use the model to "complete" the partial scan in a way that preserves the local surface geometry. In the second, we use the correspondences found by our algorithm to smoothly interpolate between two different poses of an object. In our final application, we use a set of registered scans of the same object in different positions to recover a decomposition of the object into approximately rigid parts, and recover an articulated skeleton linking the parts. All of these applications are done in an unsupervised way, using only the output of our Correlated Correspondence algorithm applied to pairs of poses with widely varying deformations, and unknown initial alignments. These results demonstrate the value of a high-quality solution to the registra- tion problem to a range of graphics tasks. 2 Previous Work Surface registration is a fundamental building block in computer graphics. The classical so- lution for registering rigid surfaces is the Iterative Closest Point algorithm (ICP) [4, 6, 17]. Recently, there has been work extending ICP to non-rigid surfaces [18, 8, 12, 1]. These algorithms treat one of the scans (usually a complete model of the surface) as a deformable template. The links between adjacent points on the surface can be thought of as springs, which are allowed to deform at a cost. Similarly to ICP, these algorithms iterate between two subproblems -- estimating the non-rigid transformation and estimating the set of correspondences C between the scans. The step estimating the correspondences assumes that a good estimate of the nonrigid transformation is available. Under this assumption, the assignments to the correspondence variables become decorrelated: each point in the second scan is associated with the nearest point (in the Euclidean distance sense) in the deformed template scan. However, the decomposition also induces the algorithm's main limitation. By assigning points in the second scan to points on the deformed model inde- pendently, nearby points in the scan can get associated to remote points in the model if the estimate of is poor (Fig. 1A). While several approaches have been proposed to address this problem of incorrect correspondences, their applicability is largely limited to problems where the deformation is local, and the initial alignment is approximately correct. Another line of related work is the work on deformable template matching in the com- puter vision community. In the 3D case, this framework is used for detection of articulated object models in images [13, 22, 19]. The algorithms assume the decomposition of the object into a relatively small number of parts is known, and that a detector for each object part is available. Template matching approaches have also been applied to deformable 2D objects, where very efficient solutions exist [9, 11]. However, these methods do not extend easily to the case of 3D surfaces. 3 The Correlated Correspondence Algorithm The input to the algorithm is a set of two meshes (surfaces tessellated into polygons). The model mesh X = (V X, EX ) is a complete model of the object, in a particular pose. V X = (x1, .. ., xN ) denotes the mesh points, while EX is the set of links between adjacent points on the mesh surface. The data mesh Z = (V Z, EZ ) is either a complete model or a partial view of the object in a different configuration. Each data mesh point zk is associated with a correspondence variable ck, specifying the corresponding model mesh point. The task of registration is one of estimating the set of all correspondences C and a non-rigid transformation which aligns the corresponding points. 3. 1 Probabilistic Model We formulate the registration problem as one of finding an embedding of the data mesh Z into the model mesh X, which is encoded as an assignment to all correspondence vari- ables C = (c1, .. ., cK ). The main idea behind our approach is to preserve the consis- tency of the embedding by explicitly correlating the assignments to the correspondence variables. We define a joint distribution over the correspondence variables c1, .. ., cK, rep- resented as a Markov network. For each pair of adjacent data mesh points zk, zl, we want to define a probabilistic potential (ck, cl) that constrains this pair of correspondences to reasonable and consistent. This gives rise to a joint probability distribution of the form p(C) = 1 (c (c Z k k ) k, l k, cl) which contains only single and pairwise potentials. Performing probabilistic inference to find the most likely joint assignment to the entire set of correspondence variables C should yield a good and consistent registration. Deformation Potentials. We want our model to encode a preference for embeddings of mesh Z into mesh X, which minimize the amount of deformation induced by the embedding. In order to quantify the amount of deformation, applied to the model, we will follow the ideas of Hahnel et al. [12] and treat the links in the set EX as springs, which resist stretching and twisting at their endpoints. Stretching is easily quantified by looking at changes in the link length induced by the transformation. Link twisting, however, is ill- specified by looking only at the Cartesian coordinates of the points alone. Following [12], we attach an imaginary local coordinate system to each point on the model. This local coordinate system allows us to quantify the "twist" of a point xj relative to a neighbor xi. A non-rigid transformation defines, for each point xi, a translation of its coordinates and a rotation of its local coordinate system. To evaluate the deformation penalty, we parameterize each link in the model in terms of its length and its direction relative to its endpoints (see Fig. 1B). Specifically, we define li, j to be the distance between xi and xj; dij is a unit vector denoting the direction of the point xj in the coordinate system of xi (and vice versa). We use ei, j to denote the set of edge parameters (li, j, dij, dji). It is now straightforward to specify the penalty for model deformations. Let be a transformation, and let ~ ei, j denote the triple of parameters associated with the link between xi and xj after applying. Our model penalizes twisting and stretching, using a separate zero-mean Gaussian noise model for each: P (~ ei, j | ei, j) = P (~li, j | li, j) P ( ~ dij | dij) P ( ~ dji | dji) (1) In the absence of prior information, we assume that all links are equally likely to deform. In order to quantify the deformation induced by an embedding C, we need to include a potential d(ck, cl) for each link eZ EZ. Every probability k, l d(ck = i, cl = j) corresponds to the deformation penalty incurred by deforming model link ei, j to generate link eZ and is defined in (1). We do not restrict ourselves to the set of links in EX, since k, l the original mesh tessellation is sparse and local. Any two points in X are allowed to implicitly define a link. Unfortunately, we cannot directly estimate the quantity P (eZ | e k, l i, j ), since the link pa- rameters eZ depend on knowing the nonrigid transformation, which is not given as part k, l of the input. The key issue is estimating the (unknown) relative rotation of the link end- points. In effect, this rotation is an additional latent variable, which must also be part of the probabilistic model. To remain within the realm of discrete Markov networks, allowing the application of standard probabilistic inference algorithms, we discretize the space of the possible rotations, and fold it into the domains of the correspondence variables. For each possible value of the correspondence variable ck = i we select a small set of candidate rotations, consistent with local geometry. We do this by aligning local patches around the points xi and zk using rigid ICP. We extend the domain of each correspondence variables ck, where each value encodes a matching point and a particular rotation from the precom- puted set for that point. Now the edge parameters eZ are fully determined and so is the k, l probabilistic potential. Geodesic Distances. Our proposed approach raises the question as to what constitutes the best constraint between neighboring correspondence variables. The literature on scan registration -- for rigid and non-rigid models alike -- relies on the preserving Euclidean distance. While Euclidean distance is meaningful for rigid objects, it is very sensitive to de- formations, especially those induced by moving parts. For example, in Fig. 1C, we see that the two legs in one configuration of our puppet are fairly close together, allowing the algo- rithm to map two adjacent points in the data mesh to the two separate legs, with minimal deformation penalty. In the complementary situation, especially when object symmetries are present, two distant yet similar points in one scan might get mapped to the same region in the other. For example, in the same figure, we see that points in both an arm and a leg in the data mesh get mapped to a single leg in the model mesh. We therefore want to enforce constraints preserving distance along the mesh surface (geodesic distance). Our probabilistic framework easily incorporate such constraints as correlations between pairs of correspondence variables. We encode a nearness preservation Figure 2: A) Automatic interpolation between two scans of an arm and a wooden puppet. B) Regis- tration results on two scans of the same man sitting and standing up (select points were displayed) C) Registration results on scans of a larger man and a smaller woman. The algorithm is robust to small changes in object scale. constraint which prevents adjacent points in mesh Z to be mapped to distant points in X in the geodesic distance sense. For adjacent points zk, zl in the data mesh, we define the following potential: 0 dist Geodesic (xi, xj ) > n(ck = i, cl = j) = (2) 1 otherwise where is the data mesh resolution and is some constant, chosen to be 3. 5. The farness preservation potentials encode the complementary constraint. For every pair of points zk, zl whose geodesic distance is more than 5 on the data mesh, we have a potential: 0 dist Geodesic(xi, xj ) < f (ck = i, cl = j) = (3) 1 otherwise where is also a constant, chosen to be 2 in our implementation. The intuition behind this constraint is fairly clear: if zk, zl are far apart on the data mesh, then their corresponding points must be far apart on the model mesh. Local Surface Signatures. Finally, we encode a set of potentials that correspond to the preservation of local surface properties between the model mesh and data mesh. The use of local surface signatures is important, because it helps to guide the optimization in the exponential space of assignments. We use spin images [14] compressed with prin- cipal component analysis to produce a low-dimensional signature sx of the local surface geometry around a point x. When data and model points correspond, we expect their lo- cal signatures to be similar. We introduce a potential whose values s(ck) = i enforce a zero-mean Gaussian penalty for discrepancies between sx and s. i zk 3. 2 Optimization In the previous section, we defined a Markov network, which encodes a joint probability distribution over the correspondence variables as a product of single and pairwise poten- tials. Our goal is to find a joint assignment to these variables that maximizes this proba- bility. This problem is one of standard probabilistic inference over the Markov network. However, the Markov network is quite large, and contains a large number of loops, so that exact inference is computationally infeasible. We therefore apply an approximate inference method known as loopy belief propagation (LBP)[21], which has been shown to work in a wide variety of applications. Running LBP until convergence results in a set of probabilis- tic assignments to the different correspondence variables, which are locally consistent. We then simply extract the most likely assignment for each variable to obtain a correspondence. One remaining complication arises from the form of our farness preservation constraints. In general, most pairs of points in the mesh are not close, so that the total number of such potentials grows as O(M 2), where M is the number of points in the data mesh. However, rather than introducing all these potentials into the Markov net from the start, we introduce them as needed. First, we run LBP without any farness preservation potentials. If the solution violates a set of farness preservation constraints, we add it and rerun BP. In practice, this approach adds a very small number of such constraints. 4 Experimental Results Basic Registration. We applied our registration algorithm to three different datasets, containing meshes of a human arm, wooden puppet and the CAESAR dataset of whole human bodies [1], all acquired by a 3D range scanner. The meshes were not complete surfaces, but several techniques exist for filling the holes (e. g. , [10]). We ran the Correlated Correspondence algorithm using the same probabilistic model and the same parameters on all data sets. We use a coarse-to-fine strategy, using the result of a coarse sub-sampling of the mesh surface to constrain the correspondences at a finer-grained level. The resulting set of correspondences were used as markers to initialize the non-rigid ICP algorithm of Hahnel et al. [12]. The Correlated Correspondence algorithm successfully aligned all mesh pairs in our hu- man arm data set containing 7 arms. In the puppet data set we registered one of the meshes to the remaining 6 puppets. The algorithm correctly registered 4 out of 6 data meshes to the model mesh. In the two remaining cases, the algorithm produced a registration where the torso was flipped, so that the front was mapped to the back. This problem arises from am- biguities induced by the puppet symmetry, whose front and back are almost identical. Im- portantly, our probabilistic model assigns a higher likelihood score to the correct solution, so that the incorrect registration is a consequence of local maxima in the LBP algorithm. This fact allows us to address the issue in an unsupervised way simply by running loopy BP several times, with different initialization. For details on the unsupervised initialization scheme we used, please refer to our technical report [2]. We ran the modified algorithm to register one puppet mesh to the remaining 6 meshes in the dataset, obtaining the correct registration in all cases. In particular, as shown in Fig. 1A, we successfully deal with the case on which the straightforward nonrigid ICP algorithm failed. The modified algorithm was applied to the CAESAR dataset and produced very good registration for challenging cases exhibiting both articulated motion and deformation (Fig. 2B), or exhibiting deforma- tion and a (small) change in object scale (Fig. 2C). Overall, the algorithm performed robustly, producing a close-to-optimal registrations even for pairs of meshes that involve large deformations, articulated motion or both. The registration is accomplished in an unsupervised way, without any prior knowledge about object shape, dynamics, or alignment. Partial view completion. The Correlated Correspondence algorithm allows us to register a data mesh containing only a partial scan of an object to a known complete surface model of the object, which serves as a template. We can then transform the template mesh to the partial scan, a process which leaves undisturbed the links that are not involved in the partial mesh. The result is a mesh that matches the data on the observed points, while completing the unknown portion of the surface using the template. We take a partial mesh, which is missing the entire back part of the puppet in a particular pose. The resulting partial model is displayed in Fig. 3B-1; for comparison, the correct complete model in this configuration (which was not available to the algorithm), is shown in Fig. 3B-2. We register the partial mesh to models of the object in a different pose (Fig. 3B- 3), and compare the completions we obtain (Fig. 3B-4), to the ground truth represented in Fig. 3B-2. The result demonstrates a largely correct reconstruction of the complete surface geometry from the partial scan and the deformed template. We report additional shape completion results in [2]. Interpolation. Current research [20] shows that if a nonrigid transformation between the poses is available, believable animation can be produced by linear interpolation be- Figure 3: A) The results produced by the CC algorithm were used for unsupervised recovery of articulated models. 15 puppet parts and 4 arm parts, as well as the articulated object skeletons, were recovered. B) Partial view completion results. The missing parts of the surface were estimated by registering the partial view to a complete model of the object in a different configuration. tween the model mesh and the transformed model mesh. The interpolation is performed in the space of local link parameters (li, j, dij, dji), We demonstrate that transforma- tion estimates produced by our algorithm can be used to automatically generate believable animation sequences between fairly different poses, as shown in Fig. 2A. Recovering Articulated Models. Articulated object models have a number of appli- cations in animation and motion capture, and there has been work on recovering them automatically from 3D data [7, 3]. We show that our unsupervised registration capability can greatly assist articulated model recovery. In particular, the algorithm in [3] requires an estimate of the correspondences between a template mesh and the remaining meshes in the dataset. We supplied it with registration computed with the Correlated Correspondence algorithm. As a result we managed to recover in a completely unsupervised way all 15 rigid parts of the puppet, as well as the joints between them (Fig. 3A). We demonstrate successful articulation recovery even for objects which are not purely rigid, as is the case with the human arm (see Fig. 3A).

AIIM Journal 2004 Journal Article

Understanding tuberculosis epidemiology using structured statistical models

Lise Getoor
Jeanne T Rhee
Daphne Koller
Peter Small

Molecular epidemiological studies can provide novel insights into the transmission of infectious diseases such as tuberculosis. Typically, risk factors for transmission are identified using traditional hypothesis-driven statistical methods such as logistic regression. However, limitations become apparent in these approaches as the scope of these studies expand to include additional epidemiological and bacterial genomic data. Here we examine the use of Bayesian models to analyze tuberculosis epidemiology. We begin by exploring the use of Bayesian networks (BNs) to identify the distribution of tuberculosis patient attributes (including demographic and clinical attributes). Using existing algorithms for constructing BNs from observational data, we learned a BN from data about tuberculosis patients collected in San Francisco from 1991 to 1999. We verified that the resulting probabilistic models did in fact capture known statistical relationships. Next, we examine the use of newly introduced methods for representing and automatically constructing probabilistic models in structured domains. We use statistical relational models (SRMs) to model distributions over relational domains. SRMs are ideally suited to richly structured epidemiological data. We use a data-driven method to construct a statistical relational model directly from data stored in a relational database. The resulting model reveals the relationships between variables in the data and describes their distribution. We applied this procedure to the data on tuberculosis patients in San Francisco from 1991 to 1999, their Mycobacterium tuberculosis strains, and data on contact investigations. The resulting statistical relational model corroborated previously reported findings and revealed several novel associations. These models illustrate the potential for this approach to reveal relationships within richly structured data that may not be apparent using conventional statistical approaches. We show that Bayesian methods, in particular statistical relational models, are an important tool for understanding infectious disease epidemiology.

IJCAI Conference 2003 Conference Paper

A Continuation Method for Nash Equilibria in Structured Games

Ben Blum
Christian R. Shelton
Daphne Koller

We describe algorithms for computing Nash equilibria in structured game representations, including both graphical games and multi-agent influence diagrams (MAIDs). The algorithms are derived from a continuation method for normal-form and extensive-form games due to Govindan and Wilson; they follow a trajectory through the space of perturbed games and their equilibria. Our algorithms exploit game structure through fast computation of the Jacobian of the game's payoff function. They are guaranteed to find at least one equilibrium of the game and may find more. Our approach provides the first exact algorithm for computing an exact equilibrium in graphical games with arbitrary topology, and the first algorithm to exploit fine-grain structural properties of MAIDs. We present experimental results for our algorithms. The running time for our graphical game algorithm is similar to, and often better than, the running time of previous approximate algorithms. Our algorithm for MAIDs can effectively solve games that arc much larger than those that could be solved using previous methods.

IJCAI Conference 2003 Conference Paper

FastSLAM 2. 0: An Improved Particle Filtering Algorithm for Simultaneous Localization and Mapping that Provably Converges

Mike Montemerlo
Sebastian Thrun
Daphne Koller
Ben Wegbreit

In [15], Montemerlo et al. proposed an algorithm called FastSLAM as an efficient and robust solution to the simultaneous localization and mapping problem. This paper describes a modified version of FastSLAM that overcomes important deficiencies of the original algorithm. We prove convergence of this new algorithm for linear SLAM problems and provide real-world experimental results that illustrate an order of magnitude improvement in accuracy over the original FastSLAM algorithm.

UAI Conference 2003 Conference Paper

Learning Continuous Time Bayesian Networks

Uri Nodelman
Christian R. Shelton
Daphne Koller

Continuous time Bayesian networks (CTBNs) describe structured stochastic processes with finitely many states that evolve over continuous time. A CTBN is a directed (possibly cyclic) dependency graph over a set of variables, each of which represents a finite state continuous time Markov process whose transition model is a function of its parents. We address the problem of learning parameters and structure of a CTBN from fully observed data. We define a conjugate prior for CTBNs, and show how it can be used both for Bayesian parameter estimation and as the basis of a Bayesian score for structure learning. Because acyclicity is not a constraint in CTBNs, we can show that the structure learning problem is significantly easier, both in theory and in practice, than structure learning for dynamic Bayesian networks (DBNs). Furthermore, as CTBNs can tailor the parameters and dependency structure to the different time granularities of the evolution of different variables, they can provide a better fit to continuous-time processes than DBNs with a fixed time granularity.

UAI Conference 2003 Conference Paper

Learning Module Networks

Eran Segal
Dana Pe'er
Aviv Regev
Daphne Koller
Nir Friedman

Methods for learning Bayesian network structure can discover dependency structure between observed variables, and have been shown to be useful in many applications. However, in domains that involve a large number of variables, the space of possible network structures is enormous, making it difficult, for both computational and statistical reasons, to identify a good model. In this paper, we consider a solution to this problem, suitable for domains where many variables have similar behavior. Our method is based on a new class of models, which we call module networks. A module network explicitly represents the notion of a module --- a set of variables that have the same parents in the network and share the same conditional probability distribution. We define the semantics of module networks, and describe an algorithm that learns a module network from data. The algorithm learns both the partitioning of the variables into modules and the dependency structure between the variables. We evaluate our algorithm on synthetic data, and on real data in the domains of gene expression and the stock market. Our results show that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks.

ICML Conference 2003 Conference Paper

Learning on the Test Data: Leveraging Unseen Features

Ben Taskar
Ming Fai Wong
Daphne Koller

NeurIPS Conference 2003 Conference Paper

Link Prediction in Relational Data

Ben Taskar
Ming-fai Wong
Pieter Abbeel
Daphne Koller

Many real-world domains are relational in nature, consisting of a set of objects related to each other in complex ways. This paper focuses on predicting the existence and the type of links between entities in such domains. We apply the relational Markov network framework of Taskar et al. to deﬁne a joint probabilis- tic model over the entire link graph — entity attributes and links. The application of the RMN algorithm to this task requires the deﬁnition of probabilistic patterns over subgraph structures. We apply this method to two new relational datasets, one involving university webpages, and the other a social network. We show that the collective classiﬁcation approach of RMNs, and the introduction of subgraph patterns over link labels, provide signiﬁcant improvements in accuracy over ﬂat classiﬁcation, which attempts to predict each link in isolation.

NeurIPS Conference 2003 Conference Paper

Max-Margin Markov Networks

Ben Taskar
Carlos Guestrin
Daphne Koller

In typical classiﬁcation tasks, we seek a function which assigns a label to a sin- gle object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of conﬁdence of the classiﬁer, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guaran- tees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ig- nore structure in the problem, assigning labels independently to each object, los- ing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum mar- gin Markov (M3) networks incorporate both kernels, which efﬁciently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efﬁcient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwrit- ten character recognition and collective hypertext classiﬁcation demonstrate very signiﬁcant gains over previous approaches.

UAI Conference 2002 Conference Paper

Continuous Time Bayesian Networks

Uri Nodelman
Christian R. Shelton
Daphne Koller

In this paper we present a language for finite state continuous time Bayesian networks (CTBNs), which describe structured stochastic processes that evolve over continuous time. The state of the system is decomposed into a set of local variables whose values change over time. The dynamics of the system are described by specifying the behavior of each local variable as a function of its parents in a directed (possibly cyclic) graph. The model specifies, at any given point in time, the distribution over two aspects: when a local variable changes its value and the next value it takes. These distributions are determined by the variable s CURRENT value AND the CURRENT VALUES OF its parents IN the graph.More formally, each variable IS modelled AS a finite state continuous time Markov process whose transition intensities are functions OF its parents.We present a probabilistic semantics FOR the language IN terms OF the generative model a CTBN defines OVER sequences OF events.We list types OF queries one might ask OF a CTBN, discuss the conceptual AND computational difficulties associated WITH exact inference, AND provide an algorithm FOR approximate inference which takes advantage OF the structure within the process.

UAI Conference 2002 Conference Paper

Discriminative Probabilistic Models for Relational Data

Ben Taskar
Pieter Abbeel
Daphne Koller

In many supervised learning tasks, the entities to be labeled are related to each other in complex ways and their labels are not independent. For example, in hypertext classification, the labels of linked pages are highly correlated. A standard approach is to classify each entity independently, ignoring the correlations between them. Recently, Probabilistic Relational Models, a relational version of Bayesian networks, were used to define a joint probabilistic model for a collection of related entities. In this paper, we present an alternative framework that builds on (conditional) Markov networks and addresses two limitations of the previous approach. First, undirected models do not impose the acyclicity constraint that hinders representation of many important relational dependencies in directed models. Second, undirected models are well suited for discriminative training, where we optimize the conditional likelihood of the labels given the features, which generally improves classification accuracy. We show how to train these models effectively, and how to use approximate probabilistic inference over the learned model for collective classification of multiple related entities. We provide experimental results on a webpage classification task, showing that accuracy can be significantly improved by modeling relational dependencies.

UAI Conference 2002 Conference Paper

Learning Hierarchical Object Maps of Non-Stationary Environments with Mobile Robots

Dragomir Anguelov
Rahul Biswas
Daphne Koller
Benson Limketkai
Sebastian Thrun

Building models, or maps, of robot environments is a highly active research area; however, most existing techniques construct unstructured maps and assume static environments. In this paper, we present an algorithm for learning object models of non-stationary objects found in office-type environments. Our algorithm exploits the fact that many objects found in office environments look alike (e.g., chairs, recycling bins). It does so through a two-level hierarchical representation, which links individual objects with generic shape templates of object classes. We derive an approximate EM algorithm for learning shape parameters at both levels of the hierarchy, using local occupancy grid maps for representing shape. Additionally, we develop a Bayesian model selection algorithm that enables the robot to estimate the total number of objects and object templates in the environment. Experimental results using a real robot equipped with a laser range finder indicate that our approach performs well at learning object-based maps of simple office environments. The approach outperforms a previously developed non-hierarchical algorithm that models objects but lacks class templates.

JMLR Journal 2002 Journal Article

Learning Probabilistic Models of Link Structure

Lisa Getoor
Nir Friedman
Daphne Koller
Benjamin Taskar

Most real-world data is heterogeneous and richly interconnected. Examples include the Web, hypertext, bibliometric data and social networks. In contrast, most statistical learning methods work with "flat" data representations, forcing us to convert our data into a form that loses much of the link structure. The recently introduced framework of probabilistic relational models (PRMs) embraces the object-relational nature of structured data by capturing probabilistic interactions between attributes of related entities. In this paper, we extend this framework by modeling interactions between the attributes and the link structure itself. An advantage of our approach is a unified generative model for both content and relational structure. We propose two mechanisms for representing a probabilistic distribution over link structures: reference uncertainty and existence uncertainty. We describe the appropriate conditions for using each model and present learning algorithms for each. We present experimental results showing that the learned models can be used to predict link structure and, moreover, the observed link structure can be used to provide better predictions for the attributes in the model.

UAI Conference 2002 Conference Paper

Monitoring a Complez Physical System using a Hybrid Dynamic Bayes Net

Uri Lerner
Brooks Moses
Maricia Scott
Sheila A. McIlraith
Daphne Koller

The Reverse Water Gas Shift system (RWGS) is a complex physical system designed to produce oxygen from the carbon dioxide atmosphere on Mars. If sent to Mars, it would operate without human supervision, thus requiring a reliable automated system for monitoring and control. The RWGS presents many challenges typical of real-world systems, including: noisy and biased sensors, nonlinear behavior, effects that are manifested over different time granularities, and unobservability of many important quantities. In this paper we model the RWGS using a hybrid (discrete/continuous) Dynamic Bayesian Network (DBN), where the state at each time slice contains 33 discrete and 184 continuous variables. We show how the system state can be tracked using probabilistic inference over the model. We discuss how to deal with the various challenges presented by the RWGS, providing a suite of techniques that are likely to be useful in a wide range of applications. In particular, we describe a general framework for dealing with nonlinear behavior using numerical integration techniques, extending the successful Unscented Filter. We also show how to use a fixed-point computation to deal with effects that develop at different time scales, specifically rapid changes occurring during slowly changing processes. We test our model using real data collected from the RWGS, demonstrating the feasibility of hybrid DBNs for monitoring complex real-world physical systems.

UAI Conference 2001 Conference Paper

Exact Inference in Networks with Discrete Children of Continuous Parents

Uri Lerner
Eran Segal
Daphne Koller

Many real life domains contain a mixture of discrete and continuous variables and can be modeled as hybrid Bayesian Networks. Animportant subclass of hybrid BNs are conditional linear Gaussian (CLG) networks, where the conditional distribution of the continuous variables given an assignment to the discrete variables is a multivariate Gaussian. Lauritzen's extension to the clique tree algorithm can be used for exact inference in CLG networks. However, many domains also include discrete variables that depend on continuous ones, and CLG networks do not allow such dependencies to berepresented. No exact inference algorithm has been proposed for these enhanced CLG networks. In this paper, we generalize Lauritzen's algorithm, providing the first ``exact'' inference algorithm for augmented CLG networks --- networks where continuous nodes are conditional linear Gaussians but that also allow discrete children ofcontinuous parents. Our algorithm is exact in the sense that it computes the exact distributions over the discrete nodes, and the exact first and second moments of the continuous ones, up to the accuracy obtained by numerical integration used within thealgorithm. When the discrete children are modeled with softmax CPDs (as is the case in many real world domains) the approximation of the continuous distributions using the first two moments is particularly accurate. Our algorithm is simple to implement and often comparable in its complexity to Lauritzen's algorithm. We show empirically that it achieves substantially higher accuracy than previous approximate algorithms.

ICML Conference 2001 Conference Paper

Learning an Agent's Utility Function by Observing Behavior

Urszula Chajewska
Daphne Koller
Dirk Ormoneit

ICML Conference 2001 Conference Paper

Learning Probabilistic Models of Relational Structure

Lise Getoor
Nir Friedman
Daphne Koller
Ben Taskar

NeurIPS Conference 2001 Conference Paper

Multiagent Planning with Factored MDPs

Carlos Guestrin
Daphne Koller
Ronald Parr

We present a principled and efﬁcient planning algorithm for cooperative multia- gent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the en- tire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian net- work (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function. This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. We provide a simple and efﬁcient method for computing such an approximate value function by solving a single linear pro- gram, whose size is determined by the interaction between the value function structure and the DBN. We thereby avoid the exponential blowup in the state and action space. We show that our approach compares favorably with approaches based on reward sharing. We also show that our algorithm is an efﬁcient alterna- tive to more complicated algorithms even in the single agent case.

NeurIPS Conference 2001 Conference Paper

Probabilistic Abstraction Hierarchies

Eran Segal
Daphne Koller
Dirk Ormoneit

Many domains are naturally organized in an abstraction hierarchy or taxonomy, where the instances in “nearby” classes in the taxonomy are similar. In this pa- per, we provide a general probabilistic framework for clustering data into a set of classes organized as a taxonomy, where each class is associated with a prob- abilistic model from which the data was generated. The clustering algorithm simultaneously optimizes three things: the assignment of data instances to clus- ters, the models associated with the clusters, and the structure of the abstraction hierarchy. A unique feature of our approach is that it utilizes global optimization algorithms for both of the last two steps, reducing the sensitivity to noise and the propensity to local maxima that are characteristic of algorithms such as hierarchi- cal agglomerative clustering that only take local steps. We provide a theoretical analysis for our algorithm, showing that it converges to a local maximum of the joint likelihood of model and data. We present experimental results on synthetic data, and on real data in the domains of gene expression and text.

JMLR Journal 2001 Journal Article

Support Vector Machine Active Learning with Applications to Text Classification

Simon Tong
Daphne Koller

Support vector machines have met with significant success in numerous real-world learning tasks. However, like most machine learning algorithms, they are generally applied using a randomly selected training set classified in advance. In many settings, we also have the option of using pool-based active learning. Instead of using a randomly selected training set, the learner has access to a pool of unlabeled instances and can request the labels for some number of them. We introduce a new algorithm for performing active learning with support vector machines, i.e., an algorithm for choosing which instances to request next. We provide a theoretical motivation for the algorithm using the notion of a version space. We present experimental results showing that employing our active learning method can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings.

NeurIPS Conference 2000 Conference Paper

Active Learning for Parameter Estimation in Bayesian Networks

Simon Tong
Daphne Koller

Bayesian networks are graphical representations of probability distributions. In virtually all of the work on learning these networks, the assumption is that we are presented with a data set consisting of randomly generated instances from the underlying distribution. In many situations, however, we also have the option of active learning, where we have the possibility of guiding the sampling process by querying for certain types of samples. This paper addresses the problem of estimating the parameters of Bayesian networks in an active learning setting. We provide a theoretical framework for this problem, and an algorithm that chooses which active learning queries to generate based on the model learned so far. We present experimental results showing that our active learning algorithm can significantly reduce the need for training data in many situations.

UAI Conference 2000 Conference Paper

Being Bayesian about Network Structure

Nir Friedman
Daphne Koller

In many domains, we are interested in analyzing the structure of the underlying distribution, e.g., whether one variable is a direct parent of the other. Bayesian model-selection attempts to find the MAP model and use its structure to answer these questions. However, when the amount of available data is modest, there might be many models that have non-negligible posterior. Thus, we want compute the Bayesian posterior of a feature, i.e., the total posterior probability of all models that contain it. In this paper, we propose a new approach for this task. We first show how to efficiently compute a sum over the exponential number of networks that are consistent with a fixed ordering over network variables. This allows us to compute, for a given ordering, both the marginal probability of the data and the posterior of a feature. We then use this result as the basis for an algorithm that approximates the Bayesian posterior of a feature. Our approach uses a Markov Chain Monte Carlo (MCMC) method, but over orderings rather than over network structures. The space of orderings is much smaller and more regular than the space of structures, and has a smoother posterior `landscape'. We present empirical results on synthetic and real-life datasets that compare our approach to full model averaging (when possible), to MCMC over network structures, and to a non-Bayesian bootstrap approach.

NeurIPS Conference 2000 Conference Paper

Discovering Hidden Variables: A Structure-Based Approach

Gal Elidan
Noam Lotner
Nir Friedman
Daphne Koller

A serious problem in learning probabilistic models is the presence of hid(cid: 173) den variables. These variables are not observed, yet interact with several of the observed variables. As such, they induce seemingly complex de(cid: 173) pendencies among the latter. In recent years, much attention has been devoted to the development of algorithms for learning parameters, and in some cases structure, in the presence of hidden variables. In this pa(cid: 173) per, we address the related problem of detecting hidden variables that interact with the observed variables. This problem is of interest both for improving our understanding of the domain and as a preliminary step that guides the learning procedure towards promising models. A very natural approach is to search for "structural signatures" of hidden variables - substructures in the learned network that tend to suggest the presence of a hidden variable. We make this basic idea concrete, and show how to integrate it with structure-search algorithms. We evaluate this method on several synthetic and real-life datasets, and show that it performs surpris(cid: 173) ingly well.

UAI Conference 2000 Conference Paper

Policy Iteration for Factored MDPs

Daphne Koller
Ronald Parr

Many large MDPs can be represented compactly using a dynamic Bayesian network. Although the structure of the value function does not retain the structure of the process, recent work has shown that value functions in factored MDPs can often be approximated well using a decomposed value function: a linear combination of restricted basis functions, each of which refers only to a small subset of variables. An approximate value function for a particular policy can be computed using approximate dynamic programming, but this approach (and others) can only produce an approximation relative to a distance metric which is weighted by the stationary distribution of the current policy. This type of weighted projection is ill-suited to policy improvement. We present a new approach to value determination, that uses a simple closed-form computation to directly compute a least-squares decomposed approximation to the value function for any weights . We then use this value determination algorithm as a subroutine in a policy iteration process. We show that, under reasonable restrictions, the policies induced by a factored value function are compactly represented, and can be manipulated efficiently in a policy iteration process. We also present a method for computing error bounds for decomposed value functions using a variable-elimination algorithm for function optimization. The complexity of all of our algorithms depends on the factorization of system dynamics and of the approximate value function.

UAI Conference 2000 Conference Paper

Probabilistic Models for Agent's Beliefs and Decisions

Brian Milch
Daphne Koller

ICML Conference 2000 Conference Paper

Support Vector Machine Active Learning with Application sto Text Classification

Simon Tong
Daphne Koller

UAI Conference 2000 Conference Paper

Utilities as Random Variables: Density Estimation and Structure Discovery

Urszula Chajewska
Daphne Koller

Decision theory does not traditionally include uncertainty over utility functions. We argue that the a person's utility value for a given outcome can be treated as we treat other domain attributes: as a random variable with a density function over its possible values. We show that we can apply statistical density estimation techniques to learn such a density function from a database of partially elicited utility functions. In particular, we define a Bayesian learning framework for this problem, assuming the distribution over utilities is a mixture of Gaussians, where the mixture components represent statistically coherent subpopulations. We can also extend our techniques to the problem of discovering generalized additivity structure in the utility functions in the population. We define a Bayesian model selection criterion for utility function structure and a search procedure over structures. The factorization of the utilities in the learned model, and the generalization obtained from density estimation, allows us to provide robust estimates of utilities using a significantly smaller number of utility elicitation questions. We experiment with our technique on synthetic utility data and on a real database of utility functions in the domain of prenatal diagnosis.

UAI Conference 1999 Conference Paper

A General Algorithm for Approximate Inference and Its Application to Hybrid Bayes Nets

Daphne Koller
Uri Lerner
Dragomir Anguelov

The clique tree algorithm is the standard method for doing inference in Bayesian networks. It works by manipulating clique potentials - distributions over the variables in a clique. While this approach works well for many networks, it is limited by the need to maintain an exact representation of the clique potentials. This paper presents a new unified approach that combines approximate inference and the clique tree algorithm, thereby circumventing this limitation. Many known approximate inference algorithms can be viewed as instances of this approach. The algorithm essentially does clique tree propagation, using approximate inference to estimate the densities in each clique. In many settings, the computation of the approximate clique potential can be done easily using statistical importance sampling. Iterations are used to gradually improve the quality of the estimation.

IJCAI Conference 1999 Conference Paper

Computing factored value functions for policies in structured MDPs

Daphne Koller
Ronald Parr

Many large Markov decision processes (MDPs) can be represented compactly using a structured representation such as a dynamic Bayesian network. Unfortunately, the compact representation does not help standard MDP algorithms, because the value function for the MDP does not retain the structure of the process description. We argue that in many such MDPs, structure is approximately retained. That is, the value functions are nearly additive: closely approximated by a linear function over factors associated with small subsets of problem features. Based on this idea, we present a convergent, approximate value determination algorithm for structured MDPs. The algorithm maintains an additive value function, alternating dynamic programming steps with steps that project the result back into the restricted space of additive functions. We show that both the dynamic programming and the projection steps can be computed efficiently, despite the fact that the number of states is exponential in the number of state variables.

UAI Conference 1999 Conference Paper

Discovering the Hidden Structure of Complex Dynamic Systems

Xavier Boyen
Nir Friedman
Daphne Koller

Dynamic Bayesian networks provide a compact and natural representation for complex dynamic systems. However, in many cases, there is no expert available from whom a model can be elicited. Learning provides an alternative approach for constructing models of dynamic systems. In this paper, we address some of the crucial computational aspects of learning the structure of dynamic systems, particularly those where some relevant variables are partially observed or even entirely unknown. Our approach is based on the Structural Expectation Maximization (SEM) algorithm. The main computational cost of the SEM algorithm is the gathering of expected sufficient statistics. We propose a novel approximation scheme that allows these sufficient statistics to be computed efficiently. We also investigate the fundamental problem of discovering the existence of hidden variables without exhaustive and expensive search. Our approach is based on the observation that, in dynamic systems, ignoring a hidden variable typically results in a violation of the Markov property. Thus, our algorithm searches for such violations in the data, and introduces hidden variables to explain them. We provide empirical results showing that the algorithm is able to learn the dynamics of complex systems in a computationally tractable way.

IJCAI Conference 1999 Conference Paper

Efficient Reinforcement Learning in Factored MDPs

Michael Kearns
Daphne Koller

We present a provably efficient and near-optimal algorithm for reinforcement learning in Markov decision processes (MDPs) whose transition model can be factored as a dynamic Bayesian network (DBN). Our algorithm generalizes the recent algorithm of Kearns and Singh, and assumes that we are given both an algorithm for approximate planning, and the graphical structure (but not the parameters) of the DBN. Unlike the original algorithm, our new algorithm exploits the DBN structure to achieve a running time that scales polynomially in the number of parameters of the DBN, which may be exponentially smaller than the number of global states.

AAAI Conference 1999 Conference Paper

Exploiting the Architecture of Dynamic Systems

Xavier Boyen
Daphne Koller
Stanford University

Consider the problem of monitoring the state of a complex dynamic system, and predicting its future evolution. Exact algorithms for this task typically maintain a belief state, or distribution over the states at some point in time. Unfortunately, these algorithms fail when applied to complex processes such as those represented as dynamic Bayesian networks (DBNs), as the representation of the belief state grows exponentially with the size of the process. In (Boyen & Koller 1998), we recently proposed an efficient approximate tracking algorithm that maintains an approximate belief state that has a compact representation as a set of independent factors. Its performance depends on the error introduced by approximating a belief state of this process by a factored one. We informally argued that this error is low if the interaction between variables in the processes is “weak”. In this paper, we give formal information-theoretic definitions for notions such as weak interaction and sparse interaction of processes. We use these notions to analyze the conditions under which the error induced by this type of approximation is small. We demonstrate several cases where our results formally support intuitions about strength of interaction.

NeurIPS Conference 1999 Conference Paper

Policy Search via Density Estimation

Andrew Ng
Ronald Parr
Daphne Koller

We propose a new approach to the problem of searching a space of stochastic controllers for a Markov decision process (MDP) or a partially observable Markov decision process (POMDP). Following several other authors, our approach is based on searching in parameterized families of policies (for example, via gradient descent) to optimize solution qual(cid: 173) ity. However, rather than trying to estimate the values and derivatives of a policy directly, we do so indirectly using estimates for the proba(cid: 173) bility densities that the policy induces on states at the different points in time. This enables our algorithms to exploit the many techniques for efficient and robust approximate density propagation in stochastic sys(cid: 173) tems. We show how our techniques can be applied both to deterministic propagation schemes (where the MDP's dynamics are given explicitly in compact form, ) and to stochastic propagation schemes (where we have access only to a generative model, or simulator, of the MDP). We present empirical results for both of these variants on complex problems.

NeurIPS Conference 1999 Conference Paper

Reinforcement Learning Using Approximate Belief States

Andres Rodriguez
Ronald Parr
Daphne Koller

The problem of developing good policies for partially observable Markov decision problems (POMDPs) remains one of the most challenging ar(cid: 173) eas of research in stochastic planning. One line of research in this area involves the use of reinforcement learning with belief states, probabil(cid: 173) ity distributions over the underlying model states. This is a promis(cid: 173) ing method for small problems, but its application is limited by the in(cid: 173) tractability of computing or representing a full belief state for large prob(cid: 173) lems. Recent work shows that, in many settings, we can maintain an approximate belief state, which is fairly close to the true belief state. In particular, great success has been shown with approximate belief states that marginalize out correlations between state variables. In this paper, we investigate two methods of full belief state reinforcement learning and one novel method for reinforcement learning using factored approximate belief states. We compare the performance of these algorithms on several well-known problem from the literature. Our results demonstrate the im(cid: 173) portance of approximate belief state representations for large problems.

UAI Conference 1999 Conference Paper

SPOOK: A system for probabilistic object-oriented knowledge representation

Avi Pfeffer
Daphne Koller
Brian Milch
Ken T. Takusagawa

In previous work, we pointed out the limitations of standard Bayesian networks as a modeling framework for large, complex domains. We proposed a new, richly structured modeling language, {em Object-oriented Bayesian Netorks}, that we argued would be able to deal with such domains. However, it turns out that OOBNs are not expressive enough to model many interesting aspects of complex domains: the existence of specific named objects, arbitrary relations between objects, and uncertainty over domain structure. These aspects are crucial in real-world domains such as battlefield awareness. In this paper, we present SPOOK, an implemented system that addresses these limitations. SPOOK implements a more expressive language that allows it to represent the battlespace domain naturally and compactly. We present a new inference algorithm that utilizes the model structure in a fundamental way, and show empirically that it achieves orders of magnitude speedup over existing approaches.

NeurIPS Conference 1998 Conference Paper

Approximate Learning of Dynamic Models

Xavier Boyen
Daphne Koller

Inference is a key component in learning probabilistic models from par(cid: 173) tially observable data. When learning temporal models, each of the many inference phases requires a traversal over an entire long data se(cid: 173) quence; furthermore, the data structures manipulated are exponentially large, making this process computationally expensive. In [2], we describe an approximate inference algorithm for monitoring stochastic processes, and prove bounds on its approximation error. In this paper, we apply this algorithm as an approximate forward propagation step in an EM algorithm for learning temporal Bayesian networks. We provide a related approxi(cid: 173) mation for the backward step, and prove error bounds for the combined algorithm. We show empirically that, for a real-life domain, EM using our inference algorithm is much faster than EM using exact inference, with almost no degradation in quality of the learned model. We extend our analysis to the online learning task, showing a bound on the error resulting from restricting attention to a small window of observations. We present an online EM learning algorithm for dynamic systems, and show that it learns much faster than standard offline EM.

AAAI Conference 1998 Conference Paper

Probabilistic Frame-Based Systems

Daphne Koller

Two of the most important threads of work in knowledge representation today are frame-based representation systems (FRS’s) and Bayesian networks (BNs). FRS’s provide an excellent representation for the organizational structure of large complex domains, but their applicability is limited because of their inability to deal with uncertainty and noise. BNs provide an intuitive and coherent probabilistic representation of our uncertainty, but are very limited in their ability to handle complex structured domains. In this paper, we provide a language that cleanly integrates these approaches, preserving the advantagesof both. Our approach allows us to provide natural and compact definitions of probability models for a class, in a way that is local to the class frame. These models can be instantiated for any set of interconnected instances, resulting in a coherentprobability distribution over the instance properties. Our language also allows us to represent important types of uncertainty that cannot be accomodated within the framework of traditional BNs: uncertainty over the set of entities present in our model, and uncertainty about the relationships between these entities. We provide an inference algorithm for our language via a reduction to inference in standard Bayesian networks. We describe an implemented system that allows most of the main frame systems in existence today to annotate their knowledge bases with probabilistic information, and to use that information in answering probabilistic queries.

UAI Conference 1998 Conference Paper

Tractable Inference for Complex Stochastic Processes

Xavier Boyen
Daphne Koller

The monitoring and control of any dynamic system depends crucially on the ability to reason about its current status and its future trajectory. In the case of a stochastic system, these tasks typically involve the use of a belief state---a probability distribution over the state of the process at a given point in time. Unfortunately, the state spaces of complex processes are very large, making an explicit representation of a belief state intractable. Even in dynamic Bayesian networks (DBNs), where the process itself can be represented compactly, the representation of the belief state is intractable. We investigate the idea of maintaining a compact approximation to the true belief state, and analyze the conditions under which the errors due to the approximations taken over the lifetime of the process do not accumulate to make our answers completely irrelevant. We show that the error in a belief state contracts exponentially as the process evolves. Thus, even with multiple approximations, the error in our process remains bounded indefinitely. We show how the additional structure of a DBN can be used to design our approximation scheme, improving its performance significantly. We demonstrate the applicability of our ideas in the context of a monitoring task, showing that orders of magnitude faster inference can be achieved with only a small degradation in accuracy.

ICML Conference 1998 Conference Paper

Using Learning for Approximation in Stochastic Processes

Daphne Koller
Raya Fratkina

AAAI Conference 1997 Conference Paper

Effective Bayesian Inference for Stochastic Programs

Daphne Koller

In this paper, we propose a stochastic version of a general purpose functional programming language as a method of modeling stochastic processes. The language contains random choices, conditional statements, structured values, defined functions, and recursion. By imagining an experiment in which the program is “run” and the random choices made by sampling, we can interpret a program in this language as encoding a probability distribution over a (potentially infinite) set of objects. We provide an exact algorithm for computing conditional probabilities of the form Pr(P(z) 1 Q(z)) where x is chosen randomly from this distribution. This algorithm terminates precisely when sampling x and computing P(X) and Q(x) terminates in all possible stochastic executions (under lazy evaluation semantics, in which only values needed to compute the output of the program are evaluated). We demonstrate the applicability of the language and the efficiency of the inference algorithm by encoding both Bayesian networks and stochastic context-free grammars in our language, and showing that our algorithm derives efficient inference algorithms for both. Our language easily supports interesting and useful extensions to these formalisms (e. g. , recursive Bayesian networks), to which our inference algorithm will automatically apply.

ICML Conference 1997 Conference Paper

Hierarchically Classifying Documents Using Very Few Words

Daphne Koller
Mehran Sahami

IJCAI Conference 1997 Conference Paper

Learning probabilities for noisy first-order rules

Daphne Koller
Avi IJeffer

First-order logic is the traditional basis for knowledge representation languages. However, its applicability to many real-world tasks is limited by its inability to represent uncertainty. Bayesian belief networks, on the other hand, are inadequate for complex KR tasks due to the limited expressivity of the underlying (prepositional) language. The need to incorporate uncertainty into an expressive language has led to a resurgence of work on first-order probabilistic Logic. This paper addresses one of the main objections to the incorporation of probabilities into the language: "Where do the numbers come from? " We present an approach that takes a knowledge base in an expressive rule-based first-order language, and leams the probabilistic parameters associated with those rules from data cases. Our approach, which is based on algorithms for learning in traditional Bayesian networks, can handle data cases where many of the relevant aspects of the situation are unobserved. It is also capable of utilizing a rich variety of data cases, including instances with varying causal structure, and even involving a varying number of individuals. These features allow the approach to be used for a wide range of tasks, such as learning genetic propagation models or learning first-order STRIPS planning operators with uncertain effects.

UAI Conference 1997 Conference Paper

Nonuniform Dynamic Discretization in Hybrid Networks

Alexander V. Kozlov
Daphne Koller

We consider probabilistic inference in general hybrid networks, which include continuous and discrete variables in an arbitrary topology. We reexamine the question of variable discretization in a hybrid network aiming at minimizing the information loss induced by the discretization. We show that a nonuniform partition across all variables as opposed to uniform partition of each variable separately reduces the size of the data structures needed to represent a continuous function. We also provide a simple but efficient procedure for nonuniform partition. To represent a nonuniform discretization in the computer memory, we introduce a new data structure, which we call a Binary Split Partition (BSP) tree. We show that BSP trees can be an exponential factor smaller than the data structures in the standard uniform discretization in multiple dimensions and show how the BSP trees can be used in the standard join tree algorithm. We show that the accuracy of the inference process can be significantly improved by adjusting discretization with evidence. We construct an iterative anytime algorithm that gradually improves the quality of the discretization and the accuracy of the answer on a query. We provide empirical evidence that the algorithm converges.

UAI Conference 1997 Conference Paper

Object-Oriented Bayesian Networks

Daphne Koller
Avi Pfeffer

Bayesian networks provide a modeling language and associated inference algorithm for stochastic domains. They have been successfully applied in a variety of medium-scale applications. However, when faced with a large complex domain, the task of modeling using Bayesian networks begins to resemble the task of programming using logical circuits. In this paper, we describe an object-oriented Bayesian network (OOBN) language, which allows complex domains to be described in terms of inter-related objects. We use a Bayesian network fragment to describe the probabilistic relations between the attributes of an object. These attributes can themselves be objects, providing a natural framework for encoding part-of hierarchies. Classes are used to provide a reusable probabilistic model which can be applied to multiple similar objects. Classes also support inheritance of model fragments from a class to a subclass, allowing the common aspects of related classes to be defined only once. Our language has clear declarative semantics: an OOBN can be interpreted as a stochastic functional program, so that it uniquely specifies a probabilistic model. We provide an inference algorithm for OOBNs, and show that much of the structural information encoded by an OOBN--particularly the encapsulation of variables within an object and the reuse of model fragments in different contexts--can also be used to speed up the inference process.

AAAI Conference 1997 Conference Paper

P-classIC: A Tractable Probablistic Description Logic

Daphne Koller

Knowledge representation languages invariably reflect a trade-off between expressivity and tractability. Evidence suggests that the compromise chosen by description logits is a particularly successful one. However, description logic (as for all variants of first-order logic) is severely limited in its ability to express uncertainty. In this paper, we present P-CLASSIC, a probabilistic version of the description logic CLASSIC. In addition to terminological knowledge, the language utilizes Bayesian networks to express uncertainty about the basic properties of an individual, the number of fillers for its roles, and the properties of these fillers. We provide a semantics for P-CLASSIC and an effective inference procedure for probabilistic subsumption: computing the probability that a random individual in class C is also in class D. The effectiveness of the algorithm relies on independence assumptions and on our ability to execute lifted inference: reasoning about similar individuals as a group rather than as separate ground terms. We show that the complexity of the inference algorithm is the best that can be hoped for in a language that combines description logic with Bayesian networks. In particular, if we restrict to Bayesian networks that support polynomial time inference, the complexity of our inference procedure is also polynomial time.

AIJ Journal 1997 Journal Article

Representations and solutions for game-theoretic problems

Daphne Koller
Avi Pfeffer

A system with multiple interacting agents (whether artificial or human) is often best analyzed using game-theoretic tools. Unfortunately, while the formal foundations are well-established, standard computational techniques for game-theoretic reasoning are inadequate for dealing with realistic games. This paper describes the Gala system, an implemented system that allows the specification and efficient solution of large imperfect information games. The system contains the first implementation of a recent algorithm, due to Koller, Megiddo and von Stengel. Experimental results from the system demonstrate that the algorithm is exponentially faster than the standard algorithm in practice, not just in theory. It therefore allows the solution of games that are orders of magnitude larger than were previously possible. The system also provides a new declarative language for compactly and naturally representing games by their rules. As a whole, the Gala system provides the capability for automated game-theoretic analysis of complex real-world situations.

UAI Conference 1997 Conference Paper

Update Rules for Parameter Estimation in Bayesian Networks

Eric Bauer
Daphne Koller
Yoram Singer

This paper re-examines the problem of parameter estimation in Bayesian networks with missing values and hidden variables from the perspective of recent work in on-line learning [Kivinen & Warmuth, 1994]. We provide a unified framework for parameter estimation that encompasses both on-line learning, where the model is continuously adapted to new data cases as they arrive, and the more traditional batch learning, where a pre-accumulated set of samples is used in a one-time model selection process. In the batch case, our framework encompasses both the gradient projection algorithm and the EM algorithm for Bayesian networks. The framework also leads to new on-line and batch parameter update schemes, including a parameterized version of EM. We provide both empirical and theoretical results indicating that parameterized EM allows faster convergence to the maximum likelihood parameters than does standard EM.

UAI Conference 1996 Conference Paper

Context-Specific Independence in Bayesian Networks

Craig Boutilier
Nir Friedman
Moisés Goldszmidt
Daphne Koller

Bayesian networks provide a language for qualitatively representing the conditional independence properties of a distribution. This allows a natural and compact representation of the distribution, eases knowledge acquisition, and supports effective inference algorithms. It is well-known, however, that there are certain independencies that we cannot capture qualitatively within the Bayesian network structure: independencies that hold only in certain contexts, i.e., given a specific assignment of values to certain variables. In this paper, we propose a formal notion of context-specific independence (CSI), based on regularities in the conditional probability tables (CPTs) at a node. We present a technique, analogous to (and based on) d-separation, for determining when such independence holds in a given network. We then focus on a particular qualitative representation scheme---tree-structured CPTs---for capturing CSI. We suggest ways in which this representation can be used to support effective inference algorithms. In particular, we present a structural decomposition of the resulting network which can improve the performance of clustering algorithms, and an alternative algorithm based on cutset conditioning.

AIJ Journal 1996 Journal Article

From statistical knowledge bases to degrees of belief

Fahiem Bacchus
Adam J. Grove
Joseph Y. Halpern
Daphne Koller

An intelligent agent will often be uncertain about various properties of its environment, and when acting in that environment it will frequently need to quantify its uncertainty. For example, if the agent wishes to employ the expected-utility paradigm of decision theory to guide its actions, it will need to assign degrees of belief (subjective probabilities) to various assertions. Of course, these degrees of belief should not be arbitrary, but rather should be based on the information available to the agent. This paper describes one approach for inducing degrees of belief from very rich knowledge bases, that can include information about particular individuals, statistical correlations, physical laws, and default rules. We call our approach the random-worlds method. The method is based on the principle of indifference: it treats all of the worlds the agent considers possible as being equally likely. It is able to integrate qualitative default reasoning with quantitative probabilistic reasoning by providing a language in which both types of information can be easily expressed. Our results show that a number of desiderata that arise in direct inference (reasoning from statistical information to conclusions about individuals) and default reasoning follow directly from the semantics of random worlds. For example, random worlds captures important patterns of reasoning such as specificity, inheritance, indifference to irrelevant information, and default assumptions of independence. Furthermore, the expressive power of the language used and the intuitive semantics of random worlds allow the method to deal with problems that are beyond the scope of many other nondeductive reasoning systems.

AAAI Conference 1996 Conference Paper

Irrelevance and Conditioning in First-Order Probabilistic Logic

Daphne Koller

First-order probabilistic logic is a powerful knowledge representation language. Unfortunately, deductive reasoning based on the standard semantics for this logic does not support certain desirable patterns of reasoning, such as indifference to irrelevant information or substitution of constants into universal rules. We show that both these patterns rely on a first-order version of probabilistic independence, and provide semantic conditions to capture them. The resulting insight enables us to understand the effect of conditioning on independence, and allows us to describe a procedure for determining when independencies are preserved under conditioning. We apply this procedure in the context of a sound and powerful inference algorithm for reasoning from statistical knowledge bases.

ICML Conference 1996 Conference Paper

Toward Optimal Feature Selection

Daphne Koller
Mehran Sahami

UAI Conference 1995 Conference Paper

Stochastic simulation algorithms for dynamic probabilistic networks

Keiji Kanazawa
Daphne Koller
Stuart Russell 0001

Stochastic simulation algorithms such as likelihood weighting often give fast, accurate approximations to posterior probabilities in probabilistic networks, and are the methods of choice for very large networks. Unfortunately, the special characteristics of dynamic probabilistic networks (DPNs), which are used to represent stochastic temporal processes, mean that standard simulation algorithms perform very poorly. In essence, the simulation trials diverge further and further from reality as the process is observed over time. In this paper, we present simulation algorithms that use the evidence observed at each time step to push the set of trials back towards reality. The first algorithm, �evidence reversal� (ER) restructures each time slice of the DPN so that the evidence nodes for the slice become ancestors of the state variables. The second algorithm, called �survival of the fittestz� sampling (SOF), �repopulates� the set of trials at each time step using a stochastic reproduction rate weighted by the likelihood of the evidence according to each trial. We compare the performance of each algorithm with likelihood weighting on the original network, and also investigate the benefits of combining the ER and SOF methods. The ER/SOF combination appears to maintain bounded error independent of the number of time steps in the simulation.

FOCS Conference 1994 Conference Paper

(De)randomized Construction of Small Sample Spaces in \calNC

David R. Karger
Daphne Koller

D. Koller and N. Megiddo (1993) introduced the paradigm of constructing compact distributions that satisfy a given set of constraints, and showed how it can be used to efficiently derandomize certain types of algorithm. In this paper, we significantly extend their results in two ways. First, we show how their approach can be applied to deal with more general expectation constraints. More importantly, we provide the first parallel (/spl Nscr//spl Cscr/) algorithm for constructing a compact distribution that satisfies the constraints up to a small relative error. This algorithm deals with constraints over any event that can be verified by finite automata, including all independence constraints as well as constraints over events relating to the parity or sum of a certain set of variables. Our construction relies on a new and independently interesting parallel algorithm for converting a solution to a linear system into an almost basic approximate solution to the same system. We use these techniques in the first /spl Nscr//spl Cscr/ derandomization of an algorithm for constructing large independent sets in d-uniform hypergraphs for arbitrary d. We also show how the linear programming perspective suggests new proof techniques which might be useful in general probabilistic analysis. >

STOC Conference 1994 Conference Paper

Fast algorithms for finding randomized strategies in game trees

Daphne Koller
Nimrod Megiddo
Bernhard von Stengel

UAI Conference 1994 Conference Paper

Generating New Beliefs from Old

Fahiem Bacchus
Adam J. Grove
Joseph Y. Halpern
Daphne Koller

In previous work [BGHK92, BGHK93], we have studied the random-worlds approach -- a particular (and quite powerful) method for generating degrees of belief (i.e., subjective probabilities) from a knowledge base consisting of objective (first-order, statistical, and default) information. But allowing a knowledge base to contain only objective information is sometimes limiting. We occasionally wish to include information about degrees of belief in the knowledge base as well, because there are contexts in which old beliefs represent important information that should influence new beliefs. In this paper, we describe three quite general techniques for extending a method that generates degrees of belief from objective information to one that can make use of degrees of belief as well. All of our techniques are bloused on well-known approaches, such as cross-entropy. We discuss general connections between the techniques and in particular show that, although conceptually and technically quite different, all of the techniques give the same answer when applied to the random-worlds method.

STOC Conference 1993 Conference Paper

Constructing small sample spaces satisfying given constraints

Daphne Koller
Nimrod Megiddo

IJCAI Conference 1993 Conference Paper

Statistical Foundations for Default Reasoning

Fahiem Bacchus
Adam J. Grove
Joseph Y. Halpern
Daphne Koller

We describe a new approach to default, reason ing, based on a principle oi indifference among possible worlds. We interpret default rules as extreme statistical statements, thus obtaining a knowledge base KB comprised of statistical and first-order statements. We then assign equal probability to all worlds consistent with KB in order to assign a degree of belief to a state ment φ. The degree of belief can be used to de cide whether to defeasibly conclude φ. Various natural patterns of reasoning, such as a prefer ence for more specific defaults, indifference to irrelevant information, and the ability to com bine independent pieces of evidence, turn out to follow naturally from this technique. Further more, our approach is not restricted to default reasoning; it supports a spectrum of reasoning. , from quantitative to qualitative. It is also re lated to other systems for default reasoning. In particular, we show that the work of |Goldszmidt et al. , 1990], which applies maximum entropy ideas to --semantics, can be embedded in our framework.

STOC Conference 1992 Conference Paper

Asymptotic Conditional Probabilities for First-Order Logic

Adam J. Grove
Joseph Y. Halpern
Daphne Koller

AAAI Conference 1992 Conference Paper

From Statistics to Beliefs

Fahiem Bacchus
Daphne Koller

An intelligent agent uses known facts, including statistical knowledge, to assign depees of belief to assertions it is uncertain about. We investigate three principled techniques for doing this. All three are applications of the princ$e of indiflerence, because they assign equal degree of belief to all basic “situations” consistent with the knowledge base. They differ because there are competing intuitions about what the basic situations are. Various natural patterns of reasoning, such as the preference for the most specific statistical data available, turn out to follow from some or all of the techniques. This is an improvement over earlier theories, such as work on direct inference and reference classes, which arbitrarily postulate these patterns without offering any deeper explanations or guarantees of consistency. The three methods we investigate have surprising characterisations: there are connections to the principle of maximum entropy, a principle of maximal independence, and a “center of mass” principle. There are also unexpected connections between the three, that help us understand why the specific language chosen (for the knowledge base) is much more critical in inductive reasoning of the sort we consider than it is in traditional deductive reasoning.

FOCS Conference 1991 Conference Paper

Finding the Hidden Path: Time Bounds for All-Pairs Shortest Paths

David R. Karger
Daphne Koller
Steven J. Phillips

The all-pairs shortest paths problem in weighted graphs is investigated. An algorithm called the hidden paths algorithm, which finds these paths in time O(m*+n n/sup 2/ log n), where m* is the number of edges participating in shortest paths, is presented. It is argued that m* is likely to be small in practice, since m*=O(n log n) with high probability for many probability distributions on edge weights. An Omega (mn) lower bound on the running time of any path-comparison-based algorithm for the all-pairs shortest paths problem is proved. >

UAI Conference 1991 Conference Paper

Probability Estimation in Face of Irrelevant Information

Adam J. Grove
Daphne Koller

In this paper, we consider one aspect of the problem of applying decision theory to the design of agents that learn how to make decisions under uncertainty. This aspect concerns how an agent can estimate probabilities for the possible states of the world, given that it only makes limited observations before committing to a decision. We show that the naive application of statistical tools can be improved upon if the agent can determine which of his observations are truly relevant to the estimation problem at hand. We give a framework in which such determinations can be made, and define an estimation procedure to use them. Our framework also suggests several extensions, which show how additional knowledge can be used to improve tile estimation procedure still further.

FOCS Conference 1987 Conference Paper

Achievable Cases in an Asynchronous Environment (Extended Abstract)

Hagit Attiya
Amotz Bar-Noy
Danny Dolev
Daphne Koller
David Peleg
Rüdiger Reischuk

The paper deals with achievability of fault tolerant goals in a completely asynchronous distributed system. Fischer, Lynch, and Paterson [FLP] proved that in such a system "nontrivial agreement" cannot be achieved even in the (possible) presence of a single "benign" fault. In contrast, we exhibit two pairs of goals that are achievable even in the presence of up to t ≪ n/2 faulty processors, contradicting the widely held assumption that no nontrivial goals are attainable in such a system. The first pair deals with renaming processors so as to reduce the size of the initial name space. When only uniqueness is required of the new names, we present a lower bound of n + 1 on the size of the new name space, and a renaming algorithm which establishes an upper bound of n + t. In case the new names are required also to preserve the original order, a tight bound of 2t(n- t + 1) - 1 is obtained. The second pair of goals deals with the multi-slot critical section problem. We present algorithms for controlled access to a critical section. As for the number of slots required, a tight bound of t + 1 is proved in case the slots are identical. In the case of distinct slots the upper bound is 2t + 1.