Author name cluster

Jeff Bilmes

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

34 papers

1 author row

TMLR Journal 2025 Journal Article

Effective Backdoor Mitigation in Vision-Language Models Depends on the Pre-training Objective

Sahil Verma
Gantavya Bhatt
Avi Schwarzschild
Soumye Singhal
Arnav Mohanty Das
Chirag Shah
John P Dickerson
Pin-Yu Chen

Despite the advanced capabilities of contemporary machine learning (ML) models, they remain vulnerable to adversarial and backdoor attacks. This vulnerability is particularly concerning in real-world deployments, where compromised models may exhibit unpredictable behavior in critical scenarios. Such risks are heightened by the prevalent practice of collecting massive, internet-sourced datasets for training multimodal models, as these datasets may harbor backdoors. Various techniques have been proposed to mitigate the effects of backdooring in multimodal models, such as CleanCLIP, which is the current state-of-the-art approach. In this work, we demonstrate that the efficacy of CleanCLIP in mitigating backdoors is highly dependent on the particular objective used during model pre-training. We observe that stronger pre-training objectives that lead to higher zero-shot classification performance correlate with harder to remove backdoors behaviors. We show this by training multimodal models on two large datasets consisting of 3 million (CC3M) and 6 million (CC6M) datapoints, under various pre-training objectives, followed by poison removal using CleanCLIP. We find that CleanCLIP, even with extensive hyperparameter tuning, is ineffective in poison removal when stronger pre-training objectives are used. Our findings underscore critical considerations for ML practitioners who train models using large-scale web-curated data and are concerned about potential backdoor threats.

TMLR Journal 2025 Journal Article

How Many Images Does It Take? Estimating Imitation Thresholds in Text-to-Image Models

Sahil Verma
Royi Rassin
Arnav Mohanty Das
Gantavya Bhatt
Preethi Seshadri
Chirag Shah
Jeff Bilmes
Hannaneh Hajishirzi

Text-to-image models are trained using large datasets of image-text pairs collected from the internet. These datasets often include copyrighted and private images. Training models on such datasets enables them to generate images that might violate copyright laws and individual privacy. This phenomenon is termed imitation – generation of images with content that has recognizable similarity to its training images. In this work we estimate the point at which a model was trained on enough instances of a concept to be able to imitate it – the imitation threshold. We posit this question as a new problem and propose an efficient approach that estimates the imitation threshold without incurring the colossal cost of training these models from scratch. We experiment with two domains – human faces and art styles, and evaluate four text-to-image models that were trained on three pretraining datasets. We estimate the imitation threshold of these models to be in the range of 200-700 images, depending on the domain and the model. The imitation threshold provides an empirical basis for copyright violation claims and acts as a guiding principle for text-to-image model developers that aim to comply with copyright and privacy laws.

TMLR Journal 2023 Journal Article

Accelerating Batch Active Learning Using Continual Learning Techniques

Arnav Mohanty Das
Gantavya Bhatt
Megh Manoj Bhalerao
Vianne R. Gao
Rui Yang
Jeff Bilmes

A major problem with Active Learning (AL) is high training costs since models are typically retrained from scratch after every query round. We start by demonstrating that standard AL on neural networks with warm starting fails, both to accelerate training and to avoid catastrophic forgetting when using fine-tuning over AL query rounds. We then develop a new class of techniques, circumventing this problem, by biasing further training towards previously labeled sets. We accomplish this by employing existing, and developing novel, replay-based Continual Learning (CL) algorithms that are effective at quickly learning the new without forgetting the old, especially when data comes from an evolving distribution. We call this paradigm \textit{"Continual Active Learning" (CAL)}. We show CAL achieves significant speedups using a plethora of replay schemes that use model distillation and that select diverse/uncertain points from the history. We conduct experiments across many data domains, including natural language, vision, medical imaging, and computational biology, each with different neural architectures and dataset sizes. CAL consistently provides a $\sim$3x reduction in training time, while retaining performance and out-of-distribution robustness, showing its wide applicability.

AAAI Conference 2022 Conference Paper

PRISM: A Rich Class of Parameterized Submodular Information Measures for Guided Data Subset Selection

Suraj Kothawade
Vishal Kaushal
Ganesh Ramakrishnan
Jeff Bilmes
Rishabh Iyer

With ever-increasing dataset sizes, subset selection techniques are becoming increasingly important for a plethora of tasks. It is often necessary to guide the subset selection to achieve certain desiderata, which includes focusing or targeting certain data points, while avoiding others. Examples of such problems include: i) targeted learning, where the goal is to find subsets with rare classes or rare attributes on which the model is underperforming, and ii) guided summarization, where data (e. g. , image collection, text, document or video) is summarized for quicker human consumption with specific additional user intent. Motivated by such applications, we present PRISM, a rich class of PaRameterIzed Submodular information Measures. Through novel functions and their parameterizations, PRISM offers a variety of modeling capabilities that enable a trade-off between desired qualities of a subset like diversity or representation and similarity/dissimilarity with a set of data points. We demonstrate how PRISM can be applied to the two real-world problems mentioned above, which require guided subset selection. In doing so, we show that PRISM interestingly generalizes some past work, therein reinforcing its broad utility. Through extensive experiments on diverse datasets, we demonstrate the superiority of PRISM over the state-of-the-art in targeted learning and in guided imagecollection summarization. PRISM is available as a part of the SUBMODLIB (https: //github. com/decile-team/submodlib) and TRUST (https: //github. com/decile-team/trust) toolkits.

AAAI Conference 2021 Conference Paper

Submodular Span, with Applications to Conditional Data Summarization

Lilly Kumari
Jeff Bilmes

As an extension to the matroid span problem, we propose the submodular span problem that involves finding a large set of elements with small gain relative to a given query set. We then propose a two-stage Submodular Span Summarization (S3) framework to achieve a form of conditional or query-focused data summarization. The first stage encourages the summary to be relevant to a given query set, and the second stage encourages the final summary to be diverse, thus achieving two important necessities for a good query-focused summary. Unlike previous methods, our framework uses only a single submodular function defined over both data and query. We analyze theoretical properties in the context of both matroids and polymatroids that elucidate when our methods should work well. We find that a scalable approximation algorithm to the polymatroid submodular span problem has good theoretical and empirical properties. We provide empirical and qualitative results on three real-world tasks: conditional multi-document summarization on the DUC 2005-2007 datasets, conditional video summarization on the UT-Egocentric dataset, and conditional image corpus summarization on the ImageNet dataset. We use deep neural networks, specifically a BERT model for text, AlexNet for video frames, and Bi-directional Generative Adversarial Networks (BiGAN) for ImageNet images to help instantiate the submodular functions. The result is a minimally supervised form of conditional summarization that matches or improves over the previous state-of-the-art.

NeurIPS Conference 2019 Conference Paper

On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks

Sunil Thulasidasan
Gopinath Chennupati
Jeff Bilmes
Tanmoy Bhattacharya
Sarah Michalak

Mixup~\cite{zhang2017mixup} is a recently proposed method for training deep neural networks where additional samples are generated during training by convexly combining random pairs of images and their associated labels. While simple to implement, it has shown to be a surprisingly effective method of data augmentation for image classification; DNNs trained with mixup show noticeable gains in classification performance on a number of image classification benchmarks. In this work, we discuss a hitherto untouched aspect of mixup training -- the calibration and predictive uncertainty of models trained with mixup. We find that DNNs trained with mixup are significantly better calibrated -- i. e the predicted softmax scores are much better indicators of the actual likelihood of a correct prediction -- than DNNs trained in the regular fashion. We conduct experiments on a number of image classification architectures and datasets -- including large-scale datasets like ImageNet -- and find this to be the case. Additionally, we find that merely mixing features does not result in the same calibration benefit and that the label smoothing in mixup training plays a significant role in improving calibration. Finally, we also observe that mixup-trained DNNs are less prone to over-confident predictions on out-of-distribution and random-noise data. We conclude that the typical overconfidence seen in neural networks, even on in-distribution data is likely a consequence of training with hard labels, suggesting that mixup training be employed for classification tasks where predictive uncertainty is a significant concern.

NeurIPS Conference 2018 Conference Paper

Diverse Ensemble Evolution: Curriculum Data-Model Marriage

Tianyi Zhou
Shengjie Wang
Jeff Bilmes

We study a new method (``Diverse Ensemble Evolution (DivE$^2$)'') to train an ensemble of machine learning models that assigns data to models at each training epoch based on each model's current expertise and an intra- and inter-model diversity reward. DivE$^2$ schedules, over the course of training epochs, the relative importance of these characteristics; it starts by selecting easy samples for each model, and then gradually adjusts towards the models having specialized and complementary expertise on subsets of the training data, thereby encouraging high accuracy of the ensemble. We utilize an intra-model diversity term on data assigned to each model, and an inter-model diversity term on data assigned to pairs of models, to penalize both within-model and cross-model redundancy. We formulate the data-model marriage problem as a generalized bipartite matching, represented as submodular maximization subject to two matroid constraints. DivE$^2$ solves a sequence of continuous-combinatorial optimizations with slowly varying objectives and constraints. The combinatorial part handles the data-model marriage while the continuous part updates model parameters based on the assignments. In experiments, DivE$^2$ outperforms other ensemble training methods under a variety of model aggregation techniques, while also maintaining competitive efficiency.

NeurIPS Conference 2018 Conference Paper

Submodular Maximization via Gradient Ascent: The Case of Deep Submodular Functions

Wenruo Bai
William Stafford Noble
Jeff Bilmes

We study the problem of maximizing deep submodular functions (DSFs) subject to a matroid constraint. DSFs are an expressive class of submodular functions that include, as strict subfamilies, the facility location, weighted coverage, and sums of concave composed with modular functions. We use a strategy similar to the continuous greedy approach, but we show that the multilinear extension of any DSF has a natural and computationally attainable concave relaxation that we can optimize using gradient ascent. Our results show a guarantee of $\max_{0<\delta<1}(1-\epsilon-\delta-e^{-\delta^2\Omega(k)})$ with a running time of $O(\nicefrac{n^2}{\epsilon^2})$ plus time for pipage rounding to recover a discrete solution, where $k$ is the rank of the matroid constraint. This bound is often better than the standard $1-1/e$ guarantee of the continuous greedy algorithm, but runs much faster. Our bound also holds even for fully curved ($c=1$) functions where the guarantee of $1-c/e$ degenerates to $1-1/e$ where $c$ is the curvature of $f$. We perform computational experiments that support our theoretical results.

NeurIPS Conference 2016 Conference Paper

Deep Submodular Functions: Definitions and Learning

Brian Dolhansky
Jeff Bilmes

We propose and study a new class of submodular functions called deep submodular functions (DSFs). We define DSFs and situate them within the broader context of classes of submodular functions in relationship both to various matroid ranks and sums of concave composed with modular functions (SCMs). Notably, we find that DSFs constitute a strictly broader class than SCMs, thus motivating their use, but that they do not comprise all submodular functions. Interestingly, some DSFs can be seen as special cases of certain deep neural networks (DNNs), hence the name. Finally, we provide a method to learn DSFs in a max-margin framework, and offer preliminary results applying this both to synthetic and real-world data instances.

NeurIPS Conference 2015 Conference Paper

Mixed Robust/Average Submodular Partitioning: Fast Algorithms, Guarantees, and Applications

Kai Wei
Rishabh Iyer
Shengjie Wang
Wenruo Bai
Jeff Bilmes

We investigate two novel mixed robust/average-case submodular data partitioning problems that we collectively call Submodular Partitioning. These problems generalize purely robust instances of the problem, namely max-min submodular fair allocation (SFA) and \emph{min-max submodular load balancing} (SLB), and also average-case instances, that is the submodular welfare problem (SWP) and submodular multiway partition (SMP). While the robust versions have been studied in the theory community, existing work has focused on tight approximation guarantees, and the resultant algorithms are not generally scalable to large real-world applications. This contrasts the average case instances, where most of the algorithms are scalable. In the present paper, we bridge this gap, by proposing several new algorithms (including greedy, majorization-minimization, minorization-maximization, and relaxation algorithms) that not only scale to large datasets but that also achieve theoretical approximation guarantees comparable to the state-of-the-art. We moreover provide new scalable algorithms that apply to additive combinations of the robust and average-case objectives. We show that these problems have many applications in machine learning (ML), including data partitioning and load balancing for distributed ML, data clustering, and image segmentation. We empirically demonstrate the efficacy of our algorithms on real-world problems involving data partitioning for distributed optimization (of convex and deep neural network objectives), and also purely unsupervised image segmentation.

NeurIPS Conference 2015 Conference Paper

Submodular Hamming Metrics

Jennifer Gillenwater
Rishabh Iyer
Bethany Lusch
Rahul Kidambi
Jeff Bilmes

We show that there is a largely unexplored class of functions (positive polymatroids) that can define proper discrete metrics over pairs of binary vectors and that are fairly tractable to optimize over. By exploiting submodularity, we are able to give hardness results and approximation algorithms for optimizing over such metrics. Additionally, we demonstrate empirically the effectiveness of these metrics and associated algorithms on both a metric minimization task (a form of clustering) and also a metric maximization task (generating diverse k-best lists).

NeurIPS Conference 2014 Conference Paper

Divide-and-Conquer Learning by Anchoring a Conical Hull

Tianyi Zhou
Jeff Bilmes
Carlos Guestrin

We reduce a broad class of machine learning problems, usually addressed by EM or sampling, to the problem of finding the $k$ extremal rays spanning the conical hull of a data point set. These $k$ ``anchors'' lead to a global solution and a more interpretable model that can even outperform EM and sampling on generalization error. To find the $k$ anchors, we propose a novel divide-and-conquer learning scheme ``DCA'' that distributes the problem to $\mathcal O(k\log k)$ same-type sub-problems on different low-D random hyperplanes, each can be solved by any solver. For the 2D sub-problem, we present a non-iterative solver that only needs to compute an array of cosine values and its max/min entries. DCA also provides a faster subroutine for other methods to check whether a point is covered in a conical hull, which improves algorithm design in multiple dimensions and brings significant speedup to learning. We apply our method to GMM, HMM, LDA, NMF and subspace clustering, then show its competitive performance and scalability over other methods on rich datasets.

NeurIPS Conference 2014 Conference Paper

Learning Mixtures of Submodular Functions for Image Collection Summarization

Sebastian Tschiatschek
Rishabh Iyer
Haochen Wei
Jeff Bilmes

We address the problem of image collection summarization by learning mixtures of submodular functions. We argue that submodularity is very natural to this problem, and we show that a number of previously used scoring functions are submodular — a property not explicitly mentioned in these publications. We provide classes of submodular functions capturing the necessary properties of summaries, namely coverage, likelihood, and diversity. To learn mixtures of these submodular functions as scoring functions, we formulate summarization as a supervised learning problem using large-margin structured prediction. Furthermore, we introduce a novel evaluation metric, which we call V-ROUGE, for automatic summary scoring. While a similar metric called ROUGE has been successfully applied to document summarization [14], no such metric was known for quantifying the quality of image collection summaries. We provide a new dataset consisting of 14 real-world image collections along with many human-generated ground truth summaries collected using mechanical turk. We also extensively compare our method with previously explored methods for this problem and show that our learning approach outperforms all competitors on this new dataset. This paper provides, to our knowledge, the first systematic approach for quantifying the problem of image collection summarization, along with a new dataset of image collections and human summaries.

NeurIPS Conference 2013 Conference Paper

Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions

Rishabh Iyer
Stefanie Jegelka
Jeff Bilmes

We investigate three related and important problems connected to machine learning, namely approximating a submodular function everywhere, learning a submodular function (in a PAC like setting [26]), and constrained minimization of submodular functions. In all three problems, we provide improved bounds which depend on the “curvature” of a submodular function and improve on the previously known best results for these problems [9, 3, 7, 25] when the function is not too curved – a property which is true of many real-world submodular functions. In the former two problems, we obtain these bounds through a generic black-box transformation (which can potentially work for any algorithm), while in the case of submodular minimization, we propose a framework of algorithms which depend on choosing an appropriate surrogate for the submodular function. In all these cases, we provide almost matching lower bounds. While improved curvature-dependent bounds were shown for monotone submodular maximization [4, 27], the existence of similar improved bounds for the aforementioned problems has been open. We resolve this question in this paper by showing that the same notion of curvature provides these improved results. Empirical experiments add further support to our claims.

NeurIPS Conference 2013 Conference Paper

Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints

Rishabh Iyer
Jeff Bilmes

We investigate two new optimization problems — minimizing a submodular function subject to a submodular lower bound constraint (submodular cover) and maximizing a submodular function subject to a submodular upper bound constraint (submodular knapsack). We are motivated by a number of real-world applications in machine learning including sensor placement and data subset selection, which require maximizing a certain submodular function (like coverage or diversity) while simultaneously minimizing another (like cooperative cost). These problems are often posed as minimizing the difference between submodular functions [9, 23] which is in the worst case inapproximable. We show, however, that by phrasing these problems as constrained optimization, which is more natural for many applications, we achieve a number of bounded approximation guarantees. We also show that both these problems are closely related and, an approximation algorithm solving one can be used to obtain an approximation guarantee for the other. We provide hardness results for both problems thus showing that our approximation factors are tight up to log-factors. Finally, we empirically demonstrate the performance and good scalability properties of our algorithms.

NeurIPS Conference 2012 Conference Paper

Submodular-Bregman and the Lovász-Bregman Divergences with Applications

Rishabh Iyer
Jeff Bilmes

We introduce a class of discrete divergences on sets (equivalently binary vectors) that we call the submodular-Bregman divergences. We consider two kinds, deﬁned either from tight modular upper or tight modular lower bounds of a submodular function. We show that the properties of these divergences are analogous to the (standard continuous) Bregman divergence. We demonstrate how they generalize many useful divergences, including the weighted Hamming distance, squared weighted Hamming, weighted precision, recall, conditional mutual information, and a generalized KL-divergence on sets. We also show that the generalized Bregman divergence on the Lov´asz extension of a submodular function, which we call the Lov´asz-Bregman divergence, is a continuous extension of a submodular Bregman divergence. We point out a number of applications, and in particular show that a proximal algorithm deﬁned through the submodular Bregman divergence pro- vides a framework for many mirror-descent style algorithms related to submodular function optimization. We also show that a generalization of the k-means algorithm using the Lov´asz Bregman divergence is natural in clustering scenarios where ordering is important. A unique property of this algorithm is that computing the mean ordering is extremely efﬁcient unlike other order based distance measures.

TIST Journal 2011 Journal Article

Inferring colocation and conversation networks from privacy-sensitive audio with implications for computational social science

Danny Wyatt
Tanzeem Choudhury
Jeff Bilmes
James A. Kitts

New technologies have made it possible to collect information about social networks as they are acted and observed in the wild, instead of as they are reported in retrospective surveys. These technologies offer opportunities to address many new research questions: How can meaningful information about social interaction be extracted from automatically recorded raw data on human behavior? What can we learn about social networks from such fine-grained behavioral data? And how can all of this be done while protecting privacy? With the goal of addressing these questions, this article presents new methods for inferring colocation and conversation networks from privacy-sensitive audio. These methods are applied in a study of face-to-face interactions among 24 students in a graduate school cohort during an academic year. The resulting analysis shows that networks derived from colocation and conversation inferences are quite different. This distinction can inform future research in computational social science, especially work that only measures colocation or employs colocation data as a proxy for conversation networks.

NeurIPS Conference 2011 Conference Paper

On fast approximate submodular minimization

Stefanie Jegelka
Hui Lin
Jeff Bilmes

We are motivated by an application to extract a representative subset of machine learning training data and by the poor empirical performance we observe of the popular minimum norm algorithm. In fact, for our application, minimum norm can have a running time of about O(n^7 ) (O(n^5 ) oracle calls). We therefore propose a fast approximate method to minimize arbitrary submodular functions. For a large sub-class of submodular functions, the algorithm is exact. Other submodular functions are iteratively approximated by tight submodular upper bounds, and then repeatedly optimized. We show theoretical properties, and empirical results suggest significant speedups over minimum norm while retaining higher accuracies.

NeurIPS Conference 2011 Conference Paper

Online Submodular Set Cover, Ranking, and Repeated Active Learning

Andrew Guillory
Jeff Bilmes

We propose an online prediction version of submodular set cover with connections to ranking and repeated active learning. In each round, the learning algorithm chooses a sequence of items. The algorithm then receives a monotone submodular function and suffers loss equal to the cover time of the function: the number of items needed, when items are selected in order of the chosen sequence, to achieve a coverage constraint. We develop an online learning algorithm whose loss converges to approximately that of the best sequence in hindsight. Our proposed algorithm is readily extended to a setting where multiple functions are revealed at each round and to bandit and contextual bandit settings.

JMLR Journal 2011 Journal Article

Semi-Supervised Learning with Measure Propagation

Amarnag Subramanya
Jeff Bilmes

We describe a new objective for graph-based semi-supervised learning based on minimizing the Kullback-Leibler divergence between discrete probability measures that encode class membership probabilities. We show how the proposed objective can be efficiently optimized using alternating minimization. We prove that the alternating minimization procedure converges to the correct optimum and derive a simple test for convergence. In addition, we show how this approach can be scaled to solve the semi-supervised learning problem on very large data sets, for example, in one instance we use a data set with over 10 8 samples. In this context, we propose a graph node ordering algorithm that is also applicable to other graph-based semi-supervised learning approaches. We compare the proposed approach against other standard semi-supervised learning algorithms on the semi-supervised learning benchmark data sets (Chapelle et al., 2007), and other real-world tasks such as text classification on Reuters and WebKB, speech phone classification on TIMIT and Switchboard, and linguistic dialog-act tagging on Dihana and Switchboard. In each case, the proposed approach outperforms the state-of-the-art. Lastly, we show that our objective can be generalized into a form that includes the standard squared-error loss, and we prove a geometric rate of convergence in that case. [abs] [ pdf ][ bib ] &copy JMLR 2011. ( edit, beta )

AAAI Conference 2010 Conference Paper

Discovering Long Range Properties of Social Networks with Multi-Valued Time-Inhomogeneous Models

Danny Wyatt
Tanzeem Choudhury
Jeff Bilmes

The current methods used to mine and analyze temporal social network data make two assumptions: all edges have the same strength, and all parameters are time-homogeneous. We show that those assumptions may not hold for social networks and propose an alternative model with two novel aspects: (1) the modeling of edges as multi-valued variables that can change in intensity, and (2) the use of a curved exponential family framework to capture time-inhomogeneous properties while retaining a parsimonious and interpretable model. We show that our model outperforms traditional models on two real-world social network data sets.

AIJ Journal 2010 Journal Article

Panlingual lexical translation via probabilistic inference

Mausam
Stephen Soderland
Oren Etzioni
Daniel S. Weld
Kobi Reiter
Michael Skinner
Marcus Sammer
Jeff Bilmes

This paper introduces a novel approach to the task of lexical translation between languages for which no translation dictionaries are available. We build a massive translation graph, automatically constructed from over 630 machine-readable dictionaries and Wiktionaries. In this graph each node denotes a word in some language and each edge ( v i, v j ) denotes a word sense shared by v i and v j. Our current graph contains over 10, 000, 000 nodes and expresses more than 60, 000, 000 pairwise translations. The composition of multiple translation dictionaries leads to a transitive inference problem: if word A translates to word B which in turn translates to word C, what is the probability that C is a translation of A? The paper describes a series of probabilistic inference algorithms that solve this problem at varying precision and recall levels. All algorithms enable us to quantify our confidence in a translation derived from the graph, and thus trade precision for recall. We compile the results of our best inference algorithm to yield PanDictionary, a novel multilingual dictionary. PanDictionary contains more than four times as many translations as in the largest Wiktionary at precision 0. 90 and over 200, 000, 000 pairwise translations in over 200, 000 language pairs at precision 0. 8.

NeurIPS Conference 2009 Conference Paper

Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

Amarnag Subramanya
Jeff Bilmes

We prove certain theoretical properties of a graph-regularized transductive learning objective that is based on minimizing a Kullback-Leibler divergence based loss. These include showing that the iterative alternating minimization procedure used to minimize the objective converges to the correct solution and deriving a test for convergence. We also propose a graph node ordering algorithm that is cache cognizant and leads to a linear speedup in parallel computations. This ensures that the algorithm scales to large data sets. By making use of empirical evaluation on the TIMIT and Switchboard I corpora, we show this approach is able to out-perform other state-of-the-art SSL approaches. In one instance, we solve a problem on a 120 million node graph.

NeurIPS Conference 2009 Conference Paper

Label Selection on Graphs

Andrew Guillory
Jeff Bilmes

We investigate methods for selecting sets of labeled vertices for use in predicting the labels of vertices on a graph. We specifically study methods which choose a single batch of labeled vertices (i. e. offline, non sequential methods). In this setting, we find common graph smoothness assumptions directly motivate simple label selection methods with interesting theoretical guarantees. These methods bound prediction error in terms of the smoothness of the true labels with respect to the graph. Some of these bounds give new motivations for previously proposed algorithms, and some suggest new algorithms which we evaluate. We show improved performance over baseline methods on several real world data sets.

NeurIPS Conference 2009 Conference Paper

Submodularity Cuts and Applications

Yoshinobu Kawahara
Kiyohito Nagano
Koji Tsuda
Jeff Bilmes

Several key problems in machine learning, such as feature selection and active learning, can be formulated as submodular set function maximization. We present herein a novel algorithm for maximizing a submodular set function under a cardinality constraint --- the algorithm is based on a cutting-plane method and is implemented as an iterative small-scale binary-integer linear programming procedure. It is well known that this problem is NP-hard, and the approximation factor achieved by the greedy algorithm is the theoretical limit for polynomial time. As for (non-polynomial time) exact algorithms that perform reasonably in practice, there has been very little in the literature although the problem is quite important for many applications. Our algorithm is guaranteed to find the exact solution in finite iterations, and it converges fast in practice due to the efficiency of the cutting-plane mechanism. Moreover, we also provide a method that produces successively decreasing upper-bounds of the optimal solution, while our algorithm provides successively increasing lower-bounds. Thus, the accuracy of the current solution can be estimated at any point, and the algorithm can be stopped early once a desired degree of tolerance is met. We evaluate our algorithm on sensor placement and feature selection applications showing good performance.

IJCAI Conference 2007 Conference Paper

Mukund Narasimhan
Jeff Bilmes

In this paper, we consider the problem of producing balanced clusterings with respect to a submodular objective function. Submodular objective functions occur frequently in many applications, and hence this problem is broadly applicable. We show that the results of Patkar and Narayanan (2003) can be applied to cases when the submodular function is derived from a bipartite object-feature graph, and moreover, in this case we have an efficient flow based algorithm for finding local improvements. We show the effectiveness of this approach by applying it to the clustering of words in language models.

IJCAI Conference 2007 Conference Paper

Danny Wyatt
Tanzeem Choudhury
Jeff Bilmes
Henry Kautz

In this paper we introduce a new dynamic Bayesian network that separates the speakers and their speaking turns in a multi-person conversation. We protect the speakers' privacy by using only features from which intelligible speech cannot be reconstructed. The model we present combines data from multiple audio streams, segments the streams into speech and silence, separates the different speakers, and detects when other nearby individuals who are not wearing microphones are speaking. No pre-trained speaker specific models are used, so the system can be easily applied in new and different environments. We show promising results in two very different datasets that vary in background noise, microphone placement and quality, and conversational dynamics.

AAAI Conference 2007 Conference Paper

Learning Large Scale Common Sense Models of Everyday Life

William Pentney
Jeff Bilmes

Recent work has shown promise in using large, publicly available, hand-contributed commonsense databases as joint models that can be used to infer human state from day-to-day sensor data. The parameters of these models are mined from the web. We show in this paper that learning these parameters using sensor data (with the mined parameters as priors) can improve performance of the models signiﬁcantly. The primary challenge in learning is scale. Since the model comprises roughly 50, 000 irregularly connected nodes in each time slice, it is intractable either to completely label observed data manually or to compute the expected likelihood of even a single time slice. We show how to solve the resulting semisupervised learning problem by combining a variety of conventional approximation techniques and a novel technique for simplifying the model called context-based pruning. We show empirically that the learned model is substantially better at interpreting sensor data and an detailed analysis of how various techniques contribute to the performance.

NeurIPS Conference 2006 Conference Paper

Multi-dynamic Bayesian Networks

Karim Filali
Jeff Bilmes

We present a generalization of dynamic Bayesian networks to concisely describe complex probability distributions such as in problems with multiple interacting variable-length streams of random variables. Our framewor k incorporates recent graphical model constructs to account for existence uncert ainty, value-specific independence, aggregation relationships, and local and global constraints, while still retaining a Bayesian network interpretation and effic ient inference and learning techniques. We introduce one such general technique, which is an extension of Value Elimination, a backtracking search inference algo rithm. Multi-dynamic Bayesian networks are motivated by our work on Statistical Machine Translation (MT). We present results on MT word alignment in support of our claim that MDBNs are a promising framework for the rapid prototyping of new MT systems.

NeurIPS Conference 2005 Conference Paper

Q-Clustering

Mukund Narasimhan
Nebojsa Jojic
Jeff Bilmes

We show that Queyranne's algorithm for minimizing symmetric submodular functions can be used for clustering with a variety of different objective functions. Two specific criteria that we consider in this paper are the single linkage and the minimum description length criteria. The first criterion tries to maximize the minimum distance between elements of different clusters, and is inherently "discriminative". It is known that optimal clusterings into k clusters for any given k in polynomial time for this criterion can be computed. The second criterion seeks to minimize the description length of the clusters given a probabilistic generative model. We show that the optimal partitioning into 2 clusters, and approximate partitioning (guaranteed to be within a factor of 2 of the the optimal) for more clusters can be computed. To the best of our knowledge, this is the first time that a tractable algorithm for finding the optimal clustering with respect to the MDL criterion for 2 clusters has been given. Besides the optimality result for the MDL criterion, the chief contribution of this paper is to show that the same algorithm can be used to optimize a broad class of criteria, and hence can be used for many application specific criterion for which efficient algorithm are not known.

NeurIPS Conference 2004 Conference Paper

Optimal sub-graphical models

Mukund Narasimhan
Jeff Bilmes

We investigate the problem of reducing the complexity of a graphical model (G, PG) by finding a subgraph H of G, chosen from a class of subgraphs H, such that H is optimal with respect to KL-divergence. We do this by first defining a decomposition tree representation for G, which is closely related to the junction-tree representation for G. We then give an algorithm which uses this representation to compute the optimal H H. Gavril [2] and Tarjan [3] have used graph separation properties to solve several combinatorial optimization problems when the size of the minimal separators in the graph is bounded. We present an extension of this technique which applies to some important choices of H even when the size of the minimal separators of G are arbitrarily large. In particular, this applies to problems such as finding an optimal subgraphical model over a (k - 1)-tree of a graphical model over a k-tree (for arbitrary k) and selecting an optimal subgraphical model with (a constant) d fewer edges with respect to KL-divergence can be solved in time polynomial in |V (G)| using this formulation. 1 Introduction and Preliminaries The complexity of inference in graphical models is typically exponential in some parame- ter of the graph, such as the size of the largest clique. Therefore, it is often required to find a subgraphical model that has lower complexity (smaller clique size) without introducing a large error in inference results. The KL-divergence between the original probability dis- tribution and the probability distribution on the simplified graphical model is often used to measure the impact on inference. Existing techniques for reducing the complexity of graph- ical models including annihilation and edge-removal [4] are greedy in nature and cannot make any guarantees regarding the optimality of the solution. This problem is NP-complete [9] and so, in general, one cannot expect a polynomial time algorithm to find the optimal solution. However, we show that when we restrict the problem to some sets of subgraphs, the optimal solution can be found quite quickly using a dynamic programming algorithm in time polynomial in the tree-width of the graph. 1. 1 Notation and Terminology A graph G = (V, E) is said to be triangulated if every cycle of length greater than 3 has a chord. A clique of G is a non-empty set S V such that {a, b} E for all This work was supported by NSF grant IIS-0093430 and an Intel Corporation Grant. {b, c, d} {c, f, g} d {b, c} {f, c} {c, e} {b, e, c} {e, c, f } b c g {b, e} {a, b, e} a e f Figure 1: A triangulated graph G and a junction-tree for G a, b S. A clique S is maximal if S is not properly contained in another clique. If and are non-adjacent vertices of G then a set of vertices S V \ {, } is called an (, )-separator if and are in distinct components of G[V \ S]. S is a minimal (, )-separator if no proper subset of S is an (, )-separator. S is said to be a minimal separator if S is a minimal (, )-separator for some non adjacent a, b V. If T = (K, S) is a junction-tree for G (see [7]), then the nodes K of T correspond to the maximal- cliques of G, while the links S correspond to minimal separators of G (We reserve the terms vertices/edges for elements of G, and nodes/links for the elements of T ). If G is triangulated, then the number of maximal cliques is at most |V |. For example, in the graph G shown in Figure 1, K = {{b, c, d}, {a, b, e}, {b, e, c}, {e, c, f }, {c, f, g}}. The links S of T correspond to minimal-separators of G in the following way. If ViVj S (where Vi, Vj K and hence are cliques of G), then Vi Vj =. We label each edge ViVj S with the set Vij = Vi Vj, which is a non-empty complete separator in G. The removal of any link ViVj S disconnects T into two subtrees which we denote T (i) and T (j) (chosen so that T (i) contains Vi). We will let K(i) be the nodes of T (i), and V (i) = V K(i)V be the set of vertices corresponding to the subtree T (i). The junction tree property ensures that V (i) V (j) = Vi Vj = Vij. We will let G(i) be the subgraph induced by V (i). A graphical model is a pair (G, P ) where P is the joint probability distribution for random variables X1, X2, .. ., Xn, and G is a graph with vertex set V (G) = {X1, X2, .. ., Xn} such that the separators in G imply conditional independencies in P (so P factors according to G). If G is triangulated, then the junction-tree algorithm can be used for exact inference in the probability distribution P. The complexity of this algorithm grows with the treewidth of G (which is one less than the size of the largest clique in G when G is triangulated). The growth is exponential when P is a discrete probability distribution, thus rendering exact inference for graphs with large treewidth impractical. Therefore, we seek another graphical model (H, PH ) which allows tractable inference (so H should have lower treewidth than G has). The general problem of finding a graphical model of tree-width at most k so as to minimize the KL-divergence from a specified probability distribution is NP complete for general k ([9]) However, it is known that this problem is solvable in polynomial time (in |V (G)|) for some special cases cases (such as when G has bounded treewidth or when k = 1 [1]). If (G, PG) and (H, PH ) are graphical models, then we say that (H, PH ) is a subgraphical model of (G, PG) if H is a spanning subgraph of G. Note in particular that separators in G are separators in H, and hence (G, PH ) is also a graphical model. 2 Graph Decompositions and Divide-and-Conquer Algorithms For the remainder of the paper, we will be assuming that G = (V, E) is some triangulated graph, with junction tree T = (K, S). As observed above, if ViVj S, then the removal {b, c, d} {c, f, g} d {b, c} {f, c} {b, e, c} {e, c, f } b c c g {b, e} {a, b, e} a e e f Figure 2: The graphs G(i), G(j) and junction-trees T (i) and T (j) resulting from the removal of the link Vij = {c, e} of Vij = Vi Vj disconnects G into two (vertex-induced) subgraphs G(i) and G(j) which are both triangulated, with junction-trees T (i) and T (j) respectively. We can recursively decompose each of G(i) and G(j) into smaller and smaller subgraphs till the resulting sub- graphs are cliques. When the size of all the minimal separators are bounded, we may use these decompositions to easily solve problems that are hard in general. For example, in [5] it is shown that NP-complete problems like vertex coloring, and finding maximum inde- pendent sets can be solved in polynomial time on graphs with bounded tree-width (which are equivalent to spanning graphs with bounded size separators). We will be interested in finding (triangulated) subgraphs of G that satisfy some conditions, such as a bound on the number of edges, or a bound on the tree-width and which optimize separable objective functions (described in Section 2) One reason why problems such as this can often be solved easily when the tree-width of G is bounded by some constant is this: If Vij is a separator decomposing G into G(i) and G(j), then a divide-and-conquer approach would suggest that we try and find optimal subgraphs of G(i) and G(j) and then splice the two together to get an optimal subgraph of G. There are two issues with this approach. First, the optimal subgraphs of G(i) and G(j) need not necessarily match up on Vij, the set of common vertices. Second, even if the two subgraphs agree on the set of common vertices, the graph resulting from splicing the two subgraphs together need not be triangulated (which could happen even if the two subgraphs individually are triangulated). To rectify the situation, we can do the following. We parti- tion the set of subgraphs of G(i) and G(j) into classes, so that any subgraph of G(i) and any subgraph G(j) corresponding to the same class are compatible in the sense that they match up on their intersection namely Vij, and so that by splicing the two subgraphs together, we get a subgraph of G which is acceptable (and in particular is triangulated). Then given op- timal subgraphs of both G(i) and G(j) corresponding to each class, we can enumerate over all the classes and pick the best one. Of course, to ensure that we do not repeatedly solve the same problem, we need to work bottom-up (a. k. a dynamic programming) or memoize our solutions. This procedure can be carried out in polynomial (in |V |) time as long as we have only a polynomial number of classes. Now, if we have a polynomial number of classes, these classes need not actually be a partition of all the acceptable subgraphs, though the union of the classes must cover all acceptable subgraphs (so the same subgraph can be contained in more than one class). For our application, every class can be thought of to be the set of subgraphs that satisfy some constraint, and we need to pick a polynomial number of constraints that cover all possibilities. The bound on the tree-width helps us here. If |V ) ij | = k, then in any subgraph H of G, H [Vij ] must be one of the 2(k 2 possible subgraphs of G[V ) ij ]. So, if k is sufficiently small (so 2(k 2 is bounded by some polynomial in |V |), then this procedure results in a polynomial time algorithm. In this paper, we show that in some cases we can characterize the space H so that we still have a polynomial number of constraints even when the tree-width of G is not bounded by a small constant. 2. 1 Separable objective functions For cases where exact inference in the graphical model (G, PG) is intractable, it is natural to try to find a subgraphical model (H, PH ) such that D(PG PH ) is minimized, and inference using H is tractable. We will denote by H the set of subgraphs of G that are tractable for inference. For example, this set could be the set of subgraphs of G with treewidth one less than the treewidth of G, or perhaps the set of subgraphs of G with at d fewer edges. For a specified subgraph H of G, there is a unique probability distribution PH factoring over H that minimizes D(PG PH ). Hence, finding a optimal subgraphical model is equivalent to finding a subgraph H for which D(PG PH ) is minimized. If Vij is a separator of G, we will attempt to find optimal subgraphs of G by finding optimal subgraphs of G(i) and G(j) and splicing them together. However, to do this, we need to ensure that the objective criteria also decomposes along the separator Vij. Suppose that H is any triangulated subgraph of G. Let PG(i) and PG(j) be the (marginalized) distributions of PG on V (i) and V (j) respectively, and PH(i) and PH(j) be the (marginalized) distributions of the distribution PH on V (i) and V (j) where H(i) = H[V (i)] and H(j) = H[V (j)], The following result assures us that the KL-divergence also factors according to the separator Vij. Lemma 1. Suppose that (G, PG) is a graphical model, H is a triangulated subgraph of G, and PH factors over H. Then D(PG PH ) = D(PG(i) PH(i)) + D(PG(j) PH(j)) - D(PG[Vij] PH[Vij]). Proof. Since H is a subgraph of G, and Vij is a separator of G, Vij must also be a sepa- P rator of H. Therefore, P H(i) ({Xv }vV (i) )PH(j) ({Xv }vV (j) ) H {Xv} =. The result vV PH[V ) ij ] ({Xv }vVij follows immediately. Therefore, there is hope that we can reduce our our original problem of finding an optimal subgraph H H as one of finding subgraphs of H (i) G(i) and H(j) G(j) that are compatible, in the sense that they match up on the overlap Vij, and for which D(PG PH ) is minimized. Throughout this paper, for the sake of concreteness, we will assume that the objective criterion is to minimize the KL-divergence. However, all the results can be extended to other objective functions, as long as they "separate" in the sense that for any separator, the objective function is the sum of the objective functions of the two parts, possibly modulo some correction factor which is purely a function of the separator. Another example might be the complexity r(H) of representing the graphical model H. A very natural representation satisfies r(G) = r(G(i)) + r(G(j)) if G has a separator G(i) G(j). Therefore, the representation cost reduction would satisfy r(G) - r(H) = (r(G(i)) - r(H(i))) + (r(G(j)) - r(H(j))), and so also factors according to the separators. Finally note that any linear combinations of such separable functions is also separable, and so this technique could also be used to determine tradeoffs (representation cost vs. KL-divergence loss for example). In Section 4 we discuss some issues regarding computing this function. 2. 2 Decompositions and decomposition trees For the algorithms considered in this paper, we will be mostly interested in the decompo- sitions that are specified by the junction tree, and we will represent these decompositions by a rooted tree called a decomposition tree. This representation was introduced in [2, 3], and is similar in spirit to Darwiche's dtrees [6] which specify decompositions of directed acyclic graphs. In this section and the next, we show how a decomposition tree for a graph may be constructed, and show how it is used to solve a number of optimization problems. abd; ce; gf a; be; cd e; cf; g d; bc; e abe dbc ebc cef cf g Figure 3: The separator tree corresponding to Figure 1 A decomposition tree for G is a rooted tree whose vertices correspond to separators and cliques of G. We describe the construction of the decomposition tree in terms of a junction- tree T = (K, S) for G. The interior nodes of the decomposition tree R(T ) correspond to S (the links of T and hence the minimal separators of G). The leaf or terminal nodes represent the elements of K (the nodes of T and hence the maximal cliques of G). R(T ) can be recursively constructed from T as follows: If T consists of just one node K, (and hence no edges), then R consists of just one node, which is given the label K as well. If however, T has more than one node, then T must contain at least one link. To begin, let ViVj S be any link in T. Then removal of the link ViVj results in two disjoint junction- trees T (i) and T (j). We label the root of R by the decomposition (V (i); Vij; V (j)). The rest of R is recursively built by successively picking links of T (i) and T (j) (decompositions of G(i) and G(j)) to form the interior nodes of R. The effect of this procedure on the junction tree of Figure 1 is shown in Figure 3, where the decomposition associated with the interior nodes is shown inside the nodes. Let M be the set of all nodes of R(T ). For any interior node M induced by the the link ViVj S of T, then we will let M (i) and M (j) represent the left and right children of M, and R(i) and R(j) be the left and right trees below M. 3 Finding optimal subgraphical models 3. 1 Optimal sub (k - 1)-trees of k-trees Suppose that G is a k-tree. A sub (k - 1)-tree of G is a subgraph H of G that is (k - 1)- tree. Now, if Vij is any minimal separator of G, then both G(i) and G(j) are k-trees on vertex sets V (i) and V (j) respectively. It is clear that the induced subgraphs H[V (i)] and H[V (j)] are subgraphs of G(i) and G(j) and are partial (k - 1)-trees. We will be interested in finding sub (k - 1)-trees of k trees and this problem is trivial by the result of [1] when k = 2. Therefore, we assume that k 3. The following result characterizes the various possibilities for H[Vij] in this case. Lemma 2. Suppose that G is a k-tree, and S = Vij is a minimal separator of G corre- sponding to the link ij of the junction-tree T. In any (k - 1)-tree H G either 1. There is a u S such that u is not connected to vertices in both V (i) \ S and V (j) \ S. Then S \ {u} is a minimal separator in H and hence is complete. 2. Every vertex in S is connected to vertices in both V (i) \S and V (j) \S. Then there are vertices {x, y} S such that the edge H[S] is missing only the edge {x, y}. Further either H[V (i)] or H[V (j)] does not contain a unchorded x-y path. Proof. We consider two possibilities. In the first, there is some vertex u S such that u is not connected to vertices in both V (i) \ S and V (j). Since the removal of S disconnects G, the removal of S must also disconnect H. Therefore, S must contain a minimal separator of H. Since H is a (k - 1)-tree, all minimal separators of H must contain k - 1 vertices which must therefore be S {u}. This corresponds to case (1) above. Clearly this possiblity can occur. If there is no such u S, then every vertex in S is connected to vertices in both V (i) \ S and V (j) \ S. If x S is connected to some yi V (i) \ S and yj V (j) \ S, then x is contained in every minimal yi/yj separator (see [5]). Therefore, every vertex in S is part of a minimal separator. Since each minimal separator contains k - 1 vertices, there must be at least two distinct minimum separators contained in S. Let Sx = S \ {x} and Sy = S \ {y} be two distinct minimal separators. We claim that H[S] contains all edges except the edge {x, y}. To see this, note that if z, w S, with z = w and {z, w} = {x, y} (as sets), then either {z, w} Sy or {z, w} Sx. Since both Sx and Sy are complete in H, this edge must be present in H. The edge {x, y} is not present in H[S] because all minimal separators in H must be of size k - 1. Further, if both V (i) and V (j) contain an unchorded path between x and y, then by joining the two paths at x and y, we get a unchorded cycle in H which contradicts the fact that H is triangulated. Therefore, we may associate k 2 + 2 k constraints with each separator V 2 ij of G as follows. There are k possible constraints corresponding to case (1) above (one for each choice of x), and k 2 choices corresponding to case (2) above. This is because for each 2 pair {x, y} corresponding to the missing edge, we have either V (i) contains no unchorded xy paths or V (j) contains no unchorded xy paths. More explicitly, we can encode the set of constraints CM associated with each separator S corresponding to an interior node M of the decomposition tree as follows: CM = { (x, y, s): x S, y S, s {i, j}}. If y = x, then this corresponds to case (1) of the above lemma. If s = i, then x is connected only to H(i) and if s = j, then x is connected only to H(j). If y = x, then this corresponds to case (2) in the above lemma. If s = i, then H (i) does not contain any unchorded path between x and y, and there is no constraint on H(j). Similarly if s = j, then H(j) does not contain any unchorded path between x and y, and there is no constraint on H (i). Now suppose that H(i) and H(j) are triangulated subgraphs of G(i) and G(j) respectively, then it is clear that if H(i) and H(j) both satisfy the same constraint they must match up on the common vertices Vij. Therefore to splice together two solutions corresponding to the same constraint, we only need to check that the graph obtained by splicing the graphs is triangulated. Lemma 3. Suppose that H(i) and H(j) are triangulated subgraphs of G(i) and G(j) re- spectively such that both of them satisfy the same constraint as described above. Then the graph H obtained by splicing H(i) and H(j) together is triangulated. Proof. Suppose that both H(i) and H(j) are both triangulated and both satisfy the same constraint. If both H(i) and H(j) satisfy the same constraint corresponding to case (1) in Lemma 2 and H has an unchorded cycle, then this cycle must involve elements of both H(i) and H(j). Therefore, there must be two vertices of S {u} on the cycle, and hence this cycle has a chord as S \ {u} is complete. This contradiction shows that H is triangulated. So assume that both of them satisfy the constraint corresponding to case (2) of Lemma 2. Then if H is not triangulated, there must be a t-cycle (for t 4) with no chord. Now, since {x, y} is the only missing edge of S in H, and because H (i) and H(j) are individually triangulated, the cycle must contain x, y and vertices of both V (i) \ S and V (j) \ S. We may split this unchorded cycle into two unchorded paths, one contained in V (i) and one in V (j) thus violating our assumption that both H(i) and H(j) satisfy the same constraint. If |S| = k, then there are 2k + 2 k O(k2) O(n2). We can use a divide and conquer 2 strategy to find the optimal sub (k - 1) tree once we have taken care of the base case, where G is just a single clique (of k + 1) elements. However, for this case, it is easily checked that any subgraph of G obtained by deleting exactly one edge results in a (k - 1) tree, and every sub (k-1)-tree results from this operation. Therefore, the optimal (k-1)-tree can be found using Algorithm 1, and in this case, the complexity of Algorithm 1 is O(n(k + 1)2). This procedure can be generalized to find the optimal sub (k - d)- tree for any fixed d. However, the number of constraints grows exponentially with d (though it is still polynomial in n). Therefore for small, fixed values of d, we can compute the optimal sub (k - d)-tree of G. While we can compute (k - d)-trees of G by first going from a k tree to a (k - 1) tree, then from a (k - 1)-tree to a (k - 2)-tree, and so on in a greedy fashion, this will not be optimal in general. However, this might be a good algorithm to try when d is large. 3. 2 Optimal triangulated subgraphs with |E(G)| - d edges Suppose that we are interested in a (triangulated) subgraph of G that contains d fewer edges that G does. That is, we want to find an optimal subgraph H G such that |E(H)| = |E(G)| - d. Note that by the result of [4] there is always a triangulated subgraph with d fewer edges (if d 1. Use the procedure described in [4]. This is a greedy procedure which works in d steps by deleting an edge at each step. At each state, the edge is picked from the set of edges whose deletion leaves a triangulated graph. Then the edge which causes the least increase in KL-divergence is picked at each stage. 2. For each possible subset A of E(G) of size d, whose deletion leaves a triangulated graph, compute the KL divergence using the formula above, and then pick the optimal one. Since there are |E(G)| such sets, this can be done in polynomial d time (in |V (G)|) when d is a constant. The first greedy algorithm is not guaranteed to yield the optimal solution. The second takes time that is O(n2d). Now, let us solve this problem using the framework we've described. Let H be the set of subgraphs of G which may be obtained by deletion of d edges. For each M = ij M corresponding to the separator Vij, let CM = (l, r, c, s, A): l + r - c = d, s a d bit string, A E(G[Vij]). The constraint repre- c sented by (l, r, c, A) is this: A is a set of d edges of G[Vij] that are missing in H, l edges are missing from the left subgraph, and r edges are missing from the right subgraph. c rep- resents the double count, and so is subtracted from the total. If k is the size of the largest ) clique, then the total number of such constraints is bounded by 2d 2d (k2 O(k2d) d which could be better than O(n2d) and is polynomial in |V | when d is constant. See [10] for additional details. 4 Conclusions Algorithm 1 will compute the optimal H H for the two examples discussed above and is polynomial (for fixed constant d) even if k is O(n). In [10] a generalization is presented which will allow finding the optimal solution for other classes of subgraphical models. Now, we assume an oracle model for computing KL-divergences of probability distribu- tions on vertex sets of cliques. It is clear that these KL-divergences can be computed R separator-tree for G; for each vertex M of R in order of increasing height (bottom up) do for each constraint cM of M do if M is an interior vertex of R corresponding to edge ij of the junction tree then Let Ml and Mr be the left and right children of M; Pick constraint cl CM compatible with c l M to minimize table[Ml, cl]; Pick constraint cr CM compatible with c r M to minimize table[Mr, cr ]; loss D(PG[M ] PH [M ]); table[M, cM ] table[Ml, cl] + table[Mr, cr] - loss; else table[M, cM ] D(PG[M ] PH [M ]); end end end Algorithm 1: Finding optimal set of constraints efficiently for distributions like Gaussians, but for discrete distributions this may not be possible when k is large. However even in this case this algorithm will result in only polynomial calls to the oracle. The standard algorithm [3] which is exponential in the treewidth will make O(2k) calls to this oracle. Therefore, when the cost of computing the KL-divergence is large, this algorithm becomes even more attractive as it results in expo- nential speedup over the standard algorithm. Alternatively, if we can compute approximate KL-divergences, or approximately optimal solutions, then we can compute an approximate solution by using the same algorithm.

NeurIPS Conference 2003 Conference Paper

Necessary Intransitive Likelihood-Ratio Classifiers

Gang Ji
Jeff Bilmes

In pattern classiﬁcation tasks, errors are introduced because of differ- ences between the true model and the one obtained via model estimation. Using likelihood-ratio based classiﬁcation, it is possible to correct for this discrepancy by ﬁnding class-pair speciﬁc terms to adjust the likelihood ratio directly, and that can make class-pair preference relationships in- transitive. In this work, we introduce new methodology that makes nec- essary corrections to the likelihood ratio, speciﬁcally those that are nec- essary to achieve perfect classiﬁcation (but not perfect likelihood-ratio correction which can be overkill). The new corrections, while weaker than previously reported such adjustments, are analytically challenging since they involve discontinuous functions, therefore requiring several approximations. We test a number of these new schemes on an isolated- word speech recognition task as well as on the UCI machine learning data sets. Results show that by using the bias terms calculated in this new way, classiﬁcation accuracy can substantially improve over both the baseline and over our previous results.

NeurIPS Conference 2001 Conference Paper

Intransitive Likelihood-Ratio Classifiers

Jeff Bilmes
Gang Ji
Marina Meila

In this work, we introduce an information-theoretic based correction term to the likelihood ratio classiﬁcation method for multiple classes. Under certain conditions, the term is sufﬁcient for optimally correcting the dif- ference between the true and estimated likelihood ratio, and we analyze this in the Gaussian case. We ﬁnd that the new correction term signif- icantly improves the classiﬁcation results when tested on medium vo- cabulary speech recognition tasks. Moreover, the addition of this term makes the class comparisons analogous to an intransitive game and we therefore use several tournament-like strategies to deal with this issue. We ﬁnd that further small improvements are obtained by using an appro- priate tournament. Lastly, we ﬁnd that intransitivity appears to be a good measure of classiﬁcation conﬁdence.

NeurIPS Conference 1991 Conference Paper

Software for ANN training on a Ring Array Processor

Phil Kohn
Jeff Bilmes
Nelson Morgan
James Beck

Experimental research on Artificial Neural Network (ANN) algorithms requires either writing variations on the same program or making one monolithic program with many parameters and options. By using an object-oriented library, the size of these experimental programs is reduced while making them easier to read, write and modify. An efficient and flexible realization of this idea is Connection(cid: 173) ist Layered Object-oriented Network Simulator (CLONES). CLONES runs on UNIX1 workstations and on the 100-1000 MFLOP Ring Array Processor (RAP) that we built with ANN algorithms in mind. In this report we describe CLONES and show how it is implemented on the RAP.