Author name cluster

Jeff A. Bilmes

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

41 papers

2 author rows

ICLR Conference 2025 Conference Paper

Many-Objective Multi-Solution Transport

Ziyue Li
Tian Li 0005
Virginia Smith
Jeff A. Bilmes
Tianyi Zhou 0001

Optimizing the performance of many objectives (instantiated by tasks or clients) jointly with a few Pareto stationary solutions (models) is critical in machine learning. However, previous multi-objective optimization methods often focus on a few objectives and cannot scale to many objectives that outnumber the solutions, leading to either subpar performance or ignored objectives. We introduce ''Many-objective multi-solution Transport (MosT)'', a framework that finds multiple diverse solutions in the Pareto front of many objectives. Our insight is to seek multiple solutions, each performing as a domain expert and focusing on a specific subset of objectives while collectively covering all of them. MosT formulates the problem as a bi-level optimization of weighted objectives for each solution, where the weights are defined by an optimal transport between objectives and solutions. Our algorithm ensures convergence to Pareto stationary solutions for complementary subsets of objectives. On a range of applications in federated learning, multi-task learning, and mixture-of-prompt learning for LLMs, MosT distinctly outperforms strong baselines, delivering high-quality, diverse solutions that profile the entire Pareto frontier, thus ensuring balanced trade-offs across many objectives.

ICML Conference 2025 Conference Paper

Tilted Sharpness-Aware Minimization

Tian Li 0005
Tianyi Zhou 0001
Jeff A. Bilmes

Sharpness-Aware Minimization (SAM) has been demonstrated to improve the generalization performance of overparameterized models by seeking flat minima on the loss landscape through optimizing model parameters that incur the largest loss within a neighborhood. Nevertheless, such min-max formulations are computationally challenging especially when the problem is highly non-convex. Additionally, focusing only on the worst-case local solution while ignoring potentially many other local solutions may be suboptimal when searching for flat minima. In this work, we propose Tilted SAM (TSAM), a smoothed generalization of SAM inspired by exponential tilting that effectively assigns higher priority to local solutions that incur larger losses. TSAM is parameterized by a tilt hyperparameter $t$ and reduces to SAM as $t$ approaches infinity. We show that TSAM is smoother than SAM and thus easier to optimize, and it explicitly favors flatter minima. We develop algorithms motivated by the discretization of Hamiltonian dynamics to solve TSAM. Empirically, TSAM arrives at flatter local minima and results in superior test performance than the baselines of SAM and ERM across a range of image and text tasks.

UAI Conference 2024 Conference Paper

Efficient Interactive Maximization of BP and Weakly Submodular Objectives

Adhyyan Narang
Omid Sadeghi
Lillian J. Ratliff
Maryam Fazel
Jeff A. Bilmes

In the context of online interactive machine learning with combinatorial objectives, we extend purely submodular prior work to more general non-submodular objectives. This includes: (1) those that are additively decomposable into a sum of two terms (a monotone submodular and monotone supermodular term, known as a BP decomposition); and (2) those that are only weakly submodular. In both cases, this allows representing not only competitive (submodular) but also complementary (supermodular) relationships between objects, enhancing this setting to a broader range of applications (e. g. , movie recommendations, medical treatments, etc.) where this is beneficial. In the two-term case, moreover, we study not only the more typical monolithic feedback approach but also a novel framework where feedback is available separately for each term. With real-world practicality and scalability in mind, we integrate \Nystrom{} sketching techniques to significantly improve the computational complexity, including for the purely submodular case. In the Gaussian process contextual bandits setting, we show sub-linear theoretical regret bounds in all cases. We also empirically show good applicability to recommendation systems and data subset selection.

ICRA Conference 2023 Conference Paper

High Resolution Point Clouds from mmWave Radar

Akarsh Prabhakara
Tao Jin
Arnav Das 0001
Gantavya Bhatt
Lilly Kumari
Elahe Soltanaghai
Jeff A. Bilmes
Swarun Kumar

This paper explores a machine learning approach on data from a single-chip mmWave radar for generating high resolution point clouds – a key sensing primitive for robotic applications such as mapping, odometry and localization. Unlike lidar and vision-based systems, mmWave radar can operate in harsh environments and see through occlusions like smoke, fog, and dust. Unfortunately, current mmWave processing techniques offer poor spatial resolution compared to lidar point clouds. This paper presents RadarHD, an end-to-end neural network that constructs lidar-like point clouds from low resolution radar input. Enhancing radar images is challenging due to the presence of specular and spurious reflections. Radar data also doesn't map well to traditional image processing techniques due to the signal's sinc-like spreading pattern. We overcome these challenges by training RadarHD on a large volume of raw I/Q radar data paired with lidar point clouds across diverse indoor settings. Our experiments show the ability to generate rich point clouds even in scenes unobserved during training and in the presence of heavy smoke occlusion. Further, RadarHD's point clouds are high-quality enough to work with existing lidar odometry and mapping workflows.

ICLR Conference 2022 Conference Paper

Diverse Client Selection for Federated Learning via Submodular Maximization

Ravikumar Balakrishnan
Tian Li 0005
Tianyi Zhou 0001
Nageen Himayat
Virginia Smith
Jeff A. Bilmes

In every communication round of federated learning, a random subset of clients communicate their model updates back to the server which then aggregates them all. The optimal size of this subset is not known and several studies have shown that typically random selection does not perform very well in terms of convergence, learning efficiency and fairness. We, in this paper, propose to select a small diverse subset of clients, namely those carrying representative gradient information, and we transmit only these updates to the server. Our aim is for updating via only a subset to approximate updating via aggregating all client information. We achieve this by choosing a subset that maximizes a submodular facility location function defined over gradient space. We introduce “federated averaging with diverse client selection (DivFL)”. We provide a thorough analysis of its convergence in the heterogeneous setting and apply it both to synthetic and to real datasets. Empirical results show several benefits to our approach including improved learning efficiency, faster convergence and also more uniform (i.e., fair) performance across clients. We further show a communication-efficient version of DivFL that can still outperform baselines on the above metrics.

ICLR Conference 2021 Conference Paper

Robust Curriculum Learning: from clean label detection to noisy label self-correction

Tianyi Zhou 0001
Shengjie Wang 0001
Jeff A. Bilmes

Neural network training can easily overfit noisy labels resulting in poor generalization performance. Existing methods address this problem by (1) filtering out the noisy data and only using the clean data for training or (2) relabeling the noisy data by the model during training or by another model trained only on a clean dataset. However, the former does not leverage the features' information of wrongly-labeled data, while the latter may produce wrong pseudo-labels for some data and introduce extra noises. In this paper, we propose a smooth transition and interplay between these two strategies as a curriculum that selects training samples dynamically. In particular, we start with learning from clean data and then gradually move to learn noisy-labeled data with pseudo labels produced by a time-ensemble of the model and data augmentations. Instead of using the instantaneous loss computed at the current step, our data selection is based on the dynamics of both the loss and output consistency for each sample across historical steps and different data augmentations, resulting in more precise detection of both clean labels and correct pseudo labels. On multiple benchmarks of noisy labels, we show that our curriculum learning strategy can significantly improve the test accuracy without any auxiliary model or extra clean data.

ICML Conference 2020 Conference Paper

Coresets for Data-efficient Training of Machine Learning Models

Baharan Mirzasoleiman
Jeff A. Bilmes
Jure Leskovec

Incremental gradient (IG) methods, such as stochastic gradient descent and its variants are commonly used for large scale optimization in machine learning. Despite the sustained effort to make IG methods more data-efficient, it remains an open question how to select a training data subset that can theoretically and practically perform on par with the full dataset. Here we develop CRAIG, a method to select a weighted subset (or coreset) of training data that closely estimates the full gradient by maximizing a submodular function. We prove that applying IG to this subset is guaranteed to converge to the (near)optimal solution with the same convergence rate as that of IG for convex optimization. As a result, CRAIG achieves a speedup that is inversely proportional to the size of the subset. To our knowledge, this is the first rigorous method for data-efficient training of general machine learning models. Our extensive set of experiments show that CRAIG, while achieving practically the same solution, speeds up various IG methods by up to 6x for logistic regression and 3x for training deep neural networks.

ICML Conference 2020 Conference Paper

Time-Consistent Self-Supervision for Semi-Supervised Learning

Tianyi Zhou 0001
Shengjie Wang 0001
Jeff A. Bilmes

Semi-supervised learning (SSL) leverages unlabeled data when training a model with insufficient labeled data. A common strategy for SSL is to enforce the consistency of model outputs between similar samples, e. g. , neighbors or data augmentations of the same sample. However, model outputs can vary dramatically on unlabeled data over different training stages, e. g. , when using large learning rates. This can introduce harmful noises and inconsistent objectives over time that may lead to concept drift and catastrophic forgetting. In this paper, we study the dynamics of neural net outputs in SSL and show that selecting and using first the unlabeled samples with more consistent outputs over the course of training (i. e. , "time-consistency") can improve the final test accuracy and save computation. Under the time-consistent data selection, we design an SSL objective composed of two self-supervised losses, i. e. , a consistency loss between a sample and its augmentation, and a contrastive loss encouraging different samples to have different outputs. Our approach achieves SOTA on several SSL benchmarks with much fewer computations.

ICML Conference 2019 Conference Paper

Bias Also Matters: Bias Attribution for Deep Neural Network Explanation

Shengjie Wang 0001
Tianyi Zhou 0001
Jeff A. Bilmes

The gradient of a deep neural network (DNN) w. r. t. the input provides information that can be used to explain the output prediction in terms of the input features and has been widely studied to assist in interpreting DNNs. In a linear model (i. e. , g(x) = wx + b), the gradient corresponds to the weights w. Such a model can reasonably locally-linearly approximate a smooth nonlinear DNN, and hence the weights of this local model are the gradient. The bias b, however, is usually overlooked in attribution methods. In this paper, we observe that since the bias in a DNN also has a non-negligible contribution to the correctness of predictions, it can also play a significant role in understanding DNN behavior. We propose a backpropagation-type algorithm “bias back-propagation (BBp)” that starts at the output layer and iteratively attributes the bias of each layer to its input nodes as well as combining the resulting bias term of the previous layer. Together with the backpropagation of the gradient generating w, we can fully recover the locally linear model g(x) = wx + b. In experiments, we show that BBp can generate complementary and highly interpretable explanations.

ICML Conference 2019 Conference Paper

Combating Label Noise in Deep Learning using Abstention

Sunil Thulasidasan
Tanmoy Bhattacharya 0001
Jeff A. Bilmes
Gopinath Chennupati
Jamal Mohd-Yusof

We introduce a novel method to combat label noise when training deep neural networks for classification. We propose a loss function that permits abstention during training thereby allowing the DNN to abstain on confusing samples while continuing to learn and improve classification performance on the non-abstained samples. We show how such a deep abstaining classifier (DAC) can be used for robust learning in the presence of different types of label noise. In the case of structured or systematic label noise {–} where noisy training labels or confusing examples are correlated with underlying features of the data{–} training with abstention enables representation learning for features that are associated with unreliable labels. In the case of unstructured (arbitrary) label noise, abstention during training enables the DAC to be used as an effective data cleaner by identifying samples that are likely to have label noise. We provide analytical results on the loss function behavior that enable dynamic adaption of abstention rates based on learning progress during training. We demonstrate the utility of the deep abstaining classifier for various image classification tasks under different types of label noise; in the case of arbitrary label noise, we show significant im- provements over previously published results on multiple image benchmarks.

ICML Conference 2019 Conference Paper

Jumpout: Improved Dropout for Deep Neural Networks with ReLUs

Shengjie Wang 0001
Tianyi Zhou 0001
Jeff A. Bilmes

We discuss three novel insights about dropout for DNNs with ReLUs: 1) dropout encourages each local linear piece of a DNN to be trained on data points from nearby regions; 2) the same dropout rate results in different (effective) deactivation rates for layers with different portions of ReLU-deactivated neurons; and 3) the rescaling factor of dropout causes a normalization inconsistency between training and test when used together with batch normalization. The above leads to three simple but nontrivial modifications resulting in our method “jumpout. ” Jumpout samples the dropout rate from a monotone decreasing distribution (e. g. , the right half of a Gaussian), so each local linear piece is trained, with high probability, to work better for data points from nearby than more distant regions. Jumpout moreover adaptively normalizes the dropout rate at each layer and every training batch, so the effective deactivation rate on the activated neurons is kept the same. Furthermore, it rescales the outputs for a better trade-off that keeps both the variance and mean of neurons more consistent between training and test phases, thereby mitigating the incompatibility between dropout and batch normalization. Jumpout significantly improves the performance of different neural nets on CIFAR10, CIFAR100, Fashion-MNIST, STL10, SVHN, ImageNet-1k, etc. , while introducing negligible additional memory and computation costs.

ICML Conference 2018 Conference Paper

Constrained Interacting Submodular Groupings

Andrew Cotter
Mahdi Milani Fard
Seungil You
Maya R. Gupta
Jeff A. Bilmes

We introduce the problem of grouping a finite ground set into blocks where each block is a subset of the ground set and where: (i) the blocks are individually highly valued by a submodular function (both robustly and in the average case) while satisfying block-specific matroid constraints; and (ii) block scores interact where blocks are jointly scored highly, thus making the blocks mutually non-redundant. Submodular functions are good models of information and diversity; thus, the above can be seen as grouping the ground set into matroid constrained blocks that are both intra- and inter-diverse. Potential applications include forming ensembles of classification/regression models, partitioning data for parallel processing, and summarization. In the non-robust case, we reduce the problem to non-monotone submodular maximization subject to multiple matroid constraints. In the mixed robust/average case, we offer a bi-criterion guarantee for a polynomial time deterministic algorithm and a probabilistic guarantee for randomized algorithm, as long as the involved submodular functions (including the inter-block interaction terms) are monotone. We close with a case study in which we use these algorithms to find high quality diverse ensembles of classifiers, showing good results.

ICML Conference 2018 Conference Paper

Greed is Still Good: Maximizing Monotone Submodular+Supermodular (BP) Functions

Wenruo Bai
Jeff A. Bilmes

We analyze the performance of the greedy algorithm, and also a discrete semi-gradient based algorithm, for maximizing the sum of a suBmodular and suPermodular (BP) function (both of which are non-negative monotone non-decreasing) under two types of constraints, either a cardinality constraint or $p\geq 1$ matroid independence constraints. These problems occur naturally in several real-world applications in data science, machine learning, and artificial intelligence. The problems are ordinarily inapproximable to any factor. Using the curvature $\curv_f$ of the submodular term, and introducing $\curv^g$ for the supermodular term (a natural dual curvature for supermodular functions), however, both of which are computable in linear time, we show that BP maximization can be efficiently approximated by both the greedy and the semi-gradient based algorithm. The algorithms yield multiplicative guarantees of $\frac{1}{\curv_f}\left[1-e^{-(1-\curv^g)\curv_f}\right]$ and $\frac{1-\curv^g}{(1-\curv^g)\curv_f + p}$ for the two types of constraints respectively. For pure monotone supermodular constrained maximization, these yield $1-\curvg$ and $(1-\curvg)/p$ for the two types of constraints respectively. We also analyze the hardness of BP maximization and show that our guarantees match hardness by a constant factor and by $O(\ln(p))$ respectively. Computational experiments are also provided supporting our analysis.

ICML Conference 2016 Conference Paper

Algorithms for Optimizing the Ratio of Submodular Functions

Wenruo Bai
Rishabh K. Iyer
Kai Wei
Jeff A. Bilmes

We investigate a new optimization problem involving minimizing the Ratio of Submodular (RS) functions. We argue that this problem occurs naturally in several real world applications. We then show the connection between this problem and several related problems, including minimizing the difference of submodular functions, and to submodular optimization subject to submodular constraints. We show RS that optimization can be solved within bounded approximation factors. We also provide a hardness bound and show that our tightest algorithm matches the lower bound up to a \log factor. Finally, we empirically demonstrate the performance and good scalability properties of our algorithms.

ICML Conference 2016 Conference Paper

Analysis of Deep Neural Networks with Extended Data Jacobian Matrix

Shengjie Wang 0001
Abdel-rahman Mohamed
Rich Caruana
Jeff A. Bilmes
Matthai Philipose
Matthew Richardson
Krzysztof J. Geras
Gregor Urban

Deep neural networks have achieved great successes on various machine learning tasks, however, there are many open fundamental questions to be answered. In this paper, we tackle the problem of quantifying the quality of learned wights of different networks with possibly different architectures, going beyond considering the final classification error as the only metric. We introduce \emphExtended Data Jacobian Matrix to help analyze properties of networks of various structures, finding that, the spectrum of the extended data jacobian matrix is a strong discriminating factor for networks of different structures and performance. Based on such observation, we propose a novel regularization method, which manages to improve the network performance comparably to dropout, which in turn verifies the observation.

ICML Conference 2015 Conference Paper

Entropic Graph-based Posterior Regularization

Maxwell W. Libbrecht
Michael M. Hoffman
Jeff A. Bilmes
William Stafford Noble

Graph smoothness objectives have achieved great success in semi-supervised learning but have not yet been applied extensively to unsupervised generative models. We define a new class of entropic graph-based posterior regularizers that augment a probabilistic model by encouraging pairs of nearby variables in a regularization graph to have similar posterior distributions. We present a three-way alternating optimization algorithm with closed-form updates for performing inference on this joint model and learning its parameters. This method admits updates linear in the degree of the regularization graph, exhibits monotone convergence and is easily parallelizable. We are motivated by applications in computational biology in which temporal models such as hidden Markov models are used to learn a human-interpretable representation of genomic data. On a synthetic problem, we show that our method outperforms existing methods for graph-based regularization and a comparable strategy for incorporating long-range interactions using existing methods for approximate inference. Using genome-scale functional genomics data, we integrate genome 3D interaction data into existing models for genome annotation and demonstrate significant improvements in predicting genomic activity.

ICML Conference 2015 Conference Paper

On Deep Multi-View Representation Learning

Weiran Wang
Raman Arora
Karen Livescu
Jeff A. Bilmes

We consider learning representations (features) in the setting in which we have access to multiple unlabeled views of the data for representation learning while only one view is available at test time. Previous work on this problem has proposed several techniques based on deep neural networks, typically involving either autoencoder-like networks with a reconstruction objective or paired feedforward networks with a correlation-based objective. We analyze several techniques based on prior work, as well as new variants, and compare them experimentally on visual, speech, and language domains. To our knowledge this is the first head-to-head comparison of a variety of such techniques on multiple tasks. We find an advantage for correlation-based representation learning, while the best results on most tasks are obtained with our new variant, deep canonically correlated autoencoders (DCCAE).

ICML Conference 2015 Conference Paper

Submodularity in Data Subset Selection and Active Learning

Kai Wei
Rishabh K. Iyer
Jeff A. Bilmes

We study the problem of selecting a subset of big data to train a classifier while incurring minimal performance loss. We show the connection of submodularity to the data likelihood functions for Naive Bayes (NB) and Nearest Neighbor (NN) classifiers, and formulate the data subset selection problems for these classifiers as constrained submodular maximization. Furthermore, we apply this framework to active learning and propose a novel scheme filtering active submodular selection (FASS), where we combine the uncertainty sampling method with a submodular data subset selection framework. We extensively evaluate the proposed framework on text categorization and handwritten digit recognition tasks with four different classifiers, including Deep Neural Network (DNN) based classifiers. Empirical results indicate that the proposed framework yields significant improvement over the state-of-the-art algorithms on all classifiers.

ICML Conference 2014 Conference Paper

Fast Multi-stage Submodular Maximization

Kai Wei
Rishabh K. Iyer
Jeff A. Bilmes

We introduce a new multi-stage algorithmic framework for submodular maximization. We are motivated by extremely large scale machine learning problems, where both storing the whole data for function evaluation and running the standard accelerated greedy algorithm are prohibitive. We propose a multi-stage framework (called MultGreed), where at each stage we apply an approximate greedy procedure to maximize surrogate submodular functions. The surrogates serve as proxies for a target submodular function but require less memory and are easy to evaluate. We theoretically analyze the performance guarantee of the multi-stage framework, and give examples on how to design instances of MultGreed for a broad range of natural submodular functions. We show that MultGreed performs very close to the standard greedy algorithm, given appropriate surrogate functions, and argue how our framework can easily be integrated with distributive algorithms for optimization. We complement our theory by empirically evaluating on several real world problems, including data subset selection on millions of speech samples, where MultGreed yields at least a thousand times speedup and superior results over the state-of-the-art selection methods.

UAI Conference 2014 Conference Paper

Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry

John T. Halloran
Jeff A. Bilmes
William Stafford Noble

We present a peptide-spectrum alignment strategy that employs a dynamic Bayesian network (DBN) for the identification of spectra produced by tandem mass spectrometry (MS/MS). Our method is fundamentally generative in that it models peptide fragmentation in MS/MS as a physical process. The model traverses an observed MS/MS spectrum and a peptide-based theoretical spectrum to calculate the best alignment between the two spectra. Unlike all existing state-of-the-art methods for spectrum identification that we are aware of, our method can learn alignment probabilities given a dataset of high-quality peptide-spectrum pairs. The method, moreover, accounts for noise peaks and absent theoretical peaks in the observed spectrum. We demonstrate that our method outperforms, on a majority of datasets, several widely used, state-of-the-art database search tools for spectrum identification. Furthermore, the proposed approach provides an extensible framework for MS/MS analysis and provides useful information that is not produced by other methods, thanks to its generative structure.

UAI Conference 2014 Conference Paper

Monotone Closure of Relaxed Constraints in Submodular Optimization: Connections Between Minimization and Maximization

Rishabh K. Iyer
Stefanie Jegelka
Jeff A. Bilmes

It is becoming increasingly evident that many machine learning problems may be reduced to submodular optimization. Previous work addresses generic discrete approaches and specific relaxations. In this work, we take a generic view from a relaxation perspective. We show a relaxation formulation and simple rounding strategy that, based on the monotone closure of relaxed constraints, reveals analogies between minimization and maximization problems, and includes known results as special cases and extends to a wider range of settings. Our resulting approximation factors match the corresponding integrality gaps. For submodular maximization, a number of relaxation approaches have been proposed. A critical challenge for the practical applicability of these techniques, however, is the complexity of evaluating the multilinear extension. We show that this extension can be efficiently evaluated for a number of useful submodular functions, thus making these otherwise impractical algorithms viable for real-world machine learning problems.

ICML Conference 2013 Conference Paper

Deep Canonical Correlation Analysis

Galen Andrew
Raman Arora
Jeff A. Bilmes
Karen Livescu

We introduce Deep Canonical Correlation Analysis (DCCA), a method to learn complex nonlinear transformations of two views of data such that the resulting representations are highly linearly correlated. Parameters of both transformations are jointly learned to maximize the (regularized) total correlation. It can be viewed as a nonlinear extension of the linear method \emphcanonical correlation analysis (CCA). It is an alternative to the nonparametric method \emphkernel canonical correlation analysis (KCCA) for learning correlated nonlinear transformations. Unlike KCCA, DCCA does not require an inner product, and has the advantages of a parametric method: training time scales well with data size and the training data need not be referenced when computing the representations of unseen instances. In experiments on two real-world datasets, we find that DCCA learns representations with significantly higher correlation than those learned by CCA and KCCA. We also introduce a novel non-saturating sigmoid function based on the cube root that may be useful more generally in feedforward neural networks.

ICML Conference 2013 Conference Paper

Fast Semidifferential-based Submodular Function Optimization

Rishabh K. Iyer
Stefanie Jegelka
Jeff A. Bilmes

We present a practical and powerful new framework for both unconstrained and constrained submodular function optimization based on discrete semidifferentials (sub- and super-differentials). The resulting algorithms, which repeatedly compute and then efficiently optimize submodular semigradients, offer new and generalize many old methods for submodular optimization. Our approach, moreover, takes steps towards providing a unifying paradigm applicable to both submodular minimization and maximization, problems that historically have been treated quite distinctly. The practicality of our algorithms is important since interest in submodularity, owing to its natural and wide applicability, has recently been in ascendance within machine learning. We analyze theoretical properties of our algorithms for minimization and maximization, and show that many state-of-the-art maximization algorithms are special cases. Lastly, we complement our theoretical analyses with supporting empirical experiments.

UAI Conference 2013 Conference Paper

The Lovasz-Bregman Divergence and connections to rank aggregation, clustering, and web ranking

Rishabh K. Iyer
Jeff A. Bilmes

We extend the recently introduced theory of Lovász Bregman (LB) divergences [19] in several ways. We show that they represent a distortion between a “score” and an “ordering”, thus providing a new view of rank aggregation and order based clustering with interesting connections to web ranking. We show how the LB divergences have a number of properties akin to many permutation based metrics, and in fact have as special cases forms very similar to the Kendall-τ metric. We also show how the LB divergences subsume a number of commonly used ranking measures in information retrieval, like NDCG [22] and AUC [35]. Unlike the traditional permutation based metrics, however, the LB divergence naturally captures a notion of “confidence” in the orderings, thus providing a new representation to applications involving aggregating scores as opposed to just orderings. We show how a number of recently used web ranking models are forms of Lovász Bregman rank aggregation and also observe that a natural form of Mallow’s model using the LB divergence has been used as conditional ranking models for the “Learning to Rank” problem.

UAI Conference 2012 Conference Paper

Algorithms for Approximate Minimization of the Difference Between Submodular Functions, with Applications

Rishabh K. Iyer
Jeff A. Bilmes

We extend the work of Narasimhan and Bilmes [30] for minimizing set functions representable as a difference between submodular functions. Similar to [30], our new algorithms are guaranteed to monotonically reduce the objective function at every step. We empirically and theoretically show that the per-iteration cost of our algorithms is much less than [30], and our algorithms can be used to efficiently minimize a difference between submodular functions under various combinatorial constraints, a problem not previously addressed. We provide computational bounds and a hardness result on the multiplicative inapproximability of minimizing the difference between submodular functions. We show, however, that it is possible to give worst-case additive bounds by providing a polynomial time computable lower-bound on the minima. Finally we show how a number of machine learning problems can be modeled as minimizing the difference between submodular functions. We experimentally show the validity of our algorithms by testing them on the problem of feature selection with submodular cost features.

UAI Conference 2012 Conference Paper

Learning Mixtures of Submodular Shells with Application to Document Summarization

Hui Lin 0001
Jeff A. Bilmes

We introduce a method to learn a mixture of submodular “shells” in a large-margin setting. A submodular shell is an abstract submodular function that can be instantiated with a ground set and a set of parameters to produce a submodular function. A mixture of such shells can then also be so instantiated to produce a more complex submodular function. What our algorithm learns are the mixture weights over such shells. We provide a risk bound guarantee when learning in a large-margin structured-prediction setting using a projected subgradient method when only approximate submodular optimization is possible (such as with submodular function maximization). We apply this method to the problem of multi-document summarization and produce the best results reported so far on the widely used NIST DUC-05 through DUC-07 document summarization corpora.

UAI Conference 2012 Conference Paper

Spectrum Identification using a Dynamic Bayesian Network Model of Tandem Mass Spectra

Ajit P. Singh
John T. Halloran
Jeff A. Bilmes
Katrin Kirchhoff
William Stafford Noble

Shotgun proteomics is a high-throughput technology used to identify unknown proteins in a complex mixture. At the heart of this process is a prediction task, the spectrum identification problem, in which each fragmentation spectrum produced by a shotgun proteomics experiment must be mapped to the peptide (protein subsequence) which generated the spectrum. We propose a new algorithm for spectrum identification, based on dynamic Bayesian networks, which significantly outperforms the de-facto standard tools for this task: SEQUEST and Mascot.

UAI Conference 2011 Conference Paper

Active Semi-Supervised Learning using Submodular Functions

Andrew Guillory
Jeff A. Bilmes

We consider active, semi-supervised learning in an offline transductive setting. We show that a previously proposed error bound for active learning on undirected weighted graphs can be generalized by replacing graph cut with an arbitrary symmetric submodular function. Arbitrary non-symmetric submodular functions can be used via symmetrization. Different choices of submodular functions give different versions of the error bound that are appropriate for different kinds of problems. Moreover, the bound is deterministic and holds for adversarially chosen labels. We show exactly minimizing this error bound is NP-complete. However, we also introduce for any submodular function an associated active semi-supervised learning method that approximately minimizes the corresponding error bound. We show that the error bound is tight in the sense that there is no other bound of the same form which is better. Our theoretical results are supported by experiments on real data.

ICML Conference 2011 Conference Paper

Approximation Bounds for Inference using Cooperative Cuts

Stefanie Jegelka
Jeff A. Bilmes

ICML Conference 2011 Conference Paper

Online Submodular Minimization for Combinatorial Structures

Stefanie Jegelka
Jeff A. Bilmes

ICML Conference 2011 Conference Paper

Simultaneous Learning and Covering with Adversarial Noise

Andrew Guillory
Jeff A. Bilmes

JMLR Journal 2010 Journal Article

Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

Franz Pernkopf
Jeff A. Bilmes

We introduce a simple order-based greedy heuristic for learning discriminative structure within generative Bayesian network classifiers. We propose two methods for establishing an order of N features. They are based on the conditional mutual information and classification rate (i.e., risk), respectively. Given an ordering, we can find a discriminative structure with O(N k+1 ) score evaluations (where constant k is the tree-width of the sub-graph over the attributes). We present results on 25 data sets from the UCI repository, for phonetic classification using the TIMIT database, for a visual surface inspection task, and for two handwritten digit recognition tasks. We provide classification performance for both discriminative and generative parameter learning on both discriminatively and generatively structured networks. The discriminative structure found by our new procedures significantly outperforms generatively produced structures, and achieves a classification accuracy on par with the best discriminative (greedy) Bayesian network learning approach, but does so with a factor of ~10-40 speedup. We also show that the advantages of generative discriminatively structured Bayesian network classifiers still hold in the case of missing features, a case where generative classifiers have an advantage over discriminative classifiers. [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

ICML Conference 2010 Conference Paper

Interactive Submodular Set Cover

Andrew Guillory
Jeff A. Bilmes

UAI Conference 2007 Conference Paper

Consensus ranking under the exponential model

Marina Meila
Kapil Phadnis
Arthur Patterson
Jeff A. Bilmes

We analyze the generalized Mallows model, a popular exponential model over rankings. Estimating the central (or consensus) ranking from data is NP-hard. We obtain the following new results: (1) We show that search methods can estimate both the central ranking pi0 and the model parameters theta exactly. The search is n! in the worst case, but is tractable when the true distribution is concentrated around its mode; (2) We show that the generalized Mallows model is jointly exponential in (pi0; theta), and introduce the conjugate prior for this model class; (3) The sufficient statistics are the pairwise marginal probabilities that item i is preferred to item j. Preliminary experiments confirm the theoretical predictions and compare the new algorithm and existing heuristics.

UAI Conference 2006 Conference Paper

Non-Minimal Triangulations for Mixed Stochastic/Deterministic Graphical Models

Chris D. Bartels
Jeff A. Bilmes

We observe that certain large-clique graph triangulations can be useful to reduce both computational and space requirements when making queries on mixed stochastic/deterministic graphical models. We demonstrate that many of these large-clique triangulations are non-minimal and are thus unattainable via the variable elimination algorithm. We introduce ancestral pairs as the basis for novel triangulation heuristics and prove that no more than the addition of edges between ancestral pairs need be considered when searching for state space optimal triangulations in such graphs. Empirical results on random and real world graphs show that the resulting triangulations that yield significant speedups are almost always non-minimal. We also give an algorithm and correctness proof for determining if a triangulation can be obtained via elimination, and we show that the decision problem associated with finding optimal state space triangulations in this mixed stochastic/deterministic setting is NP-complete.

UAI Conference 2006 Conference Paper

Recognizing Activities and Spatial Context Using Wearable Sensors

Amar Subramanya
Alvin Raj
Jeff A. Bilmes
Dieter Fox

We introduce a new dynamic model with the capability of recognizing both activities that an individual is performing as well as where that ndividual is located. Our model is novel in that it utilizes a dynamic graphical model to jointly estimate both activity and spatial context over time based on the simultaneous use of asynchronous observations consisting of GPS measurements, and measurements from a small mountable sensor board. Joint inference is quite desirable as it has the ability to improve accuracy of the model. A key goal, however, in designing our overall system is to be able to perform accurate inference decisions while minimizing the amount of hardware an individual must wear. This minimization leads to greater comfort and flexibility, decreased power requirements and therefore increased battery life, and reduced cost. We show results indicating that our joint measurement model outperforms measurements from either the sensor board or GPS alone, using two types of probabilistic inference procedures, namely particle filtering and pruned exact inference.

UAI Conference 2005 Conference Paper

A Submodular-supermodular Procedure with Applications to Discriminative Structure Learning

Mukund Narasimhan
Jeff A. Bilmes

In this paper, we present an algorithm for minimizing the difference between two submodular functions using a variational framework which is based on (an extension of) the concave-convex procedure [17]. Because several commonly used metrics in machine learning, like mutual information and conditional mutual information, are submodular, the problem of minimizing the difference of two submodular problems arises naturally in many machine learning applications. Two such applications are learning discriminatively structured graphical models and feature selection under computational complexity constraints. A commonly used metric for measuring discriminative capacity is the EAR measure which is the difference between two conditional mutual information terms. Feature selection taking complexity considerations into account also fall into this framework because both the information that a set of features provide and the cost of computing and using the features can be modeled as submodular functions. This problem is NP-hard, and we give a polynomial time heuristic for it. We also present results on synthetic data to show that classifiers based on discriminative graphical models using this algorithm can significantly outperform classifiers based on generative graphical models.

ICML Conference 2005 Conference Paper

Discriminative versus generative parameter and structure learning of Bayesian network classifiers

Franz Pernkopf
Jeff A. Bilmes

UAI Conference 2004 Conference Paper

PAC-learning Bounded Tree-width Graphical Models

Mukund Narasimhan
Jeff A. Bilmes

We show that the class of strongly connected graphical models with treewidth at most k can be properly efficiently PAC-learnt with respect to the Kullback-Leibler Divergence. Previous approaches to this problem, such as those of Chow ([1]), and Ho gen ([7]) have shown that this class is PAC-learnable by reducing it to a combinatorial optimization problem. However, for k > 1, this problem is NP-complete ([15]), and so unless P=NP, these approaches will take exponential amounts of time. Our approach differs significantly from these, in that it first attempts to find approximate conditional independencies by solving (polynomially many) submodular optimization problems, and then using a dynamic programming formulation to combine the approximate conditional independence information to derive a graphical model with underlying graph of the tree-width specified. This gives us an efficient (polynomial time in the number of random variables) PAC-learning algorithm which requires only polynomial number of samples of the true distribution, and only polynomial running time.

UAI Conference 2003 Conference Paper

On Triangulating Dynamic Graphical Models

Jeff A. Bilmes
Chris D. Bartels

This paper introduces new methodology to triangulate dynamic Bayesian networks (DBNs) and dynamic graphical models (DGMs). While most methods to triangulate such networks use some form of constrained elimination scheme based on properties of the underlying directed graph, we find it useful to view triangulation and elimination using properties only of the resulting undirected graph, obtained after the moralization step. We first briefly introduce the Graphical model toolkit (GMTK) and its notion of dynamic graphical models, one that slightly extends the standard notion of a DBN. We next introduce the ``boundary algorithm', a method to find the best boundary between partitions in a dynamic model. We find that using this algorithm, the notions of forward- and backward-interface become moot --- namely, the size and fill-in of the best forward- and backward- interface are identical. Moreover, we observe that finding a good partition boundary allows for constrained elimination orders (and therefore graph triangulations) that are not possible using standard slice-by-slice constrained eliminations. More interestingly, with certain boundaries it is possible to obtain constrained elimination schemes that lie outside the space of possible triangulations using only unconstrained elimination. Lastly, we report triangulation results on invented graphs, standard DBNs from the literature, novel DBNs used in speech recognition research systems, and also random graphs. Using a number of different triangulation quality measures (max clique size, state-space, etc.), we find that with our boundary algorithm the triangulation quality can dramatically improve

UAI Conference 2000 Conference Paper

Dynamic Bayesian Multinets

Jeff A. Bilmes

In this work, dynamic Bayesian multinets are introduced where a Markov chain state at time t determines conditional independence patterns between random variables lying within a local time window surrounding t. It is shown how information-theoretic criterion functions can be used to induce sparse, discriminative, and class-conditional network structures that yield an optimal approximation to the class posterior probability, and therefore are useful for the classification task. Using a new structure learning heuristic, the resulting models are tested on a medium-vocabulary isolated-word speech recognition task. It is demonstrated that these discriminatively structured dynamic Bayesian multinets, when trained in a maximum likelihood setting using EM, can outperform both HMMs and other dynamic Bayesian networks with a similar number of parameters.