Arrow Research search

Author name cluster

Eran Segal

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
2 author rows

Possible papers

10

ICLR Conference 2025 Conference Paper

Causal Representation Learning from Multimodal Biomedical Observations

  • Yuewen Sun
  • Lingjing Kong
  • Guangyi Chen 0002
  • Loka Li
  • Gongxu Luo
  • Zijian Li 0001
  • Yixuan Zhang 0001
  • Yujia Zheng 0001

Prevalent in biomedical applications (e.g., human phenotype research), multimodal datasets can provide valuable insights into the underlying physiological mechanisms. However, current machine learning (ML) models designed to analyze these datasets often lack interpretability and identifiability guarantees, which are essential for biomedical research. Recent advances in causal representation learning have shown promise in identifying interpretable latent causal variables with formal theoretical guarantees. Unfortunately, most current work on multimodal distributions either relies on restrictive parametric assumptions or yields only coarse identification results, limiting their applicability to biomedical research that favors a detailed understanding of the mechanisms. In this work, we aim to develop flexible identification conditions for multimodal data and principled methods to facilitate the understanding of biomedical datasets. Theoretically, we consider a nonparametric latent distribution (c.f., parametric assumptions in previous work) that allows for causal relationships across potentially different modalities. We establish identifiability guarantees for each latent component, extending the subspace identification results from previous work. Our key theoretical contribution is the structural sparsity of causal connections between modalities, which, as we will discuss, is natural for a large collection of biomedical systems. Empirically, we present a practical framework to instantiate our theoretical insights. We demonstrate the effectiveness of our approach through extensive experiments on both numerical and synthetic datasets. Results on a real-world human phenotype dataset are consistent with established biomedical research, validating our theoretical and methodological framework.

NeurIPS Conference 2025 Conference Paper

Decoding Causal Structure: End-to-End Mediation Pathways Inference

  • Yulong Li
  • Xiwei Liu
  • Feilong Tang
  • Ming Hu
  • Jionglong Su
  • Zongyuan Ge
  • Imran Razzak
  • Eran Segal

Causal mediation analysis is crucial for deconstructing complex mechanisms of action. However, in current mediation analysis, complex structures derived from causal discovery lack direct interpretation of mediation pathways, while traditional mediation analysis and effect estimation are limited by the reliance on pre-specified pathways, leading to a disconnection between structure discovery and causal mechanism understanding. Therefore, a unified framework integrating structure discovery, pathway identification, and effect estimation systematically quantifies mediation pathways under structural uncertainty, enabling automated identification and inference of mediation pathways. To this end, we propose Structure-Informed Guided Mediation Analysis (SIGMA), which guides automated mediation pathway identification through probabilistic causal structure discovery and uncertainty quantification, enabling end-to-end propagation of structural uncertainty from structure learning to effect estimation. Specifically, SIGMA employs differentiable Flow-Structural Equation Models to learn structural posteriors, generating diverse Directed Acyclic Graphs (DAGs) to quantify structural uncertainty. Based on these DAGs, we introduce the Path Stability Score to evaluate the marginal probability of pathways, identifying high-confidence mediation paths. For identified mediation pathways, we integrate Efficient Influence Functions with Bayesian model averaging to fuse within-structure estimation uncertainty and between-structure effect variation, propagating uncertainty to the final effect estimates. In synthetic data experiments, SIGMA achieves state-of-the-art performance in pathway identification accuracy and effect quantification precision under structures uncertainty, concurrent multiple pathways, and nonlinear scenarios. In real-world applications using Human Phenotype Project data, SIGMA identifies mediation effects of sleep quality on cardiovascular health through inflammatory and metabolic pathways, uncovering previously unspecified multiple mediation paths.

NeurIPS Conference 2018 Conference Paper

Regularization Learning Networks: Deep Learning for Tabular Datasets

  • Ira Shavitt
  • Eran Segal

Despite their impressive performance, Deep Neural Networks (DNNs) typically underperform Gradient Boosting Trees (GBTs) on many tabular-dataset learning tasks. We propose that applying a different regularization coefficient to each weight might boost the performance of DNNs by allowing them to make more use of the more relevant inputs. However, this will lead to an intractable number of hyperparameters. Here, we introduce Regularization Learning Networks (RLNs), which overcome this challenge by introducing an efficient hyperparameter tuning scheme which minimizes a new Counterfactual Loss. Our results show that RLNs significantly improve DNNs on tabular datasets, and achieve comparable results to GBTs, with the best performance achieved with an ensemble that combines GBTs and RLNs. RLNs produce extremely sparse networks, eliminating up to 99. 8% of the network edges and 82% of the input features, thus providing more interpretable models and reveal the importance that the network assigns to different inputs. RLNs could efficiently learn a single network in datasets that comprise both tabular and unstructured data, such as in the setting of medical imaging accompanied by electronic health records. An open source implementation of RLN can be found at https: //github. com/irashavitt/regularization learning networks.

JMLR Journal 2005 Journal Article

Learning Module Networks

  • Eran Segal
  • Dana Pe'er
  • Aviv Regev
  • Daphne Koller
  • Nir Friedman

Methods for learning Bayesian networks can discover dependency structure between observed variables. Although these methods are useful in many applications, they run into computational and statistical problems in domains that involve a large number of variables. In this paper, we consider a solution that is applicable when many variables have similar behavior. We introduce a new class of models, module networks, that explicitly partition the variables into modules, so that the variables in each module share the same parents in the network and the same conditional probability distribution. We define the semantics of module networks, and describe an algorithm that learns the modules' composition and their dependency structure from data. Evaluation on real data in the domains of gene expression and the stock market shows that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

NeurIPS Conference 2004 Conference Paper

Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale

  • Haidong Wang
  • Eran Segal
  • Asa Ben-Hur
  • Daphne Koller
  • Douglas Brutlag

Protein interactions typically arise from a physical interaction of one or more small sites on the surface of the two proteins. Identifying these sites is very important for drug and protein design. In this paper, we propose a computational method based on probabilistic relational model that at- tempts to address this task using high-throughput protein interaction data and a set of short sequence motifs. We learn the model using the EM algorithm, with a branch-and-bound algorithm as an approximate infer- ence for the E-step. Our method searches for motifs whose presence in a pair of interacting proteins can explain their observed interaction. It also tries to determine which motif pairs have high affinity, and can therefore lead to an interaction. We show that our method is more accurate than others at predicting new protein-protein interactions. More importantly, by examining solved structures of protein complexes, we find that 2/3 of the predicted active motifs correspond to actual interaction sites. 1 Introduction Many cellular functions are carried out through physical interactions between proteins. Discovering the protein interaction map can therefore help to better understand the work- ings of the cell. Indeed, there has been much work recently on developing high-throughput methods to produce a more complete map of protein-protein interactions [1, 2, 3]. Interactions between two proteins arise from physical interactions between small re- gions on the surface of the proteins [4] (see Fig. 2(b)). Finding interaction sites is an important task, which is of particular relevance to drug design. There is currently no high- throughput experimental method to achieve this goal, so computational methods are re- quired. Existing methods either require solving a protein's 3D structure (e. g. , [5]), and therefore are computationally very costly and not applicable on a genome-wide scale, or use known interaction sites as training data (e. g. , [6]), which are relatively scarce and hence have poor coverage. Other work focuses on refining the highly noisy high-throughput inter- action maps [7, 8, 9], or on assessing the confidence levels of the observed interactions [10]. In this paper, we propose a computational method for predicting protein interactions P1 P2 P5 d a d A A A A A P a d b d d 5 P1 A A A A A ab db ad dd bd B S B B S Bab B S ab ab b ab db c d b T T T P P4 12 15 25 2 P O O O 3 b (a) (b) Figure 1: (a) Simple illustration of our assumptions for protein-protein interactions. The small elements denote motif occurrences on proteins, with red denoting active and gray denoting inactive motifs. (b) A fragment of our probabilistic model, for the proteins P1, P2, P5. We use yellow to denote an assignment of the value true, and black to denote the value false; full circles denote an assignment observed in the data, and patterned circles an assignment hypothesized by our algorithm. The dependencies involving inactive motif pairs were removed from the graph because they do not affect the rest of the model. and the sites at which the interactions take place, which uses as input only high-throughput protein-protein interaction data and the protein sequences. In particular, our method as- sumes no knowledge of the 3D protein structure, or of the sites at which binding occurs. Our approach is based on the assumption that interaction sites can be described using a limited repertoire of conserved sequence motifs [11]. This is a reasonable assumption since interaction sites are significantly more conserved than the rest of the protein surface [12]. Given a protein interaction map, our method tries to explain the observed interactions by identifying a set of sites of motif occurrence on every pair of interacting proteins through which the interaction is mediated. To understand the intuition behind our approach, con- sider the example of Fig. 1(a). Here, the interaction pattern of the protein P1 can best be explained using the motif pair a, b, where a appears in P1 and b in the proteins P2, P3, P4 but not in P5. By contrast, the motif pair d, b is not as good an explanation, because d also appears in P5, which has a different interaction pattern. In general, our method aims to identify motif pairs that have high affinity, potentially leading to interaction between protein pairs that contain them. However, a sequence motif might be used for a different purpose, and not give rise to an active binding site; it might also be buried inside the protein, and thus be inaccessible for interaction. Thus, the appearance of an appropriate motif does not always imply interaction. A key feature of our approach is that we allow each motif occurrence in a protein to be either active or inactive. Interactions are then induced only by the interactions of high- affinity active motifs in the two proteins. Thus, in our example, the motif d in p2 is inactive, and hence does not lead to an interaction between p2 and p4, despite the affinity between the motif pair c, d. We note that Deng et al. [8] proposed a somewhat related method for genome-wide analysis of protein interaction data, based on protein domains. However, their method is focused on predicting protein-protein interactions and not on revealing the site of interaction, and they do not allow for the possibility that some domains are inactive. Our goal is thus to identify two components: the affinities between pairs of motifs, and the activity of the occurrences of motifs in different proteins. Our algorithm addresses this problem by using the framework of Bayesian networks [13] and probabilistic relational models [14], which allows us to handle the inherent noise in the protein interaction data and the uncertain relationship between interactions and motif pairs. We construct a model encoding our assumption that protein interactions are induced by the interactions of active motif pairs. We then use the EM algorithm [15], to fill in the details of the model, learning both the motif affinities and activities from the observed data of protein-protein interactions and protein motif occurrences. We address the computational complexity of the E-step in these large, densely connected models by using an approximate inference procedure based on branch-and-bound. We evaluated our model on protein-protein interactions in yeast and Prosite motifs [11]. As a basic performance measure, we evaluated the ability of our method to predict new protein-protein interactions, showing that it achieves better performance than several other models. In particular, our results validate our assumption that we can explain interactions via the interactions of active sequence motifs. More importantly, we analyze the ability of our method to discover the mechanism by which the interaction occurs. Finally, we examined co-crystallized protein pairs where the 3D structure of the interaction is known, so that we can determine the sites at which the interaction took place. We show that our active motifs are more likely to participate in interactions. 2 The Probabilistic Model The basic entities in our probabilistic model are the proteins and the set of sequence motifs that can mediate protein interactions. Our model therefore contains a set of protein entities P = {P1, .. ., Pn}, with the motifs that occur in them. Each protein P is associated with the set of motifs that occur in it, denoted by P. M. As we discussed, a key premise of our approach is that a specific occurrence of a sequence motif may or may not be active. Thus, each motif occurrence a P. M is associated with a binary-value variable P. Aa, which takes the value true if Aa is active in protein P and false otherwise. We structure the prior probability P (P. Aa = true) = min{0. 8, 3+0. 1|P. M| }, to capture our intuition that |P. M | the number of active motifs in a protein is roughly a constant fraction of the total number of motifs in the protein, but that even proteins with few motifs tend to have at least some number of active motifs. A pair of active motifs in two proteins can potentially bind and induce an interaction between the corresponding proteins. Thus, in our model, a pair of proteins interact if each contains an active motif, and this pair of motifs bind to each other. The probability with which two motifs bind to each other is called their affinity. We encode this assumption by including in our model entities Tij corresponding to a pair of proteins Pi, Pj. For each pair of motifs a Pi. M and b Pj. M, we introduce a variable Tij. Aab, which is a deterministic AND of the activity of these two motifs. Intuitively, this variable represents whether the pair of motifs can potentially interact. The probability with which two active motif occurrences bind is their affinity. We model the binding event between two motif occurrences using a variable Tij. Bab, and define: P (Tij. Bab = true | Tij. Aab = true) = ab and P (Tij. Bab = true | Tij. Aab = false) = 0, where ab is the affinity between motifs a and b. This model reflects our assumption that two motif occurrences can bind only if they are both active, but their actual binding probability depends on their affinity. Note that this affinity is a feature of the motif pair and does not depend on the proteins in which they appear. We must also account for interactions that are not explained by our set of motifs, whether because of false positives in the data, or because of inadequacies of our model or of our motif set. Thus, we add a spurious binding variable Tij. S, for cases where an interaction between Pi and Pj exists, but cannot be explained well by our set of active motifs. The probability that a spurious binding occurs is given by P (Tij. S = true) = S. Finally, we observe an interaction between two proteins if and only if some form of binding occurs, whether by a motif pair or a spurious binding. Thus, we define a variable Tij. O, which represents whether protein i was observed to interact with protein j, to be a deterministic OR of all the binding variables Tij. S and Tij. Bab. Overall, Tij. O is a noisy-OR [13] of all motif pair variables Tij. Aab. Note that our model accounts for both types of errors in the protein interaction data. False negatives (missing interactions) in the data are addressed through the fact that the presence of an active motif pair only implies that binding takes place with some probability. False positives (wrong interactions) in the data are addressed through the introduction of the spurious interaction variables. The full model defines a joint probability distribution over the entire set of attributes: P (P. A, T. A, T. B, T. S, T. O) = P (P i aP i. Aa) i. M P (T aP ij. Aab | Pi. Aa, Pj. Ab)P (Tij. Bab | Tij. Aab) i. M, bPj. M ij P (Tij. S)P (Tij. O | Tij. B, Tij. S) where each of these conditional probability distributions is as specified above. We use to denote the entire set of model parameters {a, b}a, b {S}. An instantiation of our probabilistic model is illustrated in Fig. 1(b). 3 Learning the Model We now turn to the task of learning the model from the data. In a typical setting, we are given as input a protein interaction data set, specifying a set of proteins P and a set of observed interacting pairs T. O. We are also given a set of potentially relevant motifs, and the occurrences of these motifs in the different proteins in P. Thus, all the variables except for the O variables are hidden. Our learning task is thus twofold: we need to infer the values of the hidden variables, both the activity variables P. A, T. A, and the binding variables T. B, T. S; we also need to find a setting of the model parameters, which specify the motif affinities. We use a variant of the EM algorithm [15] to find both an assignment to the parameters, and an assignment to the motif variables P. A, which is a local maximum of the likelihood function P (T. O, P. A | ). Note that, to maximize this objective, we search for a MAP assignment to the motif activity variables, but sum out over the other hidden variables. This design decision is reasonable in our setting, where determining motif activities is an important goal; it is a key assumption for our computational procedure. As in most applications of EM, our main difficulty arises in the E-step, where we need to compute the distribution over the hidden variables given the settings of the observed variables and the current parameter settings. In our model, any two motif variables (both within the same protein and across different proteins) are correlated, as there exists a path of influence between them in the underlying Bayesian network (see Fig. 1(c)). These cor- relations make the task of computing the posterior distribution over the hidden variables intractable, and we must resort to an approximate computation. While we could apply a general purpose approximate inference algorithm such as loopy belief propagation [16], such methods may not converge in densely connected model such as this one, and there are few guarantees on the quality of the results even if they do converge. Fortunately, our model turns out to have additional structure that we can exploit. We now describe an ap- proximate inference algorithm that is tailored to our model, and is guaranteed to converge to a (strong) local maximum. Our first observation is that the only variables that correlate the different protein pairs Tij are the motif variables P. A. Given an assignment to these activity variables, the net- work decomposes into a set of independent subnetworks, one for each protein pair. Based on this observation, we divide our computation of the E-step into two parts. In the first, we find an assignment to the motif variables in each protein, P. A; in the second, we com- pute the posterior probability over the binding motif pair variables T. B, T. S, given the assignment to the motif variables. We begin by describing the second phase. We observe that, as all the motif pair vari- ables, T. A, are fully determined by the motif variables, the only variables left to reason about are the binding variables T. B and T. S. The variables for any pair Tij are inde- pendent of the rest of the model given the instantiation to T. A and the interaction evi- dence. That fact, combined with the noisy-OR form of the interaction, allows us to com- pute the posterior probability required in the E-step exactly and efficiently. Specifically, the computation for the variables associated with a particular protein pair Tij is as fol- lows, where we omit the common prefix Tij to simplify notation. If Aab = false, then P (Bab = true | Aab = false, O, ) = 0. Otherwise, if Aab = true, then P (B P (B ab | A, )P (O | Bab = true, A, ) ab = true | A, O, ) =. P (O | A, ) The first term in the numerator is simply the motif affinity ab; the second term is 1 if O = true and 0 otherwise. The numerator can easily be computed as P (O | A, ) = 1 - (1 - S) (1 - A ab). The computation for P (S) is very similar. a, b =true We now turn to the first phase, of finding a setting to all of the motif variables. Un- fortunately, as we discussed, the model is highly interconnected, and a finding an optimal joint setting to all of these variables P. A is intractable. We thus approximate finding this joint assignment using a method that exploits our specific structure. Our method iterates over proteins, finding in each iteration the optimal assignment to the motif variables of each protein given the current assignment to the motif activities in the remaining proteins. The process repeats, iterating over proteins, until convergence. As we discussed, the likelihood of each assignment to Pi. A can be easily computed using the method described above. However, the computation for each protein is still ex- ponential in the number of motifs it contains, which can be large (e. g. , 15). However, in our specific model, we can apply the following branch-and-bound algorithm (similar to an approach proposed by Henrion [17] for BN2O networks) to find the globally optimal as- signment to the motif variables of each protein. The idea is that we search over the space of possible assignments Pi. A for one that maximizes the objective we wish to maximize. We can show that if making a motif active relative to one assignment does not improve the objective, it will also not improve the objective relative to a large set of other assignments. More precisely, let f (Pi. A) = P (Pi. A, P-i. A|O, ) denote the objective we wish to maximize, where P-i. A is the fixed assignment to motif variables in all proteins except Pi. Let Pi. A-a denote the assignment to all the motif variables in Pi except for Aa. We compute the ratio of f after we switch Pi. Aa from false to true. Let ha(Pj) = (1 - P ab) denote the probability that motif a does not bind with j. Ab =true any active motif in Pj. We can now compute: f (P g i. Aa = true, Pi. A-a) a(Pi. A-a) = = f (Pi. Aa = false, Pi. A-a) 1 - g 1 - (1 - S)ha(Pj) hb(Pj) h a=b, Pi. Ab=true a(Pj ) (1) 1 - (1 - S) h a=b, P b(Pj ) 1jn 1jn i. Ab =true Tij. O=false Tij. O=true where g is the prior probability for a motif in protein Pi to be active. Now, consider a different point in the search, where our current motif activity assign- ment is Pi. A-a, which has all the active motifs in Pi. A-a and some additional ones. The first two terms in the product of Eq. (1) are the same for a(Pi. A-a) and a(Pi. A-a). For the final term (the large fraction), one can show using some algebraic manipulation that this term in a(Pi. A-a) is lower than that for a(Pi. A-a). We conclude that a(Pi. A-a) a(Pi. A-a), and hence that: f (Pi. Aa = true, Pi. A-a) f (P 1 i. Aa = true, Pi. A-a) 1. f (Pi. Aa = false, Pi. A-a) f (Pi. Aa = false, Pi. A- ) a It follows that, if switching motif a from inactive to active relative to Pi. A decreases f, it will also decrease f if we have some additional active motifs. We can exploit this property in a branch-and-bound algorithm in order to find the glob- ally optimal assignment Pi. A. Our algorithm keeps a set V of viable candidates for motif assignments. For presentation, we encode assignments via the set of active motifs they contain. Initially, V contains only the empty assignment {}. We start out by considering motif assignments with a single active motif. We put such an assignment {a} in V if its f -score is higher than f ({}). Now, we consider assignments {a, b} that have two active motifs. We consider {a, b} only if both {a} and {b} are in V. If so, we evaluate its f -score, and add it to V if this score is greater than that of {a} and {b}. Otherwise, we throw it away. We continue this process for all assignments of size k: For each assignment with active motif set S, we test whether S - {a} V for all a S; if we compare f (S) to each f (S - {a}), and add it if it dominates all of them. The algorithm terminates when, from some k, no assignment of size k is saved. To understand the intuition behind this pruning procedure, consider a candidate assign- ment {a, b, c, d}, and assume that {a, b, c} V, but {b, c, d} V. In this case, we must have that {b, c} V, but adding d to that assignment reduces the f -score. In this case, as shown by our analysis, adding d to the superset {a, b, c} would also reduce the f -score. This algorithm is still exponential in worst case. However, in our setting, a protein with many motifs has a low prior probability that each of them is active. Hence, adding new motifs is less likely to increase the f -score, and the algorithm tends to terminate quickly. As we show in Section 4, this algorithm significantly reduces the cost of our procedure. Our E-step finds an assignment to P. A which is a strong local optimum of the ob- jective function max P (P. A | T. O, ): The assignment has higher probability than any assignment that changes any of the motif variables for any single protein. For that assign- ment, our algorithm also computes the distribution over all of the binding variables, as described above. Using this completion, we can now easily compute the (expected) suffi- cient statistics for the different parameters in the model. As each of these parameters is a simple binomial distribution, the maximum likelihood estimation in the M-step is entirely standard; we omit details.

UAI Conference 2003 Conference Paper

Learning Module Networks

  • Eran Segal
  • Dana Pe'er
  • Aviv Regev
  • Daphne Koller
  • Nir Friedman

Methods for learning Bayesian network structure can discover dependency structure between observed variables, and have been shown to be useful in many applications. However, in domains that involve a large number of variables, the space of possible network structures is enormous, making it difficult, for both computational and statistical reasons, to identify a good model. In this paper, we consider a solution to this problem, suitable for domains where many variables have similar behavior. Our method is based on a new class of models, which we call module networks. A module network explicitly represents the notion of a module --- a set of variables that have the same parents in the network and share the same conditional probability distribution. We define the semantics of module networks, and describe an algorithm that learns a module network from data. The algorithm learns both the partitioning of the variables into modules and the dependency structure between the variables. We evaluate our algorithm on synthetic data, and on real data in the domains of gene expression and the stock market. Our results show that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks.

NeurIPS Conference 2002 Conference Paper

Boosting Density Estimation

  • Saharon Rosset
  • Eran Segal

Several authors have suggested viewing boosting as a gradient descent search for a good fit in function space. We apply gradient-based boosting methodology to the unsupervised learning problem of density estimation. We show convergence properties of the algorithm and prove that a strength of weak learnability prop- erty applies to this problem as well. We illustrate the potential of this approach through experiments with boosting Bayesian networks to learn density models.

UAI Conference 2001 Conference Paper

Exact Inference in Networks with Discrete Children of Continuous Parents

  • Uri Lerner
  • Eran Segal
  • Daphne Koller

Many real life domains contain a mixture of discrete and continuous variables and can be modeled as hybrid Bayesian Networks. Animportant subclass of hybrid BNs are conditional linear Gaussian (CLG) networks, where the conditional distribution of the continuous variables given an assignment to the discrete variables is a multivariate Gaussian. Lauritzen's extension to the clique tree algorithm can be used for exact inference in CLG networks. However, many domains also include discrete variables that depend on continuous ones, and CLG networks do not allow such dependencies to berepresented. No exact inference algorithm has been proposed for these enhanced CLG networks. In this paper, we generalize Lauritzen's algorithm, providing the first ``exact'' inference algorithm for augmented CLG networks --- networks where continuous nodes are conditional linear Gaussians but that also allow discrete children ofcontinuous parents. Our algorithm is exact in the sense that it computes the exact distributions over the discrete nodes, and the exact first and second moments of the continuous ones, up to the accuracy obtained by numerical integration used within thealgorithm. When the discrete children are modeled with softmax CPDs (as is the case in many real world domains) the approximation of the continuous distributions using the first two moments is particularly accurate. Our algorithm is simple to implement and often comparable in its complexity to Lauritzen's algorithm. We show empirically that it achieves substantially higher accuracy than previous approximate algorithms.

NeurIPS Conference 2001 Conference Paper

Probabilistic Abstraction Hierarchies

  • Eran Segal
  • Daphne Koller
  • Dirk Ormoneit

Many domains are naturally organized in an abstraction hierarchy or taxonomy, where the instances in “nearby” classes in the taxonomy are similar. In this pa- per, we provide a general probabilistic framework for clustering data into a set of classes organized as a taxonomy, where each class is associated with a prob- abilistic model from which the data was generated. The clustering algorithm simultaneously optimizes three things: the assignment of data instances to clus- ters, the models associated with the clusters, and the structure of the abstraction hierarchy. A unique feature of our approach is that it utilizes global optimization algorithms for both of the last two steps, reducing the sensitivity to noise and the propensity to local maxima that are characteristic of algorithms such as hierarchi- cal agglomerative clustering that only take local steps. We provide a theoretical analysis for our algorithm, showing that it converges to a local maximum of the joint likelihood of model and data. We present experimental results on synthetic data, and on real data in the domains of gene expression and text.