Author name cluster

Sofus A. Macskassy

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

2 author rows

JMLR Journal 2017 Journal Article

Joint Label Inference in Networks

Deepayan Chakrabarti
Stanislav Funiak
Jonathan Chang
Sofus A. Macskassy

We consider the problem of inferring node labels in a partially labeled graph where each node in the graph has multiple label types and each label type has a large number of possible labels. Our primary example, and the focus of this paper, is the joint inference of label types such as hometown, current city, and employers for people connected by a social network; by predicting these user profile fields, the network can provide a better experience to its users. Existing approaches such as Label Propagation (Zhu et al., 2003) fail to consider interactions between the label types. Our proposed method, called EDGEEXPLAIN explicitly models these interactions, while still allowing scalable inference under a distributed message- passing architecture. On a large subset of the Facebook social network, collected in a previous study (Chakrabarti et al., 2014), EDGEEXPLAIN outperforms label propagation for several label types, with lifts of up to $120\%$ for recall@1 and $60\%$ for recall@3. [abs] [ pdf ][ bib ] &copy JMLR 2017. ( edit, beta )

PDF Details

ICML Conference 2014 Conference Paper

Joint Inference of Multiple Label Types in Large Networks

Deepayan Chakrabarti
Stanislav Funiak
Jonathan Chang
Sofus A. Macskassy

We tackle the problem of inferring node labels in a partially labeled graph where each node in the graph has multiple label types and each label type has a large number of possible labels. Our primary example, and the focus of this paper, is the joint inference of label types such as hometown, current city, and employers, for users connected by a social network. Standard label propagation fails to consider the properties of the label types and the interactions between them. Our proposed method, called EdgeExplain, explicitly models these, while still enabling scalable inference under a distributed message-passing architecture. On a billion-node subset of the Facebook social network, EdgeExplain significantly outperforms label propagation for several label types, with lifts of up to 120% for recall@1 and 60% for recall@3.

Details

ICML Conference 2009 Conference Paper

Workshop summary: The fourth workshop on evaluation methods for machine learning

Chris Drummond
Nathalie Japkowicz
William Klement
Sofus A. Macskassy

Details

JMLR Journal 2007 Journal Article

Classification in Networked Data: A Toolkit and a Univariate Case Study

Sofus A. Macskassy
Foster Provost

This paper is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a case-study of its application to networked data used in prior machine learning research. NetKit is based on a node-centric framework in which classifiers comprise a local classifier, a relational classifier, and a collective inference procedure. Various existing node-centric relational learning algorithms can be instantiated with appropriate choices for these components, and new combinations of components realize new algorithms. The case study focuses on univariate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. The results demonstrate that very simple network-classification models perform quite well---well enough that they should be used regularly as baseline classifiers for studies of learning with networked data. The simplest method (which performs remarkably well) highlights the close correspondence between several existing methods introduced for different purposes---that is, Gaussian-field classifiers, Hopfield networks, and relational-neighbor classifiers. The case study also shows that there are two sets of techniques that are preferable in different situations, namely when few versus many labels are known initially. We also demonstrate that link selection plays an important role similar to traditional feature selection. [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

PDF Details

AAAI Conference 2007 Conference Paper

Improving Learning in Networked Data by Combining Explicit and Mined Links

Sofus A. Macskassy

PDF Details

ICML Conference 2005 Conference Paper

ROC confidence bands: an empirical evaluation

Sofus A. Macskassy
Foster J. Provost
Saharon Rosset

Details

AIJ Journal 2003 Journal Article

Converting numerical classification into text classification

Sofus A. Macskassy
Haym Hirsh
Arunava Banerjee
Aynur A. Dayanik

Consider a supervised learning problem in which examples contain both numerical- and text-valued features. To use traditional feature-vector-based learning methods, one could treat the presence or absence of a word as a Boolean feature and use these binary-valued features together with the numerical features. However, the use of a text-classification system on this is a bit more problematic—in the most straight-forward approach each number would be considered a distinct token and treated as a word. This paper presents an alternative approach for the use of text classification methods for supervised learning problems with numerical-valued features in which the numerical features are converted into bag-of-words features, thereby making them directly usable by text classification methods. We show that even on purely numerical-valued data the results of text classification on the derived text-like representation outperforms the more naive numbers-as-tokens representation and, more importantly, is competitive with mature numerical classification methods such as C4. 5, Ripper, and SVM. We further show that on mixed-mode data adding numerical features using our approach can improve performance over not adding those features.

Details DOI