Classification in Networked Data:
A toolkit and a univariate case study
[postscript]
[pdf]
CeDER Working Paper CeDER-04-08, Stern
School of Business, New York
University, NY, NY 10012. Journal paper accepted to Journal of
Machine Learning Research. (Updated December 2006)
Abstract
This paper is about classifying entities that are interlinked with
entities for which the class is known. After surveying prior work, we
present NetKit, a modular toolkit for classification in networked
data, and a case-study of its application to networked data used in
prior machine learning research. NetKit is based on a node-centric
framework in which classifiers comprise a local classifier, a
relational classifier, and a collective inference procedure. Various
existing node-centric relational learning algorithms can be
instantiated with appropriate choices for these components, and new
combinations of components realize new algorithms. The case study
focuses on univariate network classification, for which the only
information used is the structure of class linkage in the network
(i.e., only links and some class labels). To our knowledge, no work
previously has evaluated systematically the power of class-linkage
alone for classification in machine learning benchmark data sets. The
results demonstrate that very simple network-classification models
perform quite well---well enough that they should be used regularly as
baseline classifiers for studies of learning with networked data. The
simplest method (which performs remarkably well) highlights the close
correspondence between several existing methods introduced for
different purposes---i.e., Gaussian-field classifiers, Hopfield
networks, and relational-neighbor classifiers. The case study also
shows that there are two sets of techniques that are preferable in
different situations, namely when few versus many labels are known
initially. We also demonstrate that link selection plays an important
role similar to traditional feature selection.