Improving Learning in Networked Data by Combining Explicit and Mined
in Proceedings of the
Twenty-Second
Conference on Artificial Intelligence (AAAI-2007), July 22-26,
2007, Vancouver, Canada.
Abstract
This paper is about using multiple types of information for
classification of networked data in a semi-supervised setting: given a
fully described network (nodes and edges) with known labels for some
of the nodes, predict the labels of the remaining nodes. One method
recently developed for doing such inference is a guilt-by-association
model. This method has been independently developed in two different
settings--relational learning and semi-supervised learning. In
relational learning, the setting assumes that the networked data has
explicit links such as hyperlinks between web-pages or citations
between research papers. The semi-supervised setting assumes a corpus
of non-relational data and creates links based on similarity measures
between the instances. Both use only the known labels in the network
to predict the remaining labels but use very different information
sources. The thesis of this paper is that if we combine these two
types of links, the resulting network will carry more information than
either type of link by itself. We test this thesis on six benchmark
data sets, using a within-network learning algorithm, where we show
that we gain significant improvements in predictive performance by
combining the links. We describe a principled way of combining
multiple types of edges with different edge-weights and semantics
using an objective graph measure called node-based assortativity. We
investigate the use of this measure to combine text-mined links with
explicit links and show that using our approach significantly improves
performance of our classifier over naively combining these two types
of links.