Classification with Pedigree and its Applicability to Record Linkage [postscript] [pdf]
Appears in the Workshop on Text-Mining and Link Analysis (Textlink) at the Twentieth International Joint Conference on Artificial Intelligence, January 7, 2007, Hydarabad, India.

Evan S. Gamble, Sofus A. Macskassy, Steve Minton

Abstract

Real-world data is virtually never noise-free. Current methods for handling noise do so either by removing noisy instances or by trying to clean noisy attributes. Neither of these deal directly with the issue of noise and in fact removing a noisy instance is not a viable option in many real systems. In this paper, we consider the problem of noise in the context of record linkage, a frequent problem in text mining. We present a new method for dealing with data sources that have noisy attributes which reflect the pedigree of that source. Our method, which assumes that training data is clean and that noise is only present in the test set, is an extension of decision trees which directly handles noise at classification time by changing how it walks through the tree at the various nodes, similar to how current trees handle missing values. We test the efficacy of our method on the IMDb movie database where we classify whether pairs of records refer to the same person. Our results clearly show that we dramatically improve performance by handling pedigree directly at classification time.