Abstract
Real-world data is virtually never noise-free. Current methods for
handling noise do so either by removing noisy instances or by trying
to clean noisy attributes. Neither of these deal directly with the
issue of noise and in fact removing a noisy instance is not a viable
option in many real systems. In this paper, we consider the problem
of noise
in the context of record linkage, a frequent problem in text mining.
We present a new method for dealing
with data sources that have noisy attributes which reflect the
pedigree of that source. Our method, which assumes that training data
is clean and that noise is only present in the test set, is an
extension of decision trees which directly handles noise at
classification time by changing how it walks through the tree at the
various nodes, similar to how current trees handle missing values. We
test the efficacy of our method on the IMDb movie database where we
classify whether pairs of records refer to the same person. Our
results clearly show that we dramatically improve performance by
handling pedigree directly at classification time.