Matthew Michelson,
Sofus A. Macskassy,
and Steve N. Minton
Abstract
One common framework for data integration in practice is federated
search. Here an agent queries disjoint sources simultaneously, and
then clusters the returned records in the absence of unique
keys. However, formulating the correct queries to the sources can be
challenging because of the possible query value variations. For
instance, some sources may contain a first name as "John" while other
sources use the name "Jonathan" for the same person. If the underlying
sources do not support sophisticated matching then a single query of
"John" will miss many records from the "Jonathan" sources. This paper
presents an approach to formulating queries for federated search that
leverages automatically discovered transformations such as synonyms
and abbreviations to create the set of possible queries for the given
sources. Our preliminary results demonstrate that indeed,
transformations mined from a subset of sources will apply to a new,
distinct source, thereby allowing query expansions based on the
discovered transformations.