Abstract
Recently, much attention has been given to extracting tables from Web
data. In this problem, the column denitions and tuples (such as what
"company" is headquartered in what "city,") are extracted from Web
text, structured Web data such as lists, or results of querying the
deep Web, creating the table of interest. In this paper, we examine
the problem of extracting and discovering multiple tables in a given
domain, generating a truly multi-relational database as output.
Beyond discovering the relations that dene single tables, our approach
discovers and leverages "within column" set membership relations, and
discovers relations across the extracted tables (e.g., joins). By
leveraging within-column relations our method can extract table
instances that are ambiguous or rare, and by discovering joins, our
method generates truly multi-relational output. Further, our approach
uses taxonomic queries to bootstrap the extraction, rather than the
more traditional "seed instances." Creating seeds often requires more
domain knowledge than taxonomic queries, and previous work has shown
that extraction methods may be sensitive to which input seeds they are
given. We test our approach on two real world domains: NBA basketball
and cancer information. Our results demonstrate that our approach
generates databases of relevant tables from disparate Web information,
and discovers the relations between them. Further, we show that by
leveraging the "within column" relation our approach can identify a
signicant number of relevant tuples that would be difficult to do so
otherwise.