New Techniques in Intelligent Information Filtering
180 Pages. Available in: postscript (835Kb, compressed) and pdf (2.37Mb)
Ph.D. dissertation.
Department of Computer Science,
Rutgers University,
New Brunswick, NJ. January 2003.
Abstract
Intelligent Information Filtering is the process of receiving or
monitoring large amounts of dynamically generated information and
extracting the subset of information that would be of interest to a
user based on some specified information need. Historically, this
need has been based on user profiles that are directly evaluable---the
information can be immediately classified as interesting or not. In
this thesis I introduce a new type of user interestingness criterion
which is {\em prospective}---the criterion defines the interestingness
of an information item based on events that happen {\em subsequent} to
the information item appearing. Hence, the interestingness cannot be
directly evaluated. A new technique is described which takes such a
criterion and {\em operationalizes} it, using machine learning to
generate a predictive model that can directly evaluate a piece of
information. I show that this technique works statistically
significantly better than the baseline of predicting based on class
distribution on five information filtering case studies.
However, a predictive model is only as good as the trust that its user
puts in it. Many predictive models are {\em opaque}, in the sense
that they are not easily understood or explained to a human. Thus, I
introduce a technique for taking an opaque model and generating a
small set of rules that attempt to replicate its performance. I show
that the rules generated by my technique on the five case studies are
plausible representations of the predictive model and help explain how
it works.
Finally, since many information filtering tasks involve primarily
text, I have developed a new technique for enabling the use of
numerical features in text classification---a technology often used to
generate predictive models for information filtering. This technique
converts a number into a bag of tokens such that numbers close to each
other have high overlap in the tokens and numbers far apart do not. I
show that this approach improves performance significantly over using
only text and, further, that this approach is competitive to
state-of-the-art numerical classifiers such as C4.5 and Ripper on pure
numerical classification problems that do not even involve text.