Abstract
Consider a supervised learning problem in which examples contain both
numerical- and text-valued features. To use traditional
feature-vector-based learning methods, one could treat the presence or
absence of a word as a Boolean feature and use these binary-valued
features together with the numerical features. However, the use of a
text-classification system on this is a bit more problematic --- in
the most straight-forward approach each number would be considered a
distinct token and treated as a word. This paper presents an
alternative approach for the use of text classification methods for
supervised learning problems with numerical-valued features in which
the numerical features are converted into bag-of-words features,
thereby making them directly usable by text classification methods.
We show that even on purely numerical-valued data the results of
text-classification on the derived text-like representation
outperforms the more naive numbers-as-tokens representation and, more
importantly, is competitive with mature numerical classification
methods such as C4.5 and Ripper.