Abstract
Many real-world problems involve a combination of both text- and
numerical-valued features. For example, in email classification, it
is possible to use instance representations that consider not only the
text of each message, but also numerical-valued features such as the
length of the message or the time of day at which it was sent.
Text-classification methods have thus far not easily incorporated
numerical features. In earlier work we described an approach for
converting numerical features into bags of tokens so that text
classification methods can be applied to numerical classification
problems, and showed that the resulting learning methods are
competitive with traditional numerical classification methods. In
this paper we use this as a way to learn on problems that involve a
combination of text and numbers. We show that the results outperform
competing methods. Further, we show that selecting a best
classification method using text-only features and then adding
numerical features to the problem (as might happen if numerical
features are only later added to a pre-existing text-classification
problem) gives performance that rivals a more time-consuming approach
of evaluating all classification methods using the full set of both
text and numerical features.