Significance Testing against the Random Model for Scoring Models on Top k Predictions
[postscript]
[pdf]
CeDER Working Paper CeDER-05-09, Stern
School of Business, New York
University, NY, NY 10012.
Abstract
Performance at top k predictions, where instances are ranked by a
(learned) scoring model, has been used as an evaluation metric in
machine learning for various reasons such as where the entire corpus
is unknown ({\em e.g.}, the web) or where the results are to be used
by a person with limited time or resources ({\em e.g.,} ranking
financial news stories where the investor only has time to look at
relatively few stories per day). This evaluation metric is primarily
used to report whether the performance of a given method is
significantly better than other (baseline) methods. It has not,
however, been used to show whether the result is {\em significant}
when compared to the simplest of baselines --- the random model. If
no models outperform the random model at a given confidence interval,
then the results may not be worth reporting. This paper introduces a
technique to perform an analysis of the expected performance of the
top k predictions from the random model given k and a p-value on
an evaluation dataset D. The technique is based on the
realization that the distribution of the number of positives seen in
the top k predictions follows a {\em hypergeometric distribution},
which has well-defined statistical density functions. As this
distribution is discrete, we show that using parametric estimations
based on a binomial distribution are almost always in complete
agreement with the discrete distribution and that, if they differ, an
interpolation of the discrete bounds gets very close to the parametric
estimations. The technique is demonstrated on results from three
prior published works, in which it clearly shows that even though
performance is greatly increased (sometimes over 100%) with respect
to the expected performance of the random model (at p=0.5), these
results, although qualitatively impressive, are not always as
significant (p=0.1) as might be suggested by the impressive
qualitative improvements. The technique is used to show, given k,
both how many positive instances are needed to achieve a specific
significance threshold is as well as how significant a given top k
performance is. The technique when used in a more global setting is
able to identify the crossover points, with respect to k, when a
method becomes significant for a given p. Lastly, the technique is
used to generate a complete confidence curve, which shows a general
trend over all k and visually shows where a method is significantly
better than the random model over all values of k.