Abstract
We address the problem of comparing the performance of classifiers.
In this paper we study techniques for generating and evaluating
confidence bands on ROC curves. Historically this has been done using
one-dimensional confidence intervals by freezing one variable---the
false-positive rate, or threshold on the classification scoring
function. We adapt two prior methods and introduce a new radial sweep
method to generate confidence bands. We show, through empirical
studies, that the bands are too tight and introduce a general
optimization methodology for creating bands that better fit the data,
as well as methods for evaluating confidence bands. We show
empirically that the optimized confidence bands fit much better and
that, using our new evaluation method, it is possible to gauge the
relative fit of different confidence bands.