Abstract
This paper is about constructing and evaluating
pointwise confidence bounds on an ROC curve. We describe four
confidence-bound methods, two from the medical field and two used
previously in machine learning research. We evaluate whether the
bounds indeed contain the relevant operating point on the "true" ROC
curve with a confidence of 1-delta. We then evaluate pointwise
confidence bounds on the region where the future performance of a
model is expected to lie. For evaluation we use a synthetic world
representing "binormal" distributions--the classification scores for
positive and negative instances are drawn from (separate) normal
distributions. For the "true-curve" bounds, all methods are
sensitive to how well the distributions are separated, which
corresponds directly to the area under the ROC curve. One method
produces bounds that are universally too loose, another universally
too tight, and the remaining two are close to the desired containment
although containment breaks down at the extremes of the ROC curve. As
would be expected, all methods fail when used to contain "future"
ROC curves. Widening the bounds to account for the increased
uncertainty yields identical qualitative results to the "true-curve"
evaluation. We conclude by recommending a simple, very efficient
method (vertical averaging) for large sample sizes and a more
computationally expensive method (kernel estimation) for small sample
sizes.