To appear in Proceedings of the Second workshop on ROC Analysis in ML, at the 22nd International Conference on Machine Learning.

This paper is about constructing and evaluating pointwise confidence bounds on an ROC curve. We describe four confidence-bound methods, two from the medical field and two used previously in machine learning research. We evaluate whether the bounds indeed contain the relevant operating point on the "true" ROC curve with a confidence of 1-delta. We then evaluate pointwise confidence bounds on the region where the future performance of a model is expected to lie. For evaluation we use a synthetic world representing "binormal" distributions--the classification scores for positive and negative instances are drawn from (separate) normal distributions. For the "true-curve" bounds, all methods are sensitive to how well the distributions are separated, which corresponds directly to the area under the ROC curve. One method produces bounds that are universally too loose, another universally too tight, and the remaining two are close to the desired containment although containment breaks down at the extremes of the ROC curve. As would be expected, all methods fail when used to contain "future" ROC curves. Widening the bounds to account for the increased uncertainty yields identical qualitative results to the "true-curve" evaluation. We conclude by recommending a simple, very efficient method (vertical averaging) for large sample sizes and a more computationally expensive method (kernel estimation) for small sample sizes.