Johnny Deng's Column: Introduction to ROC Curves

About ROC curves

A ROC curve provides a graphical representation of the relationship between the true-positive and false-positive prediction rate of a model. The y-axis corresponds to the sensitivity of the model, i.e. how well the model is able to predict true positives (real cleavages) from sites that are not cleaved, and the y-coordinates are calculated as:

The x-axis corresponds to the specificity (expressed on the curve as 1-specificity), i.e. the ability of the model to identify true negatives. An increase in specificity (i.e. a decrease along the X-axis) results in an increase in sensitivity. The x-coordinates are calculated as:

The greater the sensitivity at high specificity values (i.e. high y-axis values at low X-axis values) the better the model. A numerical measure of the accuracy of the model can be obtained from the area under the curve, where an area of 1.0 signifies near perfect accuracy, while an area of less than 0.5 indicates that the model is worse than just random. The quantitative-qualitative relationship between area and accuracy follows a fairly linear pattern, such that the following could be used as a guide:

0.9-1: Excellent
0.8-0.9: Very good
0.7-0.8: Good
0.6-0.7: Average
0.5-0.6: Poor

Introduction to ROC Curves

The sensitivity and specificity of a diagnostic test depends on more than just the "quality" of the test--they also depend on the definition of what constitutes an abnormal test. Look at the the idealized graph at right showing the number of patients with and without a disease arranged according to the value of a diagnostic test. This distributions overlap--the test (like most) does not distinguish normal from disease with 100% accuracy. The area of overlap indicates where the test cannot distinguish normal from disease. In practice, we choose a cutpoint (indicated by the vertical black line) above which we consider the test to be abnormal and below which we consider the test to be normal. The position of the cutpoint will determine the number of true positive, true negatives, false positives and false negatives. We may wish to use different cutpoints for different clinical situations if we wish to minimize one of the erroneous types of test results.

We can use the hypothyroidism data from the likelihood ratio section to illustrate how sensitivity and specificity change depending on the choice of T4 level that defines hypothyroidism. Recall the data on patients with suspected hypothyroidism reported by Goldstein and Mushlin (J Gen Intern Med 1987;2:20-24.). The data on T4 values in hypothyroid and euthyroid patients are shown graphically (below left) and in a simplified tabular form (below right).

T4 value	Hypothyroid	Euthyroid
5 or less	18	1
5.1 - 7	7	17
7.1 - 9	4	36
9 or more	3	39
Totals:	32	93

Suppose that patients with T4 values of 5 or less are considered to be hypothyroid. The data display then reduces to:

T4 value	Hypothyroid	Euthyroid
5 or less	18	1
> 5	14	92
Totals:	32	93

You should be able to verify that the sensivity is 0.56 and the specificity is 0.99.

Now, suppose we decide to make the definition of hypothyroidism less stringent and now consider patients with T4 values of 7 or less to be hypothyroid. The data display will now look like this:

T4 value	Hypothyroid	Euthyroid
7 or less	25	18
> 7	7	75
Totals:	32	93

You should be able to verify that the sensivity is 0.78 and the specificity is 0.81.

Lets move the cut point for hypothyroidism one more time:

T4 value	Hypothyroid	Euthyroid
<>	29	54
9 or more	3	39
Totals:	32	93

You should be able to verify that the sensivity is 0.91 and the specificity is 0.42.

Now, take the sensitivity and specificity values above and put them into a table:

Cutpoint	Sensitivity	Specificity
5	0.56	0.99
7	0.78	0.81
9	0.91	0.42

Notice that you can improve the sensitivity by moving to cutpoint to a higher T4 value--that is, you can make the criterion for a positive test less strict. You can improve the specificity by moving the cutpoint to a lower T4 value--that is, you can make the criterion for a positive test more strict. Thus, there is a tradeoff between sensitivity and specificity. You can change the definition of a positive test to improve one but the other will decline.

The next section covers how to use the numbers we just calculated to draw and interpret an ROC curve.
.

Plotting and Intrepretating an ROC Curve

This section continues the hypothyroidism example started in the the previous section. We showed that the table at left can be summarized by the operating characteristics at right:

T4 value	Hypothyroid	Euthyroid
5 or less	18	1
5.1 - 7	7	17
7.1 - 9	4	36
9 or more	3	39
Totals:	32	93

Cutpoint	Sensitivity	Specificity
5	0.56	0.99
7	0.78	0.81
9	0.91	0.42

The operating characteristics (above right) can be reformulated slightly and then presented graphically as shown below to the right:

Cutpoint	True Positives	False Positives
5	0.56	0.01
7	0.78	0.19
9	0.91	0.58

This type of graph is called a Receiver Operating Characteristic curve (or ROC curve.) It is a plot of the true positive rate against the false positive rate for the different possible cutpoints of a diagnostic test.

An ROC curve demonstrates several things:

It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that value of the test. You can check this out on the graph above. Recall that the LR for T4 <> 9 is 0.2. This corresponds to the far right, nearly horizontal portion of the curve.
The area under the curve is a measure of text accuracy. This is discussed further in the next section.

The Area Under an ROC Curve

| Previous Section | Main Menu | Next Section |

The graph at right shows three ROC curves representing excellent, good, and worthless tests plotted on the same graph. The accuracy of the test depends on how well the test separates the group being tested into those with and without the disease in question. Accuracy is measured by the area under the ROC curve. An area of 1 represents a perfect test; an area of .5 represents a worthless test. A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system:

.90-1 = excellent (A)
.80-.90 = good (B)
.70-.80 = fair (C)
.60-.70 = poor (D)
.50-.60 = fail (F)

Recall the T4 data from the previous section. The area under the T4 ROC curve is .86. The T4 would be considered to be "good" at separating hypothyroid from euthyroid patients.

ROC curves can also be constructed from clinical prediction rules. The graphs at right come from a study of how clinical findings predict strep throat (Wigton RS, Connor JL, Centor RM. Transportability of a decision rule for the diagnosis of streptococcal pharyngitis. Arch Intern Med. 1986;146:81-83.) In that study, the presence of tonsillar exudate, fever, adenopathy and the absence of cough all predicted strep. The curves were constructed by computing the sensitivity and specificity of increasing numbers of clinical findings (from 0 to 4) in predicting strep. The study compared patients in Virginia and Nebraska and found that the rule performed more accurately in Virginia (area under the curve = .78) compared to Nebraska (area under the curve = .73). These differences turn out not to be statistically different, however.

At this point, you may be wondering what this area number really means and how it is computed. The area measures discrimination, that is, the ability of the test to correctly classify those with and without the disease. Consider the situation in which patients are already correctly classified into two groups. You randomly pick on from the disease group and one from the no-disease group and do the test on both. The patient with the more abnormal test result should be the one from the disease group. The area under the curve is the percentage of randomly drawn pairs for which this is true (that is, the test correctly classifies the two patients in the random pair).

Computing the area is more difficult to explain and beyond the scope of this introductory material. Two methods are commonly used: a non-parametric method based on constructing trapeziods under the curve as an approximation of area and a parametric method using a maximum likelihood estimator to fit a smooth curve to the data points. Both methods are available as computer programs and give an estimate of area and standard error that can be used to compare different tests or the same test in different patient populations. For more on quantitative ROC analysis, see Metz CE. Basic principles of ROC analysis. Sem Nuc Med. 1978;8:283-298.

A final note of historical interest
You may be wondering where the name "Reciever Operating Characteristic" came from. ROC analysis is part of a field called "Signal Dectection Therory" developed during World War II for the analysis of radar images. Radar operators had to decide whether a blip on the screen represented an enemy target, a friendly ship, or just noise. Signal detection theory measures the ability of radar receiver operators to make these important distinctions. Their ability to do so was called the Receiver Operating Characteristics. It was not until the 1970's that signal detection theory was recognized as useful for interpreting medical test results.