A ROC curve provides a graphical representation of the relationship between the true-positive and false-positive prediction rate of a model. The y-axis corresponds to the sensitivity of the model, i.e. how well the model is able to predict true positives (real cleavages) from sites that are not cleaved, and the y-coordinates are calculated as:
The x-axis corresponds to the specificity (expressed on the curve as 1-specificity), i.e. the ability of the model to identify true negatives. An increase in specificity (i.e. a decrease along the X-axis) results in an increase in sensitivity. The x-coordinates are calculated as:
The greater the sensitivity at high specificity values (i.e. high y-axis values at low X-axis values) the better the model. A numerical measure of the accuracy of the model can be obtained from the area under the curve, where an area of 1.0 signifies near perfect accuracy, while an area of less than 0.5 indicates that the model is worse than just random. The quantitative-qualitative relationship between area and accuracy follows a fairly linear pattern, such that the following could be used as a guide:
- 0.9-1: Excellent
- 0.8-0.9: Very good
- 0.7-0.8: Good
- 0.6-0.7: Average
- 0.5-0.6: Poor
Introduction to ROC Curves
The sensitivity and specificity of a diagnostic test depends on more than just the "quality" of the test--they also depend on the definition of what constitutes an abnormal test. Look at the the idealized graph at right showing the number of patients with and without a disease arranged according to the value of a diagnostic test. This distributions overlap--the test (like most) does not distinguish normal from disease with 100% accuracy. The area of overlap indicates where the test cannot distinguish normal from disease. In practice, we choose a cutpoint (indicated by the vertical black line) above which we consider the test to be abnormal and below which we consider the test to be normal. The position of the cutpoint will determine the number of true positive, true negatives, false positives and false negatives. We may wish to use different cutpoints for different clinical situations if we wish to minimize one of the erroneous types of test results.
We can use the hypothyroidism data from the likelihood ratio section to illustrate how sensitivity and specificity change depending on the choice of T4 level that defines hypothyroidism. Recall the data on patients with suspected hypothyroidism reported by Goldstein and Mushlin (J Gen Intern Med 1987;2:20-24.). The data on T4 values in hypothyroid and euthyroid patients are shown graphically (below left) and in a simplified tabular form (below right).
T4 value | Hypothyroid | Euthyroid |
5 or less | 18 | 1 |
5.1 - 7 | 7 | 17 |
7.1 - 9 | 4 | 36 |
9 or more | 3 | 39 |
Totals: | 32 | 93 |
Suppose that patients with T4 values of 5 or less are considered to be hypothyroid. The data display then reduces to:
T4 value | Hypothyroid | Euthyroid |
5 or less | 18 | 1 |
> 5 | 14 | 92 |
Totals: | 32 | 93 |
Now, suppose we decide to make the definition of hypothyroidism less stringent and now consider patients with T4 values of 7 or less to be hypothyroid. The data display will now look like this:
T4 value | Hypothyroid | Euthyroid |
7 or less | 25 | 18 |
> 7 | 7 | 75 |
Totals: | 32 | 93 |
Lets move the cut point for hypothyroidism one more time:
T4 value | Hypothyroid | Euthyroid |
<> | 29 | 54 |
9 or more | 3 | 39 |
Totals: | 32 | 93 |
Now, take the sensitivity and specificity values above and put them into a table:
Cutpoint | Sensitivity | Specificity |
5 | 0.56 | 0.99 |
7 | 0.78 | 0.81 |
9 | 0.91 | 0.42 |
The next section covers how to use the numbers we just calculated to draw and interpret an ROC curve.
.
|
Plotting and Intrepretating an ROC Curve
T4 value | Hypothyroid | Euthyroid |
5 or less | 18 | 1 |
5.1 - 7 | 7 | 17 |
7.1 - 9 | 4 | 36 |
9 or more | 3 | 39 |
Totals: | 32 | 93 |
Cutpoint | Sensitivity | Specificity |
5 | 0.56 | 0.99 |
7 | 0.78 | 0.81 |
9 | 0.91 | 0.42 |
The operating characteristics (above right) can be reformulated slightly and then presented graphically as shown below to the right:
Cutpoint | True Positives | False Positives |
5 | 0.56 | 0.01 |
7 | 0.78 | 0.19 |
9 | 0.91 | 0.58 |
This type of graph is called a Receiver Operating Characteristic curve (or ROC curve.) It is a plot of the true positive rate against the false positive rate for the different possible cutpoints of a diagnostic test.
An ROC curve demonstrates several things:
- It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
- The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
- The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
- The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that value of the test. You can check this out on the graph above. Recall that the LR for T4 <> 9 is 0.2. This corresponds to the far right, nearly horizontal portion of the curve.
- The area under the curve is a measure of text accuracy. This is discussed further in the next section.
.
The Area Under an ROC Curve
The graph at right shows three ROC curves representing excellent, good, and worthless tests plotted on the same graph. The accuracy of the test depends on how well the test separates the group being tested into those with and without the disease in question. Accuracy is measured by the area under the ROC curve. An area of 1 represents a perfect test; an area of .5 represents a worthless test. A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system:
- .90-1 = excellent (A)
- .80-.90 = good (B)
- .70-.80 = fair (C)
- .60-.70 = poor (D)
- .50-.60 = fail (F)
ROC curves can also be constructed from clinical prediction rules. The graphs at right come from a study of how clinical findings predict strep throat (Wigton RS, Connor JL, Centor RM. Transportability of a decision rule for the diagnosis of streptococcal pharyngitis. Arch Intern Med. 1986;146:81-83.) In that study, the presence of tonsillar exudate, fever, adenopathy and the absence of cough all predicted strep. The curves were constructed by computing the sensitivity and specificity of increasing numbers of clinical findings (from 0 to 4) in predicting strep. The study compared patients in Virginia and Nebraska and found that the rule performed more accurately in Virginia (area under the curve = .78) compared to Nebraska (area under the curve = .73). These differences turn out not to be statistically different, however.
At this point, you may be wondering what this area number really means and how it is computed. The area measures discrimination, that is, the ability of the test to correctly classify those with and without the disease. Consider the situation in which patients are already correctly classified into two groups. You randomly pick on from the disease group and one from the no-disease group and do the test on both. The patient with the more abnormal test result should be the one from the disease group. The area under the curve is the percentage of randomly drawn pairs for which this is true (that is, the test correctly classifies the two patients in the random pair).
Computing the area is more difficult to explain and beyond the scope of this introductory material. Two methods are commonly used: a non-parametric method based on constructing trapeziods under the curve as an approximation of area and a parametric method using a maximum likelihood estimator to fit a smooth curve to the data points. Both methods are available as computer programs and give an estimate of area and standard error that can be used to compare different tests or the same test in different patient populations. For more on quantitative ROC analysis, see Metz CE. Basic principles of ROC analysis. Sem Nuc Med. 1978;8:283-298.
A final note of historical interestYou may be wondering where the name "Reciever Operating Characteristic" came from. ROC analysis is part of a field called "Signal Dectection Therory" developed during World War II for the analysis of radar images. Radar operators had to decide whether a blip on the screen represented an enemy target, a friendly ship, or just noise. Signal detection theory measures the ability of radar receiver operators to make these important distinctions. Their ability to do so was called the Receiver Operating Characteristics. It was not until the 1970's that signal detection theory was recognized as useful for interpreting medical test results.
No comments:
Post a Comment