Johnny Deng's Column: Spearman Correlation

Spearman's rank correlation coefficient, named after Charles Spearman and often denoted by the Greek letter ρ (rho), is a non-parametric measure of correlation – that is, it assesses how well an arbitrary monotonic function could describe the relationship between two variables, without making any assumptions about the frequency distribution of the variables. Unlike the Pearson product-moment correlation coefficient, it does not require the assumption that the relationship between the variables is linear, nor does it require the variables to be measured on interval scales; it can be used for variables measured at the ordinal level.

In principle, ρ is simply a special case of the Pearson product-moment coefficient in which the data are converted to rankings before calculating the coefficient. In practice, however, a simpler procedure is normally used to calculate ρ. The raw scores are converted to ranks, and the differences d between the ranks of each observation on the two variables are calculated.

If there are no tied ranks, i.e. $\neg\exists_{i,j} i\ne j \wedge (x_i=x_j \vee y_i=y_j)$

then ρ is given by:

$\rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}$

where:

$d i$ = the difference between each rank of corresponding values of x and y, and

$n$ = the number of pairs of values.

If tied ranks exist, classic Pearson's correlation coefficient between ranks has to be used instead of this formula. You have to assign the same rank to each of the equal values. It is an average of their positions in the ascending order of the values:

An Example of Averaging Ranks

Variable	Position in the decending order	Rank
0.8	5	5
1.2	4	$\frac{4+3}{2}=3.5\$
1.2	3	$\frac{4+3}{2}=3.5\$
2.3	2	2
18	1	1

Spearman's rank correlation coefficient is equivalent to Pearson correlation on ranks. The formula above is a short-cut to its product-moment form, assuming no tie. The product-moment form can be used in both tied and untied cases.

A version of this correlation is called Spearman's rho. In this case ranks are calculated as above, but in the formula of Pearson's correlation a standard deviation is taken as there were no ties.

Another popular method for computing rank correlation is the Kendall tau rank correlation coefficient.

Example

The raw data used in this example is shown below.

IQ	Hours of TV per week.
106	7
86	0
100	27
101	50
99	28
103	29
97	20
113	12
112	6
110	17

The first step is to sort this data by the first column. Next, two more columns are created. Both of these are for ranking the first two columns. Notice how the rank of values that are the same is the mean of what their ranks would otherwise be. Then a column "d" is created to hold the differences between the two rank columns. Finally another column "d²" should be created. This is just column d squared.

After doing this process with the example data you should end up with something like:

IQ (i)	Hours of TV per week (t)	rank (i)	rank (t)	d	d²
86	0	1	1	0	0
97	20	2	6	4	16
99	28	3	8	5	25
100	27	4	7	3	9
101	50	5	10	5	25
103	29	6	9	3	9
106	7	7	3	4	16
110	17	8	5	3	9
112	6	9	2	7	49
113	12	10	4	6	36

The values in the d² column can now be added to find $\sum d_i^2 = 194$ . The value of n is 10. So these values can now be substituted back into the equation,

$\rho = 1- {\frac {6\times194}{10(10^2 - 1)}}$

which evaluates to $ρ = - 0.175758$ . In the case of ties in the original values, then this formula should not be used. Instead, the Pearson correlation coefficient should be calculated on the ranks (where ties are given ranks, as described above).

Johnny Deng's Column

Saturday, 1 September 2007

Spearman Correlation

Example

No comments:

Site Search

Blog Archive

Who am I?

Access History