Johnny Deng's Column: Nonparametric Estimate I

Very Good to introduce nonprarametric statistics, referenced from http://www.uwsp.edu/psych/stat/14/nonparm.htm

I. Introduction

Nonparametric tests are sometimes called distribution free statistics because they do not require that the data fit a normal distribution. More generally, nonparametric tests require less restrictive assumptions about the data. Another important reason for using these tests is that they allow for the analysis of categorical as well as rank data.

Why Not Used All the Time?
Since nonparametric tests require fewer assumptions and can be used with a broader range of data types, the question becomes, "Why not use them all of the time?" Parametric tests are often preferred because:
1. They are robust.
2. They have greater power efficiency, in other words, they have greater power relative to the sample size.
3. They provide unique information (e.g., the interaction in a factorial design).
4. Parametric and nonparametric tests often address two different types of questions.
Relation to Parametric Tests
The Summary of Statistical Tests should help put into perspective where nonparametric tests fit into what we have learned. For example, we have already learned about the binomial test for the simplest case of nominal data and Spearman's Rho for correlations involving rank data. In this unit, we will learn about the chi-square test. The other tests listed in the table (that we have not yet covered) are beyond the scope of the course.
It is important to note that even with metric data, if assumptions are badly violated, nonparametric tests are likely to be employed.

II. Chi Square

This statistic is used to test expected versus observed frequencies. There are two situations in which it is used.

One Variable (or Sample) Case
This is sometimes called the goodness of fit test. Consider an example.
1. Research Question
  Do people have a preference for movie type?
2. Hypotheses
  
  In words:
  
  H_O The observed distribution fits the expected or, in
  other words, there is no preference.
  
  H_A The observed distribution does not fit that expected
  (there is a preference).
Notice that there is no mention made of parameters.
1. Assumptions
  1. The sample is chosen randomly.
  2. The scores are independent (i.e., each subject is allowed only one preference).
  3. The null hypothesis.
2. Decision rules
  Let c equal the number of columns. In this case, there are four preferences or columns. Thus, df=c-1 or 4-1=3 and with an a level of .05 the critical value of chi square is 7.82 (see table).
  
  If x²_obs7.82, reject H_o.
  If x²_obs<7.82,>o.
3. Computation
  The appropriate descriptive statistic is the percentages of people prefering each type of movie.
  
  If it looks like these percents are worthy of additional analysis, we must first determine the expected frequencies. If we are asking folks which of four movie types they prefer and there is no preference, we would expect 25% to prefer each type. Let:
  Then:
  Now let's consider the following data:
Substituting the numbers in the formula gives:
1. Decision
  Since x²_obs (10.00) > x²_crit (7.82), we reject H_o and conclude that folks do have a preference for which type of movie they like best. They like comedy the best and sci fi the least.

	In words:
H_O	The observed distribution fits the expected or, in other words, there is no preference.
H_A	The observed distribution does not fit that expected (there is a preference).

E_j	= the Expected frequency in the j-th column.
O_j	= the Observed frequency in the j-th column.
In our example, j = the number of types of movies.

Two Variable (or Sample) Case - [Minitab] [Spreadsheet]
This test goes by several names. It is most commonly called the Pearson Chi Square, but is sometimes called a test of independence between two variables or crosstabs.

Consider the following data (called a contingency table) on drug usage that I collected when I was a student in college.

Contingency Table		Frequency of Marijuana Use		Total
Contingency Table		<>	3 times/week	Total
Categories of Other Drugs Tried	1-3	26	6	32
Categories of Other Drugs Tried	4-6	17	25	42
Total		43	31	74

It looks like folks that smoked marijuana more frequently also tried more categories of other drugs.

Research Question
Is frequency of marijuana smoking related to number of other drugs tried?
Hypotheses

In words:

H_O There is no relationship (or contingency) between
the two variables, that is, they are independent.

H_A The two variables are related.

	In words:
H_O	There is no relationship (or contingency) between the two variables, that is, they are independent.
H_A	The two variables are related.

Again, notice that there is no mention made of parameters.

Assumptions
1. The individuals in each sample are chosen randomly.
2. The scores are independent (i.e., each subject fits in only one cell of the table).
3. For a 2x2 table, all expected cell frequencies should be at least equal to 10 (for larger tables, this value is 5).
4. The null hypothesis.
Decision rules
Again, let c equal the number of columns. Since we are also considering another variable, let r equal the number of rows. Thus, df=(c-1)(r-1) or (2-1)(2-1)=1 and with an a level of .05 the critical value of chi square is 3.84 (see table).

If x²_obs3.84, reject H_o.
If x²_obs<3.84, do not reject H_o.
Computation
First we must determine the expected frequencies. Let:

E_jk	= the expected frequency of the cell defined by the j-th column and the k-th row.
O_jk	= the observed frequency of the cell defined by the j-th column and the k-th row.
Where j = # columns and k = # rows.

And:

Note, a helpful check is that the sum of the expected cell frequencies is equal to N, that is:

Then:

So, let's compute the E_jks for the data above.

Contingency
Table Frequency of Marijuana Use Total

<> 3 times/week

Categories
of Other
Drugs Tried 1-3 26 (18.59) 6 (13.40) 32

4-6 17 (24.41) 25 (17.59) 42

Total
43 31 74

To be clear, E₁₁ = (32*43)/74 = 18.59
and checking our work, 18.59 + 13.40 + 24.41 + 17.59 = 73.99 74.

So 6/32 or about 19% of folks who had tried 1-3 other drugs smoked marijuana frequently whereas 25/42 or about 60% of folks who had tried 3-6 other drugs smoked frequently. These percentages are the relevant descriptive statistics that give us the reason for performing the chi square test.

Substituting the values in the formula gives:

Decision
Since x²_obs (12.4) > x²_crit (3.84), we reject H_o and conclude that frequent users of marijuana are more likely to have tried more categories of other drugs.