Hi all, This Blog is an English archive of my PhD experience in Imperial College London, mainly logging my research and working process, as well as some visual records.

Saturday 25 August 2007

Nonparametric Estimate I

Very Good to introduce nonprarametric statistics, referenced from http://www.uwsp.edu/psych/stat/14/nonparm.htm

I. Introduction

Nonparametric tests are sometimes called distribution free statistics because they do not require that the data fit a normal distribution. More generally, nonparametric tests require less restrictive assumptions about the data. Another important reason for using these tests is that they allow for the analysis of categorical as well as rank data.

  1. Why Not Used All the Time?
    Since nonparametric tests require fewer assumptions and can be used with a broader range of data types, the question becomes, "Why not use them all of the time?" Parametric tests are often preferred because:
    1. They are robust.
    2. They have greater power efficiency, in other words, they have greater power relative to the sample size.
    3. They provide unique information (e.g., the interaction in a factorial design).
    4. Parametric and nonparametric tests often address two different types of questions.

  2. Relation to Parametric Tests
    The Summary of Statistical Tests should help put into perspective where nonparametric tests fit into what we have learned. For example, we have already learned about the binomial test for the simplest case of nominal data and Spearman's Rho for correlations involving rank data. In this unit, we will learn about the chi-square test. The other tests listed in the table (that we have not yet covered) are beyond the scope of the course.

    It is important to note that even with metric data, if assumptions are badly violated, nonparametric tests are likely to be employed.


II. Chi Square

This statistic is used to test expected versus observed frequencies. There are two situations in which it is used.

  1. One Variable (or Sample) Case
    This is sometimes called the goodness of fit test. Consider an example.

    1. Research Question
      Do people have a preference for movie type?

    2. Hypotheses


      In words:
      HO The observed distribution fits the expected or, in
      other words, there is no preference.
      HA The observed distribution does not fit that expected
      (there is a preference).

    Notice that there is no mention made of parameters.

    1. Assumptions
      1. The sample is chosen randomly.
      2. The scores are independent (i.e., each subject is allowed only one preference).
      3. The null hypothesis.

    2. Decision rules
      Let c equal the number of columns. In this case, there are four preferences or columns. Thus, df=c-1 or 4-1=3 and with an a level of .05 the critical value of chi square is 7.82 (see table).

      If x2obs7.82, reject Ho.
      If x2obs<7.82,>o.

    3. Computation
      The appropriate descriptive statistic is the percentages of people prefering each type of movie.

      If it looks like these percents are worthy of additional analysis, we must first determine the expected frequencies. If we are asking folks which of four movie types they prefer and there is no preference, we would expect 25% to prefer each type. Let:

        Ej = the Expected frequency in the j-th column.
        Oj = the Observed frequency in the j-th column.
        In our example, j = the number of types of movies.

      Then:

        Picture (265x136, 2Kb)

      Now let's consider the following data:


        Comedy Horror Drama Sci fi
        Expected
        25
        25
        25
        25
        as %s
        Observed
        35
        30
        20
        15
        so n=100
        %
        35
        30
        20
        15

    Substituting the numbers in the formula gives:

    Picture (440x216, 4.8Kb)

    Picture (545x111, 3.7Kb)

    Picture (410x95, 2.7Kb)

    Picture (295x45, 1.3Kb)

    1. Decision
      Since x2obs (10.00) > x2crit (7.82), we reject Ho and conclude that folks do have a preference for which type of movie they like best. They like comedy the best and sci fi the least.

  2. Two Variable (or Sample) Case - [Minitab] [Spreadsheet]
    This test goes by several names. It is most commonly called the Pearson Chi Square, but is sometimes called a test of independence between two variables or crosstabs.

    Consider the following data (called a contingency table) on drug usage that I collected when I was a student in college.

    Contingency
    Table
    Frequency of Marijuana Use
    Total
    <>
    3 times/week
    Categories
    of Other
    Drugs Tried
    1-3
    26
    6
    32
    4-6
    17
    25
    42
    Total
    43
    31
    74

    It looks like folks that smoked marijuana more frequently also tried more categories of other drugs.

    1. Research Question
      Is frequency of marijuana smoking related to number of other drugs tried?

    2. Hypotheses


      In words:
      HO There is no relationship (or contingency) between
      the two variables, that is, they are independent.
      HA The two variables are related.

    Again, notice that there is no mention made of parameters.

    1. Assumptions
      1. The individuals in each sample are chosen randomly.
      2. The scores are independent (i.e., each subject fits in only one cell of the table).
      3. For a 2x2 table, all expected cell frequencies should be at least equal to 10 (for larger tables, this value is 5).
      4. The null hypothesis.

    2. Decision rules
      Again, let c equal the number of columns. Since we are also considering another variable, let r equal the number of rows. Thus, df=(c-1)(r-1) or (2-1)(2-1)=1 and with an a level of .05 the critical value of chi square is 3.84 (see table).

      If x2obs3.84, reject Ho.
      If x2obs<3.84, do not reject Ho.

    3. Computation
      First we must determine the expected frequencies. Let:

      • Ejk = the expected frequency of the cell defined by
        the j-th column and the k-th row.
        Ojk = the observed frequency of the cell defined by
        the j-th column and the k-th row.
        Where j = # columns and k = # rows.

    And:

    Picture (500x90, 5Kb)
    Note, a helpful check is that the sum of the expected cell frequencies is equal to N, that is:

    Picture (201x110, 1.6Kb)

    Then:

      Picture (341x136, 2.5Kb)

    So, let's compute the Ejks for the data above.

    Contingency
    Table
    Frequency of Marijuana Use
    Total
    <>
    3 times/week
    Categories
    of Other
    Drugs Tried
    1-3
    26 (18.59)
    6 (13.40)
    32
    4-6
    17 (24.41)
    25 (17.59)
    42
    Total
    43
    31
    74

    To be clear, E11 = (32*43)/74 = 18.59
    and checking our work, 18.59 + 13.40 + 24.41 + 17.59 = 73.99 74.

    So 6/32 or about 19% of folks who had tried 1-3 other drugs smoked marijuana frequently whereas 25/42 or about 60% of folks who had tried 3-6 other drugs smoked frequently. These percentages are the relevant descriptive statistics that give us the reason for performing the chi square test.

    Substituting the values in the formula gives:

    Picture (543x216, 6.2Kb)

    Picture (393x216, 5Kb)

    Picture (331x203, 4.1Kb)

    Picture (446x103, 2.5Kb)

    1. Decision
      Since x2obs (12.4) > x2crit (3.84), we reject Ho and conclude that frequent users of marijuana are more likely to have tried more categories of other drugs.

No comments: