Hi all, This Blog is an English archive of my PhD experience in Imperial College London, mainly logging my research and working process, as well as some visual records.

Wednesday 29 August 2007

A Partial Syllabus of Data Analysis

Probability

PROBABILITY THEORY :

Distributions

CONTINUOUS DISTRIBUTIONS :

DISCRETE DISTRIBUTIONS :

Linear Regression

SIMPLE LINEAR REGRESSION

MULTIPLE LINEAR REGRESSION

Estimation

Confidence intervals

Confidence intervals for means of normal distributions

One sample confidence intervals.
Two samples confidence intervals : paired samples, independent samples (variances known, unknown but equal, unknown and not equal).

Approximate confidence intervals on means

Asymptotic interval (no demonstration).
Welch's approximation.

Mean Square Error (MSE)

Mean Square Error (MSE)
Minimum Mean Square Error (MMSE) estimators

MSE of a parameter estimator.
Best estimate of a random variable X.
Best estimate of X when a second r.v. Y is available.
Properties of Minimum Mean Square Error estimators.

Sufficient statistic

First examples of sufficient statistics

Sufficient statistics for :
* The Bernoulli distribution b(p),
* The uniform distribution U[0, q],
* The Poisson distribution P(l),
from the definition only.

The factorization theorem and applications

A necessary and sufficient condition for a statistic to be sufficient.
Examples : Bernoulli, uniform, Poisson, normal (two methods), Gamma, exponential.

Tests

ANOVA (One way)

Overview of ANOVA

General principle of ANOVA

Variance decomposition

Total Sum of Squares, Factorial and Residual sums of Squares.
A purely geometrical step.

Distributions of the Sums of Squares

Sums of Squares as random variables. Distributions, independence. Properties as estimators of the common variance.

ANOVA's F test

ANOVA is a F test.

Dunnett's test

Dunnett's test

Comparing group means to the mean of a reference group.

t-test

What are t-tests ?

Is a sample average trustworthy ?

One-sample t-test

Is the sample mean significantly different from expected ?

Student's t

Distribution of the mean when the variance is unknown

"Two dependent samples" t-test

Are the means of 2 dependent samples equal ?

"Two Independent samples" t-test

Are the means of 2 independent samples equal ?

t-test results

How do I read software results of t-tests ?

Chi-square tests

The basic Chi-square test

Does a sample match a multinomial distribution ?

Continuous reference distribution

Adapting the test to a continuous variable

Estimated parameters

If some parameters of the reference distribution are unknown

The Chi-square test of equality

Do several samples originate from the same distribution ?

The Chi-square test of independence

Are two categorical variables independent ?

Complements on the Chi-square of independence

Largest value, contributions, alternate coefficients.

The Fisher-Irwin test

The Fisher-Irwin test

Are these two coins identically biased ?

The Kolmogorov-Smirnov test

The Kolmogorov statistic

Its definition, distribution function, and the ensuing test.

Complements on the Kolmogorov test

Very short on : K-test or Chi-2 test ? Estimated parameters. Normality test.

The Mann-Whitney test

The Mann-Whitney statistic

Its definition, distribution function, and the ensuing test.

Complements on the Mann-Whitney test

Very short on : Why ranks ? Location-shift test.

Newman-Keuls test

The Newman-Keuls test

Pairwise comparisons of group means that avoid "paradoxical" conclusions.

Classification

Fisher's linear discriminant

Fisher's criterion and Fisher's vector

Definition and justification of Fisher's criterion.
Maximizing Fisher's criterion (2 classes).
Fisher's discriminant.

Maximizing the generalized Fisher's criterion

Maximizing the ratio of two quadratic forms.
Maximizing the generalized Fisher's criterion.

Discriminant Analysis

What is Discriminant Analysis ?

The most basic classification technique.

Discriminant Function Analysis

Finding new variables that are good at separating classes.

Building a classifier

Creating linear or quadratic Classification Functions.

Complements on DA

Just a little bit of maths.

Logistic Regression

What is Logistic Regression ?

LR is a powerful generalization of Discriminant Analysis.

What is the "logit" ?

The information needed to build a score.

Linear logit beyond DA

Getting rid of the normality assumption.

Estimating the coefficients of the model

Likelihood, and how it is maximized.

Decision Trees

What are Decision Trees ?

Heuristic, yet powerful classifiers. Can do Regression too.

Growing a Tree

Node splitting, Tree growth and Tree use.

Three types of predictors

Handling categorical, ordinal and numerical predictors

Splitting a node

Misclassification, Gini index, Entropy, Chi-square, Twoing

Priors and costs

Weighting the observations to favorably bias the Tree.

Stopping rules and Pruning

Getting the right size Tree to avoid overfitting

Exploratory Data Analysis

Principal Components Analysis (PCA)

What is PCA ?

An optimal way to display data on a plane, and more.

What are Principal Components

The most efficient synthetic variables for representing data.

Finding the Principal Components

Maximizing the inertia of projected observations.

Projection of the observations

The best projection of data on a plane.

Projection of the variables

Visualizing correlation between variables.

Interpreting PCA results

Interpreting the Principal Components and data distribution.

Other applications of PCA

Data Compression and Dimensionality reduction.

Correspondence Analysis

Overview of Correspondence Analysis

Visualizing the interaction of two categorical variables.

Reformating data

Contengency tables, frequencies, profiles.

The Chi-square distance

...is more appropriate than euclidian distance.

The two PCAs

How many dimensions, barycenters, total inertia.

General principles of interpretation of CA

Factors, weights, inertias, plots, quality of representation.

Complete treatment of a real case

A simple example from A to Z

Complete treatment of a real case (1)

Interpreting the inertia, the Chi-square, the factors.

Complete treatment of a real case (2)

Interpreting the plot of modalities for each variable.

Complete treatment of a real case (3)

Interpreting the combined plot of modalities.

Complements on CA

Supplementary variables, ordinal variables, Guttman effect.

No comments: