Hi all, This Blog is an English archive of my PhD experience in Imperial College London, mainly logging my research and working process, as well as some visual records.

Monday, 3 September 2007

Linear regression

In statistics, linear regression is a regression method that models the relationship between a dependent variable Y, independent variables Xi, i = 1, ..., p, and a random term ε. The model can be written as

Example of linear regression with one dependent and one independent variable.

Example of linear regression with one dependent and one independent variable.
Y = \beta_1  + \beta_2 X_2 +  \cdots +\beta_p X_p + \varepsilon

where β1 is the intercept ("constant" term), the βis are the respective parameters of independent variables, and p is the number of parameters to be estimated in the linear regression. Linear regression can be contrasted with nonlinear regression.

This method is called "linear" because the relation of the response (the dependent variable Y) to the independent variables is assumed to be a linear function of the parameters. It is often erroneously thought that the reason the technique is called "linear regression" is that the graph of Y = β0 + βx is a straight line or that Y is a linear function of the X variables. But if the model is (for example)

Y = \alpha + \beta x + \gamma x^2 + \varepsilon

the problem is still one of linear regression, that is, linear in x and x2 respectively, even though the graph on x by itself is not a straight


Sample data.

Say we have a set of data, , shown at the left. If we have reason to believe that there exists a linear relationship between the variables x and y, we can plot the data and draw a "best-fit" straight line through the data. Of course, this relationship is governed by the familiar equation . We can then find the slope, m, and y-intercept, b, for the data, which are shown in the figure below.

Let's enter the above data into an Excel spread sheet, plot the data, create a trendline and display its slope, y-intercept and R-squared value. Recall that the R-squared value is the square of the correlation coefficient. (Most statistical texts show the correlation coefficient as "r", but Excel shows the coefficient as "R". Whether you write is as r or R, the correlation coefficient gives us a measure of the reliability of the linear relationship between the x and y values. (Values close to 1 indicate excellent linear reliability.))

Enter your data as we did in columns B and C. The reason for this is strictly cosmetic as you will soon see.


Linear regression equations.

If we expect a set of data to have a linear correlation, it is not necessary for us to plot the data in order to determine the constants m (slope) and b (y-intercept) of the equation . Instead, we can apply a statistical treatment known as linear regression to the data and determine these constants.

Given a set of data with n data points, the slope, y-intercept and correlation coefficient, r, can be determined using the following:


(Note that the limits of the summation, which are i to n, and the summation indices on x and y have been omitted.)


Implicitly applying regression to the sample data.

It may appear that the above equations are quite complicated, however upon inspection, we see that their components are nothing more than simple algebraic manipulations of the raw data. We can expand our spread sheet to include these components.

  1. First, add three columns that will be used to determine the quantities xy, x2 and y2, for each data point.

  2. Next, use Excel to evaluate the following: Sx, Sy, S(xy), S(x2), S(y2), (Sx)2, (Sy)2. Recall that the symbol, S, means "summation". Additionally, the term xy is the product of x and y, that is: x * y. Also, the term S(x2) is very different than the term (Sx)2. Be careful with your order of operations!

  3. Now use Excel to count the number of data points, n. (To do this, use the Excel COUNT() function. The syntax for COUNT() in this example is: =COUNT(B3:B8) and is shown in the formula bar in the screen shot below.

  4. Finally, use the above components and the linear regression equations given in the previous section to calculate the slope (m), y-intercept (b) and correlation coefficient (r) of the data. If you are careful, your spread sheet should look like ours. Note that our equations for the slope, y-intercept and correlation coefficient are highlighted in yellow.

No comments: