Linear regression

	WikiDoc Resources for Linear regression
Articles
Most recent articles on Linear regression Most cited articles on Linear regression Review articles on Linear regression Articles on Linear regression in N Eng J Med, Lancet, BMJ
Media
Powerpoint slides on Linear regression Images of Linear regression Photos of Linear regression Podcasts & MP3s on Linear regression Videos on Linear regression
Evidence Based Medicine
Cochrane Collaboration on Linear regression Bandolier on Linear regression TRIP on Linear regression
Clinical Trials
Ongoing Trials on Linear regression at Clinical Trials.gov Trial results on Linear regression Clinical Trials on Linear regression at Google
Guidelines / Policies / Govt
US National Guidelines Clearinghouse on Linear regression NICE Guidance on Linear regression NHS PRODIGY Guidance FDA on Linear regression CDC on Linear regression
Books
Books on Linear regression
News
Linear regression in the news Be alerted to news on Linear regression News trends on Linear regression
Commentary
Blogs on Linear regression
Definitions
Definitions of Linear regression
Patient Resources / Community
Patient resources on Linear regression Discussion groups on Linear regression Patient Handouts on Linear regression Directions to Hospitals Treating Linear regression Risk calculators and risk factors for Linear regression
Healthcare Provider Resources
Symptoms of Linear regression Causes & Risk Factors for Linear regression Diagnostic studies for Linear regression Treatment of Linear regression
Continuing Medical Education (CME)
CME Programs on Linear regression
International
Linear regression en Espanol Linear regression en Francais
Business
Linear regression in the Marketplace Patents on Linear regression
Experimental / Informatics
List of terms related to Linear regression

Overview

In statistics, linear regression is a regression method that models the relationship between a dependent variable Y, independent variables X_i, i = 1, ..., p, and a random term ε. The model can be written as

<math>Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots +\beta_p X_p + \varepsilon</math>

where <math>\beta_0</math> is the intercept ("constant" term), the <math>\beta_i</math>s are the respective parameters of independent variables, and <math>p</math> is the number of parameters to be estimated in the linear regression. Linear regression can be contrasted with nonlinear regression.

This method is called "linear" because the relation of the response (the dependent variable <math>Y</math>) to the independent variables is assumed to be a linear function of the parameters. It is often erroneously thought that the reason the technique is called "linear regression" is that the graph of <math>Y = \beta_{0}+\beta x </math> is a straight line or that <math>Y</math> is a linear function of the X variables. But if the model is (for example)

<math>Y = \alpha + \beta x + \gamma x^2 + \varepsilon</math>

the problem is still one of linear regression, that is, linear in x and x² respectively, even though the graph on <math>x</math> by itself is not a straight line.

Historical remarks

The earliest form of linear regression was the method of least squares, which was published by Legendre in 1805,^[1] and by Gauss in 1809.^[2] The term “least squares” is from Legendre’s term, moindres carrés. However, Gauss claimed that he had known the method since 1795.

Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the sun. Euler had worked on the same problem (1748) without success. Gauss published a further development of the theory of least squares in 1821,^[3] including a version of the Gauss–Markov theorem.

Notation and naming convention

In the notation below:

a vector of variables is denoted using a bolded arrow over the vector, such as <math> \vec X</math>
matrices are denoted using a bolded font, such as X
a vector of parameters ("constants") is a bolded β without subscript

An X matrix-times-β vector is written as Xβ. The dependent variable, Y in regression is conventionally called the "response variable." The independent variables (in vector form) are called the explanatory variables or regressors. Other terms include "exogenous variables," "input variables," and "predictor variables".

A hat, <math>\hat{}</math>, over variable denotes that the variable or parameter has been estimated, for example, <math>\hat\beta</math>, estimated values of the parameter vector β.

The linear regression model

The linear regression model can be written in vector-matrix notation as

<math> \ Y = X\beta + \varepsilon.\, </math>

The term ε is the model's "error term" (a misnomer but a standard usage) and represents the unpredicted or unexplained variation in the response variable; it is conventionally called the “error” whether it is really a measurement error or not, and is assumed to be independent of <math>\vec X</math>. For simple linear regression, where there is only a single explanatory variable and two parameters, the above equation reduces to:

<math>y = a+bx+\varepsilon.\, </math>

An equivalent formulation that explicitly shows the linear regression as a model of conditional expectation can be given as

<math> \mbox{E}(y|x) = \alpha + \beta x \, </math>

with the conditional distribution of y given x is identical to the distribution of the error term.

Types of linear regression

There are many different approaches to solving the regression problem, that is, determining suitable estimates for the parameters.

Least-squares analysis

Least-squares analysis was developed by Carl Friedrich Gauss in the 1820s. This method uses the following Gauss-Markov assumptions:

The random errors ε_i have expected value 0.
The random errors ε_i are uncorrelated (this is weaker than an assumption of probabilistic independence).
The random errors ε_i are homoscedastic, i.e., they all have the same variance.

(See also Gauss-Markov theorem). These assumptions imply that least-squares estimates of the parameters are optimal in a certain sense.

A linear regression with p parameters (including the regression intercept β₁) and n data points (sample size), with <math>n\geq (p+1) </math> allows construction of the following vectors and matrix with associated standard errors:

<math> \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{12} & x_{13} & \dots & x_{1p} \\ 1 & x_{22} & x_{23} & \dots & x_{2p} \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_{n2} & x_{n3} & \dots & x_{np} \end{bmatrix} \begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{bmatrix} </math>

or, from vector-matrix notation above,

<math> \ y = \mathbf{X}\cdot\beta + \varepsilon.\, </math>

Each data point can be given as <math>(\vec x_i, y_i)</math>, <math>i=1,2,\dots,n.</math>. For n = p, standard errors of the parameter estimates could not be calculated. For n less than p, parameters could not be calculated.

The estimated values of the parameters can be given as

<math>\widehat{\beta} </math><math>=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T {\vec y}</math>

Using the assumptions provided by the Gauss-Markov Theorem, it is possible to analyse the results and determine whether or not the model determined using least-squares is valid. The number of degrees of freedom is given by n − p.

The residuals, representing the difference between the observations and the model's predictions, are required to analyse the regression. They are determined from

<math>\hat\vec\varepsilon = \vec y - \mathbf{X} \hat\beta\,</math>

The standard deviation, <math>\hat \sigma </math> for the model is determined from

<math>

{\hat \sigma = \sqrt{ \frac {\hat\vec\varepsilon^T \hat\vec\varepsilon} {n-p}} = \sqrt {\frac{{ \vec y^T \vec y - \hat\vec\beta^T \mathbf{X}^T \vec y}}Template:N - p} }

</math>

The variance in the errors can be described using the Chi-square distribution:

<math>\hat\sigma^2 \sim \frac { \chi_{n-p}^2 \ \sigma^2 } {n-p}</math>

The <math>100(1-\alpha)% </math> confidence interval for the parameter, <math>\beta_i </math>, is computed as follows:

<math>

{\widehat \beta_i \pm t_{\frac{\alpha }{2},n - p} \hat \sigma \sqrt {(\mathbf{X}^T \mathbf{X})_{ii}^{ - 1} } }

</math>

where t follows the Student's t-distribution with <math>n-p</math> degrees of freedom and <math> (\mathbf{X}^T \mathbf{X})_{ii}^{ - 1}</math> denotes the value located in the <math>i^{th}</math> row and column of the matrix.

The <math>100(1-\alpha)% </math> mean response confidence interval for a prediction (interpolation or extrapolation) for a value <math>\vec{x} = \vec {x_d}</math> is given by:

<math>

{ \vec {x_0} \widehat\beta \pm t_{\frac{\alpha }{2},n - p} \hat \sigma \sqrt { \vec {x_0} (\mathbf{X}^T \mathbf{ X})_{}^{ - 1} \vec {x_0}^T } }

</math>

where <math>\vec {x_0} = <1, x_{2}, x_{3}, . . ., x_{p}> </math>.

The <math>100(1-\alpha)% </math> predicted response confidence intervals for the data are given by:

<math>

{ \vec {x_0} \widehat\beta \pm t_{\frac{\alpha }{2},n - p} \hat \sigma \sqrt {1 + \vec {x_0} (\mathbf{X}^T \mathbf{X})_{}^{ - 1} \vec {x_0}^T } }

</math>.

The regression sum of squares SSR is given by:

<math>

{\mathit{SSR} = \sum {\left( {\hat y_i - \bar y} \right)^2 } = \hat\beta^T \mathbf{X}^T \vec y - \frac{1}{n}\left( { \vec y^T \vec u \vec u^T \vec y} \right)}

</math>

where <math> \bar y = \frac{1}{n} \sum y_i</math> and <math> \vec u </math> is an n by 1 unit vector (i.e. each element is 1). Note that the term <math>\frac{1}{n} y^T u u^T y</math> is equivalent to <math> \frac{1}{n} (\sum y_i)^2</math>.

The error sum of squares ESS is given by:

<math>

{\mathit{ESS} = \sum {\left( {y_i - \hat y_i } \right)^2 } = \vec y^T \vec y - \hat\beta^T \mathbf{X}^T \vec y}. </math>

The total sum of squares TSS' is given by

<math>

{\mathit{TSS} = \sum {\left( {y_i - \bar y} \right)^2 } = \vec y^T \vec y - \frac{1}{n}\left( { \vec y^T \vec u \vec u^T \vec y} \right) = \mathit{SSR}+ \mathit{ESS}}. </math>

Pearson's co-efficient of regression, R² is then given as

<math>

{R^2 = \frac{\mathit{SSR}}{{\mathit{TSS}}} = 1 - \frac{\mathit{ESS}}{\mathit{TSS}}}. </math>

Assessing the least-squares model

Once the above values have been corrected, the model should be checked for two different things:

Whether the assumptions of least-squares are fulfilled and
Whether the model is valid

Checking model assumptions

The model assumptions are checked by calculating the residuals and plotting them. The residuals are calculated as follows:

<math>\hat\vec\varepsilon = \vec y - \hat \vec y = \vec y - \mathbf{X} \hat\beta\,</math>

The following plots can be constructed to test the validity of the assumptions:

A normal probability plot of the residuals to test normality. The points should lie along a straight line.
A time series plot of the residuals, that is, plotting the residuals as a function of time.
Residuals against the explanatory variables, <math> \mathbf{ X} </math>.
Residuals against the fitted values, <math>\hat \vec y\,</math>.
Residuals against the preceding residual.

There should not be any noticeable pattern to the data in all but the first plot.

Checking model validity

The validity of the model can be checked using any of the following methods:

Using the confidence interval for each of the parameters, <math>\beta_i </math>. If the confidence interval includes 0, then the parameter can be removed from the model. Ideally, a new regression analysis excluding that parameter would need to be performed and continued until there are no more parameters to remove.
Calculate Pearson’s co-efficient of regression. The closer the value is to 1; the better the regression is. This co-efficient gives what fraction of the observed behaviour can be explained by the given variables.
Examining the observational and prediction confidence intervals. The smaller they are the better.
Computing the F-statistics.

Modifications of least-squares analysis

There are various different ways in which least-squares analysis can be modified including

weighted least squares, which is a generalisation of the least squares method
polynomial fitting, which involves fitting a polynomial to the given data.

Polynomial fitting

A polynomial fit is a specific type of multiple regression. The simple regression model (a first-order polynomial) can be trivially extended to higher orders. The regression model <math>\scriptstyle y_i \,=\, \alpha_0 + \alpha_1 x_i + \alpha_2 x_i^2 + \cdots + \alpha_m x_i^m + \varepsilon_i\ (i = 1, 2, \dots , n) </math> is a system of polynomial equations of order m with polynomial coefficients <math>\scriptstyle \{ \alpha_0, \dots, \alpha_m \}</math>. As before, we can express the model using data matrix <math>\scriptstyle \mathbf{X}</math>, target vector <math>\scriptstyle\vec y</math> and parameter vector <math>\scriptstyle\vec \alpha</math>. The ith row of <math>\scriptstyle\mathbf{X}</math> and <math>\scriptstyle\vec y</math> will contain the x and y value for the ith data sample. Then the model can be written as a system of linear equations:

<math> \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix}= \begin{bmatrix} 1 & x_1 & x_1^2 & \dots & x_1^m \\ 1 & x_2 & x_2^2 & \dots & x_2^m \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_n & x_n^2 & \dots & x_n^m \end{bmatrix} \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_m \end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{bmatrix} </math>

which when using pure matrix notation remains, as before,

<math>Y = \mathbf{X} \vec \alpha + \varepsilon, \,</math>

and the vector of polynomial coefficients is

<math>\widehat{\vec \alpha} = (\mathbf{X}^T \mathbf{X})^{-1}\; \mathbf{X}^T Y. \,</math>

Robust regression

A host of alternative approaches to the computation of regression parameters are included in the category known as robust regression. One technique minimizes the mean absolute error, or some other function of the residuals, instead of mean squared error as in linear regression. Robust regression is much more computationally intensive than linear regression and is somewhat more difficult to implement as well. While least squares estimates are not very sensitive to breaking the normality of the errors assumption, this is not true when the variance or mean of the error distribution is not bounded, or when an analyst that can identify outliers is unavailable.

In the Stata culture, Robust regression means linear regression with Huber-White standard error estimates. This relaxes the assumption of homoscedasticity for variance estimates only; the predictors are still ordinary least squares (OLS) estimates.

Applications of linear regression

The trend line

For trend lines as used in technical analysis, see Trend lines (technical analysis)

A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or decreased over the period of time. A trend line could simply be drawn by eye through a set of data points, but more properly their position and slope is calculated using statistical techniques like linear regression. Trend lines typically are straight lines, although some variations use higher degree polynomials depending on the degree of curvature desired in the line.

Trend lines are sometimes used in business analytics to show changes in data over time. This has the advantage of being simple. Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time. This is a simple technique, and does not require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases where other potential changes can affect the data.

Examples

Linear regression is widely used in biological, behavioral and social sciences to describe relationships between variables. It ranks as one of the most important tools used in these disciplines.

Medicine

As one example, early evidence relating tobacco smoking to mortality and morbidity came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce spurious correlations. For the cigarette smoking example, researchers might include socio-economic status in addition to smoking to ensure that any observed effect of smoking on mortality is not due to some effect of education or income. However, it is never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized controlled trials are considered to be more trustworthy than a regression analysis.

Finance

Linear regression underlies the capital asset pricing model, and the concept of using Beta for analyzing and quantifying the systematic risk of an investment. This comes directly from the Beta coefficient of the linear regression model that relates the return on the investment to the return on all risky assets.

References

↑ A.M. Legendre. Nouvelles méthodes pour la détermination des orbites des comètes (1805). “Sur la Méthode des moindres quarrés” appears as an appendix.
↑ C.F. Gauss. Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum. (1809)
↑ C.F. Gauss. Theoria combinationis observationum erroribus minimis obnoxiae. (1821/1823)

Additional sources

Cohen, J., Cohen P., West, S.G., & Aiken, L.S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates
Charles Darwin. The Variation of Animals and Plants under Domestication. (1869) (Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)
Draper, N.R. and Smith, H. Applied Regression Analysis Wiley Series in Probability and Statistics (1998)
Francis Galton. "Regression Towards Mediocrity in Hereditary Stature," Journal of the Anthropological Institute, 15:246-263 (1886). (Facsimile at: [1])
Robert S. Pindyck amd Daniel L. Rubinfeld (1998, 4h ed.). Econometric Models and Economic Forecasts,, ch. 1 (Intro, incl. appendices on Σ operators & derivation of parameter est.) & Appendix 4.3 (mult. regression in matrix form).
http://homepage.mac.com/nshoffner/nsh/CalcBookAll/Chapter%201/1functions.html

External links

Scale-adaptive nonparametric regression (with Matlab software).
Visual Least Squares: An interactive, visual flash demonstration of how linear regression works.
In Situ Adaptive Tabulation: Combining many linear regressions to approximate any nonlinear function.
Earliest Known uses of some of the Words of Mathematics. See: [2] for "error", [3] for "Gauss-Markov theorem", [4] for "method of least squares", and [5] for "regression".
Online linear regression calculator.
Online regression by eye (simulation).
Leverage Effect Interactive simulation to show the effect of outliers on the regression results
Linear regression as an optimisation problem
Visual Statistics with Multimedia
Multiple Regression by Elmer G. Wiens. Online multiple and restricted multiple regression package.
ZunZun.com Online curve and surface fitting.
[6] example and closer description of Least-squares analysis
CAUSEweb.org Many resources for teaching statistics including Linear Regression.
Multivariate Regression Python, Smalltalk & Java Implementation of Linear Regression Calculation.

Template:Statistics

cs:Lineární regrese de:Regressionsanalyse it:Regressione lineare he:רגרסיה לינארית sv:Multipel linjär regression Template:WH

Template:WS

[Legendre-1] A.M. Legendre. Nouvelles méthodes pour la détermination des orbites des comètes (1805). “Sur la Méthode des moindres quarrés” appears as an appendix.

[Gauss-2] C.F. Gauss. Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum. (1809)

[Gauss2-3] C.F. Gauss. Theoria combinationis observationum erroribus minimis obnoxiae. (1821/1823)

[1]

[2]

[3]

Linear regression

Contents

Overview

Historical remarks

Notation and naming convention

The linear regression model

Types of linear regression

Least-squares analysis

Assessing the least-squares model

Checking model assumptions

Checking model validity

Modifications of least-squares analysis

Polynomial fitting

Robust regression

Applications of linear regression

The trend line

Examples

Medicine

Finance

References

Additional sources

See also

External links

Navigation menu