R Squre Statistic and Correlation in Linear Regression

This note is the proof of Exercise 7 in section 3.7 of An Introduction to Statistical Learning.

For n observations $(x_i, y_i), i=1..n$, let:

$$\begin{equation}\begin{aligned} ss_{xx} &\equiv \sum_{i=1}^n (x_i - \bar x)^2 \\ &= \sum_{i=1}^n x_i^2 - 2 \bar x \sum_{i=1}^n x_i + \sum_{i=1}^n \bar x^2 \\ &= \sum_{i=1}^n x_i^2 - 2n\bar x^2 + n \bar x^2 \\ &= \sum_{i=1}^n x_i^2 - n\bar x^2 \end{aligned}\end{equation} \tag{1}\label{eq1} $$

Substitute $x$ with $y$, we get:

$$ ss_{yy} \equiv \sum_{i=1}^n (y_i - \bar y)^2 = \sum_{i=1}^n y_i^2 - n\bar y^2 \tag{2}\label{eq2} $$

And:

$$\begin{equation}\begin{aligned} ss_{xy} &\equiv \sum_{i=1}^n (x_i - \bar x) (y_i - \bar y) \\ &= \sum_{i=1}^n (x_i y_i - \bar x y_i - x_i \bar y + \bar x \bar y) \\ &= \sum_{i=1}^n x_i y_i - n \bar x \bar y - n \bar x \bar y + n \bar x \bar y \\ &= \sum_{i=1}^n x_i y_i - n\bar x\bar y \end{aligned}\end{equation} \tag{3}\label{eq3} $$

For correlation between $X$ and $Y$, also denoted as $Cor(X, Y)$:

$$ r \equiv \frac{\sum_{i=1}^n(x_i - \bar x)(y_i - \bar y)} {\sqrt{\sum_{i=1}^n(x_i - \bar x)^2} \sqrt{\sum_{i=1}^n(y_i - \bar y)^2}} = \frac{ss_{xy}}{\sqrt{ss_{xx}ss_{yy}}} \tag{4} \label{eq4} $$

Here $\bar x = \frac{\sum_{i=1}^n x_i}n, \bar y = \frac{\sum_{i=1}^n y_i}n$.

See Correlation Coefficient and Least Squares Fitting for detailed reasonings.

For the target regression function $y = a + bx$, $a$ and $b$ should be the value that make $RSS$ get its minimum, where

$$ RSS \equiv \sum_{i=1}^n(y_i - \hat y_i)^2 = \sum_{i=1}^n [y_i - (a + b x_i)]^2 \tag{5}\label{eq5} $$

So we have:

$$ \frac{\partial (RSS)}{\partial a} = -2 \sum_{i=1}^n [y_i - (a + b x_i)] = 0 \\ \frac{\partial (RSS)}{\partial b} = -2 \sum_{i=1}^n [y_i - (a + b x_i)] x_i = 0 $$

These lead to:

$$ \begin{align} na + b\sum_{i=1}^n x_i = \sum_{i=1}^n y_i \tag{6} \label{eq6}\\ a \sum_{i=1}^n x_i + b \sum_{i=1}^n x_i^2 = \sum_{i=1}^n x_i y_i \tag{7}\label{eq7} \end{align}$$

Eq$\eqref{eq7}$ can be written as:

$$ an \bar x + b \sum_{i=1}^n x_i^2 = \sum_{i=1}^n x_i y_i \tag{8}\label{eq8} $$

From eq$\eqref{eq6}$ we have:

$$a = \bar y - b \bar x \tag{9}\label{eq9}$$

Take eq$\eqref{eq1}, \eqref{eq3}, \eqref{eq9}$ into eq$\eqref{eq8}$, we get:

$$ (\bar y - b \bar x) n \bar x + b \sum_{i=1}^n x_i^2 = \sum_{i=1}^n x_i y_i \\ n \bar x \bar y - bn \bar x^2 + b \sum_{i=1}^n x_i^2 = \sum_{i=1}^n x_i y_i \\ (\sum_{i=1}^n x_i^2 - n \bar x^2) b = \sum_{i=1}^n x_i y_i - n \bar x \bar y \\ \therefore b = \frac{\sum_{i=1}^n x_i y_i - n \bar x \bar y}{\sum_{i=1}^n x_i^2 - n \bar x^2} = \frac{ss_{xy}}{ss_{xx}} \tag{10}\label{eq10} $$

Take eq$\eqref{eq9}, \eqref{eq10}$ into eq$\eqref{eq5}$, we get:

$$\begin{equation}\begin{aligned} RSS &= \sum_{i=1}^n [y_i - (a + b x_i)]^2 \\ &= \sum_{i=1}^n (y_i - \bar y + b \bar x - b x_i)^2 \\ &= \sum_{i=1}^n [(y_i - \bar y) - b (x_i - \bar x)]^2 \\ &= \sum_{i=1}^n (y_i - \bar y)^2 + b^2 \sum_{i=1}^n (x_i - \bar x)^2 -2b \sum_{i=1}^n (x_i - \bar x) (y_i - \bar y) \\ &= ss_{yy} + b^2 ss_{xx} -2bss_{xy} \\ &= ss_{yy} + \frac{ss_{xy}^2}{xx_{xx}^2}ss_{xx} -2 \frac{ss_{xy}}{ss_{xx}}ss_{xy} \\ &= ss_{yy} - \frac{ss_{xy}^2}{ss_{xx}} \end{aligned}\end{equation} \tag{11}\label{eq11} $$

Take eq$\eqref{eq11}$ into equation (3.17) of An Introduction to Statistical Learning, we have:

$$ R^2 = \frac{TSS - RSS}{TSS} = \frac{ss_{yy} - ss_{yy} + \frac{ss_{xy}^2}{ss_{xx}}}{ss_{yy}} = \frac{ss_{xy}^2}{ss_{xx} ss_{yy}} $$

With eq$\eqref{eq4}$, we get $R^2 = Cor(X, Y)^2$.

Here $RSS$, $R^2$, $TSS$ and $Cor(X, Y)$ is defined in Equation (3.16) ~ (3.18) of An Introduction to Statistical Learning. $ss$, $r$ is defined in Correlation Coefficient and Least Squares Fitting.

Other references:

Relationship between R2 and correlation coefficient

R Squre Statistic and Correlation in Linear Regression

Published

Last Updated

Category

Tags

Contact