Chapter 1: Simple Linear Regression

Lawrence M. Leemis

Chapter 1 Simple Linear Regression

Regression is a statistical technique that involves describing the relationship between one or more independent variables and a single dependent variable. For simplicity, assume for now that there is just a single independent variable. To establish some notation, let

X be an independent variable, also called an explanatory variable, predictor variable, or regressor, which is typically assumed to take on fixed values (that is, X is not a random variable) which can be observed without error, and
Y be a dependent variable, also called a response variable, which is typically a continuous random variable.

The relationship between the independent variable X and the dependent variable Y is often established by collecting n data pairs denoted by (X₁, Y₁), (X₂, Y₂), …, (X_n, Y_n), plotting these pairs on a pair of axes, and looking for a pattern that can be translated to a mathematical form. This process establishes an empirical mathematical model for the underlying relationship between the independent variable Y.

1.1 Deterministic Models

Regression analysis establishes a functional relationship between X and Y. The simplest type of relationship between X and Y is a deterministic relationship [latex]Y=f(X)[/latex]. In this rare case, the value of Y can be determined without error once the value of X is known, so Y is not a random variable when the relationship between X and Y is deterministic. The deterministic model is described by [latex]Y=f(X)[/latex]. Deterministic relationships are uncommon in real-world applications because there is typically uncertainty in the dependent variable. If data pairs (X₁, Y₁), (X₂, Y₂), …, (X_n, Y_n) are collected and the deterministic relationship [latex]Y=f(X)[/latex] establishes the correct functional relationship between X and Y, then all of the data pairs will fall on the graph of the function [latex]Y=f(X)[/latex].

Example 1.1 Bob is a salesman. The independent variable X is the number of sales that he makes per week. Bob receives a $50 commission for each sale, regardless of the amount of each sale. The dependent random variable Y is the total weekly commission that Bob receives. Find the deterministic relationship between X and Y.

In this setting, the independent variable X is a fixed constant which is measured without error, and the deterministic relationship between X and Y is

$\begin{array}{l} Y = f (X) = 50 X . \end{array}$

This deterministic relationship expresses Y as a linear function of X. If the next three weeks of Bob’s sales activity result in the three data pairs

$\begin{array}{l} (X_{1}, Y_{1}) = (6, 300), (X_{2}, Y_{2}) = (8, 400), and (X_{3}, Y_{3}) = (2, 100), \end{array}$

then all three of these data pairs will fall on the graph of the deterministic relationship [latex]Y= f (X) = 50X[/latex]. The X_i values are distinct for these data pairs, but this need not necessarily be the case. Bob could have had weeks in which he made the same number of sales multiple times. Figure 1.1 shows the deterministic relationship and the three data values that fall on the line. Notice that the graph of [latex]Y= f (X) = 50X[/latex] passes through the origin, (0,0), because zero weekly sales results in no weekly commissions. In this particular example, a line is plotted even though X can only take on integer values.

Figure 1.1: A deterministic linear relationship between X and Y.

Long Description for Figure 1.1

The horizontal axis X ranges from 0 to 8 in increments of 1 unit. The vertical axis Y ranges from 0 to 400 in increments of 100 units. The ordered pair X 1, Y 1 is plotted at (6, 30); X 2, Y 2 is plotted at (8, 400); and X 3, Y 3 is plotted at (2, 100). A line with a positive slope begins from the origin and passes through the three plotted points. All data are approximate.

Determining the relationship between the number of sales per week X and the commissions paid per week Y did not require the collection of any data to determine the function [latex]Y=f(X)[/latex]. That linear relationship was implicit in the problem statement. Other cases can arise, such as (a) the relationship is deterministic but requires data to determine its functional form, or (b) the relationship is deterministic, but unlike the relationship in the previous example, it is not linear. The following example illustrates a nonlinear deterministic relationship between the independent variable X and the dependent variable Y.

Example 1.2 Alice purchases a five-year certificate of deposit paying 8% annually with an initial deposit of $1000. Let the independent variable X be the number of months that the certificate of deposit has been held at a bank. Let the dependent variable Y be the associated balance. Find the deterministic relationship between X and Y assuming that the interest on the certificate of deposit is compounded monthly.

Under these assumptions, the balance on Alice’s certificate of deposit at month X is

$\begin{array}{l} Y = f (X) = 1000 {(1 + \frac{0.08}{12})}^{X} . \end{array}$

(This relationship between X and Y makes three somewhat minor simplifying assumptions: (1) [latex]Y=f(X)[/latex] gives the instantaneous value of the CD after X months have passed even though interest is paid monthly, making this a continuous function rather than a step function, (2) all 12 months are assumed to have the same number of days, and (3) all years have the same number of days, which is not the case because of leap years. The violation of these assumptions are minor, and the relationship given here is very close to the balance Y after X months have passed.)

The curve in Figure 1.2 associated with the deterministic relationship is concave upward because of compounding. The three points plotted on the curve are

$\begin{array}{l} (X_{1}, Y_{1}) = (0, 1000), (X_{2}, Y_{2}) = (12, 1083.00), and (X_{3}, Y_{3}) = (60, 1489.85) . \end{array}$

The first data pair corresponds to the initial $1000 deposit into the certificate of deposit at [latex]X = 0[/latex]. The second data pair corresponds to the account balance after one year, or [latex]X = 12[/latex] months. The balance after 12 months is slightly more than the annual simple interest balance $1000. [latex](1 + 0.08) = $1080[/latex] because of the monthly compounding. The third data pair corresponds to the final balance of $1489.85 after 60 months. As was the case with the sales commissions in the previous example, the three data pairs were not necessary to establish the deterministic relationship between the independent variable X and the dependent variable Y. Their relationship is implicit in the problem statement. In both examples, the three data pairs fall on the graph of the deterministic relationship.

Figure 1.2: A deterministic nonlinear relationship between X and Y.

Long Description for Figure 1.2

The horizontal axis X ranges from 0 to 60 in increments of 1 unit. The vertical axis Y ranges from 0 to 1500 in increments of 500 units. The ordered pair X 1, Y 1 is plotted at (0, 1000); X 2, Y 2 is plotted at (12, 11000); and X 3, Y 3 is plotted at (60, 1500). A line with a positive slope passes through the three plotted points. All data are approximate.

In most applications, the relationship between the independent variable X and the dependent variable Y is not deterministic because Y is typically a random variable. The next section introduces some of the thinking behind the development of a statistical model that describes the relationship between X and Y.

1.2 Statistical Models

The goal in constructing a statistical model is to write a formula that adequately captures the governing probabilistic relationship between an independent variable X and a dependent variable Y. This formula might be used subsequently for prediction or some other form of statistical inference. In this section, we assume that the dependent variable Y is a continuous random variable that can assume a range of values associated with a particular setting of the independent variable X. The relationship

$\begin{array}{l} Y = f (X) \end{array}$

that was used in the previous section is no longer adequate because X is assumed to be observed without error, and this formula results in a value of Y which is deterministic rather than random. One way of overcoming this problem is to replace the left-hand side of this equation by the expected value of Y, which is a constant, resulting in

$\begin{array}{l} E [Y] = f (X) . \end{array}$

To be a little more careful about what is meant by this statistical relationship, the left-hand side is actually a conditional expectation, namely

$\begin{array}{l} E [Y | X = x] = f (x) . \end{array}$

In words, given that the independent variable X assumes the value x, the transformation [latex]f(X)[/latex] gives the conditional expected value of the dependent variable Y. Notice that this statistical model does not specify the distribution of the random variable Y for a particular value of X; it only tells us the expected value of Y for a particular value of X. This statistical regression model defines a hypothesized relationship between the observed value of X on the right-hand side of the model and the conditional expected value of Y on the left-hand side of the model. The hypothesized relationship might be adequate for modeling or it might need some refining. There is typically no model that perfectly captures the relationship between X and Y. This was recognized by George Box, who wrote:

All models are wrong; some models are useful.

In a statistical model that involves parameters, the estimation of the model parameters will be followed by assessments to determine whether the model holds in an empirical sense. If the model needs refining, the new set of parameters are estimated and new assessments are made to see if the refined model is an improvement over the previous model. Regression modeling is an iterative process.

There is a second way to write a statistical model that is equivalent to the statistical model described in the previous paragraph. The model can be written as

$\begin{array}{l} Y = f (X) + ϵ, \end{array}$

where the error term ϵ (also known as the “noise” or “disturbance” term) is a random variable that accounts for the fact that the independent variable cannot predict the dependent variable with certainty. This term makes the relationship between X and Y a random (or statistical or stochastic) relationship rather than a deterministic relationship. If the probability distribution of the error term is specified, then not only is the expected value of Y conditioned on the value of X determined, but also the entire probability distribution of Y conditioned on the value of X is specified. It is common practice to assume that the expected value of ϵ is zero. The probability distribution of ϵ establishes the nature and magnitude of the scatter of the data values about the regression function. When the population variance of ϵ is small, the values of Y are tightly clustered about the regression function [latex]f(X)[/latex]; when the population variance of ϵ is large, the values of Y stray further from the regression function [latex]f(X)[/latex].

Regression modeling involves determining the functional form of [latex]f(X)[/latex] from a data set of n data pairs [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)[/latex]. The statistical model for X and Y in a general sense also applies to each of the data points, so

$\begin{array}{l} Y_{i} = f (X_{i}) + ϵ_{i} \end{array}$

for i = 1, 2, . . . , n. The sign of ϵ_i indicates whether the observed data pair (X_i, Y_i) falls above [latex](\epsilon_i > 0)[/latex] or below [latex](\epsilon_i < 0)[/latex] the conditional expected value of Y_i, for i = 1, 2, . . . , n.

The function [latex]f(X)[/latex] is called the regression function, and was first referred to in print as such by Sir Francis Galton (1822–1911), a British anthropologist and meteorologist, in his 1885 paper titled “Regression Toward Mediocrity in Hereditary Stature” published in the Journal of the Anthropological Institute. He established a regression function relating the adult height of an offspring, Y, as a function of an average of the parent’s heights, X, which had been adjusted for gender.

The regression function [latex]Y=f(X)[/latex] can be either linear or nonlinear. The next section focuses on the easier case, a linear regression function. In this case, the model is typically referred to as a simple linear regression model, which is often abbreviated as an SLR model. The model is simple because there is only one independent variable X that is used to predict the dependent variable Y. The model is linear because the regression function [latex]f(X)=\beta_{0}+\beta_{1}X[/latex] is assumed to be linear in the parameters β₀ and β₁. The more complicated cases of multiple linear regression, which involve more than one independent variable, and nonlinear regression, in which [latex]f(X)[/latex] is not a linear function, will be introduced later.

1.3 Simple Linear Regression Model

A simple linear regression model assumes a linear relationship between an independent variable X and a dependent variable Y. In this section, the more general regression model

$\begin{array}{l} Y = f (X) + ϵ \end{array}$

is reduced to the simple linear regression model given in the definition below.

Definition 1.1 A simple linear regression model is given by

$\begin{array}{l} Y = β_{0} + β_{1} X + ϵ, \end{array}$

where

X is the independent variable, assumed to be a fixed value observed without error,
Y is the dependent variable, which is a continuous random variable,
β₀ is the population intercept of the regression line, which is an unknown constant,
β₁ is the population slope of the regression line, which is an unknown constant, and
ϵ is the error term, a continuous random variable with population mean zero and positive, finite population variance σ² that accounts for the randomness in the relationship between X and Y.

Stating the simple linear regression model in this fashion will not seem natural from probability theory. As a non-regression illustration from probability theory, [latex]W \sim N(\mu, \sigma^{2})[/latex] indicates that W has a normal distribution with population mean μ and population variance σ². Although much less compact, the probability distribution of W can also be written as [latex]W = \mu + \epsilon[/latex], where [latex]\epsilon \sim N(0, \sigma^{2})[/latex]. This illustration reflects the essence behind writing the simple linear regression model in the form [latex]Y=\beta_{0}+\beta_{1}X + \epsilon[/latex] in Definition 1.1.

The formulation of the simple linear model from Definition 1.1 involves a random variable ϵ on the right-hand side of the model. In some settings, this model might be viewed as a transformation of a random variable, but this is not the correct interpretation of the model in this setting. The simple linear regression model defines a hypothesized relationship between the random variable on the left-hand side of the model and terms on the right-hand side of the model. This probability model is hypothesized to govern the relationship between X and Y. The goal in constructing a simple linear regression model is to determine if it adequately captures the probabilistic relationship between X and Y. Estimation of the model parameters will be followed by assessment to see if the model holds in an empirical sense.

The assumption that the random variable ϵ has population mean zero and population variance σ² in the most basic simple linear regression model in Definition 1.1 allows for mathematically tractable statistical inference. In models that allow for confidence intervals and hypothesis testing concerning the estimated slope and intercept, the error term is assumed to have a specific distribution, which is typically the normal distribution. The error term models all sources of variation, both known and unknown, other than the variation in Y associated with the particular level of X. Notice that σ² is constant over all values of X.

The assumption that the independent variable X is not subject to random variability is not always satisfied in practice. The fitting procedure becomes more complicated when X is considered to be a random variable. For this reason, we assume that the observed value of X is either exact or that the variation of X is small enough so that its observed value can be assumed to be exact.

The assumption of a linear relationship between X and Y might also be flawed. In some cases it might not be a perfectly linear relationship, but a linear relationship provides a close enough approximation between X and Y to be useful for associated statistical inference. In other cases, a linear relationship might be appropriate for some range of values of X, known as the scope of the model, but not others. One important step in establishing a simple linear regression model is to specify the values of X for which the simple linear regression model is valid.

The procedure for establishing a simple linear regression model that relates the dependent variable Y to the independent variable X is given below.

Collect the data pairs. The data pairs are denoted by [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)[/latex]. In some settings, it is possible to exert some control over the X_i values. As will be seen later, there are advantages to having the X_i values spread out as much as possible in terms of the precision of the fitted regression model.
Make a scatterplot of data pairs. A scatterplot is just a plot of the points [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)[/latex] on a set of axes. The purpose of the scatterplot is to see if the linear relationship between X and Y is appropriate and to visually assess the spread of the data values about the regression function. With modern statistical software, scatterplots are easy to generate.
Inspect the scatterplot. Although this step is subjective, it is important to visually assess
(a) whether the relationship between X and Y appears to be linear or nonlinear, (b) whether the spread of the data pairs about the regression function is small or large, and (c) whether the variability of the data pairs about the regression function remains constant over the range of X values that have been collected.
State the regression model. In this chapter, the regression model is assumed to be the simple linear regression model [latex]Y=\beta_{0}+\beta_{1}X + \epsilon[/latex]. Nonlinear regression models, such as the quadratic model [latex]Y=\beta_{0}+\beta_{1}X +\beta_{2}X^{2} + \epsilon[/latex], and multiple regression models with more than one independent variable, such as [latex]Y=\beta_{0}+\beta_{1}X_{1} +\beta_{2}X_{2} + \epsilon[/latex], will be considered later.
Fit the regression model to the data pairs. The method of least squares, which will be described in the next section, is commonly used to estimate the parameters in the regression model. The least squares criterion is to choose the regression model that minimizes the sum of the squares of the vertical differences between data points and the fitted regression model.
Assess the adequacy of the fitted regression model. Visual assessment techniques for assessing the fitted regression model include superimposing the fitted regression model onto the scatterplot of the data pairs and examining a plot of the residuals. A residual is the signed vertical distance between a data pair and its associated value on the regression function. In addition, there are statistical methods that can be applied to the fitted regression model to see if it adequately describes the relationship between X and Y.
Perform statistical inference. Once the fitted regression model is deemed an acceptable approximation to the relationship between X and Y, it can be used for statistical inference. One simple example of statistical inference that occurs often in practice is the prediction of a future value of Y for a particular level of X.

The seven steps for establishing a regression model are not necessarily performed in the order given here. Many times the fitted regression model is rejected in Step 6, and it is necessary to return to Step 4 in order to formulate an alternative model. Steps 4 through 6 might need to be repeated several times before arriving at an acceptable model for statistical inference.

The simple linear regression model given in Definition 1.1 implies that all of the [latex](X_{i}, Y_{i})[/latex] pairs also follow the simple linear regression model:

$\begin{array}{l} Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex], where

(X_i, Y_i) are the data pairs, for [latex]i = 1, 2,\ldots, n[/latex],
X_i is the value of the independent variable for observation i, which is observed without error, for [latex]i = 1, 2,\ldots, n[/latex],
Y_i is the value of the dependent variable for observation i, which is a continuous random variable, for [latex]i = 1, 2,\ldots, n[/latex],
β₀ is the population intercept of the regression line,
β₁ is the population slope of the regression line, and
ϵ_i is the random error term for observation i which satisfies
- [latex]E[\epsilon_{i}]=0[/latex] for [latex]i = 1, 2,\ldots, n[/latex],
- [latex]V[\epsilon_{i}]=\sigma^{2}[/latex] for [latex]i = 1, 2,\ldots, n[/latex],
- the random ϵ_i values are mutually independent random variables, which implies that their variance–covariance matrix is diagonal.

When the simple linear regression model is stated in this fashion, four properties become apparent. First, Y_i is a random variable that can be broken into two components: a deterministic component [latex]\beta_{0}+\beta_{1}X_{i}[/latex], and a random component ϵ_i, for [latex]i = 1, 2,\ldots, n[/latex]. Second, Y_i has population mean

$\begin{array}{l} E [Y_{i}] = E [β_{0} + β_{1} X_{i} + ϵ_{i}] = β_{0} + β_{1} X_{i} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex] and population variance

$\begin{array}{l} V [Y_{i}] = V [β_{0} + β_{1} X_{i} + ϵ_{i}] = V [ϵ_{i}] = σ^{2} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex]. Using slightly different notation, it would be reasonable to write the population mean and variance as the conditional expectations

$\begin{array}{l} E [Y_{i} | X_{i}] = β_{0} + β_{1} X_{i} and V [Y_{i} | X_{i}] = σ^{2} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex]. The property that the variance does not change with X_i is known as homoscedasticity. Temporarily dropping the subscripts, the line

$\begin{array}{l} E [Y] = β_{0} + β_{1} X, \end{array}$

with β₀ and β₁ replaced by the associated estimated values [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex], is oftentimes superimposed onto the scatterplot to visualize the fitted regression model. Third, each data pair (X_i, X_i) has a Y_i value that misses the regression function by the error term ϵ_i, for [latex]i = 1, 2,\ldots, n[/latex]. Fourth, the values of the observed dependent variables Y₁, Y₂, …, Y_n must be mutually independent random variables because the error terms [latex]\epsilon_{1}, \epsilon_{2},\dots, \epsilon_{n}[/latex] are mutually independent random variables.

1.4 Least Squares Estimators

We now turn to the question of estimating the intercept β₀ and the slope β₁ by the method of least squares. German mathematician Carl Friedrich Gauss (1777–1855) invented the least squares method and French mathematician Adrien–Marie Legendre (1752–1833) first published the method in 1805. The least squares method determines the values of β₀ and β₁ that minimize the sum of the squares of the errors, where the error is the vertical distance between the Y_i value and the fitted regression line. The term estimator will be used here to refer to a generic formula for [latex]\hat{\beta}_0[/latex] or [latex]\hat{\beta}_1[/latex]; the term estimate will be used to refer to a specific numeric value for [latex]\hat{\beta}_0[/latex] or [latex]\hat{\beta}_1[/latex].

One bit of notation that will make the expressions of the point estimators more compact is

$\begin{array}{l} S_{X Y} & = \sum_{i = 1}^{n} (X_{i} - \bar{X}) (Y_{i} - \bar{Y}) \\ = \sum_{i = 1}^{n} (X_{i} Y_{i} - X_{i} \bar{Y} - \bar{X} Y_{i} + \bar{X} \bar{Y}) \\ = \sum_{i = 1}^{n} X_{i} Y_{i} - n \bar{X} \bar{Y} - n \bar{X} \bar{Y} + n \bar{X} \bar{Y} \\ = \sum_{i = 1}^{n} X_{i} Y_{i} - n \bar{X} \bar{Y} . \end{array}$

Similarly,

$\begin{array}{l} S_{X X} = \sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} = \sum_{i = 1}^{n} X_{i}^{2} - n {\bar{X}}^{2} \end{array}$

and

$\begin{array}{l} S_{Y Y} = \sum_{i = 1}^{n} (Y_{i} - \bar{Y})^{2} = \sum_{i = 1}^{n} Y_{i}^{2} - n {\bar{Y}}^{2} . \end{array}$

This new notation allow us to express [latex]nS_{XY}[/latex], [latex]nS_{XX}[/latex], and [latex]nS_{YY}[/latex] as

$\begin{array}{l} n S_{X Y} = n \sum_{i = 1}^{n} X_{i} Y_{i} - \sum_{i = 1}^{n} X_{i} \sum_{i = 1}^{n} Y_{i}, \end{array}$

$\begin{array}{l} n S_{X X} = n \sum_{i = 1}^{n} X_{i}^{2} - {(\sum_{i = 1}^{n} X_{i})}^{2}, \end{array}$

and

$\begin{array}{l} n S_{Y Y} = n \sum_{i = 1}^{n} Y_{i}^{2} - {(\sum_{i = 1}^{n} Y_{i})}^{2} . \end{array}$

Using this notation, the least squares estimators for the slope and intercept of the model, denoted by [latex]\hat{\beta}_1[/latex] and [latex]\hat{\beta}_0[/latex], are given in the following theorem. Notice that the term normal equations in the theorem is not related to the normal distribution.

Proof The deviation of Y_i from the associated value on the population regression line is

$\begin{array}{l} Y_{i} - (β_{0} + β_{1} X_{i}), \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex]. The sum of the squared deviations is

$\begin{array}{l} S = \sum_{i = 1}^{n} (Y_{i} - β_{0} - β_{1} X_{i})^{2} . \end{array}$

The least squares estimators are those that minimize S respect to β₀ and β₁; that is,

$\begin{array}{l} ({\hat{β}}_{0}, {\hat{β}}_{1}) = \underset{β_{0}, β_{1}}{argmin} \sum_{i = 1}^{n} (Y_{i} - β_{0} - β_{1} X_{i})^{2} . \end{array}$

Using calculus to minimize S with respect to β₀ and β₁ requires taking the partial derivatives of S with respect to β₀ and β₁:

$\begin{array}{l} \frac{\partial S}{\partial β_{0}} & = - 2 \sum_{i = 1}^{n} (Y_{i} - β_{0} - β_{1} X_{i}) = 0 \\ \frac{\partial S}{\partial β_{1}} & = - 2 \sum_{i = 1}^{n} X_{i} (Y_{i} - β_{0} - β_{1} X_{i}) = 0. \end{array}$

Simplifying and using the hat notation to denote the estimators results in the simultaneous normal equations

$\begin{array}{l} n {\hat{β}}_{0} + {\hat{β}}_{1} \sum_{i = 1}^{n} X_{i} & = \sum_{i = 1}^{n} Y_{i} \\ {\hat{β}}_{0} \sum_{i = 1}^{n} X_{i} + {\hat{β}}_{1} \sum_{i = 1}^{n} X_{i}^{2} & = \sum_{i = 1}^{n} X_{i} Y_{i} . \end{array}$

The normal equations are a system of two linear equations in the two unknowns [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex]. Solving these equations simultaneously yields the point estimator for the slope

$\begin{array}{l} {\hat{β}}_{1} = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}} = \frac{S_{X Y}}{S_{X X}} . \end{array}$

Dividing the first normal equation by the sample size n yields the point estimator for the intercept

$\begin{array}{l} {\hat{β}}_{0} = \bar{Y} - {\hat{β}}_{1} \bar{X} . \end{array}$

The next step is to show that the least squares estimators [latex]\hat{\beta}_1[/latex] and [latex]\hat{\beta}_0[/latex] minimize S. This will be done by showing that the Hessian matrix is positive definite. The Hessian matrix [latex]\mathbf{H}[/latex] is the matrix of second partial derivatives of S with respect to β₀ and β₁:

$H = [\begin{array}{cc} \frac{\partial^{2} S}{\partial β_{0}^{2}} & \frac{\partial^{2} S}{\partial β_{0} \partial β_{1}} \\ \frac{\partial^{2} S}{\partial β_{1} \partial β_{0}} & \frac{\partial^{2} S}{\partial β_{1}^{2}} \end{array}] = [\begin{array}{cc} 2 n & 2 \sum_{i = 1}^{n} X_{i} \\ 2 \sum_{i = 1}^{n} X_{i} & 2 \sum_{i = 1}^{n} X_{i}^{2} \end{array}] .$

The [latex]\mathbf{H}[/latex] matrix is unchanged when evaluated at the least squares estimators [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex]. To show that this matrix is positive definite at the least squares estimators, it is sufficient to show that the upper-left-hand element and the determinant of [latex]\mathbf{H}[/latex] are both positive. The upper-left-hand element is positive for all values of the sample size n. The determinant of [latex]\mathbf{H}[/latex] is

[latex]|{\bf H}| = \left| \begin{array}{cc} 2n & \displaystyle{2 \sum_{i\,=\,1}^n X_i} \\ \displaystyle{2 \sum_{i\,=\,1}^n X_i} & \displaystyle{2 \sum_{i\,=\,1}^n X_i^2} \end{array} \right| = 4n \displaystyle{\sum_{i\,=\,1}^n X_i ^ 2 - 4 \left( \sum_{i\,=\,1}^n X_i \right) ^ 2}.[/latex]

This expression is positive when there are at least two distinct X_i values by the Cauchy–Schwartz inequality. The Cauchy–Schwartz inequality (a special case of the triangle inequality) states that for real numbers [latex]a_{1}, a_{2}, \dots, a_{n}[/latex] and [latex]b_{1}, b_{2}, \dots, b_{n}[/latex],

$\begin{array}{l} (a_{1}^{2} + a_{2}^{2} + \dots + a_{n}^{2}) \cdot (b_{1}^{2} + b_{2}^{2} + \dots + b_{n}^{2}) \geq {(a_{1} b_{1} + a_{2} b_{2} + \dots + a_{n} b_{n})}^{2}, \end{array}$

where equality is satisfied if and only if [latex]a_{1}=a_{2} = \cdots = a_{n}[/latex] and [latex]b_{1} = b_{2} = \cdots = b_{n}[/latex]. Letting [latex]a_{i} = 1[/latex] and [latex]b_{i} = x_{i}[/latex] in the Cauchy–Schwartz inequality indicates that the determinant of [latex]\mathbf{H}[/latex] is positive when there are at least two distinct X_i values. Hence, the Hessian matrix [latex]\mathbf{H}[/latex] is positive definite and the least squares estimators [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] minimize S. [latex]\Box[/latex]

The requirement that there are at least two distinct X_i values in Theorem 1.1 is consistent with intuition. Figure 1.3 shows [latex]n=5[/latex] data pairs in which the independent variable assumes the same value for each pair: [latex]X_{1}=X_{2}=X_{3}=X_{4}=X_{5}=3[/latex]. It is not possible to estimate the slope of the regression line in this particular setting. This is the geometric reason for the requirement that there are at least two distinct X_i values. In addition, the denominator in [latex]\hat{\beta}_1 = S_{XY} / S_{XX}[/latex] is zero when all X_i values are equal, which gives the associated algebraic reason for the requirement. From this point forward, whenever the simple linear regression model is used, it is assumed that the associated data pairs [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)[/latex] have at least two distinct X_i values.

A graph of five identical independent data pair values on the X Y coordinate plane. — Figure 1.3: Identical independent variable values for all [latex]n=5[/latex] data pairs.

Long Description for Figure 1.3

The horizontal axis X ranges from 0 to 8 in increments of 1 unit. The vertical axis Y ranges from 0 to 40 in increments of 10 units. The ordered pair X 1, Y 1 is plotted at (3, 23); X 2, Y 2 is plotted at (3, 29); X 3, Y 3 is plotted at (3, 2); X 4, Y 4 is plotted at (3, 40); and X 5, Y 5 is plotted at (3, 9). All data are approximate.

Figure 1.4 shows the geometric interpretation associated with the estimated intercept [latex]\hat{\beta}_0[/latex] and estimated slope [latex]\hat{\beta}_1[/latex]. The [latex]n=9[/latex] data pairs [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_9, Y_9)[/latex] are plotted as points, along with the associated estimated regression line [latex]Y=\hat{\beta}_{0}+\hat{\beta}_{1}X[/latex]. The y-intercept of the graph [latex]\hat{\beta}_0[/latex] is the height of the estimated regression line at [latex]X = 0[/latex]. The “rise over run” interpretation of the slope is illustrated by the right triangle with legs consisting of dotted lines.

A graph with 9 data points depicts the geometry associated with estimated intercept and estimated slope. — Figure 1.4: Geometry associated with [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex].

Long Description for Figure 1.4

“9 data points are plotted along the line of regression, which a positive slope and originates from the vertical axis. An equation reading, Y equals beta cap 0 plus beta cap 1 X, is represented by the line of regression. 4 points fall below the line of regression, one point is on the line, and 4 points are above the line of regression. The Y intercept of the regression line, at X value 0 is indicated as beta cap 0. The rise over run of the slope is indicated by a triangle with dotted lines. The height of the dotted triangle, which is parallel to the horizontal axis is labeled 1 and the base that is parallel to vertical axis is labeled beta cap 1.”

The next example illustrates the mechanics associated with calculating the least squares estimates [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex]. In order to focus on the calculations performed by hand, a small sample size of [latex]n = 3[/latex] data pairs is used. The numbers have been handpicked in order to make the resulting parameter estimates come out to whole numbers. A sample size of [latex]n = 2[/latex] is too simplistic in that two points determine a line, and the estimated regression line will always pass through those two points.

Example 1.3 Cheryl sells farm equipment and supplies. Let X be the number of sales she completes in a week, which will serve as the independent variable in this example. Each sale that she completes results in an associated random amount of revenue to the company that can be attributed to Cheryl’s sales prowess. The dependent random variable Y is the associated total revenue to the company from Cheryl’s sales for that week, in thousands of dollars. The data pairs for the past [latex]n = 3[/latex] weeks are

$\begin{array}{l} (X_{1}, Y_{1}) = (6, 2), (X_{2}, Y_{2}) = (8, 9), and (X_{3}, Y_{3}) = (2, 2) . \end{array}$

Find the least squares estimates of the population intercept β₀ and population slope β₁ for the simple linear regression model from these data pairs and plot the fitted regression line and the data pairs on a single plot.

A scatterplot for this data set is generated using the plot function in the R commands

and is displayed in Figure 1.5. Your immediate reaction to the scatterplot might be to conclude that this is certainly not a linear relationship between X and Y. But this conclusion might not be warranted because of the small number of data pairs collected. One thing that is unusual about this data set is that Cheryl generated six sales in the first week, resulting in just $2000 in revenue, and then two sales in the third week, also resulting in $2000 in revenue. Clearly the sales transacted during the first week were much smaller in size, on average, than those in the third week. Since the purpose of this example is to illustrate the calculations for computing [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex], we will proceed as if the linear model were appropriate. Assessing a simple linear regression model with only [latex]n = 3[/latex] data pairs is nearly impossible.

Figure 1.5: A scatterplot of the sales data pairs.

The least squares estimates for β₀ and β₁ will be calculated in three different fashions. First, they will be calculated by hand, with all of the calculations displayed here. Second, they will be calculated in R using an approach that mirrors the hand calculations.

Third, they will be calculated in R using the lm (for linear model) function, which automates the process of estimating β₀ and β₁.

Table 1.1 contains the data pairs and calculations necessary to compute the estimated slope and intercept of the regression line. The sample means of the independent and dependent variables are

$\begin{array}{l} \bar{X} = \frac{16}{3} a n d \bar{Y} = \frac{13}{3} . \end{array}$

Table 1.1: Data pairs and calculated values for estimating β₀ and β₁.
Observation number i	Number of sales X_i	Total revenue Y_i	[latex]\big( X_i - \bar X \big) ^2[/latex]	[latex]\big(X_i - \bar X \big) \big(Y_i - \bar Y \big)[/latex]
1	6	2	[latex](6 - 16 / 3) ^ 2[/latex]	[latex](6 - 16 / 3)(2 - 13 / 3)[/latex]
2	8	9	[latex](8 - 16 / 3) ^ 2[/latex]	[latex](8 - 16 / 3)(9 - 13 / 3)[/latex]
3	2	2	[latex](2 - 16 / 3) ^ 2[/latex]	[latex](2 - 16 / 3)(2 - 13 / 3)[/latex]
Sum	16	13	[latex]168 / 9[/latex]	[latex]168 / 9[/latex]

Although [latex]\bar{X}[/latex] and [latex]\bar{Y}[/latex] are set in upper case, it is important to remember that the X_i values are observed without error and the Y_i values are the associated random responses. The sums in the bottom row of Table 1.1 give the sums of squares

$\begin{array}{l} S_{X X} = \sum_{i = 1}^{3} {(X_{i} - \bar{X})}^{2} = \frac{168}{9} a n d S_{X Y} = \sum_{i = 1}^{3} (X_{i} - \bar{X}) (Y_{i} - \bar{Y}) = \frac{168}{9} . \end{array}$

The fact that [latex]S_{XY} = S_{XY}[/latex] is coincidental, and is typically not the case in practice. Using Theorem 1.1, the least squares estimates of β₁ and β₀ are

$\begin{array}{l} {\hat{β}}_{1} = \frac{S_{X Y}}{S_{X X}} = \frac{168 / 9}{168 / 9} = 1 a n d {\hat{β}}_{0} = \bar{Y} - {\hat{β}}_{1} \bar{X} = \frac{13}{3} - 1 \cdot \frac{16}{3} = - 1. \end{array}$

A second way to calculate the least squares estimates [latex]\hat{\beta}_1[/latex] and [latex]\hat{\beta}_0[/latex] uses the R code below to implement the formulas given in Theorem 1.1. The code is generic in the sense that once the two vectors x and y are defined using the first two commands, the last four commands will calculate the point estimates [latex]\hat{\beta}_1[/latex] and [latex]\hat{\beta}_0[/latex] for any number of [latex](X_i, Y_i)[/latex] pairs.

This code also returns the point estimates

$\begin{array}{l} {\hat{β}}_{1} = 1 a n d {\hat{β}}_{0} = - 1. \end{array}$

As you might imagine, these calculations are performed so often by statisticians that R has a built-in function to estimate β₁ and β₀.

A third way to calculate the least squares estimates of β₁ and β₀ via use of the R lm function.

The lm function takes a formula for an argument, in this case [latex]y \sim x[/latex], and returns a list. One component of the list returned by lm is named coefficients, and it contains the estimated regression coefficients [latex]\hat{\beta}_1 = 1[/latex] and [latex]\hat{\beta}_0 = -1[/latex].

The fitted regression line is added to the scatterplot in Figure 1.6 using the R code below. The plot function plots the data pairs, the lm function estimates the intercept and slope of the regression line via least squares, and the abline function plots the fitted regression line. The labels on the data pairs can be added with the text function. The regression line plotted in Figure 1.6 is the line which minimizes the sum of the squares of the vertical distances between the points associated with the data pairs and the fitted regression line.

Figure 1.6: A scatterplot of the sales data pairs with the fitted regression line.

Long Description for Figure 1.6

The horizontal axis X ranges from 0 to 8 in increments of 1 unit. The vertical axis Y ranges from 0 to 9 in increments of 1 unit. The ordered pair X 1, Y 1 is plotted at (6, 2); X 2, Y 2 is plotted at (8, 9); and X 3, Y 3 is plotted at (2, 2). The line of regression, with a positive slope, begins from the origin and slopes up to the top right corner of the quadrant. Data point X 1, Y 1 lies below the line of regression and X 2, Y 2 and X 3, Y 3 lie above the line of regression.

The fitted regression line has intercept [latex]\hat{\beta}_0 = -1[/latex] and slope [latex]\hat{\beta}_1 = 1[/latex]. The fact that the intercept is [latex]\hat{\beta}_0 = -1[/latex] rather than [latex]\hat{\beta}_0 = 0[/latex] (because [latex]X = 0[/latex] sales in a week should result in [latex]Y = 0[/latex] revenue in that week) is due to random sampling variability. Section 3.1 investigates how to force a regression line through the origin, which would be appropriate in this setting. The interpretation of the estimated slope [latex]\hat{\beta}_1 = 1[/latex] is that the average amount of revenue generated from each sale that Cheryl completes is $1000.

Figure 1.7 makes two embellishments to Figure 1.6. First, the axes have been adjusted so that the length of one unit on the vertical axis is the same as the length of one unit on the horizontal axis. Second, three shaded squares have been added to the plot. Each square has one vertex at a data pair, and a second vertex at the associated point on the fitted regression line. The numbers in each square give the area of the square. For these data pairs, the total area is the sum of squares

$\begin{array}{l} S & = (Y_{1} - {\hat{β}}_{0} - {\hat{β}}_{1} X_{1})^{2} + (Y_{2} - {\hat{β}}_{0} - {\hat{β}}_{1} X_{2})^{2} + (Y_{3} - {\hat{β}}_{0} - {\hat{β}}_{1} X_{3})^{2} \\ = (2 + 1 - 6)^{2} + (9 + 1 - 8)^{2} + (2 + 1 - 2)^{2} \\ = 9 + 4 + 1 \\ = 14. \end{array}$

Figure 1.7: A scatterplot of the sales data pairs with the fitted regression line.

Long Description for Figure 1.7

The horizontal axis X ranges from 0 to 8 in increments of 1 unit. The vertical axis Y ranges from negative 1 to 9 in increments of 1 unit. The ordered pair X 1, Y 1 is plotted at (6, 2); X 2, Y 2 is plotted at (8, 9); and the pair X 3, Y 3 is plotted at (2, 2). A diagonal line with a positive slope begins from the origin and slopes up to the top right of the quadrant. A square with an approximate length of 2 units and labeled 4 is drawn above the diagonal line, such that the top right vertex is the data point X 2, Y 2 and its bottom right vertex touches the diagonal line. A square with an approximate length of 3 units and labeled 9 is drawn below the diagonal line, such that its bottom left vertex is the data point X 1, Y 1 and its top left vertex touches the diagonal line. A square with an approximate length of 1 unit and labeled 1 is drawn above the diagonal line, such that its top right vertex is the data point X 3, Y 3 and the bottom right vertex touches the diagonal line.

The fitted least squares line is unique in the following sense. The squares illustrated in Figure 1.7 for any line having an intercept and/or slope that differ from [latex]\hat{\beta}_0 = -1[/latex] and [latex]\hat{\beta}_1 = 1[/latex] will have a total area that exceeds [latex]S = 14[/latex]. The fitted least squares line is that line which minimizes S. If a different line were selected and plotted, some of the squares might become smaller, but at least one of the squares would become larger, and the total area of the squares would exceed 14.

Another way to view the minimization of S is to consider contours, or level surfaces, of the sum of squares as a function of the intercept β₀ and the slope β₁. The point

$\begin{array}{l} ({\hat{β}}_{0}, {\hat{β}}_{1}) = (- 1, 1) \end{array}$

in Figure 1.8 corresponds to the fitted least squares line with a sum of squares [latex]S = 14[/latex] for the three data pairs. The concentric contours corresponding to [latex]S = 15[/latex], [latex]S = 18[/latex], [latex]S = 23[/latex], and [latex]S = 30[/latex] show how the sum of squares increases as the intercept and slope stray from the least squares estimates.

Figure 1.8: Level surfaces of the sum of squares.

Long Description for Figure 1.8

The horizontal axis labeled beta cap 0 ranges between negative 7 and 5. The vertical axis labeled beta cap 1 ranges between 0 and 2 in increments of 1. A point plotted at (negative 1, 1) is surrounded by four concentric contours. The outer circle of the concentric contour extends between negative 7 and 5 of horizontal axis and 0 and 2 of vertical axis. The innermost circle of the contour extends between negative 2 and 2 of horizontal axis and 0.8 and 1.2 of vertical axis.

1.5 Properties of Least Squares Estimators

The least squares estimators of β₀ and β₁ possess several properties which are important for statistical inference. The four properties established in this section are:

the least squares estimators [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] are unbiased estimators of β₀ and β₁,
the least squares estimators [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] can be written as linear combinations of the dependent variables [latex]Y_{1}, Y_{2}, \dots, Y_{n}[/latex],
the variance–covariance matrix of [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] can be written in closed form, and
the least squares estimators [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] have the smallest population variance among all unbiased estimators that can be expressed as linear combinations of the dependent variables.

Proofs of the associated results are included in each of the following subsections.

1.5.1 [latex]\bf \hat{\beta}_0[/latex] and [latex]\bf \hat{\beta}_1[/latex] are Unbiased Estimators of β₀ and β₁

A key property associated with the least squares estimators [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] is that their expected values equal the associated population values β₀ and β₁. The next result establishes the unbiasedness of the two point estimators.

Proof To show that [latex]\hat{\beta}_1[/latex] and [latex]\hat{\beta}_0[/latex] are unbiased estimators of β₁ and β₀, it is sufficient to show that

$\begin{array}{l} E [{\hat{β}}_{1}] = β_{1} and E [{\hat{β}}_{0}] = β_{0} . \end{array}$

The denominator of the expression for [latex]\hat{\beta}_1[/latex] , which is [latex]S_{XX}[/latex], is a constant because the values of the independent variables [latex]X_{1}, X_{2}, \dots, X_{n}[/latex] are assumed to be observed without error in the simple linear regression model. Thus, the expected value of [latex]\hat{\beta}_1[/latex] is

$\begin{array}{l} E [{\hat{β}}_{1}] & = E [\frac{S_{X Y}}{S_{X X}}] \\ = E [\frac{\sum_{i = 1}^{n} X_{i} Y_{i} - n \bar{X} \bar{Y}}{\sum_{i = 1}^{n} X_{i}^{2} - n {\bar{X}}^{2}}] \\ = \frac{\sum_{i = 1}^{n} X_{i} E [Y_{i}] - n \bar{X} E [\bar{Y}]}{\sum_{i = 1}^{n} X_{i}^{2} - n {\bar{X}}^{2}} \\ = \frac{\sum_{i = 1}^{n} X_{i} (β_{0} + β_{1} X_{i}) - n \bar{X} (β_{0} + β_{1} \bar{X})}{\sum_{i = 1}^{n} X_{i}^{2} - n {\bar{X}}^{2}} \\ = \frac{β_{0} \sum_{i = 1}^{n} X_{i} + β_{1} \sum_{i = 1}^{n} X_{i}^{2} - β_{0} \sum_{i = 1}^{n} X_{i} - n β_{1} {\bar{X}}^{2}}{\sum_{i = 1}^{n} X_{i}^{2} - n {\bar{X}}^{2}} \\ = β_{1} . \end{array}$

The expected value of [latex]\hat{\beta}_0[/latex] is

$\begin{array}{l} E [{\hat{β}}_{0}] = E [\bar{Y} - {\hat{β}}_{1} \bar{X}] = β_{0} + β_{1} \bar{X} - β_{1} \bar{X} = β_{0} . \end{array}$

Therefore, [latex]\hat{\beta}_1[/latex] and [latex]\hat{\beta}_0[/latex] are unbiased estimators of β₁ and β₀. [latex]\Box[/latex]

The fact that the least squares estimators of the slope and intercept of the regression line are unbiased will be supported by a Monte Carlo simulation experiment in the next example. Unlike the typical simple linear regression setting in which data pairs [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)[/latex] are used to estimate the unknown parameters β₀ and β₁, the simulation will generate data pairs and associated regression lines for known parameters β₀ and β₁.

Example 1.4 Consider the simple linear regression model

$\begin{array}{l} Y = β_{0} + β_{1} X + ϵ, \end{array}$

where

the population intercept is [latex]{\beta}_0 = 1[/latex],
the population slope is [latex]{\beta}_1 = 1/2[/latex], and
the error term [latex]\epsilon[/latex] has a [latex]U(-1,1)[/latex] distribution.

The population parameters have been chosen arbitrarily. The error term distribution has population mean zero and finite population variance, so it satisfies the conditions of a simple linear regression model from Definition 1.1. The uniform error term distribution is not likely to occur in practice, however, because it cuts off at [latex]-1[/latex] and 1. Probability distributions with tails, such as the normal distribution, are used more often in practice. Conduct a Monte Carlo simulation with 5000 replications that analyzes the probability distribution of the estimated intercept [latex]\hat{\beta}_0[/latex] and estimated slope [latex]\hat{\beta}_1[/latex] for [latex]n = 10[/latex] data pairs. Assume that the X_i values are equally likely to be one of the integers [latex]0,1,2, \dots, 9[/latex]. The independent variable X happens to assume discrete values in this example, but it would pose no difficulty if it took on continuous values.

One problem that might arise in the Monte Carlo experiment is that the X_i values might all be equal. This would violate the assumption in Theorem 1.1 that at least two X_i values must be distinct. Even though this event occurs with the low probability

$\begin{array}{l} 10 \cdot {(\frac{1}{10})}^{10} = 10^{- 9}, \end{array}$

an if statement will be included in the Monte Carlo simulation code to catch this problem if it occurs.

The R code below conducts 5000 replications of the Monte Carlo experiment. The commands prior to the for loop set the number of replications to 5000, set the sample size to [latex]n = 10[/latex], set the population intercept to [latex]{\beta}_0 = 1[/latex], set the population slope to [latex]{\beta}_1 = 1/2[/latex], define the vectors beta0hat and beta1hat to hold the simulated estimated intercepts and slopes, and establish the random number stream with the set.seed function with arbitrary argument. Within the for loop, x contains the values of the independent variables, y contains the values of the associated dependent variables, and fit is the list that stores the results of the regression analysis generated by the call to the lm function.

Figure 1.9 shows the scatterplot and the fitted regression line for the first replication of the simulation. Notice that having tied values for the independent variables poses no difficulty for calculating the estimates of the intercept and slope of the fitted regression line. This first fitted regression line has intercept [latex]\hat{\beta}_{0} = 1.398[/latex] which exceeds the population intercept [latex]\beta_{0} = 1[/latex]; this first fitted regression line has slope [latex]\hat{\beta}_{1} = 0.399[/latex] which is less than the population slope [latex]\beta_{1} = 0.5[/latex]. Each of the 5000 replications will yield unique values of [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex]. Since [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] are unbiased estimators of β₀ and β₁ by Theorem 1.2, the 5000 simulated point estimates will hover around their population counterparts.

Figure 1.9: Scatterplot of simulated data pairs and fitted regression line (replication 1).

Long Description for Figure 1.9

“The horizontal axis X ranges from 0 to 9 in increments of 1 unit. The vertical axis Y ranges from 0 to 7 in increments of 1 unit. The data points are plotted at (0, 0.9); (1, 2); (2, 2.8); (3, 3); (3, 2.5); (4, 3); (4, 3.2); (5, 2.8); (5, 3); (8, 4). A diagonal line with a positive slope originates at point (0, 1.5) and slopes up and to the right to the point (9, 5). 5 data points lie below the diagonal line, 4 lie above, and 1 lies on the diagonal line.

Figure 1.10 contains four lines. The thick, solid line is the population regression line with intercept [latex]\beta_{0} = 1[/latex] and slope [latex]\beta_{1} = 1/2[/latex]. The other three dashed lines correspond to the fitted regression lines for the first three replications of the simulation. As expected, the estimated intercepts and slopes differ from the associated population values from one replication to the next.

Figure 1.10: Population and fitted regression lines (replications 1–3).

Long Description for Figure 1.10

The horizontal axis X ranges from 0 to 9 in increments of 1 unit. The vertical axis Y ranges from 0 to 7 in increments of 1 unit. A solid line represents the line of regression for the population with a positive slope. It originates at point (0, 0.9) and increases steadily to reach (9, 5.5). Three dotted lines slope up toward the top right of the quadrant, at varying degrees of slope. One begins at (0, 1) and extends to (9, 5.7). One begins at (0, 1) and extends to (9, 4.5). One begins at (0, 1.5) and extends to (9, 4.5). All data are approximate.

When the simulation is run for all 5000 replications, there are 5000 [latex](\hat{\beta}_0, \hat{\beta}_1)[/latex] pairs generated. The additional R commands below plot a histogram of the 5000 [latex]\hat{\beta}_0[/latex] values on the left and a histogram of the 5000 [latex]\hat{\beta}_1[/latex] values on the right. The mfrow (multiple frame by row) argument in par function sets up a [latex]1 \times 2[/latex] array of plots, and the hist function plots the histograms. Figure 1.11 contains the two histograms. The vertical axes have been suppressed because only the center and shape of the histogram is of interest.

As predicted by Theorem 1.2, the histogram of the [latex]\hat{\beta}_0[/latex] values is centered around [latex]\beta_0 = 1[/latex] and the histogram of the [latex]\hat{\beta}_1[/latex] values is centered around [latex]\beta_1 = 1/2[/latex]. Both histograms have a bell shape, indicating that the extreme values for the intercepts and slopes are less likely as you move further away from the population values. Although the error terms in the model are mutually independent [latex]U(-1,1)[/latex] random variables, the summations involved with the computation of [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] allow the central limit theorem to produce a histogram shape that is quite close to that of a normal probability density function.

Figure 1.11: Histograms of estimated intercepts (left) and estimated slopes (right).

Long Description for Figure 1.11

On the left, a bell-shaped histogram with 8 bars from negative 0.30 to 2.30. The first four bars are of increasing height, and the following four bars are of decreasing heights. The bar from 0.7 to 1 is the highest. On the right, a bell-shaped histogram with 10 bars or bins from negative 0.18 to 0.7. The first six bars from the left are of increasing height, and the remaining are of decreasing height The bar from 0.5 to 0.55 is the highest.

The two histograms in Figure 1.11 do not indicate whether [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] are independent or dependent random variables. The additional R command

plots the 5000 [latex](\hat{\beta}_0, \hat{\beta}_1)[/latex] pairs, which is displayed in Figure 1.12. The Monte Carlo simulation indicates that the estimated intercepts and slopes are negatively correlated. They tend to be on the opposite sides of their respective means. A larger-than-usual slope is likely to be associated with a smaller-than-usual intercept, and vice versa.

Figure 1.12: Estimated intercepts and slopes for 5000 Monte Carlo simulation replications.

Long Description for Figure 1.12

The horizontal axis X measures estimated slope, with values ranging from negative 1 to 3 in increments of 1 unit. The vertical axis Y measures estimated intercepts, with values ranging from 0.2 to 0.8 in increments of 0.1 unit. The data points are concentrated between the Y axis values of 0.7 and 0.3 and the X axis values of 0 and 2, in an elliptical shape with a negative slope.

The two key take-aways from this Monte Carlo experiment are:

[latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] being unbiased estimators of β₀ and β₁ via Theorem 1.2 is supported by the histograms in Figure 1.11, and
[latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] appear to be negatively correlated for this particular simple linear regression model by Figure 1.12.

1.5.2 [latex]\bf \hat{\beta}_0[/latex] and [latex]\bf \hat{\beta}_1[/latex] are Linear Combinations of [latex]\bf Y_{1}, Y_{2}, \dots , Y_{n}[/latex]

Theorem 1.2, which states that [latex]E [\hat {\beta}_0] = \beta_{0}[/latex] and [latex]E [\hat {\beta}_1] = \beta_{1}[/latex], concerns the accuracy of the least squares estimators [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex]. These estimators are “on target” in the sense that their expected values equal their associated population values. The histograms in Figure 1.11 show that the estimators for β₀ and β₁ do not systematically deviate above or below their population values.

The precision of the estimators [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] is also of interest. This requires that we also compute their population variances. Before doing so, it is helpful to see that both of these point estimators can be written as linear combinations of the values of the dependent variables [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex]. It is not immediately apparent from the formula for the point estimator for the slope of the regression line [latex]\hat{\beta}_1 = S_{XY} / S_{XX}[/latex], but the estimator can be written as a linear combination of the dependent variables:

$\begin{array}{l} {\hat{β}}_{1} & = \frac{S_{X Y}}{S_{X X}} \\ = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}} \\ = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) Y_{i}}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}} \end{array}$

because [latex]\bar{Y} \sum^{n}_{i=1} (X_{i} - \bar{X}) = \bar{Y} (n\bar{X} - n\bar{X}) = 0[/latex]. This formula indicates that the point estimator for the slope of the regression line is the linear combination

$\begin{array}{l} {\hat{β}}_{1} = a_{1} Y_{1} + a_{2} Y_{2} + \dots + a_{n} Y_{n}, \end{array}$

where

$\begin{array}{l} a_{i} = \frac{X_{i} - \bar{X}}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex].

The coefficients [latex]a_{1}, a_{2}, \dots, a_{n}[/latex] in the linear combination [latex]\hat{\beta}_1 = a_{1}Y_{1} + a_{2}Y_{2} + \cdots + a_{n}Y_{n}[/latex] satisfy three properties. First, [latex]\sum^{n}_{i=1}a_{i}=0[/latex] because

$\begin{array}{l} \sum_{i = 1}^{n} a_{i} = \frac{1}{S_{X X}} \sum_{i = 1}^{n} (X_{i} - \bar{X}) = \frac{n \bar{X} - n \bar{X}}{S_{X X}} = 0. \end{array}$

Second, [latex]\sum^{n}_{i=1}a_{i}X_{i}=1[/latex] because

$\begin{array}{l} \sum_{i = 1}^{n} a_{i} X_{i} = \frac{1}{S_{X X}} \sum_{i = 1}^{n} (X_{i} - \bar{X}) X_{i} = \frac{1}{S_{X X}} [\sum_{i = 1}^{n} X_{i}^{2} - n {\bar{X}}^{2}] = \frac{S_{X X}}{S_{X X}} = 1. \end{array}$

Third, [latex]\sum^{n}_{i=1}a^{2}_{i} = 1 / S_{XX}[/latex] because

$\begin{array}{l} \sum_{i = 1}^{n} a_{i}^{2} = \frac{1}{S_{X X}^{2}} \sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} = \frac{S_{X X}}{S_{X X}^{2}} = \frac{1}{S_{X X}} . \end{array}$

These properties can be useful in deriving results associated with the simple linear regression model.

Likewise, the least squares point estimator for the intercept of the regression line is also a linear combination of the Y_i values:

$\begin{array}{l} {\hat{β}}_{0} & = \bar{Y} - {\hat{β}}_{1} \bar{X} \\ = \frac{1}{n} \sum_{i = 1}^{n} Y_{i} - \bar{X} \sum_{i = 1}^{n} \frac{X_{i} - \bar{X}}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}} Y_{i} \\ = \sum_{i = 1}^{n} (\frac{1}{n} - \bar{X} \cdot \frac{X_{i} - \bar{X}}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}}) Y_{i} . \end{array}$

This formula indicates that the point estimator for the intercept of the regression line can also be written as a linear combination:

$\begin{array}{l} {\hat{β}}_{0} = c_{1} Y_{1} + c_{2} Y_{2} + \dots + c_{n} Y_{n}, \end{array}$

where

$\begin{array}{l} c_{i} = \frac{1}{n} - \bar{X} \cdot \frac{X_{i} - \bar{X}}{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex]. This derivation constitutes a proof of the following result.

These formulas will be illustrated for the small data set consisting of [latex]n = 3[/latex] data pairs.

Example 1.5 Consider again the [latex]n = 3[/latex] data pairs

$\begin{array}{l} (X_{1}, Y_{1}) = (6, 2), (X_{2}, Y_{2}) = (8, 9), and (X_{3}, Y_{3}) = (2, 2) \end{array}$

from Example 1.3. Recall that the independent variable X is Cheryl’s number of sales per week. Each sale results in a random amount of revenue to the company. The dependent random variable Y is the associated total revenue from the sales that Cheryl completes for a particular week, in thousands of dollars. Find the least squares estimates of the intercept β₀ and slope β₁ for the simple linear regression model using the formulas that express the estimates as linear combinations of [latex]Y_{1},Y_{2},Y_{3}[/latex] from Theorem 1.3.

The sample mean of the independent variables is

$\begin{array}{l} \bar{X} = \frac{6 + 8 + 2}{3} = \frac{16}{3} . \end{array}$

The value of [latex]S_{XX}[/latex] is

$\begin{array}{l} S_{X X} = \sum_{i = 1}^{3} (X_{i} - \bar{X})^{2} = {(6 - \frac{16}{3})}^{2} + {(8 - \frac{16}{3})}^{2} + {(2 - \frac{16}{3})}^{2} = \frac{4}{9} + \frac{64}{9} + \frac{100}{9} = \frac{56}{3} . \end{array}$

The coefficients for the linear combination associated with [latex]\hat{\beta}_1[/latex] are

$\begin{array}{l} a_{i} = \frac{X_{i} - \bar{X}}{S_{X X}} \end{array}$

for [latex]i = 1, 2, 3[/latex], or

$\begin{array}{l} a_{1} = \frac{6 - 16 / 3}{56 / 3} = \frac{1}{28}, a_{2} = \frac{8 - 16 / 3}{56 / 3} = \frac{1}{7}, a_{3} = \frac{2 - 16 / 3}{56 / 3} = - \frac{5}{28} . \end{array}$

You might want to check that the three properties associated with the coefficients a₁, a₂, and a₃ from Theorem 1.3, namely [latex]a_{1} + a_{2} + a_{3} = 0[/latex], [latex]a_{1}X_{1} + a_{2}X_{2} + a_{3}X_{3} = 1[/latex], and [latex]a^{2}_{1} + a^{2}_{2} + a^{2}_{3} = 1 / S_{XX}[/latex], are all satisfied as expected. The least squares estimate of the slope of the regression line is

$\begin{array}{l} {\hat{β}}_{1} = a_{1} Y_{1} + a_{2} Y_{2} + a_{3} Y_{3} = \frac{1}{28} \cdot 2 + \frac{1}{7} \cdot 9 - \frac{5}{28} \cdot 2 = \frac{1}{14} + \frac{9}{7} - \frac{5}{14} = 1. \end{array}$

The R code for performing these calculations is given below.

The coefficients for the linear combination associated with [latex]\hat{\beta}_0[/latex] are

$\begin{array}{l} c_{i} = \frac{1}{n} - \bar{X} \cdot \frac{X_{i} - \bar{X}}{S_{X X}} = \frac{1}{n} - \bar{X} \cdot a_{i} \end{array}$

for [latex]i = 1, 2, 3,[/latex] or

$\begin{array}{l} c_{1} = \frac{1}{3} - \frac{16}{3} \cdot \frac{1}{28} = \frac{1}{7}, c_{2} = \frac{1}{3} - \frac{16}{3} \cdot \frac{1}{7} = - \frac{3}{7}, c_{3} = \frac{1}{3} - \frac{16}{3} \cdot \frac{- 5}{28} = \frac{9}{7} . \end{array}$

The least squares estimate of the intercept of the regression line is

$\begin{array}{l} {\hat{β}}_{0} = c_{1} Y_{1} + c_{2} Y_{2} + c_{3} Y_{3} = \frac{1}{7} \cdot 2 - \frac{3}{7} \cdot 9 + \frac{9}{7} \cdot 2 = \frac{2}{7} - \frac{27}{7} + \frac{18}{7} = - 1. \end{array}$

The R code for performing these calculations follows.

In both cases the point estimates match the associated values calculated by the standard formulas for [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] from Theorem 1.1 that were used in Example 1.3, as expected.

1.5.3 Variance–Covariance Matrix of [latex]\bf \hat{\beta}_0[/latex] and [latex]\bf\hat{\beta}_1[/latex]

Theorem 1.2 states that [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] are unbiased estimators of β₀ and β₁ because [latex]E[\hat{\beta}_0] = \beta_0[/latex] and [latex]E[\hat{\beta}_1] = \beta_1[/latex]. This result concerns the accuracy of the least squares estimators, but does not address the precision of the least squares estimators. We now return to the question of assessing the precision of the point estimators. Being able to express the point estimators of the least squares estimators as linear combinations of the dependent variables as summarized in Theorem 1.3 will be very useful as we proceed. In order to assess the precision of [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex], it is necessary to compute [latex]V[\hat{\beta}_0][/latex] and [latex]V[\hat{\beta}_1][/latex]. More generally, we will compute the variance–covariance matrix of [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] in this subsection. Returning to the Monte Carlo simulation in Example 1.4, the magnitudes of the diagonal elements of the variance–covariance matrix reflect the spread of the histograms in Figure 1.11, and the off-diagonal elements of the variance–covariance matrix give the population covariance between [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] which is apparent in the simulation results displayed in Figure 1.12. The general form for the population covariance between [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] will indicate whether the negative sample covariance between [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] that was encountered in the Monte Carlo simulation was due to the particular values of the parameters in the simple linear regression model or whether the negative covariance is generally the case.

We begin with the lower-right-hand element of the variance–covariance matrix of [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex]. In the simple linear regression model

$\begin{array}{l} Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex], the error terms [latex]\varepsilon_1, \varepsilon_2, \dots, \varepsilon_n[/latex] are assumed to be mutually independent random variables. This implies that the dependent variables [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] are also mutually independent random variables. Using the fact that [latex]\hat{\beta}_1[/latex] can be written as a linear combination of the dependent variables from Theorem 1.3, the population variance of [latex]\hat{\beta}_1[/latex] is

$\begin{array}{l} V [{\hat{β}}_{1}] & = V [a_{1} Y_{1} + a_{2} Y_{2} + \dots + a_{n} Y_{n}] \\ = \sum_{i = 1}^{n} V [a_{i} Y_{i}] \\ = \sum_{i = 1}^{n} a_{i}^{2} V [Y_{i}] \\ = (\sum_{i = 1}^{n} a_{i}^{2}) σ^{2} \\ = \frac{σ^{2}}{S_{X X}} \end{array}$

because [latex]\sum^{n}_{i=1}a^{2}_{i} = 1 / S_{XX}[/latex] by Theorem 1.3. Although the experimenter typically has no control over σ², the experimenter may have control over selecting the values of [latex]X_{1}, X_{2}, \dots , X_{n}[/latex] in some applications of simple linear regression. In order to make [latex]V[\hat{\beta}_1][/latex] as small as possible, the experimenter should make [latex]S_{XX}[/latex] as large as possible. Spreading the X_i values as much as possible gives the most stability to the estimated slope of the regression line. Simple linear regression modeling can still be performed when the X_i values are tightly clustered together, but the estimated slope will be less stable, and the scope of the model will be limited. As an extreme example of spreading the X_i values, consider clustering all of the X_i values at a left-most and a right-most extreme possible values for the independent variable. The good news is that this will give you the largest possible [latex]S_{XX}[/latex] and the associated smallest possible [latex]V[\hat{\beta}_1][/latex]. The bad news is that you will not be able to assess linearity in this case because you have observed the dependent variable at only two values of the independent variable. A multitude of functions can model the average of the dependent variables at these two extreme values of the independent variable. So the usual practice is to select the X_i values in an approximately uniform fashion over as wide a range as possible. This gives the experimenter the opportunity to assess linearity and also achieves a large [latex]S_{XX}[/latex], resulting in an associated small [latex]V[\hat{\beta}_1][/latex].

The next step is to calculate the upper-left-hand element of the variance–covariance matrix of [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex]. Before calculating the population variance of [latex]\hat{\beta}_0[/latex], it is necessary to establish that [latex]\bar{Y}[/latex] and [latex]\hat{\beta}_1[/latex] are uncorrelated. Since [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] are mutually independent random variables, each with population variance [latex]V[Y_{i}] = \sigma^{2}[/latex], the population covariance between [latex]\bar{Y}[/latex] and [latex]\hat{\beta}_1[/latex] is

$\begin{array}{l} C o v (\bar{Y}, {\hat{β}}_{1}) & = C o v (\frac{Y_{1}}{n} + \frac{Y_{2}}{n} + \dots + \frac{Y_{n}}{n}, a_{1} Y_{1} + a_{2} Y_{2} + \dots + a_{n} Y_{n}) \\ = \sum_{i = 1}^{n} \sum_{j = 1}^{n} C o v (\frac{Y_{i}}{n}, a_{j} Y_{j}) \\ = \sum_{i = 1}^{n} C o v (\frac{Y_{i}}{n}, a_{i} Y_{i}) \\ = \sum_{i = 1}^{n} \frac{a_{i}}{n} V [Y_{i}] \\ = \frac{σ^{2}}{n} \sum_{i = 1}^{n} a_{i} \\ = 0 \end{array}$

because [latex]\sum^{n}_{i=1}a_{i}=0[/latex] by Theorem 1.3. So [latex]\bar{Y}[/latex] and [latex]\hat{\beta}_1[/latex] are uncorrelated.

Based on the fact that the population covariance between [latex]\bar{Y}[/latex] and [latex]\hat{\beta}_1[/latex] is zero, the population variance of [latex]\hat{\beta}_0[/latex] is

$\begin{array}{l} V [{\hat{β}}_{0}] & = V [\bar{Y} - {\hat{β}}_{1} \bar{X}] \\ = V [\bar{Y}] + {\bar{X}}^{2} V [{\hat{β}}_{1}] \\ = \frac{σ^{2}}{n} + \frac{{\bar{X}}^{2} σ^{2}}{S_{X X}} \\ = [\frac{1}{n} + \frac{{\bar{X}}^{2}}{S_{X X}}] σ^{2} \\ = [\frac{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} + n {\bar{X}}^{2}}{n \sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}}] σ^{2} \\ = \frac{\sum_{i = 1}^{n} X_{i}^{2}}{n S_{X X}} σ^{2} . \end{array}$

The last step is to calculate the off-diagonal elements of the variance–covariance matrix of [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex]. Since [latex]\text{Cov}(\bar{Y},\hat\beta_{1}) = 0[/latex], the population covariance between [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] is

$\begin{array}{l} C o v ({\hat{β}}_{0}, {\hat{β}}_{1}) & = C o v (\bar{Y} - {\hat{β}}_{1} \bar{X}, {\hat{β}}_{1}) \\ = C o v (\bar{Y}, {\hat{β}}_{1}) - C o v ({\hat{β}}_{1} \bar{X}, {\hat{β}}_{1}) \\ = - C o v ({\hat{β}}_{1} \bar{X}, {\hat{β}}_{1}) \\ = - \bar{X} C o v ({\hat{β}}_{1}, {\hat{β}}_{1}) \\ = - \bar{X} V [{\hat{β}}_{1}] \\ = - \frac{\bar{X} σ^{2}}{S_{X X}} . \end{array}$

All of the elements of the variance–covariance matrix have now been established, which constitutes a proof of the following theorem.

There are two important observations that can be made from Theorem 1.4. First, the elements of the variance–covariance matrix of [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] are a function of only the X_i values and the typically unknown population error variance σ²; the values of [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] do not play a role. Recall from Definition 1.1 that the independent variable observations [latex]X_{1}, X_{2}, \dots , X_{n}[/latex] are assumed to be observed without error. Second, since [latex]S_{XX} > 0[/latex] because at least two of the X_i values are distinct, the population covariance between [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] takes the opposite sign of [latex]\bar{X}[/latex]. This provides an explanation of why [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] appeared to have negative covariance in the results of the 5000 simulated estimates plotted in Figure 1.12.

Example 1.6 Consider again the simple linear regression model

$\begin{array}{l} Y = β_{0} + β_{1} X + ϵ \end{array}$

from Example 1.4 in which

the population intercept is [latex]\beta_{0} = 1[/latex],
the population slope is [latex]\beta_{1} = 1/2[/latex], and
the error term ϵ has a [latex]U(-1,1)[/latex] distribution.

The error term distribution has population mean zero, so this model satisfies the conditions of a simple linear regression model. Find the variance–covariance matrix for the least squares estimators [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] associated with a single Monte Carlo replication of [latex]n = 10[/latex] data pairs. Assume that the X_i values are equally likely to be one of the integers [latex]0, 1, 2,\ldots, 9[/latex].

The R code that follows conducts a single replication of the Monte Carlo experiment. The results of this single replication were illustrated by the fitted regression line in Figure 1.9. Since the error terms are mutually independent [latex]U(-1,1)[/latex] random variables and the population variance of a [latex]U(a, b)[/latex] random variable is [latex](b - a)^{2} / 12[/latex], the population variance of the error terms is [latex]\sigma^{2} = (1 + 1)^{2} / 12 = 1 / 3[/latex]. Although the dependent variables are generated and stored in the vector y, they are not used in the calculation of the variance–covariance matrix.

The variance–covariance matrix for this single replication of the Monte Carlo simulation experiment, reported to four digits, is

$\begin{array}{l} [\begin{array}{cc} V [{\hat{β}}_{0}] & C o v ({\hat{β}}_{0}, {\hat{β}}_{1}) \\ C o v ({\hat{β}}_{1}, {\hat{β}}_{0}) & V [{\hat{β}}_{1}] \end{array}] = [\begin{array}{cc} 0.1211 & - 0.02509 \\ - 0.02509 & 0.007168 \end{array}] . \end{array}$

If additional Monte Carlo simulation replications were made, this matrix would vary from one replication to the next because the X_i values vary from one replication to the next. Taking the square roots of the diagonal elements yields

$\begin{array}{l} \sqrt{V [{\hat{β}}_{0}]} = 0.3481 a n d \sqrt{V [{\hat{β}}_{1}]} = 0.0847, \end{array}$

which are estimates of the standard deviation of the intercept and slope of the regression line, often referred to as the standard errors of the estimated parameters. These two standard deviations are roughly in line with the spread of the histograms generated from the 5000 simulation replications depicted in Figure 1.11. The negative values of the off-diagonal elements of the variance–covariance matrix are consistent with the plot of 5000 simulated [latex](\hat{\beta}_0, \hat{\beta}_1)[/latex] values given in Figure 1.12.

So far we have found the expected values and the variance–covariance matrix of the least squares estimators [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex]. But there is a lingering doubt as to whether better point estimators for β₀ and β₁ exist. An example of such a better point estimator would be an unbiased estimator of β₀ with a smaller population variance than the least squares estimator of β₀. This lingering doubt will be addressed in the next subsection.

1.5.4 Gauss–Markov Theorem

Recall from Theorem 1.3 that the least squares estimators for the slope and intercept of the regression line were expressed as linear combinations of the dependent variables:

$\begin{array}{l} {\hat{β}}_{1} = a_{1} Y_{1} + a_{2} Y_{2} + \dots + a_{n} Y_{n} \end{array}$

and

$\begin{array}{l} {\hat{β}}_{0} = c_{1} Y_{1} + c_{2} Y_{2} + \dots + c_{n} Y_{n} . \end{array}$

But are these linear combinations the best possible linear combinations for estimating β₁ and β₀? The Gauss–Markov theorem is used to show that these estimators have the minimum variance of all possible unbiased estimators which are linear combinations of the dependent variables. These estimators are known as Best Linear Unbiased Estimators, typically abbreviated with the colorful acronym BLUE. The Venn diagram in Figure 1.13 might be helpful in categorizing the various types of estimators. The set L consists of all point estimators for the regression parameters β₀ and β₁ which can be expressed as linear combinations of the dependent variables [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex]. The set U consists of all point estimators for the regression parameters β₀ and β₁ which are unbiased estimators of β₀ and β₁. The shaded intersection of L and U (that is, [latex]L \cap U[/latex]) is all estimators which are both linear combinations of [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] and unbiased. An example of an estimator of β₁ which is neither in L nor in U is [latex]Y^{2}_{i}[/latex]. The Gauss–Markov theorem states that the least squares estimators have the smallest possible variance among all estimators in [latex]L \cap U[/latex].

A Venn diagram with two side by side, overlapping circles, each labeled L and U. The area of overlap of the circles is shaded. — Figure 1.13: Venn diagram of sets L (linear combinations) and U (unbiased estimators).

Proof (partial proof) This proof will show that [latex]\hat{\beta}_1[/latex] has the smallest population variance among the class of all linear unbiased estimators for β₁. The proof for [latex]\hat{\beta}_0[/latex] is similar but left as an exercise for the reader. Let

$\begin{array}{l} {\hat{β}}_{1} = a_{1} Y_{1} + a_{2} Y_{2} + \dots + a_{n} Y_{n} \end{array}$

be the unbiased least squares estimator of the population slope β₁ from Theorem 1.3, where [latex]a_{i} = (X_{i} - \bar{X}) / S_{XX}[/latex] for [latex]i = 1, 2,\ldots, n[/latex]. Consider another linear combination of the dependent variables which is also an unbiased estimator of β₁ that can be written as

$\begin{array}{l} {\hat{β}}_{1}^{'} = k_{1} Y_{1} + k_{2} Y_{2} + \dots + k_{n} Y_{n} \end{array}$

for some real-valued constants [latex]k_{1}, k_{2}, \dots , k_{n}[/latex]. Since [latex]E[Y_{i}] = \beta_{0} + \beta_{1}X_{i}[/latex], the expected value of [latex]\hat{\beta}'_{1}[/latex] is

$\begin{array}{l} E [{\hat{β}}_{1}^{'}] & = E [\sum_{i = 1}^{n} k_{i} Y_{i}] \\ = \sum_{i = 1}^{n} k_{i} E [Y_{i}] \\ = \sum_{i = 1}^{n} k_{i} (β_{0} + β_{1} X_{i}) \\ = β_{0} \sum_{i = 1}^{n} k_{i} + β_{1} \sum_{i = 1}^{n} k_{i} X_{i} . \end{array}$

Since [latex]\hat{\beta}'_{1}[/latex] is an unbiased estimator of β₁, [latex]E[\hat{\beta}'_{1}] = \beta_{1}[/latex]. In order for this to be the case, the following conditions must hold:

$\begin{array}{l} \sum_{i = 1}^{n} k_{i} = 0 a n d \sum_{i = 1}^{n} k_{i} X_{i} = 1. \end{array}$

These two conditions will be used in the last step of the derivation that follows. Now let [latex]k_{i} = a_{i} + d_{i}[/latex], for [latex]i = 1, 2,\ldots, n[/latex]. We want to find the d_i values that meet the two conditions given above and minimize [latex]V[\hat{\beta}'_{1}][/latex], which is

$\begin{array}{l} V [{\hat{β}}_{1}^{'}] & = V [\sum_{i = 1}^{n} k_{i} Y_{i}] \\ = \sum_{i = 1}^{n} k_{i}^{2} V [Y_{i}] \\ = \sum_{i = 1}^{n} k_{i}^{2} σ^{2} \\ = σ^{2} \sum_{i = 1}^{n} {(a_{i} + d_{i})}^{2} \\ = σ^{2} [\sum_{i = 1}^{n} a_{i}^{2} + \sum_{i = 1}^{n} d_{i}^{2} + 2 \sum_{i = 1}^{n} a_{i} d_{i}] \\ = V [{\hat{β}}_{1}] + σ^{2} \sum_{i = 1}^{n} d_{i}^{2} + 2 σ^{2} \sum_{i = 1}^{n} a_{i} d_{i} \\ = V [{\hat{β}}_{1}] + σ^{2} \sum_{i = 1}^{n} d_{i}^{2} + 2 σ^{2} \sum_{i = 1}^{n} a_{i} (k_{i} - a_{i}) \\ = V [{\hat{β}}_{1}] + σ^{2} \sum_{i = 1}^{n} d_{i}^{2} + 2 σ^{2} (\sum_{i = 1}^{n} a_{i} k_{i} - \sum_{i = 1}^{n} a_{i}^{2}) \\ = V [{\hat{β}}_{1}] + σ^{2} \sum_{i = 1}^{n} d_{i}^{2} + 2 σ^{2} (\sum_{i = 1}^{n} k_{i} \cdot \frac{X_{i} - \bar{X}}{S_{X X}} - \frac{1}{S_{X X}}) \\ = V [{\hat{β}}_{1}] + σ^{2} \sum_{i = 1}^{n} d_{i}^{2} + 2 σ^{2} (\frac{\sum_{i = 1}^{n} k_{i} X_{i} - \bar{X} \sum_{i = 1}^{n} k_{i} - 1}{S_{X X}}) \\ = V [{\hat{β}}_{1}] + σ^{2} \sum_{i = 1}^{n} d_{i}^{2} . \end{array}$

In order to minimize [latex]V[\hat{\beta}'_{1}][/latex] the d_i values should be selected to minimize [latex]\sum^{n}_{i=1}d^{2}_{i}[/latex]. This sum of squares is minimized when [latex]d_{1}=d_{2} = \cdots = d_{n} = 0[/latex]. Therefore, the least squares estimator [latex]\hat{\beta}_1[/latex], with coefficients [latex]k_{i} = a_{i}[/latex] for [latex]i = 1, 2,\ldots, n[/latex], has the smallest variance among all unbiased estimators that can be written as linear combinations of [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] and is therefore a best linear unbiased estimator. [latex]\Box[/latex]

The Gauss–Markov theorem indicates that the least squares estimators for β₀ and β₁ have minimal variance among all linear estimators. It does not indicate whether the least squares estimators for β₀ and β₁ have minimal variance among all estimators. The Gauss–Markov theorem extends to the case of multiple linear regression in which there are several independent variables. The least squares estimators are also the best linear unbiased estimators in this case.

To review the results that have been introduced so far, the simple linear regression model

$\begin{array}{l} Y = β_{0} + β_{1} X + ϵ \end{array}$

defines a linear statistical relationship between an independent variable X, observed without error, and a random dependent variable Y as given in Definition 1.1. The point estimators for β₁ and β₀ from n data pairs [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)[/latex] using the least squares criterion are

$\begin{array}{l} {\hat{β}}_{1} = \frac{S_{X Y}}{S_{X X}} a n d {\hat{β}}_{0} = \bar{Y} - {\hat{β}}_{1} \bar{X} \end{array}$

as given in Theorem 1.1. The least squares estimators are unbiased estimators of their associated parameters because

$\begin{array}{l} E [{\hat{β}}_{1}] = β_{1} a n d E [{\hat{β}}_{0}] = β_{0} \end{array}$

as given in Theorem 1.2. The least squares estimators of β₀ and β₁ can be expressed as linear combinations of [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] as

$\begin{array}{l} {\hat{β}}_{0} = c_{1} Y_{1} + c_{2} Y_{2} + \dots + c_{n} Y_{n} a n d {\hat{β}}_{1} = a_{1} Y_{1} + a_{2} Y_{2} + \dots + a_{n} Y_{n}, \end{array}$

with coefficients [latex]c_{1}, c_{2}, \dots , c_{n}[/latex] and [latex]a_{1}, a_{2}, \dots , a_{n}[/latex] given in Theorem 1.3. The variance–covariance matrix of [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] is

$\begin{array}{l} [\begin{array}{cc} V [{\hat{β}}_{0}] & C o v ({\hat{β}}_{0}, {\hat{β}}_{1}) \\ C o v ({\hat{β}}_{1}, {\hat{β}}_{0}) & V [{\hat{β}}_{1}] \end{array}] = [\begin{array}{cc} \sum_{i = 1}^{n} X_{i}^{2} / (n S_{X X}) & - \bar{X} / S_{X X} \\ - \bar{X} / S_{X X} & 1 / S_{X X} \end{array}] σ^{2} \end{array}$

as given in Theorem 1.4. Finally, the Gauss–Markov theorem given in Theorem 1.5 states that the least squares estimators of β₀ and β₁ have the smallest population variance among all unbiased estimators that can be expressed as a linear combination of [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex].

The next section defines fitted values and residuals. Fitted values are the heights of the regression line associated with the observed values of the independent variable [latex]X_{1}, X_{2}, \dots , X_{n}[/latex]. The residuals are the vertical signed distances between the observed values of the dependent variable [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] and the associated fitted values that fall on the regression line. Residuals play an analogous role to the error terms in the simple linear regression model.

1.6 Fitted Values and Residuals

The simple linear regression model

$\begin{array}{l} Y = β_{0} + β_{1} X + ϵ \end{array}$

was introduced in the previous section as a linear statistical model for describing the relationship between an independent variable X and a dependent variable Y. Taking the expected value of both sides of this equation yields

$\begin{array}{l} E [Y] = β_{0} + β_{1} X \end{array}$

because [latex]E[\varepsilon] = 0[/latex] and X is a fixed value assumed to be observed without error, which are two key assumptions in Definition 1.1. When the population intercept β₀ and the population slope β₁ are replaced by their associated least squares point estimators [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] the resulting estimated regression line is

$\begin{array}{l} \hat{Y} = {\hat{β}}_{0} + {\hat{β}}_{1} X . \end{array}$

This estimated regression line is typically plotted on a scatterplot that contains the data pairs [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)[/latex]. Seeing the data pairs and the least squares regression line on the same plot often makes the visual assessment of linearity easier. For any value X in which the simple linear regression model is valid, [latex]\hat{Y}[/latex] is the point estimator for the value of the dependent variable based on the data pairs and associated estimated regression line. This equation can be rewritten for the particular values of the independent variable collected as

$\begin{array}{l} {\hat{Y}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} X_{i} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex]. The value [latex]\hat{Y}_{i}[/latex] is known as the fitted value associated with data pair i, for [latex]null[/latex] [latex]i = 1, 2,\ldots, n[/latex]. When [latex]\hat{Y}_i \ne Y_{i}[/latex], which is almost always the case in applications, the fitted value does not fall on the estimated regression line; when [latex]\hat{Y}_i = Y_{i}[/latex], the fitted value falls on the estimated regression line. The next example illustrates the notion of fitted values for the sales data set.

Example 1.7 Consider the sales data set from Example 1.3 with just [latex]n = 3[/latex] data pairs:

$\begin{array}{l} (X_{1}, Y_{1}) = (6, 2), (X_{2}, Y_{2}) = (8, 9), (X_{3}, Y_{3}) = (2, 2) . \end{array}$

Find the fitted values [latex]\hat{Y}_1[/latex], [latex]\hat{Y}_2[/latex], and [latex]\hat{Y}_3[/latex] associated with the least squares regression line.

From Examples 1.3 and 1.5, the point estimates for the population intercept and population slope are

$\begin{array}{l} {\hat{β}}_{0} = - 1 a n d {\hat{β}}_{1} = 1. \end{array}$

Hence, the estimated regression line is [latex]\hat{Y} = \hat{\beta}_0 + \hat{\beta}_{1}X[/latex], or

$\begin{array}{l} \hat{Y} = - 1 + X, \end{array}$

which is plotted along with the scatterplot of the data pairs in Figure 1.14. So calculating the fitted values is just a matter of using the X_i values as arguments in the estimated regression line:

$\begin{array}{l} {\hat{Y}}_{1} & = - 1 + X_{1} = - 1 + 6 = 5 \Rightarrow (X_{1}, {\hat{Y}}_{1}) = (6, 5) \\ {\hat{Y}}_{2} & = - 1 + X_{2} = - 1 + 8 = 7 \Rightarrow (X_{2}, {\hat{Y}}_{2}) = (8, 7) \\ {\hat{Y}}_{3} & = - 1 + X_{3} = - 1 + 2 = 1 \Rightarrow (X_{3}, {\hat{Y}}_{3}) = (2, 1) . \end{array}$

Figure 1.14: A scatterplot of the sales data pairs with the fitted values.

Long Description for Figure 1.14

The horizontal axis X ranges from 0 to 8 in increments of 1 unit. The vertical axis Y ranges from negative 1 to 9 in increments of 1 unit. The ordered pair X 1, Y 1 is plotted at 6, 2; X 2, Y 2 is plotted at (8, 9); X 3, Y 3 is plotted at (2, 2). A diagonal line with a positive slope begins from the origin and slopes up to the top right of the quadrant. Three points with predicted y values for the existing data pairs, are plotted on the line of regression. The data pairs with predicted values are as follows. X 1, Y cap is (6, 5) and those of X 2, Y 2 and X 3, Y 3 are (8, 6.5) and (2, 1), respectively. A dotted line is drawn from the actual y values to the predicted y value.

The fitted values are also plotted as points that lie on the estimated regression line in Figure 1.14. Recall from the previous section that the fitted least squares line is the line which minimizes the sum of the squares of the lengths of the vertical dashed lines which connect the data pair with its associated fitted value. The fitted values are calculated and stored in a component named fitted in the list returned by the R lm function. The R code below confirms the fitted values calculated above by hand.

The spread of the data pair [latex](X_{i}, Y_{i})[/latex] from the fitted regression line [latex]\hat{Y} = \hat{\beta}_0 + \hat{\beta}_{1}X[/latex] is reflected in the vertical signed distance between the data pair [latex](X_{i}, Y_{i})[/latex] and the associated fitted value [latex](X_{i}, \hat{Y}_{i})[/latex], These signed distances are known as the residuals, and are defined by

$\begin{array}{l} e_{i} = Y_{i} - {\hat{Y}}_{i} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex]. Data pairs that fall above the regression line correspond to positive residuals; data pairs that fall below the regression line correspond to negative residuals. The least squares approach used so far in estimating the intercept and slope of the regression line is a matter of finding the values of [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] which minimize the sum of the squares of the residuals. In other words, minimize

$\begin{array}{l} S = \sum_{i = 1}^{n} e_{i}^{2} . \end{array}$

The fitted values and residuals are formally defined next.

Choosing to use the vertical distance between the observed value of the dependent variable and the regression line in the definition of the residual was based on the fact that the values of the independent variable [latex]X_{1}, X_{2}, \dots , X_{n}[/latex] are assumed to be observed without error in Definition 1.1. The mathematics associated with simple linear regression changes substantially if both X and Y are considered to be random variables.

A subtle but important distinction should be drawn between the model error term ϵ_i for data pair i and the residual e_i for data pair i. The model error terms are defined by

$\begin{array}{l} ϵ_{i} = Y_{i} - (β_{0} + β_{1} X_{i}) \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex], and represent the vertical distances between the observed dependent variable Y_i and the true (population) regression line [latex]Y=\beta_{0}+\beta_{1}X[/latex]. The simple linear regression model assumes that [latex]\varepsilon_{1}, \varepsilon_{2},\dots, \varepsilon_{n}[/latex] are mutually independent random variables. In nearly all applications, however, β₀ and β₁ are unknown. This means that for a particular data set, these model error terms are also unknown. On the other hand, the residuals are defined by

$\begin{array}{l} e_{i} = Y_{i} - {\hat{Y}}_{i} = Y_{i} - ({\hat{β}}_{0} + {\hat{β}}_{1} X_{i}) \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex], and represent the error for data pair i when compared to the estimated regression line [latex]\hat{Y}=\hat{\beta}_{0}+\hat{\beta}_{1}X[/latex], which is calculated from the n data pairs. Thus, [latex]\hat{\varepsilon}_{i} = e_{i}[/latex], for [latex]i = 1, 2,\ldots, n[/latex]. The [latex]e_{1}, e_{2}, \ldots, e_{n}[/latex] values are not mutually independent random variables because they must sum to zero. (This will be proven subsequently in Theorem 1.6.) For a particular data set, these residuals are known. The residuals are calculated for the sales data next.

Example 1.8 Consider again the sales data set from Example 1.3 with [latex]n = 3[/latex] data pairs:

$\begin{array}{l} (X_{1}, Y_{1}) = (6, 2) (X_{2}, Y_{2}) = (8, 9) (X_{3}, Y_{3}) = (2, 2) . \end{array}$

Calculate the residuals e₁, e₂, and e₃ associated with the least squares regression line and display them on a scatterplot that includes the regression line.

Table 1.2 contains the calculations required to calculate the residuals and their squares. The sum of the squared residuals for these data pairs is

$\begin{array}{l} S = \sum_{i = 1}^{3} e_{i}^{2} = (- 3)^{2} + 2^{2} + 1^{2} = 9 + 4 + 1 = 14. \end{array}$

This total is consistent with the sum of the areas of the squares from Figure 1.7. The data pairs were handpicked in this example to make the residuals all integers. This will not be the case in nearly all applications of simple linear regression. This value for S which is associated with the estimated regression line is the smallest possible value for the sum of squared residuals. Any other line will be associated with a larger sum of squared residuals.

Figure 1.15 shows the residuals e₁, e₂, and e₃ along with the data pairs and the estimated regression line. Unless all of the data pairs fall in a line (which would correspond to [latex]S = 0[/latex]), there will always be one or more data values falling above the line and one or more data values falling below the line.

The values of the residuals are stored in a component named residuals in the list returned by the R lm function. The R code below calculates and displays the residuals that were calculated by hand and displayed in Table 1.2.

Figure 1.15: A scatterplot of the sales data pairs with the fitted values and residuals.

Long Description for Figure 1.15

The horizontal axis X ranges from 0 to 8 in increments of 1 unit. The vertical axis Y ranges from negative 1 to 9 in increments of 1 unit. The ordered pair X 1, Y 1 is plotted at 6, 2; X 2, Y 2 is plotted at 8, 9; X 3, Y 3 is plotted at 2, 2. A diagonal line with a positive slope begins from the origin and slopes up to the top right of the quadrant. Three points with predicted y values, Y cap, for the existing data pairs, are plotted on the line of regression. The data pairs with predicted values are as follows. X 1, Y cap is 6, 5 and those of X 2, Y 2 and X 3, Y 3 are 8, 6.5 and 2, 1 respectively. A dotted line runs between the data pairs, with actual y values to those with predicted y values. The dotted lines from Y 1 to Y 1 cap, Y 2 to Y 2 cap, and Y 3 to Y 3 cap are labeled e 1, e 2 and e 3 respectively.

Table 1.2: Data pairs, fitted values, residuals, and squared residuals.
Observation number i	Number of sales X_i	Total revenue Y_i	Fitted value [latex]\hat{Y}_i[/latex]	Residual [latex]e_i=Y_i-\hat{Y}_i[/latex]	Squared residual [latex]e_i^2[/latex]
1	6	2	5	[latex]-3[/latex]	9
2	8	9	7	2	4
3	2	2	1	1	1
Sum	16	13	13	0	14

A close inspection of the entries in Table 1.2 reveals that there are some curious outcomes that occur, such as

$\begin{array}{l} \sum_{i = 1}^{n} e_{i} = 0 a n d \sum_{i = 1}^{n} Y_{i} = \sum_{i = 1}^{n} {\hat{Y}}_{i} . \end{array}$

In other words, (a) the sum of the residuals is zero, and (b) the sum of the observed values of the dependent variable equals the sum of the fitted values. These were not just a matter of coincidence. The following theorem confirms that these relationships, along with a few other relationships, are true in general.

Proof Each of the five results will be proven individually.

Since [latex]\hat{\beta}_{0} = \bar{Y} - \hat{\beta}_{1} \bar{X}[/latex] from Theorem 1.1, the sum of the residuals is
$\begin{array}{l} \sum_{i = 1}^{n} e_{i} & = \sum_{i = 1}^{n} (Y_{i} - {\hat{Y}}_{i}) \\ = \sum_{i = 1}^{n} (Y_{i} - {\hat{β}}_{0} - {\hat{β}}_{1} X_{i}) \\ = \sum_{i = 1}^{n} Y_{i} - n {\hat{β}}_{0} - {\hat{β}}_{1} \sum_{i = 1}^{n} X_{i} \\ = \sum_{i = 1}^{n} Y_{i} - \sum_{i = 1}^{n} Y_{i} + {\hat{β}}_{1} \sum_{i = 1}^{n} X_{i} - {\hat{β}}_{1} \sum_{i = 1}^{n} X_{i} \\ = 0. \end{array}$
Since [latex]\hat{\beta}_{0} = \bar{Y} - \hat{\beta}_{1} \bar{X}[/latex] from Theorem 1.1, the sum of the fitted values is
$\begin{array}{l} \sum_{i = 1}^{n} {\hat{Y}}_{i} & = \sum_{i = 1}^{n} ({\hat{β}}_{0} + {\hat{β}}_{1} X_{i}) \\ = n {\hat{β}}_{0} + {\hat{β}}_{1} \sum_{i = 1}^{n} X_{i} \\ = n (\bar{Y} - {\hat{β}}_{1} \bar{X}) + {\hat{β}}_{1} \sum_{i = 1}^{n} X_{i} \\ = \sum_{i = 1}^{n} Y_{i} . \end{array}$

Thus, the sum of the values of the dependent variable always equals the sum of the fitted values.
The sum of the products of the independent variables and residuals is
$\begin{array}{l} \sum_{i = 1}^{n} X_{i} e_{i} & = \sum_{i = 1}^{n} X_{i} (Y_{i} - {\hat{Y}}_{i}) \\ = \sum_{i = 1}^{n} X_{i} Y_{i} - \sum_{i = 1}^{n} X_{i} {\hat{Y}}_{i} \\ = \sum_{i = 1}^{n} X_{i} Y_{i} - \sum_{i = 1}^{n} X_{i} ({\hat{β}}_{0} + {\hat{β}}_{1} X_{i}) \\ = \sum_{i = 1}^{n} X_{i} Y_{i} - {\hat{β}}_{0} \sum_{i = 1}^{n} X_{i} - {\hat{β}}_{1} \sum_{i = 1}^{n} X_{i}^{2} \\ = 0. \end{array}$

The final step uses the second normal equation from Theorem 1.1.
Using the first and third result in this theorem, the sum of the products of the fitted values and residuals is
$\begin{array}{l} \sum_{i = 1}^{n} {\hat{Y}}_{i} e_{i} = \sum_{i = 1}^{n} ({\hat{β}}_{0} + {\hat{β}}_{1} X_{i}) e_{i} = {\hat{β}}_{0} \sum_{i = 1}^{n} e_{i} + {\hat{β}}_{1} \sum_{i = 1}^{n} X_{i} e_{i} = {\hat{β}}_{0} \cdot 0 + {\hat{β}}_{1} \cdot 0 = 0. \end{array}$
The first normal equation from Theorem 1.1 is
$\begin{array}{l} n {\hat{β}}_{0} + {\hat{β}}_{1} \sum_{i = 1}^{n} X_{i} = \sum_{i = 1}^{n} Y_{i} . \end{array}$

Dividing both sides by n,

$\begin{array}{l} {\hat{β}}_{0} + {\hat{β}}_{1} \bar{X} = \bar{Y}, \end{array}$

which indicates that the point [latex](\bar{X}, \bar{Y})[/latex] lies on the estimated regression line.

[latex]\Box[/latex]

These five results from Theorem 1.6 will be illustrated for the sales data in the example that follows.

Example 1.9 Calculate the quantities given in Theorem 1.6 for the [latex]n = 3[/latex] data pairs from the sales data set from Example 1.3:

$\begin{array}{l} (X_{1}, Y_{1}) = (6, 2) (X_{2}, Y_{2}) = (8, 9) (X_{3}, Y_{3}) = (2, 2) . \end{array}$

From Examples 1.3 and 1.5, the point estimate for the intercept is [latex]\hat{\beta}_{0} = -1[/latex] and the point estimate for the slope is [latex]\hat{\beta}_{1} = 1[/latex]. Table 1.3 contains the calculations necessary to illustrate the results given in Theorem 1.6. More specifically,

- $\sum_{i = 1}^{3} e_{i} = 0$ ,
- $\sum_{i = 1}^{3} Y_{i} = \sum_{i = 1}^{3} {\hat{Y}}_{i} = 13$ ,
- $\sum_{i = 1}^{3} X_{i} e_{i} = 0$ ,

- $\sum_{i = 1}^{3} {\hat{Y}}_{i} e_{i} = 0$ .

Table 1.3: Calculation of quantities from Theorem 1.6.
i	X_i	Y_i	[latex]\hat{Y}_i[/latex]	e_i	[latex]e_i^2[/latex]	[latex]X_i e_i[/latex]	[latex]\hat{Y}_i e_i[/latex]
1	6	2	5	−3	9	−18	−15
2	8	9	7	2	4	16	14
3	2	2	1	1	1	2	1
Sum	16	13	13	0	14	0	0

Finally, the point [latex](\bar{X}, \bar{Y}) = (16/3, 13/3)[/latex] lies on the estimated regression line [latex]\hat{Y} = -1 + X[/latex], as illustrated in Figure 1.16.

Figure 1.16: The point [latex](\bar{X}, \bar{Y})[/latex] falls on the estimated regression line.

Long Description for Figure 1.16

The horizontal axis X ranges from 0 to 8 in increments of 1 unit. The vertical axis Y ranges from negative 1 to 9 in increments of 1 unit. The ordered pair X 1, Y 1 is plotted at 6, 2; X 2, Y 2 is plotted at (8, 9); X 3, Y 3 is plotted at (2, 2). The pair X bar, Y bar is plotted at (5.5, 4). A diagonal line with a positive slope begins from the origin and slopes up to the top right of the quadrant. The X bar, Y bar data pair falls on the line of regression. All data are approximate.

1.7 Estimating the Variance of the Error Terms

The emphasis so far has been focused on the estimation of the intercept and slope of the regression line. While [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] are the most critical parameters in most applications of a simple linear regression model, there is another parameter, the population variance of the error terms σ², which should also be estimated from the data pairs.

To establish a foundation for the estimation of σ², assume for this paragraph only that there is a univariate, rather than a bivariate, sample of values denoted by [latex]X_{1}, X_{2}, \dots , X_{n}[/latex]. These will not be fixed values observed without error as they were in regression modeling. It is assumed that these values constitute a random sample from a population that has finite population mean μ and finite population variance σ². The goal in this paragraph is to estimate σ² as a function of the data values. If the population mean μ is known (which is rare in practice), then an unbiased estimator of σ² is

$\begin{array}{l} \frac{1}{n} \sum_{i = 1}^{n} {(X_{i} - μ)}^{2} . \end{array}$

If the first [latex]n - 1[/latex] deviations between the sample values and the population mean [latex]X_1 - \mu, X_2 - \mu, \ldots, X_{n-1} - \mu[/latex] were known, the final deviation, [latex]X_{n} - \mu[/latex], would be free to take on any value. It is in this sense that the sum of squares

$\begin{array}{l} \sum_{i = 1}^{n} {(X_{i} - μ)}^{2} \end{array}$

is said to have n “degrees of freedom.” It is common practice in statistics to divide a sum of squares by its degrees of freedom to arrive at a point estimator. In this particular instance, dividing by n makes the point estimator an unbiased estimator of σ². The problem that arises more often in practice is to estimate σ² when μ is unknown. An unbiased estimator of σ² in this case is the sample variance

$\begin{array}{l} \frac{1}{n - 1} \sum_{i = 1}^{n} (X_{i} - \bar{X})^{2}, \end{array}$

which is typically denoted by S² by statisticians. There are three reasons why the term outside of the summation has [latex]n - 1[/latex] in the denominator. The first reason is that this is the appropriate term so that this estimator is an unbiased estimator of σ². This can be stated as [latex]E[S^2] = \sigma^2[/latex]. The second reason is that one can’t estimate the dispersion of a distribution from a single data value, so the sample variance is undefined when [latex]n = 1[/latex]. The third reason is that the sum of squares has [latex]n - 1[/latex] degrees of freedom. One degree of freedom is lost because the sample mean [latex]\bar{X}[/latex] is used to estimate the population mean μ. If the first [latex]n - 1[/latex] deviations between the sample values and the sample mean [latex]X_{1} - \bar{X}, X_{2} - \bar{X}, \dots, X_{n-1} - \bar{X}[/latex] were known, the final deviation, [latex]X_{n} - \bar{X}[/latex], could be calculated from the other [latex]n - 1[/latex] values because

$\begin{array}{l} \sum_{i = 1}^{n} (X_{i} - \bar{X}) = \sum_{i = 1}^{n} X_{i} - n \bar{X} = 0. \end{array}$

It is in this sense that the sum of squares

$\begin{array}{l} \sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} \end{array}$

is said to have [latex]n - 1[/latex] degrees of freedom. This ends the discussion of degrees of freedom for a univariate data set.

We now return to the problem of estimating σ² in simple linear regression. The independent variables [latex]X_{1}, X_{2}, \dots , X_{n}[/latex] are once again assumed to be fixed values observed without error as they have been throughout this chapter. Based on the fact that the error terms [latex]\varepsilon_{1}, \varepsilon_{2},\dots, \varepsilon_{n}[/latex] in the simple linear regression model are assumed to be mutually independent and identically distributed random variables, each with population mean 0 and finite population variance σ², the population variance of the error terms can be estimated with the unbiased estimator

$\begin{array}{l} \frac{1}{n} \sum_{i = 1}^{n} ϵ_{i}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - β_{0} - β_{1} X_{i})}^{2} \end{array}$

if β₀ and β₁ were known. But in practice, the two parameters β₀ and β₁ are estimated from the data pairs [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)[/latex], so two degrees of freedom are lost and an appropriate point estimator for the population variance σ² is given by

$\begin{array}{l} {\hat{σ}}^{2} = \frac{1}{n - 2} \sum_{i = 1}^{n} e_{i}^{2} = \frac{1}{n - 2} \sum_{i = 1}^{n} (Y_{i} - {\hat{β}}_{0} - {\hat{β}}_{1} X_{i})^{2} . \end{array}$

It is important that the population variance of the error terms σ² remain constant over the range of X values in which the simple linear regression model is appropriate. One tool for visually assessing this assumption is a scatterplot of the data pairs with the estimated regression line superimposed.

The point estimator for σ² when β₀ and β₁ are estimated from the data pairs involves the sum of squares of the residuals, and this is often abbreviated as SSE, for sum of squares for error:

$\begin{array}{l} S S E = \sum_{i = 1}^{n} e_{i}^{2}, \end{array}$

which is also known as the error sum of squares, residual sum of squares, and sum of squares due to error. When this quantity is divided by its degrees of freedom, it is known as the mean square error, which is abbreviated by MSE:

$\begin{array}{l} {\hat{σ}}^{2} = MSE = \frac{S S E}{n - 2} = \frac{1}{n - 2} \sum_{i = 1}^{n} e_{i}^{2} . \end{array}$

Some good news is provided by the next result, which states that [latex]MSE = \hat{\sigma}^2[/latex] is an unbiased estimator of σ².

Proof The simple linear regression model is

$\begin{array}{l} Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex]. Summing both sides of this equation and dividing by n yields

$\begin{array}{l} \bar{Y} = β_{0} + β_{1} \bar{X} + \bar{ϵ} . \end{array}$

Taking the difference between the previous two equations results in

[latex]Y_i - \bar Y = \beta_1 \left( X_i - \bar X \right) + \epsilon_i - \bar \epsilon \tag{1}[/latex]

for [latex]i = 1, 2,\ldots, n[/latex]. The definition of the residual associated with data pair i is

$\begin{array}{l} e_{i} = Y_{i} - {\hat{β}}_{0} - {\hat{β}}_{1} X_{i} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex]. Recognizing that the residuals sum to zero via Theorem 1.6, summing both sides of this equation, and dividing by n yields

$\begin{array}{l} 0 = \bar{Y} - {\hat{β}}_{0} - {\hat{β}}_{1} \bar{X} . \end{array}$

Taking the difference between the previous two equations results in

[latex]e_i = Y_i - \bar Y - \hat \beta_1 \left( X_i - \bar X \right) \tag{2}[/latex]

for [latex]i = 1, 2,\ldots, n[/latex]. Substituting the right-hand side of equation (1) for [latex]Y_{i} - \bar{Y}[/latex] in equation (2) gives

$\begin{array}{l} e_{i} & = β_{1} (X_{i} - \bar{X}) + ϵ_{i} - \bar{ϵ} - {\hat{β}}_{1} (X_{i} - \bar{X}) \\ = (β_{1} - {\hat{β}}_{1}) (X_{i} - \bar{X}) + (ϵ_{i} - \bar{ϵ}) \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex]. Squaring both sides of this equation and summing gives

$\begin{array}{l} \sum_{i = 1}^{n} e_{i}^{2} = ({\hat{β}}_{1} - β_{1})^{2} \sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} - 2 ({\hat{β}}_{1} - β_{1}) \sum_{i = 1}^{n} (X_{i} - \bar{X}) (ϵ_{i} - \bar{ϵ}) + \sum_{i = 1}^{n} {(ϵ_{i} - \bar{ϵ})}^{2} . \end{array}$

Taking into account that the X_i values are assumed to be fixed constants in a simple linear regression model, the expected value of both sides of this equation is

$$
E \left[ \sum_{i\,=\,1}^n e_i^2 \right] =
E \left[ \big( \hat \beta_1 – \beta_1 \big) ^ 2 \right]
\sum_{i\,=\,1}^n \big( X_i – \bar X \big) ^ 2
–
2 E \left[ \big( \hat \beta_1 – \beta_1 \big) \sum_{i\,=\,1}^n
\big( X_i – \bar X \big)
\left( \epsilon_i – \bar \epsilon \right) \right]
$$
$$
\qquad \qquad \qquad \qquad
\qquad \qquad \qquad \qquad
+ \kern 0.2em
E \left[
\sum_{i\,=\,1}^n
\left( \epsilon_i – \bar \epsilon \right) ^ 2 \right]. \tag{3}
$$

There are three terms on the right-hand side of equation (3). Each term will be considered separately. The first term contains [latex]E[(\hat{\beta}_{1} - \beta_{1})^{2}][/latex], which is an expression for the population variance of [latex]\hat{\beta}_1[/latex] because [latex]\hat{\beta}_1[/latex] is an unbiased estimator for β₁ via Theorem 1.2. This population variance is the lower-right entry of the variance–covariance matrix given in Theorem 1.4. So the first term on the right-hand side of equation (3) reduces to

$\begin{array}{l} E [({\hat{β}}_{1} - β_{1})^{2}] \sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} = V [{\hat{β}}_{1}] \cdot S_{X X} = \frac{σ^{2}}{S_{X X}} \cdot S_{X X} = σ^{2} . \end{array}$

Before considering the second term on the right-hand side of equation (3), recall from Theorem 1.3 that [latex]\hat{\beta}_1[/latex] can be written as a linear combination of the observations of the dependent variable [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] as [latex]\hat{\beta}_1 = a_1Y_1 + a_2Y_2 + \cdots + a_nY_n[/latex], where [latex]a_i = (X_i - \bar{X}) / S_{XX}[/latex] for [latex]i = 1, 2,\ldots, n[/latex]. So an expression for the least squares point estimator of β₁ can be written as

$\begin{array}{l} {\hat{β}}_{1} & = \sum_{i = 1}^{n} a_{i} Y_{i} \\ = \sum_{i = 1}^{n} a_{i} (β_{0} + β_{1} X_{i} + ϵ_{i}) \\ = β_{0} \sum_{i = 1}^{n} a_{i} + β_{1} \sum_{i = 1}^{n} a_{i} X_{i} + \sum_{i = 1}^{n} a_{i} ϵ_{i} \\ = β_{1} + \sum_{i = 1}^{n} a_{i} ϵ_{i} \end{array}$

via Theorem 1.3. Temporarily ignoring the [latex]-2[/latex] coefficient on the second term in equation (3) and using the fact that [latex]\varepsilon_{1}, \varepsilon_{2},\dots, \varepsilon_{n}[/latex] are mutually independent random variables with population mean zero and population variance σ², the expected value in the second term on the right-hand side of equation (3) is

$\begin{array}{l} E [({\hat{β}}_{1} - β_{1}) \sum_{i = 1}^{n} (X_{i} - \bar{X}) (ϵ_{i} - \bar{ϵ})] & = E [(β_{1} + \sum_{i = 1}^{n} a_{i} ϵ_{i} - β_{1}) \sum_{i = 1}^{n} (X_{i} - \bar{X}) (ϵ_{i} - \bar{ϵ})] \\ = E [(\sum_{i = 1}^{n} a_{i} ϵ_{i}) (\sum_{i = 1}^{n} (X_{i} - \bar{X}) ϵ_{i} - \bar{ϵ} \sum_{i = 1}^{n} (X_{i} - \bar{X}))] \\ = E [\frac{1}{S_{X X}} (\sum_{i = 1}^{n} (X_{i} - \bar{X}) ϵ_{i}) (\sum_{i = 1}^{n} (X_{i} - \bar{X}) ϵ_{i})] \\ = \frac{1}{S_{X X}} E [\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} ϵ_{i}^{2} + \underset{i \neq j}{\sum \sum} (X_{i} - \bar{X}) (X_{j} - \bar{X}) ϵ_{i} ϵ_{j}] \\ = \frac{1}{S_{X X}} \sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} E [ϵ_{i}^{2}] \\ = σ^{2} . \end{array}$

Finally, consider the third term on the right-hand side of equation (3). Using [latex](i)[/latex] the shortcut formula for the population variance, [latex](ii)[/latex] the fact that the expected value operator E is a linear operator, [latex](iii)[/latex] the fact that the population variance of a sample mean comprised of mutually independent and identically distributed random variables is the ratio of the population variance to the sample size, and [latex](iv)[/latex] the fact that [latex]E[\varepsilon_{i}] = 0[/latex] for [latex]i = 1, 2,\ldots, n[/latex] and therefore [latex]E[\bar{\varepsilon}] = 0[/latex], the third term on the right-hand side of equation (3) is

$\begin{array}{l} E [\sum_{i = 1}^{n} (ϵ_{i} - \bar{ϵ})^{2}] & = E [\sum_{i = 1}^{n} ϵ_{i}^{2} - 2 \bar{ϵ} \sum_{i = 1}^{n} ϵ_{i} + n {\bar{ϵ}}^{2}] \\ = E [\sum_{i = 1}^{n} ϵ_{i}^{2} - 2 n {\bar{ϵ}}^{2} + n {\bar{ϵ}}^{2}] \\ = E [\sum_{i = 1}^{n} ϵ_{i}^{2} - n {\bar{ϵ}}^{2}] \\ = \sum_{i = 1}^{n} E [ϵ_{i}^{2}] - n E [{\bar{ϵ}}^{2}] \\ = \sum_{i = 1}^{n} (V [ϵ_{i}] + E [ϵ_{i}]^{2}) - n (V [\bar{ϵ}] + E [\bar{ϵ}]^{2}) \\ = \sum_{i = 1}^{n} σ^{2} - n \cdot \frac{σ^{2}}{n} \\ = n σ^{2} - σ^{2} \\ = (n - 1) σ^{2} . \end{array}$

Combining the three terms together, equation (3) becomes

$\begin{array}{l} E [\sum_{i = 1}^{n} e_{i}^{2}] = σ^{2} - 2 σ^{2} + (n - 1) σ^{2} = (n - 2) σ^{2} . \end{array}$

Dividing both sides of this equation by [latex]n - 2[/latex] indicates that the MSE is an unbiased estimator of σ²:

$\begin{array}{l} E [\frac{1}{n - 2} \sum_{i = 1}^{n} e_{i}^{2}] = σ^{2} . \end{array}$

[latex]\Box[/latex]

To summarize, there are three parameters in a simple linear regression model: the population intercept β₀, the population slope β₁, and the population variance of the error terms σ². These parameters can be estimated from n data pairs [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)[/latex] by the least squares method. Theorem 1.2 indicates that the least squares point estimator [latex]\hat{\beta}_0[/latex] is an unbiased estimator of β₀ and the least squares point estimator [latex]\hat{\beta}_1[/latex] is an unbiased estimator of β₁. Theorem 1.7 indicates that the MSE is an unbiased estimator of σ². All three parameter estimators are on target on average. The next three examples illustrate the estimation of σ².

The magnitude of the point estimate of σ² is a reflection of whether the data points are tightly clustered about the estimated regression line (for small values of [latex]\hat{\sigma}^{2}[/latex]) or whether the data points stray significantly from the estimated regression line (for large values of [latex]\hat{\sigma}^{2}[/latex]). In the previous example involving the sales data pairs, there is significant vertical deviation between the data points and the associated fitted values, as seen in Figure 1.15. The next example illustrates the case in which the data pairs are tightly clustered about the regression line.

Example 1.11 Scottish physicist James Forbes wanted to devise a technique to estimate the altitude above sea level without transporting a fragile mercury barometer to the location of interest. He knew that the altitude could be computed from the barometric pressure, with lower barometric pressures corresponding to higher altitudes. He wanted to see if the boiling point of water could be used as a surrogate to determine the barometric pressure. In the 1840’s and 1850’s, he gathered [latex]n = 17[/latex] data pairs [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_{17}, Y_{17})[/latex] from various locations at different altitudes in the Alps, where

X_i: the boiling point of water in degrees Fahrenheit at location i, and

Y_i: the adjusted barometric pressure in inches of mercury at location i,

for [latex]i = 1, 2,\ldots, 17[/latex]. The data was published in an 1857 article in the Transactions of the Royal Society of Edinburgh titled “Further Experiments and Remarks on the Measurement of Heights and Boiling Point of Water.” The [latex]n = 17[/latex] data pairs in Forbes’ data set are shown in Table 1.4. Make a scatterplot of the data values to determine whether a simple linear regression model is appropriate. If it is an appropriate model, estimate the model parameters β₀, β₁, and σ².

Table 1.4: Data pairs for Forbes’ experiment.
Boiling point	Barometric pressure
194.5	20.79
194.3	20.79
197.9	22.40
198.4	22.67
199.4	23.15
199.9	23.35
200.9	23.89
201.1	23.99
201.4	24.02
201.3	24.01
203.6	25.14
204.6	26.57
209.5	28.49
208.6	27.76
210.7	29.04
211.9	29.88
212.2	30.06

A scatterplot of the data is plotted with the R commands given below.

The scatterplot is given in Figure 1.17. On the range of the independent variable X that was collected by Forbes, which is [latex]194.3 \leq X \leq 212.2[/latex], there appears to be a linear relationship between the boiling temperature X and the barometric pressure Y, so it is reasonable to proceed with fitting a simple linear regression model. The point that seems to stray slightly from the linear relationship, namely [latex](X_{12}, Y_{12}) = (204.6, 26.57)[/latex], could be due to [latex](i)[/latex] random sampling variability, [latex](ii)[/latex] measurement error associated with the barometric pressure [latex]Y_{12} = 26.57[/latex], or [latex](iii)[/latex] measurement error associated with the boiling point [latex]X_{12} = 204.6[/latex] even though the simple linear regression model assumes that the boiling points are measured without error.

Figure 1.17: Scatterplot of the Forbes data.

Long Description for Figure 1.17

The horizontal axis X, measuring the boiling point, ranges from 190 to 215 in increments of 5 units. The vertical axis Y, measuring barometric pressure, and ranges from 20 to 30 in increments of 2 units. The ordered pairs are spread between values 195 and 215 of X axis and values 21 and 30 of the Y axis. The data points form an increasing trend.

The R code below plots the fitted regression line on the scatterplot, which is shown in Figure 1.18. Forbes’ data pairs are in a built-in data frame named forbes that resides in the MASS package. The first column in forbes is named bp (for boiling point) and the second column is named pres (for barometric pressure).

Figure 1.18 confirms our conclusion about the linear relationship between X and Y from the scatterplot on the range of X values collected by Forbes. A simple linear regression model seems appropriate in this setting. The additional R commands that follow print the estimates for β₀, β₁, and σ² for Forbes’ [latex]n = 14[/latex] data pairs.

These yield the three unbiased point estimates for the simple linear regression model as

$\begin{array}{l} {\hat{β}}_{0} = - 81.0637 {\hat{β}}_{1} = 0.5229 {\hat{σ}}^{2} = 0.05421 . \end{array}$

So the estimated regression line is

$\begin{array}{l} \hat{Y} = - 81.0637 + 0.5229 X . \end{array}$

Figure 1.18: Scatterplot of the Forbes data with the estimated regression line.

Long Description for Figure 1.18

The horizontal axis X ranges from 190 to 215 in increments of 5 units. The vertical axis Y ranges from 20 to 30 in increments of 2 units. The ordered pairs are plotted between 195 and 215 of X axis and 21 and 30 of Y axis. A diagonal line originating from 193, 0 slopes up to point (215, 30). The data points plotted form an increasing trend, with the maximum points plotted on the line of regression.

Using the usual interpretation of the estimated intercept, when the boiling point of water is 0^∘ Fahrenheit, the barometric pressure is estimated to be [latex]-81[/latex] inches of mercury. This is obviously an inappropriate conclusion and highlights the fact that this simple linear regression model is only appropriate for a limited range of X values. The interpretation of [latex]\hat{\beta}_1[/latex], however, is meaningful. The barometric pressure increases by an estimated 0.5229 inches of mercury for every degree increase in the boiling point of water over the range of X values collected by Forbes. Finally, the estimated variance of the error terms, [latex]\hat{\sigma}^2 = 0.05421[/latex], is small (particularly relative to the estimated variance of the dependent variable observations, [latex]S_{YY} / (n - 1) = 9.12[/latex], calculated with the additional R command var(forbes$pres)). This small estimated variance indicates that the data values are tightly clustered about the regression line. This is clearly the case in Figure 1.18.

The additional R command plot(fit$residuals) generates a plot of the residuals. Figure 1.19 shows the [latex]n = 17[/latex] residuals, along with a dashed horizontal line at a residual value of zero to show which observations fall above and below the regression line. (Notice that some of the X_i values are not in increasing order.) Six of the residuals are positive and 11 are negative. The reason that more residuals are negative is that the 12th data pair [latex](X_{12}, Y_{12}) = (204.6, 26.57)[/latex] exerts a strong upwards “tug” on the fitted regression line, which is reflected in the plot of the residuals in Figure 1.19.

Figure 1.19: Residuals for the Forbes data.

Long Description for Figure 1.19

The horizontal axis X listing i values ranges from 1 to 17 in increments of 1 unit. The vertical axis Y listing e i values ranges from negative 0.6 to 0.6 in increments of 0.2 units. A dotted horizontal line originates from point (1, 0) and extends to 17, 0. Multiple data points are plotted in the quadrant. Five of the plotted points are above the dotted line and 9 are below the dotted line. Three of the points are on the dotted line. All data are approximate.

The non-symmetry in the values of [latex]e_{1}, e_{2}, \ldots, e_{17}[/latex] will also be reflected in a histogram of the residuals. Although [latex]n = 17[/latex] is a relatively small sample size for drawing a histogram and having a meaningful interpretation, one is displayed in Figure 1.20. This histogram can be generated with the additional R command hist(fit$residuals). The histogram reveals a bell-shaped distribution for the residuals, with a single extreme value in the right-hand tail associated with the residual [latex]e_{12} = 0.65[/latex]. This is consistent with the plot of the residuals in Figure 1.19.

Figure 1.20: Histogram of the residuals for the Forbes data.

In conclusion, the regression analysis seems to indicate that Forbes’ experiment was a success. The barometric pressure does appear to be a function of the boiling point of water, and furthermore, the relationship between the two variables appears to be reasonably linear on the range of data pairs collected by Forbes. For a particular boiling point X that falls within that range of X values, the barometric pressure can be estimated by

$\begin{array}{l} \hat{Y} = {\hat{β}}_{0} + {\hat{β}}_{1} X = - 81.0637 + 0.5229 X . \end{array}$

The altitude can, in turn, be estimated from the estimate of the barometric pressure provided by the regression analysis.

In the two previous examples, point estimates of the population variance of the error terms σ² were calculated. In the sales data example, the estimated error term variance [latex]\hat{\sigma}^2 = 14[/latex] indicated that the data pairs strayed a large distance from the estimated regression line, as illustrated in Figure 1.6. In the Forbes data set, the estimated error term variance [latex]\hat{\sigma}^2 = 0.05421[/latex] reflects data pairs that cluster closely to the estimated regression line, as illustrated in Figure 1.18. But these two examples involving individual data sets do not indicate anything about the distribution of [latex]\hat{\sigma}^2[/latex]. The next example addresses this topic by extending the Monte Carlo simulation experiment from Example 1.4.

Example 1.12 Consider again the simple linear regression model

$\begin{array}{l} Y = β_{0} + β_{1} X + ϵ \end{array}$

from Example 1.4, where

the population intercept is [latex]\beta_0 = 1[/latex],
the population slope is [latex]\beta_1 = 1 / 2[/latex], and
the error term ϵ has a [latex]U(-1, 1)[/latex] distribution.

The focus in this example will be on the estimation of the probability distribution of [latex]\hat{\sigma}^2[/latex]. Recall that the error term distribution has population mean zero and finite population variance, so it satisfies the conditions of a simple linear regression model from Definition 1.1. Conduct a Monte Carlo simulation with 5000 replications that estimates the probability distribution of the estimated variance of the error terms [latex]\hat{\sigma}^2[/latex] for [latex]n = 10[/latex] data pairs. Assume that the X_i values are equally likely to be one of the integers [latex]i = 1, 2,\ldots, 9[/latex].

The R code below conducts 5000 replications of the Monte Carlo experiment. The simulated regression model is fit by the lm function and the results are stored in the list named fit. The component of the list named fit$residuals contains the residuals [latex]e_{1}, e_{2}, \ldots, e_{10}[/latex] for a particular simulation replication. The estimator of the variance of the error term in the simple linear regression model is given by the MSE:

$\begin{array}{l} {\hat{σ}}^{2} = \frac{1}{n - 2} \sum_{i = 1}^{n} e_{i}^{2}, \end{array}$

which is an unbiased estimator estimator of σ² by Theorem 1.7. The code generates a histogram of the 5000 estimates of the variance of the error terms.

The histogram that is produced by this Monte Carlo simulation is given in Figure 1.21. The histogram is centered around the population variance of the error terms

$\begin{array}{l} σ^{2} = \frac{(1 + 1)^{2}}{12} = \frac{4}{12} = \frac{1}{3} \end{array}$

because the population variance of the [latex]U(a, b)[/latex] distribution is

$\begin{array}{l} σ^{2} = \frac{(b - a)^{2}}{12}, \end{array}$

where [latex]a = -1[/latex] and [latex]b = 1[/latex]. So the Monte Carlo simulation supports the fact that [latex]\hat{\sigma}^2[/latex] is an unbiased estimator of σ² via Theorem 1.7. Although the distribution of the probability density function is bell-shaped, a careful examination of the histogram indicates that the right-hand tail of the distribution appears to be slightly heavier than the left-hand tail of the distribution. The probability density function of [latex]\hat{\sigma}^2[/latex] is not symmetric. This nonsymmetry is a universal result which extends beyond this particular simple linear regression model. This should not be surprising because the support of [latex]\hat{\sigma}^2[/latex] is the positive real numbers, unlike the support of [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] whose support is the entire real number line.

So the conclusions of the Monte Carlo simulation experiment are that (a) Theorem 1.7 is supported because the histogram in Figure 1.21 is centered around σ², and (b) the probability density function of [latex]\hat{\sigma}^2[/latex] is nearly bell-shaped with a slight bit of nonsymmetry.

Figure 1.21: Histogram of the error term estimates for the Monte Carlo simulation.

Before leaving the topic of the estimation of σ² behind, consider the case of collecting just [latex]n = 2[/latex] data pairs [latex](X_1, Y_1)[/latex] and [latex](X_2, Y_2)[/latex], as illustrated in Figure 1.22. One of the assumptions associated with the observations in a simple linear regression model is that there are at least two distinct values of the independent variable observed. So when [latex]n = 2[/latex], it must be the case that [latex]X_1 \ne X_2[/latex] In this case, the least squares regression line will pass through the points [latex](X_1, Y_1)[/latex] and [latex](X_2, Y_2)[/latex]. This means that the fitted values are identical to the data pairs, and hence, both residuals are zero. So the sum of squares for error is [latex]SSE = e^{2}_{1} + e^{2}_{2} = 0[/latex]. But is an SSE of zero an appropriate estimate for the population variance of the spread of the values about the regression line? Can one conclude that this is really a deterministic relationship and any additional data pairs collected will fall on the fitted regression line? Certainly not, because it is not possible to draw that conclusion based on just two data pairs. A third data pair might fall on the regression line or fall significantly off of the regression line, as was the case with the sales data from Example 1.3. The unbiased estimator of σ² is undefined because of the [latex]n - 2[/latex] in the denominator of the formula for [latex]\hat{\sigma}^2[/latex], as it should be. Two data pairs are adequate for estimating the population slope and population intercept of the regression line, but they are not adequate for estimating σ². The mathematics and intuition are consistent in this setting.

A scatter plot with two data pair values that fall on the estimated linear regression line, which has a negative slope. — Figure 1.22: Scatterplot and estimated regression line for [latex]n = 2[/latex] data pairs.

Long Description for Figure 1.22

The horizontal axis is labeled X and the vertical axis is labeled Y. Two data points, X 1, Y 1 and X 2, Y 2, are plotted on a diagonal line with a negative slope. The values of X 1 and X 2 and Y 1 and Y 2 are dissimilar.

1.8 Sums of Squares

Certain sums of squares play a key role in simple linear regression. This section considers three topics related to these sums of squares: (a) partitioning the total sum of squares, (b) defining and interpreting the coefficient of determination and the coefficient of correlation, and (c) displaying the sums of squares in an ANOVA table.

1.8.1 Partitioning the Total Sum of Squares

A topic that is closely related to fitted values and residuals is the partitioning of the total sum of squares. Figure 1.23 provides the geometric framework for the mathematical derivation provided next. There are only three points plotted in Figure 1.23. The first point plotted is [latex](X_i, Y_i)[/latex], which is a generic data pair. The other [latex]n - 1[/latex] data pairs are not plotted in order to keep the figure uncluttered. The estimated regression line associated with the n data pairs, which happens to have a negative slope, is also plotted. The second point plotted is the fitted value [latex](X_i, \hat{Y}_i)[/latex] associated with the ith data pair, which is located directly below data pair i and falls on the estimated regression line. The third point plotted is [latex](\bar{X}, \bar{Y})[/latex], which, by Theorem 1.6, will always fall on the regression line.

Figure 1.23 provides a geometric proof of the relationship

$\begin{array}{l} Y_{i} - \bar{Y} = {\hat{Y}}_{i} - \bar{Y} + Y_{i} - {\hat{Y}}_{i} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex]. The relationship can also be established algebraically by recognizing that the right-hand side of this equation can be determined by just adding and subtracting [latex]\hat{Y}_i[/latex] to the left-hand side of the equation. As will be stated and proved subsequently, squaring both sides of this equation and summing results in

$\begin{array}{l} \sum_{i = 1}^{n} (Y_{i} - \bar{Y})^{2} = \sum_{i = 1}^{n} ({\hat{Y}}_{i} - \bar{Y})^{2} + \sum_{i = 1}^{n} (Y_{i} - {\hat{Y}}_{i})^{2} . \end{array}$

A graph depicts the partitioning of the total sum of squares. — Figure 1.23: Partitioning the total sum of squares.

Long Description for Figure 1.23

“Two collinear data pairs (X i, Y i), (X i, Y cap i), and a data pair (X bar, Y bar) are plotted in the quadrant. A diagonal line with a negative slope passing through the points (X i, Y cap i) and (X bar, Y bar) is the regression line. The Y intercept of (X i, Y i) is indicated by a horizontal dotted line. Similarly, the X and Y intercepts of (X bar, Y bar) are indicated by dotted lines. The distance between the data pairs (X i, Y i) and (X bar, Y bar) is indicated as Y i minus Y bar. The distance between the data pairs (X i, Y i) and (X i, Y cap i) is indicated as Y i minus Y cap i. The distance between the data pairs (X i, Y cap i) and (X bar, Y bar) is indicated as Y cap I minus Y bar.”

This equation involves three sums of squares that occur so often in regression analysis that they are given the abbreviations

$\begin{array}{l} S S T = S S R + S S E, \end{array}$

where SST stands for total sum of squares, SSR stands for sum of squares for regression, and SSE stands for sum of squares for error. (The sum of squares for error has already been encountered in Theorem 1.7.) This equation expresses the total variation of the observed values of the dependent variable [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] about their sample mean [latex]\bar{Y}[/latex] in SST as the sum of two sums of squares. The first term on the right-hand side, SSR, reflects the variation of the fitted values [latex]\hat{Y}_{1}, \hat{Y}_{2}, \dots , \hat{Y}_{n}[/latex] about the sample mean [latex]\bar{Y}[/latex]. The second term on the right-hand side, SSE, reflects the variation of the observed values [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] about their associated fitted values [latex]\hat{Y}_{1}, \hat{Y}_{2}, \dots , \hat{Y}_{n}[/latex]. Since all three terms in this equation are sums of squares, all three terms are nonnegative. Notice that [latex]SST / (n-1)[/latex] is the sample variance of [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex].

The equation

$\begin{array}{l} S S T = S S R + S S E \end{array}$

partitions SST into two pieces: SSR, which accounts for the total variability in [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] that is accounted for by the regression line (that is, the linear relationship between X and Y), and SSE, which accounts for the remaining variability that is not associated with the regression line. This is why SSR measures the total variability in [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] “explained” by the relationship between X and Y, whereas SSE measures the total variability in [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] left “unexplained” by the relationship between X and Y. It is reasonable to think of SSR as measuring the “signal” associated with the linear relationship and SSE as measuring the “noise” associated with the linear relationship. The result is stated formally and proven next.

1.8.2 Coefficients of Determination and Correlation

There are two measures that are helpful in assessing the degree of the linear relationship between X and Y in a simple linear regression model. The coefficient of determination and the coefficient of correlation are defined next. The thinking behind the way that the coefficient of determination [latex]R^{2} = SSR / SST[/latex] is defined is as follows. The value of SST reflects the variability in [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] when the values of the associated independent variables [latex]X_{1}, X_{2}, \dots , X_{n}[/latex] are ignored. The value of SSE reflects the variability in [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] when a fitted regression model uses [latex]X_{1}, X_{2}, \dots , X_{n}[/latex] as predictors. Their difference, [latex]SSR = SST - SSE[/latex], reflects the reduction in variability associated with using the regression model. The ratio [latex]SSR / SST[/latex] captures the fraction of that reduction in variability.

The coefficient of determination R² is the fraction of the variation in [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] about [latex]\bar{Y}[/latex] that is accounted for by the linear relationship between X and Y. Based on the result from Theorem 1.8, [latex]SST = SSR + SSE[/latex], the coefficient of determination must satisfy [latex]0 \leq R^{2} \leq 1[/latex]. Likewise, the coefficient of correlation must satisfy [latex]-1 \leq r \leq 1[/latex], which is true for all population and sample correlations.

Values of R² that are near 1 indicate that nearly all of the variation in [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] about [latex]\bar{Y}[/latex] can be explained by the linear relationship between X and Y. This in turn implies that X is a useful predictor for Y. On the other hand, values of R² that are near 0 indicate that very little of the variation in [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] about [latex]\bar{Y}[/latex] can be explained by the linear relationship between X and Y. This in turn implies that X is not a useful predictor for Y. It is in this sense that R² is a measure of the strength of the linear relationship between X and Y.

There are some important limitations associated with R² and r. First, it is important to remember that the linear relationship between X and Y might only be appropriate on a limited range of X values. Second, even a relatively large value of R² might not provide the precision necessary for a particular application. Third, regardless of the value of R², the scatterplot of the data pairs must always be inspected to see if a simple linear regression model is warranted. Both high and low values of R² can be associated with a strong nonlinear relationship between X and Y. Fourth, in the case in which the experimenter can control the values of [latex]X_{1}, X_{2}, \dots , X_{n}[/latex], the magnitude of R² depends on the choices of the independent variables, which clouds its interpretation. Fifth, the usual interpretation of the coefficient of correlation r as an estimator of [latex]\rho = {\text{Cov}(X,Y)} / {\sigma_X \sigma_Y}[/latex] is only appropriate when X and Y are random variables, which is not the case in simple linear regression because X is assumed to be observed without error.

It is a useful thought experiment to consider the scatterplots associated with the values of SST, SSR, and SSE at their extremes. These three extreme cases will be described in the next three paragraphs.

The first of these extreme cases is illustrated for [latex]n = 7[/latex] in Figure 1.24 in which

$\begin{array}{l} S S E = \sum_{i = 1}^{n} (Y_{i} - {\hat{Y}}_{i})^{2} = \sum_{i = 1}^{n} e_{i}^{2} = 0. \end{array}$

The only way to achieve a sum of squares for error of zero is to have the data pairs [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)[/latex] all fall on a line, which is the regression line. Using the result from Theorem 1.8 that [latex]SST = SSR + SSE[/latex], in this case [latex]SST = SSR[/latex], which implies that [latex]R^{2} = 1[/latex]. Therefore, all of the variation in [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] is explained by the linear relationship between X and Y. In addition, [latex]r = -1[/latex] if the slope of the regression line is negative and [latex]r = 1[/latex] if the slope of the regression line is positive.

A scatter plot graph of seven data pair values that fall on the linear regression line, with a negative slope. The horizontal axis is labeled X and the vertical axis is labeled Y. Seven data points are plotted on a diagonal line with a negative slope. The data points form a declining trend. — Figure 1.24: Data pairs with [latex]SSE = 0[/latex] and [latex]\hat{\beta}_{1} \ne 0[/latex] (which implies that [latex]SST = SSR[/latex] and [latex]R^{2} = 1[/latex]).

The second of these extreme cases is illustrated for [latex]n = 7[/latex] in Figure 1.25 in which

$\begin{array}{l} S S R = \sum_{i = 1}^{n} ({\hat{Y}}_{i} - \bar{Y})^{2} = 0. \end{array}$

“A scatter plot graph of seven data points. The horizontal axis is labeled X and the vertical axis is labeled Y. A horizontal line, originating from Y axis and slope 0, runs parallel to X axis dividing the graph into two halves. Three of the plotted points are above the horizontal line, and four are below the horizontal line.” — Figure 1.25: Data pairs with [latex]SSR = 0[/latex] (which implies that [latex]SST = SSE[/latex] and [latex]R^{2} = 0[/latex]).

The only way to achieve a sum of squares for regression of zero is to have an estimated regression line with slope zero. Using the result from Theorem 1.8 that [latex]SST = SSR + SSE[/latex], in this case [latex]SST = SSE[/latex], which implies that [latex]R^{2} = 0[/latex]. This means that none of the variation in [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] is explained by the linear relationship between X and Y. In addition, [latex]r = 0[/latex].

The third of these extreme cases is illustrated for [latex]n = 7[/latex] in Figure 1.26 in which

$\begin{array}{l} S S T = \sum_{i = 1}^{n} (Y_{i} - \bar{Y})^{2} = 0. \end{array}$

A scatter plot graph with seven data points. The horizontal axis is labeled X and the vertical axis is labeled Y. A horizontal line, originating from Y axis with slope 0, runs parallel to X axis. Seven plotted points with the same Y value are plotted on the horizontal line. — Figure 1.26: Data pairs with [latex]SST = 0[/latex] (which implies that [latex]SSR = SSE = 0[/latex] and R² is undefined).

The only way to achieve a total sum of squares of zero is to have an estimated regression line with slope zero and all points lying on the estimated regression line. Using the result from Theorem 1.8 that [latex]SST = SSR + SSE[/latex], in this case [latex]SSR = SSE = 0[/latex], and the coefficient of determination and coefficient of correlation are undefined because the denominator is zero.

Each of the sums of squares has an associated degrees of freedom. The total sum of squares

$\begin{array}{l} S S T = \sum_{i = 1}^{n} (Y_{i} - \bar{Y})^{2} \end{array}$

has [latex]n - 1[/latex] degrees of freedom for either of two reasons: (1) one degree of freedom is lost because [latex]\bar{Y}[/latex] is used to estimate the population mean, and (2) the terms in the summation above are subject to the one constraint—they must sum to zero. The sum of squares for regression

$\begin{array}{l} S S R = \sum_{i = 1}^{n} ({\hat{Y}}_{i} - \bar{Y})^{2} \end{array}$

has 1 degree of freedom because each of the [latex]\hat{Y}_{i}[/latex] values is calculated from the same regression line which has two degrees of freedom, but is subject to the additional constraint [latex]\sum^{n}_{i=1} (\hat{Y}_{i} - \bar{Y})=0[/latex] by Theorem 1.6. The sum of squares for error

$\begin{array}{l} S S E = \sum_{i = 1}^{n} (Y_{i} - {\hat{Y}}_{i})^{2} \end{array}$

has [latex]n - 2[/latex] degrees of freedom for the reasons outlined just before Theorem 1.7.

An alternative definition for computing the coefficient of correlation r can save on computation time, as given in the following theorem.

1.8.3 The ANOVA Table

The three sums of squares for the simple linear regression model and their associated degrees of freedom can be summarized in an analysis of variance (ANOVA) table. The four columns in the generic ANOVA table shown in Table 1.5 are (a) the source of variation, (b) the sum of squares, (c) the degrees of freedom, and (d) the mean square. The sums of squares and the degrees of freedom add to the values in the row labeled “Total”. The mean square is the ratio of the sum of squares to the associated degrees of freedom. The regression mean square is [latex]MSR = SSR/1 = SSR[/latex]. The mean square error is [latex]MSE = SSE / (n-2)[/latex]. The mean square entries do not add. Tradition dictates that the mean square associated with SST is not reported in an ANOVA table, but it does have meaning as the sample variance of [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex]. More information on how the ANOVA table can be used for hypothesis testing concerning the population slope β₀ by adding a fifth column to the ANOVA table will be given in the next chapter.

Table 1.5: Partial ANOVA table for simple linear regression.
Source	SS	df	MS
Regression	SSR	1	MSR
Error	SSE	[latex]n-2[/latex]	MSE
Total	SST	[latex]n-1[/latex]

Example 1.13 Consider the Forbes data set from Example 1.11 in which the independent variable X is the boiling point of water in degrees Fahrenheit and the dependent variable Y is the adjusted barometric pressure in inches of mercury. There are [latex]n = 17[/latex] data pairs collected from various locations. Calculate the three sums of squares (SST, SSR, and SSE), show that Theorem 1.8 is satisfied, calculate R² and r, and present the results in an ANOVA table.

The scatterplot with the estimated regression line superimposed from Example 1.11 is reproduced in Figure 1.27. The R commands below calculate the three sums of squares.

Figure 1.27: Scatterplot of the Forbes data with the estimated regression line.

Long Description for Figure 1.27

The horizontal axis X ranges from 190 to 215 in increments of 5 units. The vertical axis Y ranges from 20 to 30 in increments of 2 units. The line of regression originates from 194, 0 and slopes up to (215, 30). The data points plotted along the line of regression follow an increasing trend. The points plotted are as follows. (195, 21); (200, 23); (205, 24); (205, 25); (205, 26); (210, 28); (213, 29); (215, 30). All data are approximate.

These commands result in the following values for the three sums of squares:

$\begin{array}{l} S S T = 145.9378 S S R = 145.1246 S S E = 0.8131 . \end{array}$

Ignoring the roundoff error in the fourth digit after the decimal point, these values satisfy the result in Theorem 1.8:

$\begin{array}{l} S S T = S S R + S S E . \end{array}$

The fact that SSR is more than two orders of magnitude greater than SSE indicates that there is much more of the total variation in [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] that is explained by the relationship between X and Y than unexplained. This interpretation is consistent with the scatterplot and estimated regression line given in Figure 1.27.

The value of the coefficient of determination and the coefficient of correlation for this data set can be calculated by the additional R commands

via Definition 1.3 or

via Theorem 1.9. Both code segments print the values

$\begin{array}{l} R^{2} = 0.9944 and r = 0.9972 . \end{array}$

The proper interpretation of R² is that 99.44% of the total variation in [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] can be explained by the linear relationship between X and Y. This high percentage is consistent with the scatterplot and estimated regression line in Figure 1.27, which shows a nearly perfect linear relationship between boiling point of water and the barometric pressure, and data values that lie very close to the estimated regression line. Table 1.6 contains the sums of squares, degrees of freedom, and mean squares for the [latex]n = 17[/latex] data pairs collected by Forbes. This ANOVA table can be generated with the additional R command

Table 1.6: Partial ANOVA table for the Forbes data.
Source	SS	df	MS
Regression	145.1246	1	145.1246
Error	0.8131	15	0.0542
Total	145.9378	16

The anova function returns a data frame, and values in that data frame can be extracted using the $ extractor. The degrees of freedom for the sum of squares for error, for example, can be extracted with the R command

The definitions and theorems that are associated with fitted values, residuals, estimating the population variance σ², partitioning the sums of squares, the coefficient of determination, the coefficient of correlation, and the ANOVA table are briefly reviewed here. The simple linear regression model

$\begin{array}{l} Y = β_{0} + β_{1} X + ϵ \end{array}$

from Definition 1.1 establishes a linear statistical relationship between an independent variable X and a dependent random variable Y. The error term ϵ has population mean 0 and finite population variance σ². The n data pairs collected are denoted by [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)[/latex]. The fitted values [latex]\hat{Y}_{1}, \hat{Y}_{2}, \dots , \hat{Y}_{n}[/latex] are the values on the estimated regression line associated with the independent variables [latex]X_{1}, X_{2}, \dots , X_{n}[/latex]:

$\begin{array}{l} {\hat{Y}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} X_{i} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex], as established in Definition 1.2. The associated residuals are defined by

$\begin{array}{l} e_{i} = Y_{i} - {\hat{Y}}_{i} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex], as established in Definition 1.2. An unbiased estimator of the population variance of the error terms is

$\begin{array}{l} {\hat{σ}}^{2} = \frac{1}{n - 2} \sum_{i = 1}^{n} e_{i}^{2} \end{array}$

as given in Theorem 1.7. The total sum of squares SST can be partitioned into the regression sum of squares SSR and the sum of squares for error SSE as

$\begin{array}{l} S S T = S S R + S S E \end{array}$

or

$\begin{array}{l} \sum_{i = 1}^{n} (Y_{i} - \bar{Y})^{2} = \sum_{i = 1}^{n} ({\hat{Y}}_{i} - \bar{Y})^{2} + \sum_{i = 1}^{n} (Y_{i} - {\hat{Y}}_{i})^{2} \end{array}$

as given in Theorem 1.8. Two quantities that measure the linear association between X and Y are the coefficient of determination

$\begin{array}{l} R^{2} = \frac{S S R}{S S T}, \end{array}$

which satisfies [latex]0 \leq R^{2} \leq 1[/latex], and the coefficient of correlation

$\begin{array}{l} r = \pm \sqrt{R^{2}}, \end{array}$

which satisfies [latex]-1 \leq r \leq 1[/latex] as defined in Definition 1.3. The coefficient of determination is the fraction of variation in [latex]Y_{1}, Y_{2}, \dots , Y_{n}[/latex] that is explained by the linear relationship with X. The sums of squares are often presented in an ANOVA table, which includes columns for the source of variation, the sum of squares, the associated degrees of freedom, and the mean squares. An additional column will be added to the ANOVA table in the next chapter, when statistical inference in simple linear regression is introduced.

The point estimators for β₀, β₁, and σ² in the simple linear regression model have now all been established and many of their properties have been surveyed. But without additional assumptions, it is not possible to easily obtain interval estimators or perform hypothesis testing concerning these parameters. The next chapter addresses this issue.

1.9 Exercises

1.1 Establish a linear deterministic relationship between the independent variable X, the temperature in degrees Fahrenheit, and the dependent variable Y, the associated temperature in degrees Celsius.
1.2 Establish a nonlinear deterministic relationship between the independent variable X, the distance between two objects with fixed masses m₁ and m₂, and the dependent variable Y, the gravitational force acting between the two objects, using Newton’s Law of Universal Gravitation.
1.3 For the following interpretations of the independent and dependent variables, predict whether the estimated slope [latex]\hat{\beta}_1[/latex] in a simple linear regression model will be positive or negative.
1. The independent variable X is a car’s speed and the dependent variable Y is the car’s stopping distance.
2. The independent variable X is a car’s weight and the dependent variable Y is the car’s fuel efficiency measured in miles per gallon.
3. The independent variable X is a husband’s height and the dependent variable Y is the wife’s height for a married couple.
4. The independent variable X is the average annual unemployment rate and the dependent variable Y is the annual GDP for a particular country.
1.4 For the simple linear regression model, show that solving the [latex]2 \times 2[/latex] set of linear normal equations

$\begin{array}{l} n {\hat{β}}_{0} + {\hat{β}}_{1} \sum_{i = 1}^{n} X_{i} & = \sum_{i = 1}^{n} Y_{i} \\ {\hat{β}}_{0} \sum_{i = 1}^{n} X_{i} + {\hat{β}}_{1} \sum_{i = 1}^{n} X_{i}^{2} & = \sum_{i = 1}^{n} X_{i} Y_{i} . \end{array}$

for [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] gives the expressions for [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] given in Theorem 1.1.
1.5 Consider the simple linear regression model

$\begin{array}{l} Y = β_{0} + β_{1} X + ϵ, \end{array}$

where
- the population intercept is [latex]{\beta}_0 = 1[/latex],
- the population slope is [latex]{\beta}_1 = 1/2[/latex], and
- the error term [latex]\varepsilon[/latex] has a [latex]U(-1,1)[/latex] distribution.
Assume that [latex]n = 10[/latex] data pairs [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_{10}, Y_{10})[/latex] are collected. The values of the independent variable X are equally likely to be one of the integers [latex]0, 1, 2,\ldots, 9[/latex], What are the minimum and maximum values that the estimated parameters [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] can assume?
1.6 For the values of the independent variables [latex]X_{1}, X_{2}, \dots , X_{n}[/latex], show that

$\begin{array}{l} \sum_{i = 1}^{n} (X_{i} - \bar{X}) = 0. \end{array}$
1.7 Write R commands to plot contours of the sum of squares for the sales data pairs

$\begin{array}{l} (X_{1}, Y_{1}) = (6, 2), (X_{2}, Y_{2}) = (8, 9), (X_{3}, Y_{3}) = (2, 2) \end{array}$

in the [latex](\beta_0, \beta_1)[/latex] plane.
1.8 The least squares criterion applied to a simple linear regression model minimizes

$\begin{array}{l} S = \sum_{i = 1}^{n} (Y_{i} - β_{0} - β_{1} X_{i})^{2} . \end{array}$

If instead the least absolute deviation criterion (also known as the minimum absolute deviation or MAD criterion) were applied to a simple linear regression model to minimize

$\begin{array}{l} S = \sum_{i = 1}^{n} | Y_{i} - β_{0} - β_{1} X_{i} |, \end{array}$

what are the values of [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] for the sales data pairs

$\begin{array}{l} (X_{1}, Y_{1}) = (6, 2) (X_{2}, Y_{2}) = (8, 9) (X_{3}, Y_{3}) = (2, 2) ? \end{array}$
1.9 Write a Monte Carlo simulation experiment that uses the same parameters as those in Example 1.4 (that is, [latex]\beta_0 = 1[/latex], [latex]\beta_1 = 1 / 2[/latex], [latex]\varepsilon \sim U(-1,1)[/latex], [latex]n = 10[/latex]) for 5000 replications, but this time selects the independent variable values to be equally likely integers from [latex]-5[/latex] and 5. Produce analogous figures to those of Figure 1.11 and Figure 1.12. Comment on your figures and how they relate to the variance–covariance matrix from Theorem 1.4.
1.10 For a simple linear regression model with [latex]X_{1} = 1, X_{2} = 2, \dots , X_{n} = n[/latex] and [latex]\sigma^{2} = 1[/latex] find the variance–covariance matrix of [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex].
1.11 Use Theorems 1.2 and 1.4 to show that the least squares estimator of the intercept of the regression line β₀ in the simple linear regression model is a consistent estimator of β₀.
1.12 Example 1.6 calculates the variance–covariance matrix for a single replication of a Monte Carlo simulation experiment. Conduct this experiment for 5000 replications and report the average of the values in the variance–covariance matrix.
1.13 Let L be the set of all linear estimators of the slope β₁ in a simple linear regression model. Let U be the set of all unbiased estimators of the slope β₁ in a simple linear regression model. Give an example of an estimator of β₁ in [latex]L \cap U'[/latex].
1.14 Show that the fitted simple linear regression model

$\begin{array}{l} {\hat{Y}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} X_{i} \end{array}$

for [latex]i = 1, 2,\ldots, n[/latex] can be written as

$\begin{array}{l} {\hat{Y}}_{i} - \bar{Y} = {\hat{β}}_{1} (X_{i} - \bar{X}), \end{array}$

where [latex]\hat{\beta}_0[/latex] and [latex]\hat{\beta}_1[/latex] are the least squares estimators of β₀ and β₁ and [latex]\bar{X}[/latex] and [latex]\bar{Y}[/latex] are the sample means of the observed values of the independent and dependent variables.
1.15 Write a paragraph that argues why a fitted least squares regression line cannot pass through all data pairs except for one of the data pairs.
1.16 One of the most common error distributions used in simple linear regression is the normal distribution with population mean 0 and finite population variance σ², which has probability density function

$\begin{array}{l} f (x) = \frac{1}{\sqrt{2 π} σ} e^{- x^{2} / (2 σ^{2})} - \infty < x < \infty . \end{array}$

An alternative error distribution is the Laplace distribution with probability density function

$\begin{array}{l} f (x) = \frac{1}{\sqrt{2} σ} e^{- \sqrt{2} | x - μ | / σ} - \infty < x < \infty . \end{array}$

Since the error distribution must have expected value zero by assumption, this reduces to

$\begin{array}{l} f (x) = \frac{1}{\sqrt{2} σ} e^{- \sqrt{2} | x | / σ} - \infty < x < \infty . \end{array}$

As parameterized here, the Laplace distribution has population variance σ². Both of these distributions are symmetric and centered about zero.
1. Plot the normal and Laplace error probability density functions on [latex]-3 < x < 3[/latex] and comment on any differences between the two error distributions. Use [latex]\sigma = 1[/latex] for the plots.
2. Plot the normal and Laplace error probability density functions on [latex]4 < x < 5[/latex] and comment on any differences between the tails of the two error distributions.
3. Fit both of these error distributions (that is, find [latex]\hat{\sigma}^{2}[/latex] for each distribution) for the forbes data set from the MASS package in R using the simple linear regression model.
1.17 Let the independent variable X be a car’s speed and the dependent variable Y be the car’s stopping distance, which are going to be modeled with a simple linear regression model. In which of the following scenarios do you expect to have a larger population variance of the error term?
1. The data pairs [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_{20}, Y_{20})[/latex] are [latex]n = 20[/latex] new cars that are all of the same make and model.
2. The data pairs [latex](X_1, Y_1), (X_2, Y_2), \ldots, (X_{20}, Y_{20})[/latex] are [latex]n = 20[/latex] new cars from [latex]n = 20[/latex] different car manufacturers.
1.18 Show that the sum of squares for regression in a simple linear regression model can be written as

$\begin{array}{l} S S R = {\hat{β}}_{1} S_{X Y} . \end{array}$
1.19 Show that the sum of squares for regression in a simple linear regression model can be written as

$\begin{array}{l} S S R = {\hat{β}}_{1}^{2} S_{X X} . \end{array}$
1.20 Consider the data pairs in the Formaldehyde data set built into the base R language. Use the help function in R to determine the interpretation of the independent and dependent variables. Fit a simple linear regression model to the data pairs and interpret the meaning of [latex]\hat{\beta}_0[/latex], [latex]\hat{\beta}_1[/latex], and [latex]\hat{\sigma}^{2}[/latex]. Also, calculate SST, SSR, and SSE for this data set.
1.21 Consider the data pairs collected by James Forbes that are given in the data frame forbes contained in the MASS package in R. The independent variable is the boiling point (in degrees Fahrenheit) and the dependent variable is the barometric pressure (in inches of mercury). For a simple linear regression model, calculate
- the fitted values,
- the residuals,
- the sum of squares for error, and
- the mean square error
without using the lm function. Then use the lm function to check the correctness of the values that you calculate.
1.22 This exercise investigates the effect of controllable values of [latex]X_{1}, X_{2}, \dots , X_{n}[/latex] on the coefficient of determination R² in simple linear regression. Consider the simple linear regression model

$\begin{array}{l} Y = β_{0} + β_{1} X + ϵ, \end{array}$

where
- the population intercept is [latex]\beta_{0} = 1[/latex],
- the population slope is [latex]\beta_{1} = 1 / 2[/latex], and
- the error term [latex]\varepsilon[/latex] has a [latex]N(0,1)[/latex] distribution.
Conduct a Monte Carlo simulation with 40,000 replications that estimates the expected coefficient of determination for [latex]n = 10[/latex] data pairs under the following two ways of setting the values of [latex]X_{1}, X_{2}, \dots , X_{10}[/latex].
1. Let [latex]X_{i} = i[/latex] for [latex]1, 2, \dots, 10[/latex].
2. Let [latex]X_{1} = X_{2} = \cdots = X_{5} = 5[/latex] and [latex]X_{6} = X_{7} = \cdots = X_{10} = 6[/latex].
1.23 Let S_X and S_Y be the sample standard deviations of the independent and dependent variables, respectively. Show that the following four definitions of the coefficient of correlation are equivalent.
1. $r = \frac{1}{n - 1} \sum_{i = 1}^{n} (\frac{X_{i} - \bar{X}}{S_{X}}) (\frac{Y_{i} - \bar{Y}}{S_{Y}})$
2. $r = \pm \sqrt{\frac{S S R}{S S E}}$
3. $r = \frac{S_{X Y}}{\sqrt{S_{X X} S_{Y Y}}}$
4. $r = {\hat{β}}_{1} \sqrt{\frac{S_{X X}}{S_{Y Y}}}$

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Statistical Modeling: Regression, Survival Analysis, and Time Series Analysis Copyright © 2023 by Lawrence M. Leemis is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Chapter 1 Simple Linear Regression

1.1 Deterministic Models

1.2 Statistical Models

1.3 Simple Linear Regression Model

1.4 Least Squares Estimators

1.5 Properties of Least Squares Estimators

1.5.1 [latex]\bf \hat{\beta}_0[/latex] and [latex]\bf \hat{\beta}_1[/latex] are Unbiased Estimators of β0 and β1

1.5.2 [latex]\bf \hat{\beta}_0[/latex] and [latex]\bf \hat{\beta}_1[/latex] are Linear Combinations of [latex]\bf Y_{1}, Y_{2}, \dots , Y_{n}[/latex]

1.5.3 Variance–Covariance Matrix of [latex]\bf \hat{\beta}_0[/latex] and [latex]\bf\hat{\beta}_1[/latex]

1.5.4 Gauss–Markov Theorem

1.6 Fitted Values and Residuals

1.7 Estimating the Variance of the Error Terms

1.8 Sums of Squares

1.8.1 Partitioning the Total Sum of Squares

1.8.2 Coefficients of Determination and Correlation

1.8.3 The ANOVA Table

1.9 Exercises

License

Share This Book

1.5.1 [latex]\bf \hat{\beta}_0[/latex] and [latex]\bf \hat{\beta}_1[/latex] are Unbiased Estimators of β₀ and β₁