Chapter 2: Inference in Simple Linear Regression

Lawrence M. Leemis

Chapter 2 Inference in Simple Linear Regression

The focus now shifts to statistical inference in the setting of a simple linear regression model applied to a data set containing the n data pairs [latex]( X_1, \, Y_1 ), \, ( X_2, \, Y_2 ), \, \ldots , \, ( X_n, \, Y_n )[/latex]. The statistical inference typically takes the form of confidence intervals and hypothesis tests concerning the various parameters in the simple linear regression model. More specifically, the sections that follow concern statistical inference concerning σ², β₁, β₀, [latex]E[Y_h][/latex], [latex]Y_h^\star[/latex], and joint statistical inference concerning β₀ and β₁.

2.1 Simple Linear Regression with Normal Error Terms

Drawing mathematically tractable statistical inferences concerning the parameters in a simple linear regression model is not possible with the current assumptions given in Definition 1.1. The problem lies in the vagueness of the assumptions about the error term. The assumption in a simple linear regression model is that the error term ϵ is a random variable with population mean 0 and finite population variance σ². The most common way of making this assumption more specific is to assume that the error term is normally distributed with population mean 0 and finite population variance σ². This will be stated formally in the following definition.

Instead of just any probability distribution with a population mean of zero, we now specify that the error term should have a bell-shaped distribution centered about zero. Even though this is a more limiting assumption, it will allow us to establish exact confidence intervals and perform the associated hypothesis tests on the model parameters and other aspects of the model that might be of interest. Under this more restricted model, it is important to assure that the residuals (which estimate the error terms) do indeed have a bell-shaped distribution which has constant variance over the values of the independent variable in which the model is valid. Another way of stating Definition 2.1 is

$\begin{array}{l} Y \sim N (β_{0} + β_{1} X, σ^{2}) . \end{array}$

Since this model is a special case of the simple linear regression model from Definition 1.1, all of the results from the previous chapter still apply to the simple linear regression model with normal error terms. As before, for the n data pairs [latex](X_1, \, Y_1), \, (X_2, \, Y_2), \, \ldots , \, ( X_n, \, Y_n )[/latex], the model becomes

$\begin{array}{l} Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i} \end{array}$

for [latex]i = 1, \, 2, \, \ldots, \, n[/latex], where [latex]\epsilon_1, \, \epsilon_2, \, \ldots , \, \epsilon_n[/latex] are mutually independent and identically distributed [latex]N\left(0, \, \sigma ^ {\, 2} \right)[/latex] random variables. The geometry associated with this model is shown in Figure 2.1. The model regression line (not the estimated regression line) [latex]E[Y] = \beta_0 + \beta_1 X[/latex] is shown with a negative slope. There are [latex]n=4[/latex] data pairs collected from this simple linear regression model with normal error terms. The probability density function of each of the Y_i values, rotated clockwise by 90° highlights the fact that the population error distribution is normal with a population variance that does not change from one data pair to the next. The geometry illustrated here indicates how a simulation of a simple linear regression model with normal error terms is conducted. Once an X_i value has been established, a Y_i value is generated as [latex]Y_i \sim N \left( \beta_0 + \beta_1 X_i, \, \sigma ^ {\, 2} \right)[/latex], for [latex]i = 1, \, 2, \, \ldots, \, n[/latex]. A realization of four data pairs [latex]\left( X_1, \, Y_1 \right), \, \left( X_2, \, Y_2 \right), \, \left( X_3, \, Y_3 \right), \, \left( X_4, \, Y_4 \right)[/latex] is given by the points plotted in Figure 2.1. The estimated regression line [latex]\hat{Y} = \hat \beta_0 + \hat \beta_1 X[/latex] can be calculated from these four data pairs in the usual fashion.

A graph of the geometry associated with a simple linear regression model with normal error terms. — Figure 2.1: Simple linear regression model with normal error terms.

Long Description for Figure 2.1

Four data pairs are plotted in the quadrant. A dotted vertical line from each of the data pairs is labeled X 2, X 1, X 4 and X 3, in the order from left to right, respectively. The line of regression with a negative slope is draw. The first data pair lies below the line of regression and the remaining lie above the line of regression. Probability density function, rotated 90 degrees clockwise, indicated at each of the Y intercept falling on the line of regression for X 2, X 1, X 4 and X 3 values show normal distribution. The equation for the line of regression is indicated as E of Y equals beta 0 plus beta 1 X. The equation above the dotted line X 2 reads, Y approximately equal to N, open parenthesis beta 0 plus beta 1 times X 2, sigma squared, close parenthesis. The equation above the dotted line X 1 reads, Y 1 approximately equal to N, open parenthesis, beta 0 plus beta 1 X 1, sigma squared, close parenthesis. The equation above the dotted line X 4 reads, Y 4 approximately equals N, open parenthesis, beta 0 plus beta 1 X 4, sigma squared, close parenthesis. The equation above the dotted line X 3 reads, Y 3 approximately equals N, open parenthesis, beta 0 plus beta 1 X 3, sigma squared, close parenthesis. The data pair on line X 4 is indicated as an observed value of Y 4.

2.2 Maximum Likelihood Estimators

Since we have now specified a parametric distribution for the error terms, maximum likelihood estimation can be used to determine parameter estimates for β₀, β₁, and σ². As seen in the next result, the news is good. The maximum likelihood estimators for β₀ and β₁ are identical to the least squares estimators and the maximum likelihood estimator for σ² differs from the associated least squares estimator by a constant factor.

Proof Since [latex]Y_i \sim N \left( \beta_0 + \beta_1 X_i , \, \sigma ^ {\, 2} \right)[/latex] for [latex]i = 1, \, 2, \, \ldots, \, n[/latex] under the assumption of normally distributed errors, the likelihood function is

$\begin{array}{l} L (β_{0}, β_{1}, σ^{2}) & = \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π σ^{2}}} e^{- (Y_{i} - β_{0} - β_{1} X_{i})^{2} / (2 σ^{2})} \\ = (2 π σ^{2})^{- n / 2} e^{- \sum_{i = 1}^{n} (Y_{i} - β_{0} - β_{1} X_{i})^{2} / (2 σ^{2})} . \end{array}$

The log likelihood function is

$\begin{array}{l} \ln L (β_{0}, β_{1}, σ^{2}) = - \frac{n}{2} \ln (2 π σ^{2}) - \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {(Y_{i} - β_{0} - β_{1} X_{i})}^{2} . \end{array}$

The score vector consists of the partial derivatives of the log likelihood function with respect to the unknown parameters β₀, β₁, and σ². Its components are

$\begin{array}{l} \frac{\partial \ln L (β_{0}, β_{1}, σ^{2})}{\partial β_{0}} & = \frac{1}{σ^{2}} \sum_{i = 1}^{n} (Y_{i} - β_{0} - β_{1} X_{i}), \\ \frac{\partial \ln L (β_{0}, β_{1}, σ^{2})}{\partial β_{1}} & = \frac{1}{σ^{2}} \sum_{i = 1}^{n} (Y_{i} - β_{0} - β_{1} X_{i}) X_{i}, \\ \frac{\partial \ln L (β_{0}, β_{1}, σ^{2})}{\partial σ^{2}} & = - \frac{n}{2 σ^{2}} + \frac{1}{2 σ^{4}} \sum_{i = 1}^{n} (Y_{i} - β_{0} - β_{1} X_{i})^{2} . \end{array}$

Equating the first two elements of the score vector to zero, simplifying, and using the hat notation to denote the maximum likelihood estimators results in the normal equations

$\begin{array}{l} n {\hat{β}}_{0} + {\hat{β}}_{1} \sum_{i = 1}^{n} X_{i} = \sum_{i = 1}^{n} Y_{i} \end{array}$

$\begin{array}{l} {\hat{β}}_{0} \sum_{i = 1}^{n} X_{i} + {\hat{β}}_{1} \sum_{i = 1}^{n} X_{i}^{2} = \sum_{i = 1}^{n} X_{i} Y_{i}, \end{array}$

which are identical to those in Theorem 1.1. So the maximum likelihood estimators [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] are identical to the associated least squares estimators from Theorem 1.1. Equating the third element of the score vector to zero and solving for [latex]\hat{\sigma}^{2}[/latex] results in

$\begin{array}{l} {\hat{σ}}^{2} = \frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - {\hat{β}}_{0} - {\hat{β}}_{1} X_{i})^{2} = \frac{1}{n} \sum_{i = 1}^{n} e_{i}^{2} . \end{array}$

Next, determine whether these maximum likelihood estimators maximize the log likelihood function. The symmetric Hessian matrix [latex]{\bf H}[/latex] is the matrix of second partial derivatives with respect to the parameters β₀, β₁, and σ²:

$\begin{array}{l} H = [\begin{array}{ccc} \frac{\partial^{2} \ln L (β_{0}, β_{1}, σ^{2})}{\partial β_{0}^{2}} & \frac{\partial^{2} \ln L (β_{0}, β_{1}, σ^{2})}{\partial β_{0} \partial β_{1}} & \frac{\partial^{2} \ln L (β_{0}, β_{1}, σ^{2})}{\partial β_{0} \partial σ^{2}} \\ \frac{\partial^{2} \ln L (β_{0}, β_{1}, σ^{2})}{\partial β_{1} \partial β_{0}} & \frac{\partial^{2} \ln L (β_{0}, β_{1}, σ^{2})}{\partial β_{1}^{2}} & \frac{\partial^{2} \ln L (β_{0}, β_{1}, σ^{2})}{\partial β_{1} \partial σ^{2}} \\ \frac{\partial^{2} \ln L (β_{0}, β_{1}, σ^{2})}{\partial σ^{2} \partial β_{0}} & \frac{\partial^{2} \ln L (β_{0}, β_{1}, σ^{2})}{\partial σ^{2} \partial β_{1}} & \frac{\partial^{2} \ln L (β_{0}, β_{1}, σ^{2})}{\partial {(σ^{2})}^{2}} \end{array}] . \end{array}$

Taking the second partial derivatives results in the Hessian matrix

[latex]{\bf H} = \left[ \begin{array}{ccc} \displaystyle{-\frac{n}{\sigma ^ {\, 2}}} & \displaystyle{-\frac{1}{\sigma ^ {\, 2}} \sum_{i\,=\,1}^n X_i} & \displaystyle{-\frac{1}{\sigma ^ {\, 4}} \sum_{i\,=\,1}^n (Y_i - \beta_0 - \beta_1 X_i)} \\ \displaystyle{-\frac{1}{\sigma ^ {\, 2}} \sum_{i\,=\,1}^n X_i} & \displaystyle{-\frac{1}{\sigma ^ {\, 2}} \sum_{i\,=\,1}^n X_i^2} & \displaystyle{-\frac{1}{\sigma ^ {\, 4}} \sum_{i\,=\,1}^n (Y_i - \beta_0 - \beta_1 X_i) X_i} \\ \displaystyle{-\frac{1}{\sigma ^ {\, 4}} \sum_{i\,=\,1}^n (Y_i - \beta_0 - \beta_1 X_i)} & \displaystyle{-\frac{1}{\sigma ^ {\, 4}} \sum_{i\,=\,1}^n (Y_i - \beta_0 - \beta_1 X_i) X_i} & \displaystyle{\frac{n}{2 \sigma ^ {\, 4}} - \frac{1}{\sigma ^ {\, 6}} \sum_{i\,=\,1}^n (Y_i - \beta_0 - \beta_1 X_i) ^ 2} \\ \end{array} \right].[/latex]

After some simplification, the Hessian matrix evaluated at the maximum likelihood estimators [latex]\hat \beta_0[/latex], [latex]\hat \beta_1[/latex], and [latex]\hat{\sigma}^{2}[/latex] is

$\begin{array}{l} H = [\begin{array}{ccc} - \frac{n^{2}}{S S E} & - \frac{n \sum_{i = 1}^{n} X_{i}}{S S E} & 0 \\ - \frac{n \sum_{i = 1}^{n} X_{i}}{S S E} & - \frac{n \sum_{i = 1}^{n} X_{i}^{2}}{S S E} & 0 \\ 0 & 0 & - \frac{n^{3}}{2 (S S E)^{2}} \end{array}] \end{array}$

using the value of the maximum likelihood estimator for σ²

$\begin{array}{l} {\hat{σ}}^{2} = \frac{S S E}{n} = \frac{1}{n} \sum_{i = 1}^{n} e_{i}^{2} = \frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - {\hat{Y}}_{i})^{2} = \frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - {\hat{β}}_{0} - {\hat{β}}_{1} X_{i})^{2}, \end{array}$

two of the results from Theorem 1.6, and the definition of the sum of squares for error as [latex]SSE = \sum_{i\,=\,1}^n e_i^2[/latex] from Theorem 1.8. If [latex]{\bf H}[/latex] is a negative definite matrix when evaluated at the maximum likelihood estimators, then the maximum likelihood estimators maximize the likelihood function. In order to show that [latex]{\bf H}[/latex] is a negative definite matrix when evaluated at the maximum likelihood estimators [latex]\hat \beta_0[/latex], [latex]\hat \beta_1[/latex], and [latex]\hat{\sigma}^{2}[/latex], it is sufficient to show that the leading principal minors of [latex]{\bf H}[/latex] alternate in sign in the following fashion: negative, positive, negative. The first leading principal minor is the upper-left hand entry, which is negative when [latex]SSE > 0[/latex] by inspection. The second leading principal minor is the determinant of the upper-left-hand [latex]2 \times 2[/latex] submatrix of [latex]{\bf H}[/latex], which is

$\begin{array}{l} \frac{n^{3} \sum_{i = 1}^{n} X_{i}^{2}}{(S S E)^{2}} - \frac{n^{2} {(\sum_{i = 1}^{n} X_{i})}^{2}}{(S S E)^{2}} \end{array}$

and is positive by the Cauchy–Schwartz inequality when [latex]SSE > 0[/latex]. (For details, see the proof to Theorem 1.1.) The third leading principal minor is the determinant of [latex]{\bf H}[/latex]. Taking advantage of the elements in [latex]{\bf H}[/latex] which are zero, the determinant of [latex]{\bf H}[/latex] is the lower-right element of [latex]{\bf H}[/latex] (which is negative when [latex]SSE > 0[/latex]) multiplied by the second leading principal minor. Thus, the determinant of [latex]{\bf H}[/latex] is negative when evaluated at the maximum likelihood estimators. Since the leading principal minors of [latex]{\bf H}[/latex] are negative, positive, and negative when evaluated at the maximum likelihood estimators, [latex]{\bf H}[/latex] is a negative definite matrix. Hence, the maximum likelihood estimators [latex]\hat \beta_0[/latex], [latex]\hat \beta_1[/latex], and [latex]\hat{\sigma}^{2}[/latex] maximize the likelihood function. [latex]\Box[/latex]

The restriction that [latex]SSE > 0[/latex] in Theorem 2.1 is not a particularly restrictive assumption in practice. The only way to achieve a sum of squares for error of zero is to have all of the data pairs fall on a line. If this is indeed the case, then it is possible that a deterministic, rather than a statistical model, is appropriate.

The fact that the least squares estimators and maximum likelihood estimators for β₀ and β₁ are identical is welcome news. Since both techniques give the same values for [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex], there is no lingering doubt as to which technique is appropriate for a particular modeling situation. But there is a slight difference between the estimators for σ². In the previous section, the sum of squares for error was divided by the appropriate degrees of freedom to arrive at the following unbiased estimator for σ²:

$\begin{array}{l} {\hat{σ}}^{2} = \frac{1}{n - 2} \sum_{i = 1}^{n} e_{i}^{2}, \end{array}$

or [latex]\hat{\sigma} ^ {\, 2} = SSE / (n - 2)[/latex]. On the other hand, the maximum likelihood estimator for σ² uses a similar formula, but with an n rather than an [latex]n - 2[/latex] in the denominator. For large n, the difference is slight. But for small n, the difference can be significant. For the [latex]n = 3[/latex] sales data pairs first introduced in Example 1.3 with the variance of the error terms estimated in Example 1.10, for instance, the unbiased estimate of σ² is [latex]\hat{\sigma} ^ {\, 2} = 14[/latex], whereas the maximum likelihood estimate of σ² is [latex]\hat{\sigma} ^ {\, 2} = 14 / 3[/latex]. The standard practice in regression analysis is to use the unbiased estimator. In general, maximum likelihood estimators are not guaranteed to be unbiased, although they are consistent and asymptotically efficient. For the simple linear regression model with normal error terms, the maximum likelihood estimators for the slope and intercept are unbiased, but the maximum likelihood estimator for σ² is biased.

In a more advanced course on regression, you will prove that the maximum likelihood estimators for the population intercept β₀ and the population slope β₁ (which are the same as the least squares estimators) are consistent, sufficient, and efficient. The property of consistency indicates that the estimators will converge to the associated population values as [latex]n \rightarrow \infty[/latex]; symbolically,

$\begin{array}{l} lim_{n \to \infty} P (| {\hat{β}}_{0} - β_{0} | < δ) = 1 \end{array}$

and

$\begin{array}{l} lim_{n \to \infty} P (| {\hat{β}}_{1} - β_{1} | < δ) = 1 \end{array}$

for any [latex]\delta > 0[/latex]. The property of sufficiency indicates that all of the information concerning the estimation of β₀ and β₁ is encapsulated in [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex], respectively. The property of efficiency indicates that [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] have the smallest possible population variance among all unbiased estimators for β₀ and β₁, respectively.

2.3 Inference in Simple Linear Regression

Inference concerning the parameters in the simple linear regression model with normal error terms, which usually is performed in terms of constructing confidence intervals and performing hypothesis tests, is considered in this section. The following three subsections consider the sampling distributions of [latex]\hat \beta_0[/latex], [latex]\hat \beta_1[/latex], and [latex]\hat{\sigma}^{2}[/latex] under the simple linear regression model with normal error terms. We begin with σ².

2.3.1 Inference Concerning σ²

Even though statistical inference concerning σ² typically has the least interest of the three parameters in simple linear regression, there is an important result concerning the probability distribution of [latex]SSE / \sigma ^ {\, 2}[/latex] that is critical to the derivation of other results, so it is taken up first.

Using the fact that [latex]\hat \beta _ 1 = S_{XY}/S_{XX}[/latex] and equation (4),

$$\begin{align*} \hat \beta_1 \sum_{i\,=\,1}^n \left( X_i – \bar X \right) ^ 2 & = \frac{S_{XY}}{S_{XX}} \cdot S_{XX} \\ & = S_{XY} \\ & = \sum_{i\,=\,1}^n \left( X_i – \bar X \right) \left( Y_i – \bar Y \right) \\ & = \sum_{i\,=\,1}^n \left( X_i – \bar X \right) \big( \beta_1 \left( X_i – \bar X \right) + \epsilon_i – \bar \epsilon \big) \\ & = \beta_1 \sum_{i\,=\,1}^n \left( X_i – \bar X \right) ^ 2 + \sum_{i\,=\,1}^n \left( X_i – \bar X \right) \left( \epsilon_i – \bar \epsilon \right). \end{align*}$$

Solving for [latex]\sum_{i\,=\,1}^n \left( X_i - \bar X \right) \left( \epsilon_i - \bar \epsilon \right)[/latex] gives

$\begin{array}{l} \sum_{i = 1}^{n} (X_{i} - \bar{X}) (ϵ_{i} - \bar{ϵ}) = ({\hat{β}}_{1} - β_{1}) \sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2} . \end{array}$

Substituting this into equation (5) and rearranging gives

$$\sum_{i\,=\,1}^n e_i^2 = \sum_{i\,=\,1}^n \left( \epsilon_i – \bar \epsilon \right) ^ 2 + \big( \hat \beta_1 – \beta_1 \big) ^ 2 \sum_{i\,=\,1}^n \big( X_i – \bar X \big) ^ 2 – 2 \big( \hat \beta_1 – \beta_1 \big) ^ 2 \sum_{i\,=\,1}^n \big( X_i – \bar X \big) ^ 2$$

or

$\begin{array}{l} S S E = \sum_{i = 1}^{n} ϵ_{i}^{2} - n {\bar{ϵ}}^{2} - ({\hat{β}}_{1} - β_{1})^{2} \sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} . \end{array}$

Dividing both sides of this equation by σ² and rearranging gives

$$\sum_{i\,=\,1}^n \left( \frac{\epsilon_i}{\sigma} \right) ^ 2 = \frac{SSE}{\sigma ^ {\, 2}} + \left( \frac{\bar \epsilon}{\sigma / \sqrt{n}} \right) ^ 2 + \left( \frac{\hat \beta_1 – \beta_1}{\sigma / \sqrt{S_{XX}}} \right)^ 2.
\tag{6}$$

Since [latex]\epsilon_1, \, \epsilon_2, \, \ldots, \, \epsilon_n[/latex] are mutually independent and identically distributed error terms under the simple linear regression model with normal error terms, [latex]\epsilon_i \sim N\left( 0, \, \sigma ^ {\, 2} \right)[/latex] for [latex]{i = 1, \, 2, \, \ldots, \, n}[/latex]. A well-known result from probability theory concerning sample means indicates that [latex]\bar \epsilon \sim N\left( 0, \, \sigma ^ {\, 2} / n \right)[/latex]. Furthermore, since [latex]\hat \beta_1 \sim N \left( \beta_1, \, \sigma ^ {\, 2}/S_{XX} \right)[/latex] via Theorem 2.4, it can be seen that the three random quantities in parentheses in equation (6) are normally distributed random variables which have been standardized by subtracting their population means and dividing by their population standard deviations. Thus, the three random quantities in parentheses are standard normal random variables. The square of a standard normal random variable has the chi-square distribution with one degree of freedom, so the last two terms on the right-hand side of equation (4) have the chi-square distribution with one degree of freedom. Also, since the sum of n mutually independent chi-square random variables also has the chi-square distribution with n degrees of freedom, the left-hand side of equation (6) is [latex]\chi^2(n)[/latex]. Since the sum of mutually independent chi-square random variables also has the chi-square distribution (with degrees of freedom summing),

$\begin{array}{l} \frac{S S E}{σ^{2}} \sim χ^{2} (n - 2) . \end{array}$

The part of this proof that is incomplete is proving that the four terms in equation (6) are mutually independent, which is left as an exercise. [latex]\Box[/latex]

As an illustration of the use of Theorem 2.2, the derivation that follows develops an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval for σ². Under the simple linear regression model with normal error terms, Theorem 2.2 states that

$\begin{array}{l} \frac{S S E}{σ^{2}} \sim χ^{2} (n - 2) . \end{array}$

For some α value between 0 and 1, placing an area of [latex]\alpha / 2[/latex] in each tail of the chi-square distribution with [latex]n - 2[/latex] degrees of freedom gives

$\begin{array}{l} P (χ_{n - 2, 1 - α / 2}^{2} < \frac{S S E}{σ^{2}} < χ_{n - 2, α / 2}^{2}) = 1 - α, \end{array}$

where the second value in the subscripts corresponds to right-hand tail probabilities. Rearranging the inequality to isolate σ² in the center of the inequality gives an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval for σ² as

$\begin{array}{l} \frac{S S E}{χ_{n - 2, α / 2}^{2}} < σ^{2} < \frac{S S E}{χ_{n - 2, 1 - α / 2}^{2}} . \end{array}$

This derivation is a proof of the following theorem.

Example 2.1 Consider again the [latex]n = 17[/latex] data pairs in the Forbes data set that was introduced in Example 1.11, where the independent variable X is the boiling point of water in degrees Fahrenheit and the dependent variable Y is the adjusted barometric pressure in inches of mercury. Give a point estimate and a 95% confidence interval for σ².

Using the calculations in Example 1.13, the unbiased point estimate for σ² is

$\begin{array}{l} {\hat{σ}}^{2} = M S E = \frac{S S E}{n - 2} = \frac{0.8131}{17 - 2} = 0.05421 . \end{array}$

Since the scatterplot in Figure 1.18 showed that the data pairs fall very close to the estimated regression line, we expect a narrow confidence interval for σ². Using the formula from Theorem 2.3, an exact two-sided 95% confidence interval for σ² is

$\begin{array}{l} \frac{S S E}{χ_{n - 2, α / 2}^{2}} < σ^{2} < \frac{S S E}{χ_{n - 2, 1 - α / 2}^{2}} \end{array}$

or

$\begin{array}{l} \frac{0.8131}{27.4884} < σ^{2} < \frac{0.8131}{6.2621} \end{array}$

or

$\begin{array}{l} 0.02958 < σ^{2} < 0.1299 . \end{array}$

This confidence interval is nonsymmetric about the point estimate because the chi-square distribution is not a symmetric probability distribution, and the quantiles appear in the denominators of the confidence interval formula. The width of the confidence interval is controlled by two factors: the sample size n and the vertical distances that the data pairs stray from the estimated regression line. Larger values of n result in narrower confidence intervals; data pairs that lie close to the regression line result in narrower confidence intervals.

The R code to compute the point and interval estimates follows. It uses the lm function to fit the simple linear regression model and the qchisq function to calculate the quantiles of the chi-square distribution. Notice that qchisq uses left-hand-tail probabilities whereas our formulas use right-hand-tail probabilities when computing quantiles.

2.3.2 Inference Concerning β₁

In order to perform statistical inference concerning the population slope of the regression line β₁, it is first necessary to establish the sampling distribution of the estimator [latex]\hat \beta_1[/latex]. Since the error terms [latex]\epsilon_1, \, \epsilon_2, \, \ldots , \, \epsilon_n[/latex] are mutually independent and identically distributed [latex]N\left( 0, \, \sigma ^ {\, 2} \right)[/latex] random variables under the simple linear regression model with normal error terms from Definition 2.1, the associated dependent variables [latex]Y_1, \, Y_2, \, \ldots, \, Y_n[/latex] are also mutually independent normally distributed random variables because [latex]Y_i = \beta_0 + \beta_1 X_i + \epsilon_i[/latex] for [latex]i = 1, \, 2, \, \ldots, \, n[/latex]. Furthermore, recall from Theorem 1.3 that [latex]\hat \beta_1[/latex] can be written as a linear combination of [latex]Y_1, \, Y_2, \, \ldots, \, Y_n[/latex] as

$\begin{array}{l} {\hat{β}}_{1} = a_{1} Y_{1} + a_{2} Y_{2} + \dots + a_{n} Y_{n} . \end{array}$

Since a linear combination of mutually independent normally distributed random variables is itself normally distributed, we can conclude that [latex]\hat \beta_1[/latex] is normally distributed.

Now that the normality of [latex]\hat \beta_1[/latex] has been established, the next step is to find the population mean and population variance of the point estimator [latex]\hat \beta_1[/latex], which will completely determine the distribution of [latex]\hat \beta_1[/latex]. From Theorem 1.2 and Theorem 1.4, the population mean and the population variance of the point estimator [latex]\hat \beta_1[/latex] are

$\begin{array}{l} E [{\hat{β}}_{1}] = β_{1} and V [{\hat{β}}_{1}] = \frac{σ^{2}}{S_{X X}} . \end{array}$

This establishes the result given in Theorem 2.4.

The usual method for conducting statistical inference on a test statistic that is normally distributed is to subtract the population mean and divide by the population standard deviation. A problem that arises here is that the population variance of [latex]\hat \beta_1[/latex] in Theorem 2.4 is not known for a particular set of n data pairs because σ² is not known. The population variance of [latex]\hat \beta_1[/latex], however, can be estimated by

$\begin{array}{l} \hat{V} [{\hat{β}}_{1}] = \frac{{\hat{σ}}^{2}}{S_{X X}}, \end{array}$

where [latex]\hat{\sigma} ^ {\, 2} = MSE = SSE / (n - 2)[/latex], which is a quantity that can be estimated from n data pairs. We can now use

$\begin{array}{l} \frac{{\hat{β}}_{1} - β_{1}}{\sqrt{\hat{V} [{\hat{β}}_{1}]}} \end{array}$

as a pivotal quantity in the following result.

Proof This proof is based on the fact that the ratio of a standard normal random variable to the square root of an independent chi-square random variable divided by its degrees of freedom is a t random variable with the same number of degrees of freedom. In the particular setting here with n data pairs drawn from a simple linear regression model with normal error terms, this is

$\begin{array}{l} \frac{N (0, 1)}{\sqrt{χ^{2} (n - 2) / (n - 2)}} \sim t (n - 2), \end{array}$

where [latex]N(0, \, 1)[/latex] denotes a standard normal random variable, [latex]\chi ^ 2 (n - 2)[/latex] denotes a chi-square random variable with [latex]n - 2[/latex] degrees of freedom, and [latex]t(n - 2)[/latex] denotes a t random variable with [latex]n - 2[/latex] degrees of freedom. The normal and chi-square random variables are assumed to be independent. Begin by dividing the numerator and the denominator of the pivotal quantity by the square root of [latex]V \big[ \hat \beta_1 \big] = \sigma ^ {\, 2} / S_{XX}[/latex]:

$\begin{array}{l} \frac{{\hat{β}}_{1} - β_{1}}{\sqrt{\hat{V} [{\hat{β}}_{1}]}} = \frac{\frac{{\hat{β}}_{1} - β_{1}}{\sqrt{V [{\hat{β}}_{1}]}}}{\sqrt{\frac{\hat{V} [{\hat{β}}_{1}]}{V [{\hat{β}}_{1}]}}} . \end{array}$

(7)

Focus initially on the numerator of the right-hand side of equation (7). Because Theorem 2.4 states that [latex]\hat \beta_1[/latex] has a normal distribution with population mean β₁ and population standard deviation[latex]\sqrt{V \big[ \hat \beta_1 \big]} = \sqrt{\sigma ^ {\, 2} / S_{XX}}[/latex], the numerator is a normal random variable minus its mean, divided by its standard deviation. Thus, the numerator of the right-hand side of equation (7) is a [latex]N(0, \, 1)[/latex] random variable. In other words,

$\begin{array}{l} \frac{{\hat{β}}_{1} - β_{1}}{\sqrt{V [{\hat{β}}_{1}]}} \sim N (0, 1) . \end{array}$

The focus now shifts to the denominator of the right-hand side of equation (7). Since [latex]V \big[ \hat \beta_1 \big] = \sigma ^ {\, 2} / S_{XX}[/latex] is estimated by

$\begin{array}{l} \hat{V} [{\hat{β}}_{1}] = \frac{{\hat{σ}}^{2}}{S_{X X}} = \frac{S S E}{(n - 2) S_{X X}}, \end{array}$

the denominator of the right-hand side of equation (7) can be written as

$\begin{array}{l} \sqrt{\frac{\hat{V} [{\hat{β}}_{1}]}{V [{\hat{β}}_{1}]}} = \sqrt{\frac{\frac{S S E}{(n - 2) S_{X X}}}{\frac{σ^{2}}{S_{X X}}}} = \sqrt{\frac{S S E}{(n - 2) σ^{2}}} \sim \sqrt{χ^{2} (n - 2) / (n - 2)} \end{array}$

because [latex]SSE / \sigma ^ {\, 2} \sim \chi ^ 2 (n - 2)[/latex] and is independent of [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] by Theorem 2.2. Since the numerator of equation (7) is a standard normal random variable and the denominator is the square root of an independent chi-square random variable with [latex]n - 2[/latex] degrees of freedom divided by its degrees of freedom, the pivotal quantity

$\begin{array}{l} \frac{{\hat{β}}_{1} - β_{1}}{\sqrt{\hat{V} [{\hat{β}}_{1}]}} \sim t (n - 2) . \end{array}$

[latex]\Box[/latex]

Theorem 2.5 can be used to construct confidence intervals and perform hypothesis tests concerning β₁. In many applications, β₁ is the key parameter in the regression analysis because statistical evidence showing that it differs from zero indicates a linear relationship between X and Y if the assumptions associated with a simple linear regression model with normal error terms are met.

As an illustration, an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval for β₁ is developed as follows. Theorem 2.5 states that

$\begin{array}{l} \frac{{\hat{β}}_{1} - β_{1}}{\sqrt{\hat{V} [{\hat{β}}_{1}]}} \sim t (n - 2) . \end{array}$

For some α between 0 and 1, placing an area of [latex]\alpha / 2[/latex] in each tail of the t distribution with [latex]n - 2[/latex] degrees of freedom gives

$\begin{array}{l} P (- t_{n - 2, α / 2} < \frac{{\hat{β}}_{1} - β_{1}}{\sqrt{\hat{V} [{\hat{β}}_{1}]}} < t_{n - 2, α / 2}) = 1 - α, \end{array}$

where the second value in the subscripts corresponds to right-hand tail probabilities. Rearranging the inequality to isolate β₁ in the center of the inequality gives an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval for β₁ as

$\begin{array}{l} {\hat{β}}_{1} - t_{n - 2, α / 2} \sqrt{\hat{V} [{\hat{β}}_{1}]} < β_{1} < {\hat{β}}_{1} + t_{n - 2, α / 2} \sqrt{\hat{V} [{\hat{β}}_{1}]}, \end{array}$

where

$\begin{array}{l} {\hat{β}}_{1} = \frac{S_{X Y}}{S_{X X}} and \hat{V} [{\hat{β}}_{1}] = \frac{{\hat{σ}}^{2}}{S_{X X}} = \frac{M S E}{S_{X X}} . \end{array}$

This constitutes a derivation of the following theorem.

Example 2.2 Calculate a point estimate and an exact two-sided 95% confidence interval for the population slope β₁ for the Forbes data set from Example 1.11.

The unbiased point estimate for β₁ is

$\begin{array}{l} {\hat{β}}_{1} = \frac{S_{X Y}}{S_{X X}} = \frac{277.5}{530.8} = 0.5229 . \end{array}$

The barometric pressure increases by an estimated 0.5229 inches of mercury for every degree increase in the boiling point of water over the range of values collected by Forbes. This value is reported to four-digit accuracy because that was the number of digits given in the data pairs. As was seen in the scatterplot in Figure 1.18, the [latex]n = 17[/latex] data pairs cluster tightly about the regression line, so we expect a fairly narrow confidence interval for β₁ even though the sample size is moderate. Using the formula from Theorem 2.6, an exact two-sided confidence interval for β₁ is

$\begin{array}{l} {\hat{β}}_{1} - t_{n - 2, α / 2} \sqrt{\frac{M S E}{S_{X X}}} < β_{1} < {\hat{β}}_{1} + t_{n - 2, α / 2} \sqrt{\frac{M S E}{S_{X X}}} \end{array}$

or

$\begin{array}{l} 0.5229 - 2.131 \sqrt{\frac{0.05421}{530.8}} < β_{1} < 0.5229 + 2.131 \sqrt{\frac{0.0 5421}{530.8}} \end{array}$

or

$\begin{array}{l} 0.5014 < β_{1} < 0.5444 . \end{array}$

Unlike the confidence interval for σ², this confidence interval is symmetric about the point estimate. The R code to compute the point and interval estimates is given below. The lm function fits the simple linear regression model and the qt function calculates the quantiles of the appropriate t distribution.

Statisticians perform these calculations so often that R has a built-in confint function to calculate the bounds of the confidence interval, as illustrated below. The first argument is the name of the fitted regression model, the second argument is the name of the parameter being estimated, and the third argument, which defaults to 0.95, is the confidence level.

The hypothesis test concerning β₁ with the null hypothesis

$\begin{array}{l} H_{0} : β_{1} = β_{1}^{⋆} \end{array}$

is based on the test statistic

$\begin{array}{l} \frac{{\hat{β}}_{1} - β_{1}^{⋆}}{\sqrt{\hat{V} [{\hat{β}}_{1}]}}, \end{array}$

which has the t distribution with [latex]n - 2[/latex] degrees of freedom under H₀ and the simple linear regression model with normal errors. The most common value for [latex]\beta_1^\star[/latex] in the null hypothesis is [latex]\beta_1^\star = 0[/latex], which tests whether the estimated slope of the regression line [latex]\hat \beta_1[/latex] differs significantly from zero. This type of hypothesis test concerning β₁ will be illustrated later in this chapter.

2.3.3 Inference Concerning β₀

In order to perform statistical inference concerning the population intercept of the regression line β₀, it is first necessary to establish the sampling distribution of [latex]\hat \beta_0[/latex].

Since the error terms [latex]\epsilon_1, \, \epsilon_2, \, \ldots , \, \epsilon_n[/latex] are mutually independent and identically distributed [latex]N\left( 0, \, \sigma ^ {\, 2} \right)[/latex] random variables under the simple linear regression model with normal error terms from Definition 2.1, the associated dependent variables [latex]Y_1, \, Y_2, \, \ldots, \, Y_n[/latex] are also mutually independent normally distributed random variables. Furthermore, recall from Theorem 1.3 that [latex]\hat \beta_0[/latex] can be written as a linear combination of [latex]Y_1, \, Y_2, \, \ldots, \, Y_n[/latex] as

$\begin{array}{l} {\hat{β}}_{0} = c_{1} Y_{1} + c_{2} Y_{2} + \dots + c_{n} Y_{n} . \end{array}$

Since a linear combination of mutually independent normally distributed random variables is itself normally distributed, we can conclude that [latex]\hat \beta_0[/latex] is normally distributed.

Now that the normality of [latex]\hat \beta_0[/latex] has been established, the next step is to find the population mean and population variance of the point estimator [latex]\hat \beta_0[/latex], which will completely determine the distribution of [latex]\hat \beta_0[/latex]. From Theorem 1.2 and Theorem 1.4, the population mean and the population variance of the point estimator [latex]\hat \beta_0[/latex] are

$\begin{array}{l} E [{\hat{β}}_{0}] = β_{0} and V [{\hat{β}}_{0}] = \frac{σ^{2} \sum_{i = 1}^{n} X_{i}^{2}}{n S_{X X}} . \end{array}$

This establishes the result given in Theorem 2.7.

The usual method for conducting statistical inference on a test statistic that is normally distributed is to subtract the population mean and divide by the population standard deviation. A problem that arises here is that the population variance of [latex]\hat \beta_0[/latex] in Theorem 2.7 is not known for a particular set of n data pairs because σ² is not known. The population variance of [latex]\hat \beta_0[/latex], however, can be estimated by

$\begin{array}{l} \hat{V} [{\hat{β}}_{0}] = \frac{{\hat{σ}}^{2} \sum_{i = 1}^{n} X_{i}^{2}}{n S_{X X}}, \end{array}$

where [latex]\hat{\sigma} ^ {\, 2} = MSE = SSE / (n - 2)[/latex], which is a quantity that can be estimated from n data pairs. We can now use

$\begin{array}{l} \frac{{\hat{β}}_{0} - β_{0}}{\sqrt{\hat{V} [{\hat{β}}_{0}]}} \end{array}$

as a pivotal quantity in the following result.

Proof This proof is based on the fact that the ratio of a standard normal random variable to the square root of an independent chi-square random variable divided by its degrees of freedom is a t random variable with the same number of degrees of freedom. In the particular setting here with n data pairs drawn from a simple linear regression model with normal error terms, this is

$\begin{array}{l} \frac{N (0, 1)}{\sqrt{χ^{2} (n - 2) / (n - 2)}} \sim t (n - 2), \end{array}$

where [latex]N(0, \, 1)[/latex] denotes a standard normal random variable, [latex]\chi ^ 2 (n - 2)[/latex] denotes a chi-square random variable with [latex]n - 2[/latex] degrees of freedom, and [latex]t(n - 2)[/latex] denotes a t random variable with [latex]n - 2[/latex] degrees of freedom. Begin by dividing the numerator and the denominator of the pivotal quantity by the square root of [latex]V \big[ \hat \beta_0 \big] = \sigma ^ {\, 2} \sum_{i\,=\,1}^n X_i^2 / (n S_{XX})[/latex]:

$\begin{array}{l} \frac{{\hat{β}}_{0} - β_{0}}{\sqrt{\hat{V} [{\hat{β}}_{0}]}} = \frac{\frac{{\hat{β}}_{0} - β_{0}}{\sqrt{V [{\hat{β}}_{0}]}}}{\sqrt{\frac{\hat{V} [{\hat{β}}_{0}]}{V [{\hat{β}}_{0}]}}} . \end{array}$

(8)

Focus initially on the numerator of the right-hand side of equation (8). Because Theorem 2.7 states that [latex]\hat \beta_0[/latex] has a normal distribution with population mean β₀ and population standard deviation [latex]\sqrt{V \big[ \hat \beta_0 \big]} = \sqrt{\sigma ^ {\, 2} \sum_{i\,=\,1}^n X_i^2 / (n S_{XX})}[/latex], the numerator is a normal random variable minus its mean, divided by its standard deviation. Thus, the numerator of the right-hand side of equation (8) is a [latex]N(0, \, 1)[/latex] random variable. In other words,

$\begin{array}{l} \frac{{\hat{β}}_{0} - β_{0}}{\sqrt{V [{\hat{β}}_{0}]}} \sim N (0, 1) . \end{array}$

The focus now shifts to the denominator of the right-hand side of equation (8). Since [latex]V \big[ \hat \beta_0 \big] = \sigma ^ {\, 2} \sum_{i\,=\,1}^n X_i^2 / (n S_{XX})[/latex] is estimated by

$\begin{array}{l} \hat{V} [{\hat{β}}_{0}] = \frac{{\hat{σ}}^{2} \sum_{i = 1}^{n} X_{i}^{2}}{n S_{X X}} = \frac{S S E \sum_{i = 1}^{n} X_{i}^{2}}{n (n - 2) S_{X X}}, \end{array}$

the denominator of the right-hand side of equation (8) can be written as

$\begin{array}{l} \sqrt{\frac{\hat{V} [{\hat{β}}_{0}]}{V [{\hat{β}}_{0}]}} = \sqrt{\frac{\frac{S S E \sum_{i = 1}^{n} X_{i}^{2}}{n (n - 2) S_{X X}}}{\frac{σ^{2} \sum_{i = 1}^{n} X_{i}^{2}}{n S_{X X}}}} = \sqrt{\frac{S S E}{(n - 2) σ^{2}}} \sim \sqrt{χ^{2} (n - 2) / (n - 2)} \end{array}$

because [latex]SSE / \sigma ^ {\, 2} \sim \chi ^ 2 (n - 2)[/latex] and is independent of [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] by Theorem 2.2. Since the numerator of equation (8) is a standard normal random variable and the denominator is the square root of an independent chi-square random variable with [latex]n - 2[/latex] degrees of freedom divided by its degrees of freedom, the pivotal quantity

$\begin{array}{l} \frac{{\hat{β}}_{0} - β_{0}}{\sqrt{\hat{V} [{\hat{β}}_{0}]}} \sim t (n - 2) . \end{array}$

[latex]\Box[/latex]

Theorem 2.8 can be used to construct confidence intervals and perform hypothesis tests concerning β₀. In many applications, there is an interest in whether β₀ is statistically different from 0. The results of this hypothesis test and the particular setting for the simple linear regression model indicate whether forcing a simple linear regression model through the origin is appropriate.

As an illustration, an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval for β₀ is developed as follows. Theorem 2.8 states that

$\begin{array}{l} \frac{{\hat{β}}_{0} - β_{0}}{\sqrt{\hat{V} [{\hat{β}}_{0}]}} \sim t (n - 2) . \end{array}$

For some α between 0 and 1, placing an area of [latex]\alpha / 2[/latex] in each tail of the t distribution with [latex]n - 2[/latex] degrees of freedom gives

$\begin{array}{l} P (- t_{n - 2, α / 2} < \frac{{\hat{β}}_{0} - β_{0}}{\sqrt{\hat{V} [{\hat{β}}_{0}]}} < t_{n - 2, α / 2}) = 1 - α, \end{array}$

where the second value in the subscripts corresponds to right-hand tail probabilities. Rearranging the inequality to isolate β₀ in the center of the inequality gives an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval for β₀ as

$\begin{array}{l} {\hat{β}}_{0} - t_{n - 2, α / 2} \sqrt{\hat{V} [{\hat{β}}_{0}]} < β_{0} < {\hat{β}}_{0} + t_{n - 2, α / 2} \sqrt{\hat{V} [{\hat{β}}_{0}]}, \end{array}$

where

$\begin{array}{l} {\hat{β}}_{0} = \bar{Y} - {\hat{β}}_{1} \bar{X} and \hat{V} [{\hat{β}}_{0}] = \frac{{\hat{σ}}^{2} \sum_{i = 1}^{n} X_{i}^{2}}{n S_{X X}} = \frac{M S E \sum_{i = 1}^{n} X_{i}^{2}}{n S_{X X}} . \end{array}$

This constitutes a derivation of the following theorem.

Example 2.3 Calculate a point estimate and an exact two-sided 95% confidence interval for β₀ for the Forbes data set from Example 1.11.

In this particular application, there is little meaning associated with the parameter β₀. Since the independent variable X is the boiling point of water in degrees Fahrenheit and the dependent variable Y is the associated barometric pressure, the intercept β₀ is interpreted as the barometric pressure when the boiling point is zero degrees Fahrenheit. Since the X_i values range from a minimum of 194.3 to a maximum of 212.2, a boiling point of zero degrees Fahrenheit is way outside of the scope of the model. Nevertheless, to illustrate the mechanics associated with the R code to compute the point and interval estimator, we proceed with the calculations. This also illustrates that just because we can perform a calculation does not mean that we should. The R code below uses the lm and confint functions to calculate the point and interval estimators for β₀. The first argument to confint is the fitted regression model and the second argument is the name of the parameter being estimated.

The unbiased point estimator of β₀ is displayed by R as

$\begin{array}{l} {\hat{β}}_{0} = - 81.06 \end{array}$

and the exact two-sided 95% confidence interval for β₀ is

$\begin{array}{l} - 85.44 < β_{0} < - 76.69 . \end{array}$

The reason that the confidence intervals for σ² and β₁ are so narrow and this confidence interval is much wider is that [latex]X = 0[/latex] is so far out of the scope of the simple linear regression model with normal error terms. Typing just confint(fit) gives exact 95% confidence intervals for both β₀ and β₁. More realistic applications of statistical inference on β₀ are given later in this chapter.

The hypothesis test concerning β₀ with the null hypothesis

$\begin{array}{l} H_{0} : β_{0} = β_{0}^{⋆} \end{array}$

is based on the test statistic

$\begin{array}{l} \frac{{\hat{β}}_{0} - β_{0}^{⋆}}{\sqrt{\hat{V} [{\hat{β}}_{0}]}}, \end{array}$

which has the t distribution with [latex]n - 2[/latex] degrees of freedom under H₀ and the simple linear regression model with normal errors. The most common value for [latex]\beta_0^\star[/latex] in the null hypothesis is [latex]\beta_0^\star = 0[/latex], which is for testing whether the estimated intercept of the regression line [latex]\hat \beta_0[/latex] differs significantly from zero. The p-value associated with this hypothesis test and the context associated with the meaning of X and Y might influence a modeler whether or not to fit a simple linear regression model which is forced through the origin.

2.3.4 Inference Concerning $E [Y_{h}]$

Many applications of simple linear regression require not only point and interval estimates for the regression parameters β₀, β₁, and σ², but also a point and interval estimate for the expected value of Y associated with a particular value of X. In this context, the simple linear regression model is being used to forecast the conditional expected value of Y from the data pairs. Denote the X-value of interest by X_h, which is a fixed constant that is observed without error within the scope of the simple linear regression model. The associated random Y-value is denoted by Y_h, which has conditional expected value [latex]E[Y_h][/latex]. This compact notation for the conditional expected value is adopted over the more precise [latex]E[Y_h \, | \, X = X_h][/latex]. If the parameters β₀ and β₁ are known, then the point estimator for [latex]E[Y_h][/latex] is

$\begin{array}{l} {\hat{Y}}_{h} = β_{0} + β_{1} X_{h}, \end{array}$

which is simply the height of the population regression line at X_h. In nearly all applications, however, we estimate the parameters β₀, β₁, and σ² from the data pairs [latex]\left( X_1, \, Y_1 \right), \, \left( X_2, \, Y_2 \right), \, \ldots , \, \left( X_n, \, Y_n \right)[/latex]. In this case, the point estimator for [latex]E[Y_h][/latex] is

$\begin{array}{l} {\hat{Y}}_{h} = {\hat{β}}_{0} + {\hat{β}}_{1} X_{h}, \end{array}$

which is simply the height of the estimated regression line at X_h. When the data pairs [latex]\left( X_1, \, Y_1 \right), \left( X_2, \, Y_2 \right), \dots, \left( X_n, \, Y_n \right)[/latex] are tightly clustered about the regression line, we expect a fairly precise point estimate for [latex]E[Y_h][/latex]. A more explicit notation for [latex]\hat{Y} _ h[/latex] is [latex]\hat E [ Y_h \, | \, X = X_h ][/latex] or [latex]\hat{\mu} _ {Y_h \, | \, X = X_h}[/latex]. We opt for the more compact [latex]\hat{Y}_h[/latex] and leave it to the reader to mentally convert this to the more explicit meaning.

The value of X_h might correspond to one of [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex], or might correspond to another value of X. It is critical that X_h fall in the scope of the simple linear regression model. If X_h is less than [latex]\min \left\{ X_1, \, X_2, \, \ldots, \, X_n \right\}[/latex] or greater than [latex]\max \left\{ X_1, \, X_2, \, \ldots, \, X_n \right\}[/latex], then there should be some evidence, perhaps evidence based on data sets collected previously or evidence provided by experts in the subject matter, that the relationship between X and Y remains linear outside of the scope of the data pairs. Without evidence of this nature, one should not extrapolate beyond the scope of the simple linear regression model.

With the point estimator for [latex]E[Y_h][/latex] established, we now seek a pivotal quantity which can be used to construct confidence intervals and perform hypothesis tests concerning [latex]E[Y_h][/latex]. We continue to assume that the simple linear regression model with normally distributed error terms is appropriate. Recall from Theorem 1.3 that [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] can be written as written as linear combinations of [latex]Y_1, \, Y_2, \, \ldots, \, Y_n[/latex]:

$\begin{array}{l} {\hat{β}}_{0} = c_{1} Y_{1} + c_{2} Y_{2} + \dots + c_{n} Y_{n} \end{array}$

and

$\begin{array}{l} {\hat{β}}_{1} = a_{1} Y_{1} + a_{2} Y_{2} + \dots + a_{n} Y_{n} \end{array}$

for constants [latex]c_1, \, c_2, \, \ldots, \, c_n[/latex] and [latex]a_1, \, a_2, \, \ldots, \, a_n[/latex]. Furthermore, [latex]Y_1, \, Y_2, \, \ldots, \, Y_n[/latex] are mutually independent random variables because [latex]\epsilon_1, \, \epsilon_2, \, \ldots, \, \epsilon_n[/latex] are mutually independent random variables in the simple linear regression model [latex]Y_i = \beta_0 + \beta_1 X_i + \epsilon_i[/latex] for [latex]i = 1, \, 2, \, \ldots, \, n[/latex]. This implies that [latex]\hat{Y} _ h[/latex] can be written as

$\begin{array}{l} {\hat{Y}}_{h} = {\hat{β}}_{0} + {\hat{β}}_{1} X_{h} = (c_{1} + a_{1} X_{h}) Y_{1} + (c_{2} + a_{2} X_{h}) Y_{2} + \dots + (c_{n} + a_{n} X_{h}) Y_{n} . \end{array}$

Since a linear combination of mutually independent normally distributed random variables is itself normally distributed, [latex]\hat{Y} _ h[/latex] is normally distributed under the simple linear regression model with normal error terms.

Now that the normality of [latex]\hat{Y} _ h[/latex] has been established, we seek its population mean and population variance, which will completely define its probability distribution. Since [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] are unbiased estimators of β₀ and β₁, respectively, the population mean of [latex]\hat{Y} _ h[/latex] is

$\begin{array}{l} E [{\hat{Y}}_{h}] = E [{\hat{β}}_{0} + {\hat{β}}_{1} X_{h}] = E [{\hat{β}}_{0}] + X_{h} E [{\hat{β}}_{1}] = β_{0} + β_{1} X_{h} \end{array}$

via Theorem 1.2. So the point estimator [latex]\hat{Y}_h = \hat \beta_0 + \hat \beta_1 X_h[/latex] is an unbiased estimator of [latex]{Y_h = \beta_0 + \beta_1 X_h}[/latex]. Next, we calculate the population variance of [latex]\hat{Y} _ h[/latex]. Since [latex]\bar Y[/latex] and [latex]\hat \beta _ 1[/latex] are independent random variables (this was shown in the derivation prior to the establishment of the variance–covariance matrix of [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] in Theorem 1.4),

$\begin{array}{l} V [{\hat{Y}}_{h}] & = V [{\hat{β}}_{0} + {\hat{β}}_{1} X_{h}] \\ = V [\bar{Y} - {\hat{β}}_{1} \bar{X} + {\hat{β}}_{1} X_{h}] \\ = V [\bar{Y} + {\hat{β}}_{1} (X_{h} - \bar{X})] \\ = V [\bar{Y}] + (X_{h} - \bar{X})^{2} V [{\hat{β}}_{1}] \\ = \frac{σ^{2}}{n} + (X_{h} - \bar{X})^{2} \frac{σ^{2}}{S_{X X}} \\ = [\frac{1}{n} + \frac{(X_{h} - \bar{X})^{2}}{S_{X X}}] σ^{2} \end{array}$

using the lower-right hand entry in the variance–covariance matrix for [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] from Theorem 1.4. This constitutes a derivation of the following result.

The population variance of [latex]\hat{Y} _ h[/latex] in Theorem 2.10 is of particular interest. If the experimenter has complete control over the choice of the values of the independent variables [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex] in the data pairs [latex]\left( X_1, \, Y_1 \right), \, \left( X_2, \, Y_2 \right), \, \ldots , \, \left( X_n, \, Y_n \right)[/latex], the best choice is to (a) choose [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex] so that [latex]S_{XX}[/latex] is as large as possible (that is, spread the [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex] out as much as possible), and (b) choose [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex] such that [latex]\bar X[/latex] equals X_h. These choices for the values of the independent variables will result in the smallest possible population variance for [latex]\hat{Y} _ h[/latex].

The geometry associated with the choice of the [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex] values is illustrated in Figure 2.2. In each of the two scatterplots, there are [latex]n = 24[/latex] simulated data pairs drawn from simple linear regression models with normal error terms having identical population parameters β₀, β₁, and σ².

Two scatter plot graphs; one with 24 data pair values clustered around the linear regression line, and one with data points spread out within the quadrant. — Figure 2.2: The effect of spreading [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex].

Long Description for Figure 2.2

In both the scatter plot graphs, the horizontal axis is labeled X, and the vertical axis is labeled Y. Left graph: 24 data points plotted on the X Y place are clustered around center of the diagonal line with a positive slope. 10 points fall above, and 10 below the diagonal line. 4 points are on the diagonal line. Right graph: 24 data points plotted on the X Y plane are spread out along the diagonal line with a positive slope. 12 points are above the diagonal line, and 12 are below the diagonal line.

Although they are not labeled, the axes on the two graphs have identical scales, and the two regression lines have nearly the same slope and intercept. The key difference between the two graphs is that the values of the independent variable are less spread out in the left-hand graph and more spread out in the right-hand graph. The spread of [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex] results in three conclusions. First, the scope of the regression model is narrower in the graph on the left. Second, the estimation of β₁ is less stable when [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex] are tightly clustered as in the graph on the left. Third, inference on [latex]E[Y_h][/latex] will be less precise in the graph on the left because the variance of [latex]\hat{Y} _ h[/latex] is larger via Theorem 2.10.

The development of a pivotal quantity for statistical inference concerning [latex]E[Y_h][/latex] follows along the same line of reasoning as that for β₁ and β₀. We can’t calculate the population variance of [latex]\hat{Y} _ h[/latex] from n data pairs because the value of σ² is unknown, so it is estimated by

$\begin{array}{l} \hat{V} [{\hat{Y}}_{h}] = [\frac{1}{n} + \frac{(X_{h} - \bar{X})^{2}}{S_{X X}}] M S E, \end{array}$

where [latex]\hat{\sigma} ^ {\, 2} = MSE = SSE / (n - 2)[/latex], which is a quantity that can be estimated from n data pairs. We can now use

$\begin{array}{l} \frac{{\hat{Y}}_{h} - E [Y_{h}]}{\sqrt{\hat{V} [{\hat{Y}}_{h}]}} \end{array}$

as a pivotal quantity in the following result.

The proof of this result is analogous to the associated proofs for the pivotal quantities for inference concerning β₀ and β₁. This pivotal quantity can be used as a test statistic when conducting a hypothesis test concerning [latex]E[Y_h][/latex]. Proceeding in an analogous fashion to the development of the confidence intervals for β₁ and β₀, an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval for [latex]E[Y_h][/latex] is given next.

The calculation of an exact two-sided confidence interval for [latex]E[Y_h][/latex] from a data set consisting of n data pairs [latex]\left( X_1, \, Y_1 \right), \, \left( X_2, \, Y_2 \right), \, \ldots , \, \left( X_n, \, Y_n \right)[/latex] using Theorem 2.12 will be illustrated in the next example.

Example 2.4 Calculate a point estimate and an exact two-sided 95% confidence interval for the expected barometric pressure [latex]E[Y_h][/latex] associated with a boiling point of [latex]X_h = 206[/latex] degrees Fahrenheit for the Forbes data set from Example 1.11.

The R code below implements the formula in Theorem 2.12 for the Forbes data pairs. The lm function is used to fit the simple linear regression model. The estimated regression coefficients and the residuals are extracted from the fitted model in order to complete the computations.

This code returns the point estimator for the population mean barometric pressure [latex]\hat{Y} _ h = 26.65[/latex] inches of mercury corresponding to the boiling point [latex]X_h = 206[/latex] degrees Fahrenheit, which is associated with the exact two-sided 95% confidence interval

$\begin{array}{l} 26.52 < E [Y_{h}] < 26.79 . \end{array}$

We are 95% confident that the population mean barometric pressure lies between 26.52 and 26.79 inches of mercury when the boiling point is 206° Fahrenheit based on the [latex]n=17[/latex] data pairs. Again, the confidence interval for [latex]E[Y_h][/latex] is narrow because of the tight clustering of the data pairs around the estimated regression line. These calculations are also routinely performed by statisticians, so they can be performed with fewer lines of code by using the R built-in generic predict function. After the predict function recognizes the object fit, given as the first argument, as a fitted regression model, it internally calls the predict.lm function to calculate the point estimate and the interval estimate. The second argument to predict is a data frame that contains the value of X_h, which is [latex]X_h = 206[/latex] in this example. The interval argument is set to the character string “confidence” because a confidence interval is being requested. The default value for α is 0.05, which yields a two-sided 95% confidence interval for [latex]E[Y_h][/latex], and can be altered with the level argument.

The predict function displays the output given below.

These values match the point estimator [latex]\hat{Y} _ h = 26.65[/latex] inches of mercury and the associated exact two-sided confidence interval [latex]26.52 < E[Y_h] < 26.79[/latex] generated by the previous code segment. Figure 2.3 contains a scatterplot of the data, the fitted regression line, and a (tiny) vertical line segment indicating the width of the confidence interval for [latex]E[Y_h][/latex]. This segment is symmetric about the fitted Y-value, which is the point estimator [latex]\hat{Y} _ h = 26.65[/latex].

The previous example illustrated the steps required to calculate a point and interval estimate of [latex]E[Y_h][/latex]. The width of the confidence interval for [latex]E[Y_h][/latex] is a function of

n (a narrower confidence interval for larger values of n),
α (a narrower confidence interval for smaller values of α),
the dispersion of the data pairs about the regression line as measured by SSE (a narrower confidence interval for smaller values of SSE),
the spread of the X values selected in the experiment as measured by [latex]S_{XX}[/latex] (a narrower confidence interval for larger values of [latex]S_{XX}[/latex]), and
the proximity of X_h to [latex]\bar X[/latex] (a narrower confidence interval for X_h closer to [latex]\bar X[/latex]).

Each of these conclusions concerning the width of the confidence interval is apparent in the formula for the confidence interval for [latex]E[Y_h][/latex] given in Theorem 2.12. The next section considers the closely-related prediction interval associated with the introduction of a new data pair.

Figure 2.3: Predicted barometric pressure at a boiling point of 206° Fahrenheit.

Long Description for Figure 2.3

The horizontal axis X measuring temperature in Fahrenheit ranges from 190 to 215 in increments of 5 units. The vertical axis Y measuring barometric pressure ranges from 20 to 30 in increments of 2 units. The data points are plotted along the line of regression, which has a positive slope, except for the point (205, 26). The barometric pressure at boiling point 206 is indicated by a short vertical line on the line of regression of 26. An arrow pointing at the short vertical line reads 95 percent confidence interval for E of Y subscript h, when X subscript h equals 206.

2.3.5 Inference Concerning $Y_{h}^{⋆}$

The previous section considered statistical inference on the mean response associated with a value X_h for the independent variable associated with the data pairs [latex]\left( X_1, \, Y_1 \right), \, \left( X_2, \, Y_2 \right), \, \ldots , \, \left( X_n, \, Y_n \right)[/latex]. This section considers statistical inference associated with the introduction of a new data pair. The value of the independent variable for this new data pair is, as before, X_h, which is a fixed constant observed without error within the scope of the model. We would like to perform some type of statistical inference on the associated value of the dependent variable [latex]Y_h^\star[/latex]. The star superscript is to denote that this is an additional data pair that is not one of the original n data pairs used to fit the simple linear regression model. There is a similar, but fundamentally different, analysis that must be used when we would like to consider the introduction of an additional data pair

$\begin{array}{l} (X_{n + 1}, Y_{n + 1}) = (X_{h}, Y_{h}^{⋆}) . \end{array}$

Three examples in which this type of analysis is appropriate are given below.

A sociologist collects the [latex]n = 50[/latex] data pairs [latex]\left( X_1, \, Y_1 \right), \, \left( X_2, \, Y_2 \right), \, \ldots , \, \left( X_{50}, \, Y_{50} \right)[/latex], where the independent variable X is the wife’s height and the dependent variable Y is the husband’s height for 50 married couples. These data pairs represent 50 couples surveyed by the sociologist. If the sociologist knows the height of a married woman who is not in the group of 50, what statistical inference can the sociologist make about her husband’s height?
An economist collects the [latex]n = 50[/latex] data pairs [latex]\left( X_1, \, Y_1 \right), \, \left( X_2, \, Y_2 \right), \, \ldots , \, \left( X_{50}, \, Y_{50} \right)[/latex], where the independent variable is the average annual unemployment rate and the dependent variable is the annual gross domestic product (GDP) for a particular country. If these data pairs represent the last 50 years of data, and the economist knows the average annual unemployment rate for next year, what statistical inference can the economist perform on the random GDP for next year?
An engineer collects the [latex]n = 50[/latex] data pairs [latex]\left( X_1, \, Y_1 \right), \, \left( X_2, \, Y_2 \right), \, \ldots , \, \left( X_{50}, \, Y_{50} \right)[/latex], where the independent variable is the speed of a car and the dependent variable is the car’s stopping distance for 50 different cars. If the engineer knows the speed of a 51st car to be tested, what statistical inference can the engineer perform on its random stopping distance?

The common thread that runs through the three examples is that there is a new data pair, [latex]\left( X_{51}, \, Y_{51} \right) = \left( X_h , \, Y_h^\star \right)[/latex], that is being introduced.

So we would like to predict the outcome for a new value of the dependent variable associated with the new value of the independent variable, namely X_h. As before, the value of X_h need not necessarily correspond to one of the [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex] values, but needs to fall within the scope of the model unless there is some prevailing evidence to make statistical inference outside of the scope of the model. In order to help frame the issues associated with the case of a new data pair being introduced, the next paragraph considers the very rare case in which all of the parameters in the simple linear regression model are known.

Consider the simplest case in which all parameters are known in the simple linear regression model. In the previous section, Theorem 2.12 gave a confidence interval for [latex]E \left[ Y_h \right][/latex], which is a fixed constant. In this section, we desire a statistical interval for [latex]Y_h^\star[/latex], which is a random variable. Because of this fundamental difference in the nature of [latex]E \left[ Y_h \right][/latex] and [latex]Y_h^\star[/latex], the interval derived here for [latex]Y_h^\star[/latex] is a prediction interval. If all of the parameters in the regression model are known, Definition 2.1 indicates that

$\begin{array}{l} Y_{h}^{⋆} \sim N (β_{0} + β_{1} X_{h}, σ^{2}) . \end{array}$

Standardizing this normally distributed random variable,

$\begin{array}{l} \frac{Y_{h}^{⋆} - (β_{0} + β_{1} X_{h})}{σ} \sim N (0, 1) . \end{array}$

The probability that this standard normal random variable lies between the [latex]\alpha / 2[/latex] and [latex]1 - \alpha / 2[/latex] fractiles of the standard normal distribution is

$\begin{array}{l} P (- z_{α / 2} < \frac{Y_{h}^{⋆} - (β_{0} + β_{1} X_{h})}{σ} < z_{α / 2}) = 1 - α . \end{array}$

Some algebra on the inequality gives an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] prediction interval for [latex]Y_h^\star[/latex] as

$\begin{array}{l} β_{0} + β_{1} X_{h} - z_{α / 2} σ < Y_{h}^{⋆} < β_{0} + β_{1} X_{h} + z_{α / 2} σ . \end{array}$

Although this derivation is straightforward, the vast majority of regression applications do not have parameters which are known a priori, and we now pivot to the more practical question.

In the case in which the parameters in the simple linear regression model are unknown, they must be estimated from the n data pairs. The point estimator for [latex]Y_h^\star[/latex] is the same as the point estimator in the previous section:

$\begin{array}{l} {\hat{Y}}_{h}^{⋆} = {\hat{β}}_{0} + {\hat{β}}_{1} X_{h} . \end{array}$

Handling the population variance of [latex]Y_h^\star[/latex] requires a little more finesse. In the case of the parameters being estimated from n data pairs, the population variance of [latex]Y_h^\star[/latex] comes from two sources:

the population variance associated with a new observation of the dependent variable, and
the population variance induced by estimating the intercept and slope of the fitted regression line from the n data pairs.

Since the new data pair is independent of the original n data pairs, the population variance of the prediction error is

$\begin{array}{l} V [Y_{h}^{⋆} - ({\hat{β}}_{0} + {\hat{β}}_{1} X_{h})] = V [Y_{h}^{⋆} - {\hat{Y}}_{h}] = V [Y_{h}^{⋆}] + V [{\hat{Y}}_{h}] = σ^{2} + V [{\hat{Y}}_{h}] . \end{array}$

Since [latex]\hat{Y} _ h[/latex] is normally distributed via Theorem 2.10 and [latex]Y_h^\star[/latex] is independent of [latex]\hat{Y} _ h[/latex] and is also normally distributed, we have the following result.

The population mean of the normal distribution in Theorem 2.13 is estimated by

$\begin{array}{l} {\hat{Y}}_{h}^{⋆} = {\hat{β}}_{0} + {\hat{β}}_{1} X_{h} \end{array}$

and the population variance of the normal distribution is estimated by

$\begin{array}{l} \hat{V} [{\hat{Y}}_{h}^{⋆}] = [1 + \frac{1}{n} + \frac{(X_{h} - \bar{X})^{2}}{S_{X X}}] {\hat{σ}}^{2} = [1 + \frac{1}{n} + \frac{(X_{h} - \bar{X})^{2}}{S_{X X}}] M S E . \end{array}$

Using an analogous approach to the pivotal quantities in the previous sections, the following quantity can be used for statistical inference concerning [latex]Y_h^\star[/latex].

The proof of this result is analogous to the associated proofs for the pivotal quantities for inference concerning β₀ and β₁. This pivotal quantity can be used as a test statistic when conducting a hypothesis test concerning [latex]Y_h^\star[/latex]. Proceeding in an analogous fashion to the development of the confidence intervals for β₁ and β₀, an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] prediction interval for [latex]Y_h^\star[/latex] is given next.

Adding a 1 inside of the expression for [latex]\hat{V} \big[ \hat{Y}_h^{\kern 0.15em \star} \big][/latex] ensures that the prediction interval for [latex]\hat{Y} _ h[/latex] will be wider than the associated confidence interval for [latex]E[ Y_h ][/latex] from Theorem 2.12. In both results, the intervals are narrowest when X_h is near [latex]\bar X[/latex] and the observations of the independent variable are spread out so as to maximize [latex]S_{XX}[/latex].

Example 2.5 Calculate a point estimate and an exact two-sided 95% prediction interval for the barometric pressure [latex]Y_h^\star[/latex] associated with a new observation with a boiling point of [latex]X_h = 206[/latex] degrees Fahrenheit for the Forbes data set from Example 1.11.

A point estimate and an exact two-sided 95% prediction interval for the barometric pressure associated with a new data pair having boiling point [latex]X_h = 206[/latex] can be computed with the R predict function as shown below.

The output from these statements is given below.

The point estimator for [latex]Y_h^\star[/latex] is [latex]\hat{Y}_h^{\kern 0.15em \star} = 26.65[/latex] (this value matches the point estimate from Example 2.4) and the 95% two-sided prediction interval returned is

$\begin{array}{l} 26.14 < Y_{h}^{⋆} < 27.17 . \end{array}$

Figure 2.4 contains a scatterplot of the data, the fitted regression line, and a (not-as-tiny-as-before) vertical line segment indicating the width of the exact two-sided 95% prediction interval for [latex]Y_h^\star[/latex]. This segment is symmetric about the predicted Y-value, which is the point estimator [latex]\hat{Y} _ h ^ {\kern 0.15em \star} = 26.65[/latex]. In this particular setting, the 1 inside the expression for [latex]\hat{V} \big[ \hat{Y}_h^{\kern 0.15em \star} \big][/latex] resulted in a significantly wider 95% prediction interval than the associated 95% confidence interval.

Figure 2.4: Prediction interval for a new data pair with boiling point of 206° F.

Long Description for Figure 2.4

The horizontal axis X measuring temperature ranges from 190 to 215 in increments of 5 units. The vertical axis Y measuring barometric pressure ranges from 20 to 30 in increments of 2 units. The data points are plotted along the line of regression with a positive slope, except for the point plotted at (205, 26). The barometric pressure at boiling point 206 is indicated by a short vertical line on the line of regression as 26. The short vertical line represents the 95 percent prediction interval for Y asterisk subscript h, when X subscript h equals 206.

A thought experiment that helps clarify the difference between the confidence interval for [latex]E[Y_h][/latex] and the prediction interval for [latex]\hat{Y}_h^{\kern 0.15em \star}[/latex] is to consider the two intervals associated with [latex]X_h = \bar X[/latex]. A careful inspection of the confidence interval given in Theorem 2.12 indicates that the width of the confidence interval for [latex]E[Y_h][/latex] approaches zero as [latex]n \rightarrow \infty[/latex]. Increasing the number of data pairs without bound results in perfect precision for the point estimator for the conditional expected value [latex]\hat{Y}_h = \hat \beta_0 + \hat \beta_1 X_h[/latex]. On the other hand, a careful inspection of the prediction interval given in Theorem 2.15 indicates that the width of the prediction interval for [latex]\hat{Y}_h^{\kern 0.15em \star}[/latex] approaches a finite, nonzero value as [latex]n \rightarrow \infty[/latex]. When a new observation associated with independent variable [latex]X_h = \bar X[/latex], the associated point estimator for the conditional expected value of the dependent variable [latex]\hat{Y}_h^{\kern 0.15em \star} = \hat \beta_0 + \hat \beta_1 X_h[/latex] has a population variance that approaches the MSE (which, in turn, approaches σ²) as [latex]n \rightarrow \infty[/latex]. It is not possible to predict the random response to a new data pair with perfect precision.

This section and the previous four sections have introduced various techniques for statistical inference in the setting of a simple linear regression model with normal error terms. Table 2.1 summarizes many of the key results from these sections. The first column gives the parameter of interest. The second column gives the pivotal quantity and its distribution. This pivotal quantity serves as the test statistic in a hypothesis test concerning the parameter of interest. The third column gives an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval for the parameter of interest for the first four rows and an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] prediction interval for the parameter of interest for the last row.

Table 2.1: Pivotal quantities and exact statistical intervals for a simple linear regression model.
parameter	pivotal quantity	exact two-sided [latex]{100(1 - \alpha)}\%[/latex] statistical interval
σ²	[latex]\displaystyle{\frac{SSE}{\sigma ^ {\, 2}} \sim \chi ^ 2 (n - 2)}[/latex]	[latex]\displaystyle{\frac{SSE}{\chi ^ 2 _ {n - 2, \, \alpha / 2}} < \sigma ^ {\, 2} < \frac{SSE}{\chi ^ 2 _ {n - 2, \, 1 - \alpha / 2}}}[/latex]
β₁	[latex]\displaystyle{\frac{\hat \beta_1 - \beta_1}{\rule{0pt}{1.40em} % a little extra space \, \ \sqrt{\hat{V} \big[ \hat \beta_1 \big]} \, \ } \sim t(n - 2)}[/latex]	[latex]\displaystyle{\hat \beta_1 \pm t_{n - 2, \, \alpha / 2}{\sqrt{\frac{MSE}{S_{XX}}}}}[/latex]
β₀	[latex]\displaystyle{\frac{\hat \beta_0 - \beta_0}{\rule{0pt}{1.40em} \, \ \sqrt{\hat{V} \big[ \hat \beta_0 \big]} \, \ } \sim t(n - 2)}[/latex]	[latex]\displaystyle{\hat \beta_0 \pm t_{n - 2, \, \alpha / 2}{\sqrt{\frac{MSE \sum_{i\,=\,1}^n X_i^2}{nS_{XX}}}}}[/latex]
[latex]E[Y_h][/latex]	[latex]\displaystyle{\frac{\hat{Y}_h - E[Y_h]}{\rule{0pt}{1.40em} \, \ \sqrt{\hat{V} \big[ \hat{Y}_h \big]} \, \ } \sim t(n - 2)}[/latex]	[latex]\displaystyle{ \hat \beta_0 + \hat \beta_1 X_h \pm t_{n - 2, \, \alpha / 2} \sqrt{\left( \frac{1}{n} + \frac{\left(X_h - \bar X \right) ^ 2}{S_{XX}} \right) MSE }}[/latex]
[latex]Y_h^\star[/latex]	[latex]\displaystyle{\frac{\hat{Y}_h^{\kern 0.15em \star} - E[\hat{Y}_h^{\kern 0.15em \star}]} {\rule{0pt}{1.40em} \, \ \sqrt{\hat{V} \big[ \hat{Y}_h^{\kern 0.15em \star} \big]} \, \ } \sim t(n - 2)}[/latex]	[latex]\displaystyle{ \hat \beta_0 + \hat \beta_1 X_h \pm t_{n - 2, \, \alpha / 2} \sqrt{\left( 1 + \frac{1}{n} + \frac{\left(X_h - \bar X \right) ^ 2}{S_{XX}} \right) MSE }}[/latex]

2.3.6 Joint Inference Concerning β₀ and β₁

The exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence intervals for the intercept β₀ and slope β₁ in a simple linear regression model with normal error terms developed in Sections 2.3.2 and 2.3.3 might be combined to provide a joint confidence region on both parameters. Occasions arise in regression modeling in which joint inference on both β₀ and β₁ simultaneously is required. As a particular instance, recall from Examples 2.2 and 2.3 that the unbiased point estimators for β₀ and β₁ for the Forbes data set were

$\begin{array}{l} {\hat{β}}_{0} = - 81.06 and {\hat{β}}_{1} = 0.5229 \end{array}$

and the associated exact two-sided 95% confidence intervals for β₀ and β₁ calculated separately were

$\begin{array}{l} - 85.44 < β_{0} < - 76.69 and 0.5014 < β_{1} < 0.5 444. \end{array}$

The union of these two confidence intervals is depicted by the rectangle in Figure 2.5. The point estimates for β₀ and β₁ are depicted by the point at the center of the rectangle. Does the union of the two confidence intervals depicted by the rectangle in Figure 2.5 constitute an exact 95% confidence region for β₀ and β₁? It does not. The problems associated with this rectangular-shaped confidence region are outlined in the next two paragraphs.

A graph presents the confidence region of beta 0 and beta 1. — Figure 2.5: Confidence region for β₀ and β₁.

Long Description for Figure 2.5

The horizontal axis is labeled beta 0 and ranges from negative 85.44 to negative 76.969 in increments of 4.38. The vertical axis labeled beta 1 and ranges from 0.5014 to 0.5444 in increments of 0.215. A data pair (beta cap 0, beta cap 1) is plotted at negative (81.06, 0.5229). A dotted square extends between negative 85.44 to negative 76.69 on the horizontal axis, and between 0.5014 and 0.5444 on the vertical axis.

If the two confidence intervals were constructed independently, the actual coverage associated with the confidence region would be [latex](0.95)(0.95) = 0.9025[/latex]. This would be a 90.25% confidence region. If the confidence intervals were constructed independently, then we could simply adjust the coverages of the individual confidence intervals for β₀ and β₁ to [latex]\sqrt{0.95} \cong 0.9747[/latex] in order to get an exact 95% confidence region for β₀ and β₁. But the two confidence intervals are constructed from the same data set, and, as seen by the off-diagonal elements in the variance–covariance matrix in Theorem 1.4, the covariance between [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] is zero only when [latex]\bar X = 0[/latex]. This is seldom the case in practice. So while the rectangular region in Figure 2.5 is a confidence region, it is not one that we can easily find the associated actual coverage. Some help is provided by the Bonferroni inequality, which states that the actual coverage for the rectangular region is at least [latex]1 - 2 \alpha[/latex], which in this setting is [latex]1 - (2)(0.05) = 0.90[/latex]. Both confidence intervals contain the true value of β₀ and β₁ with at least 90% confidence, but this is all that can be stated concerning the actual coverage of the rectangular-shaped confidence region.

Since we know that the point estimators for β₀ and β₁ are only independent in the rare case of [latex]\bar X = 0[/latex] from Theorem 1.4, perhaps a rectangular-shaped confidence region is not appropriate. This is certainly the impression that one gets from the Monte Carlo simulation experiment conducted in Example 1.4. The problem here is that the point estimators [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] are typically dependent random variables, which means that a non-rectangular confidence region is appropriate. In an advanced class in regression, you will prove the following result, which is used to determine an exact [latex]{100(1 - \alpha)}\%[/latex] confidence region for β₀ and β₁.

Let [latex]F_{2, \, n - 2, \, \alpha}[/latex] be the [latex]1 - \alpha[/latex] percentile of the F distribution with 2 and [latex]n - 2[/latex] degrees of freedom. Theorem 2.16 implies that

$\begin{array}{l} P (\frac{n - 2}{2 \sum_{i = 1}^{n} e_{i}^{2}} [n {({\hat{β}}_{0} - β_{0})}^{2} + 2 ({\hat{β}}_{0} - β_{0}) ({\hat{β}}_{1} - β_{1}) \sum_{i = 1}^{n} X_{i} + {({\hat{β}}_{1} - β_{1})}^{2} \sum_{i = 1}^{n} X_{i}^{2}] \leq F_{2, n - 2, α}) = 1 - α . \end{array}$

This inequality can be used to construct an exact [latex]{100(1 - \alpha)}\%[/latex] confidence region for β₀ and β₁.

The boundary of the confidence region in the [latex](\beta_0, \, \beta_1)[/latex] plane is an ellipse centered at [latex]\big( \hat \beta_0, \, \hat \beta_1 \big)[/latex]. The boundary is found by replacing the inequality in Theorem 2.17 with an equality. The tilt of the ellipse is a function of [latex]\hbox{Cov} \big( \hat \beta_0, \, \hat \beta_1 \big)[/latex], which is [latex]- \bar X S_{XX} / \sigma ^ {\, 2}[/latex] by Theorem 1.4. Notice that [latex]{S_{XX} > 0}[/latex] and [latex]\sigma ^ {\, 2} > 0[/latex] under the simple linear regression model assumptions given in Definition 1.1. If [latex]{\bar X > 0}[/latex], then the covariance between the parameter estimates is negative, which implies that the error associated with the parameter estimates and their true values tends to be in the opposite direction. If [latex]\hat \beta_0 > \beta_0[/latex], for example, then it is more likely that [latex]\hat \beta_1 < \beta_1[/latex]. This is the more common situation in practice. Conversely, if [latex]\bar X < 0[/latex], then [latex]\hbox{Cov} \big( \hat \beta_0, \, \hat \beta_1 \big) > 0[/latex], which implies that the error associated with the parameter estimates and their true values tends to be in the same direction.

The confidence region given in Theorem 2.17 can be plotted for the data pairs [latex]\left( X_1, \, Y_1 \right), \left( X_2, \, Y_2 \right), \ldots \ , \left( X_n, \, Y_n \right)[/latex] using numerical methods. Plotting the boundary of the confidence region in the [latex]( \beta_0, \, \beta_1)[/latex] plane can be performed using a two-dimensional search for points on the boundary. Alternatively, a ray can be extended from [latex]\big( \hat \beta_0, \, \hat \beta_1 \big)[/latex] at a particular angle, and a one-dimensional search can be conducted to find a point on the boundary. The details associated with plotting such a confidence region will be given in one of the examples in the next section.

2.4 The ANOVA Table

In most applications of simple linear regression, the slope of the regression line, β₁, is the most critical of the three parameters in the model. The most common statistical test that is performed in a simple linear regression application is testing whether the population slope β₁ is zero against the two-tailed alternative:

$\begin{array}{l} H_{0} : β_{1} = 0 \end{array}$

versus

$\begin{array}{l} H_{1} : β_{1} \neq 0. \end{array}$

This choice of H₀ and H₁ is designed to determine whether the independent variable X is a statistically significant predictor of the dependent variable Y. Rejecting H₀ indicates that the independent variable is providing some predictive capability. Although this test can be conducted based on Theorem 2.5, a second test based on the F distribution is introduced in this section and its equivalency to the test based on the t distribution is established. Both tests are exact. In addition, the ANOVA table which was introduced in Section 1.8 will be expanded in this section to include an additional column.

Cochran’s theorem, named after American statistician William Cochran (1909–1980), concerns writing sums of squares of independent and identically distributed [latex]N \kern -0.02em \left( 0, \, \sigma ^ {\, 2} \right)[/latex] random variables as the sum of positive semi-definite quadratic forms of these random variables. Applying his theorem to the simple linear regression model with normal error terms yields the following result.

The second of the three results has already been seen in Theorem 2.2. The first and third results are necessary to derive the F test for the significance of the slope β₁, which is given in the following theorem.

The ANOVA table which was first introduced in Section 1.8 can be expanded to include an additional column on the right as shown in Table 2.2. Some computer packages will add yet another column on the right-hand side of the ANOVA table which contains the p-value associated with the F test.

Table 2.2: Basic ANOVA table for simple linear regression.
Source	SS	df	MS	F
Regression	SSR	1	MSR	[latex]MSR / MSE[/latex]
Error	SSE	[latex]n - 2[/latex]	MSE
Total	SST	[latex]n - 1[/latex]

So the F test for the statistical significance of the slope parameter in the regression model with normal error terms begins by computing the test statistic [latex]F = MSR / MSE[/latex]. If [latex]F < F_{1, \, n - 2, \, 1 - \alpha / 2}[/latex] or [latex]F > F_{1, \, n - 2, \, \alpha / 2}[/latex], then H₀ is rejected. The ANOVA table will be illustrated in one of the examples in the next section.

To show that the F-test developed here is equivalent to the same test based on the t distribution in Section 2.3.2,

$\begin{array}{l} F = \frac{M S R}{M S E} = \frac{{\hat{β}}_{1}^{2} S_{X X}}{\hat{V} [{\hat{β}}_{1}] S_{X X}} = \frac{{\hat{β}}_{1}^{2}}{\hat{V} [{\hat{β}}_{1}]} = t^{2}, \end{array}$

because [latex]MSR = SSR = \hat \beta_1 ^ 2 S_{XX}[/latex] (which is an exercise from Chapter 1), where t is the test statistic for the hypothesis based on Theorem 2.5. Since the square of a t random variable has the F distribution with the appropriate degrees of freedom, the two tests are equivalent.

We do not pursue the F test any further because the test of significance for the slope of the regression line based on the F distribution is less flexible than that based on the t distribution from Section 2.3.2. The test based on the t distribution is superior because (a) it can adapt to one-tailed alternative hypotheses, and (b) it is capable of testing for slopes other than [latex]\beta_1^\star = 0[/latex]. The primary purpose of introducing the F test here is to append the additional column to the right of the ANOVA table and provide an insightful link between regression, which is presented here, and experimental design, which relies heavily on ANOVA tables.

2.5 Examples

This section contains four examples that illustrate the implementation of the simple linear regression modeling techniques that have been developed so far.

Example 2.6 This first example is more of a cautionary tale than a real-world example. Francis Anscombe (1918–2001) was a British statistician who devised four sets, each consisting of [latex]{n = 11}[/latex] data pairs. These four data sets have come to be known as Anscombe’s quartet, which are given in Table 2.3. Make scatterplots of the four data sets, along with the associated estimated regression lines.

Table 2.3: Anscombe’s quartet.
Data set I		Data set II		Data set III		Data set IV
X_i	Y_i	X_i	Y_i	X_i	Y_i	X_i	Y_i
10.0	8.04	10.0	9.14	10.0	7.46	8.0	6.58
8.0	6.95	8.0	8.14	8.0	6.77	8.0	5.76
13.0	7.58	13.0	8.74	13.0	12.74	8.0	7.71
9.0	8.81	9.0	8.77	9.0	7.11	8.0	8.84
11.0	8.33	11.0	9.26	11.0	7.81	8.0	8.47
14.0	9.96	14.0	8.10	14.0	8.84	8.0	7.04
6.0	7.24	6.0	6.13	6.0	6.08	8.0	5.25
4.0	4.26	4.0	3.10	4.0	5.39	19.0	12.50
12.0	10.84	12.0	9.13	12.0	8.15	8.0	5.56
7.0	4.82	7.0	7.26	7.0	6.42	8.0	7.91
5.0	5.68	5.0	4.74	5.0	5.73	8.0	6.89

Anscombe’s quartet is contained in a data frame in R named anscombe. The R code below creates scatterplots of the four data sets in a [latex]2 \times 2[/latex] set of graphs using common horizontal and vertical scales. The four scatterplots and the associated regression lines are given in Figure 2.6.

Figure 2.6: Scatterplots and estimated regression lines for Anscombe’s quartet.

Long Description for Figure 2.6

The horizontal axis X ranges from 4 to 19, and the vertical axis Y ranges from 3 to 13. The line of regression originating from (0, 4) extends to (19, 13), with a positive slope. Top left graph: 5 data points fall on the linear line of regression following an increasing trend; 3 are below and 3 are above the line of regression. Top right graph: The data points plotted follows an inverted parabola with 4 points falling below the line of regression, 5 above and two on the line of regression. Bottom left graph: The data points follow an increasing tend with 2 points above the line of regression, 5 below and 4 on the line of regression. Bottom right graph: The data points follow a vertical trend with a common X value and different Y value. 3 data points are plotted on the line of regression, 4 are above the line, and 4 are below the line of regression.

Reading the plots row-wise, the first plot shows [latex]n = 11[/latex] data pairs could have come from a simple linear regression model with normal error terms. The residuals could possibly have emanated from a normal population with population mean zero and finite population variance σ². The second plot show that there is clearly a relationship between X and Y, but the relationship is nonlinear rather than linear. It appears that a quadratic model, rather than a linear model, best describes the relationship between X and Y. The third plot appears to contain an outlier, which might have been coded improperly. The fourth plot highlights the leverage that the far-right data pair exerts on the estimated regression line. Leverage points are those data pairs that exert more influence on the fitted model than others, typically by having a value of its independent variable which is distant from the values of the independent variable for other data pairs. The far-right point exerts that influence in the fourth plot. To summarize, only the first of the four data sets would be appropriate for a simple linear regression model. What if we bypassed the plotting of the scatterplots? If we did so and went directly to fitting the simple linear regression models, we would find that

$\begin{array}{l} \bar{X} = 9.0, \bar{Y} = 7.5, S_{X X} = 1001, S_{Y Y} = 660, {\hat{β}}_{0} = 3.0, {\hat{β}}_{1} = 0.5, r = 0.67 \end{array}$

for all four data sets! (Some of these values are exactly the same for all four data sets and others match for two or three digits.) The estimated regression lines from Figure 2.6 are basically identical for all four data sets. Had we neglected to plot the data pairs in a scatterplot and proceeded directly to the regression analysis, we would conclude that the four data sets are basically identical. But the scatterplots show that this is clearly not the case. Only the first of the four data sets supports the simple linear regression model [latex]Y = \beta_0 + \beta_1 X + \epsilon[/latex], with [latex]N\left( 0, \, \sigma ^ {\, 2} \right)[/latex] error terms.

The moral to the cautionary tale is to never bypass the critical step of making a scatterplot of the data pairs and visually assessing whether or not a simple linear regression model is appropriate. While it is easy to input your data into a statistical package and quickly get numerical estimates for the parameters, this can lead to adopting a statistical model which is inappropriate. In addition, if the visual assessment of the scatterplot leads you to believe that a simple linear regression model is feasible, this should be followed by a residual plot to assess the normality of the error terms.

The second example illustrates the assessment of the simple linear regression model, point estimation, and interval estimation for a large data set.

Example 2.7 A sociologist might be interested in the following question. Do taller-than-average women tend to date and eventually marry taller-than-average men? This question can be answered by collecting the heights of husband and wife pairs and examining the associated scatterplot to see if a regression model is appropriate. If it is appropriate, then a hypothesis test can be conducted to answer the question. The R data frame named heights contained in the R package PBImisc contains [latex]n = 96[/latex] pairs of heights (measured in centimeters) which will be used to answer the question. The first five and last five data pairs, ordered by the wife’s height, are given in Table 2.4. We will use these data pairs of heights to assess the hypothesis by executing the following steps.

Table 2.4: Couple’s heights [latex](n = 96)[/latex].
Wife’s height	Husband’s height
141	152
143	156
145	160
146	164
147	178
⋮	⋮
178	187
179	192
180	192
181	186
181	188

Make a scatterplot of the data values and make an initial visual assessment of whether a simple linear regression model might be a reasonable approximation to the relationship between the husband’s height and the wife’s height.
Fit a simple linear regression model and interpret the estimated slope and intercept of the regression line.
Assess the adequacy of the model by making a plot of the residuals ordered by the values of the independent variable, plotting a histogram of the residuals, constructing a QQ plot, and performing a goodness-of-fit test for the normality of the residuals.
Perform a hypothesis test with the null hypothesis
$\begin{array}{l} H_{0} : β_{1} = 0 \end{array}$

versus the alternative hypothesis

$\begin{array}{l} H_{1} : β_{1} > 0 \end{array}$

based on the data pairs, where β₁ is the slope of the regression line.
Give a point estimator and a 95% confidence interval for [latex]E[Y_h][/latex] when [latex]X_h = 150[/latex] centimeters.
Give a point estimator and a 95% prediction interval for [latex]\hat{Y}_h^{\kern 0.15em \star}[/latex] when [latex]X_h = 150[/latex] centimeters.

This is an unusual data set in that it is not clear whether the husband’s height or the wife’s height should serve as the independent variable. Both spouses choose one another, so the analysis could be performed with either height serving as the independent variable. It could also be performed treating both heights as random variables. For the analysis performed here, we assume that the wife’s height is a fixed value X and the husband’s height is the random response Y.

The R code below installs and loads the PBImisc package in R and generates a scatterplot using the plot function, which is displayed in Figure 2.7.

Figure 2.7: Scatterplot of wife’s height X and husband’s height Y for [latex]n = 96[/latex] data pairs.

Long Description for Figure 2.7

The horizontal axis X measuring wife’ height ranges from 140 to 180 in increments of 10 units. The vertical axis Y measuring husbands’ height ranges from 150 to 190 in increments of 10 units. Multiple data points are plotted on the X Y plane. Most of them are clustered between the wife’s height of 160 to 180, and the husbands’ height of 160 to 190.

Figure 2.7 appears to contain only 90 of the 96 points because there are six tied data pairs, such as [latex](165, \, 181)[/latex], that occur in the data set. Some analysts prefer to jitter the tied data values slightly in order to avoid obscuring tied pairs. The paucity of points in the upper-left and lower-right corner of the scatterplot indicates that the two heights are positively correlated. There does not appear to be any systematic change in the variance of the data values moving from left to right, so it is reasonable to move forward with a simple linear regression model. Adding the line through the origin with slope 1 with the additional R command abline(c(0, 1)) reveals that one of the points, [latex](157, \, 157)[/latex], has equal heights for the husband and wife, three points have a taller wife than her husband, and 92 of the points have a taller husband than his wife.
The R statements below use the lm function to fit a simple linear regression model to the data pairs and the abline function to plot the associated regression line on the scatterplot.

The point estimates of the intercept and slope are

$\begin{array}{l} {\hat{β}}_{0} = 37.8 and {\hat{β}}_{1} = 0.833 . \end{array}$

The interpretation of the slope is that the expected husband’s height is 0.833 centimeter greater for each increase in the wife’s height by one centimeter. The remaining question is whether this positive slope differs significantly from zero. The intercept, on the other hand, does not have a meaningful interpretation in this setting (a woman who is zero centimeters tall marries a man who has an average height of 37.8 centimeters). The intercept is way outside of the scope of the model and has no practical meaning here. Any conclusions drawn should be made within the range of collected heights of the women, which range from 141 to 181 centimeters. The fitted regression line is superimposed over the scatterplot in Figure 2.8.

Figure 2.8: Fitted regression model for the [latex]n = 96[/latex] data pairs.

Long Description for Figure 2.8

“The horizontal axis X measuring the wife’s height ranges from 140 to 180 in increments of 10 units. The vertical axis Y measuring the husbands’ height ranges from 150 to 190 in increments of 10 units. Multiple data points are plotted on the X Y plane. Most of them lie between wife’s height of 160 to 180, and husbands’ height of 160 to 190. 96 data points are plotted, and the linear line of regression, originating from the vertical axis, at (0, 151) slope up toward the top right of the X Y plane. 8 data points are plotted on the line of regression. 44 points are plotted below the line of regression, and 44 are plotted above the line of regression.”
Before conducting a hypothesis test concerning the slope, it is critical to assess the validity of the simple linear regression model by examining the residuals. The R code below orders the 96 data pairs by the wife’s height, and plots the index on the horizontal axis and the associated residual on the vertical axis, which is displayed in Figure 2.9. Normally distributed error terms seem plausible from this graph, but there is some evidence that the early observations incur more variability on the high side. The fifth ordered residual plotted corresponds to the spectacular data pair [latex](147, \, 178)[/latex], with a 31-centimeter difference between the two heights. Could these early extreme positive residuals correspond to shorter women having a greater array of options than taller women?

Figure 2.9: Residuals for the heights data.

Long Description for Figure 2.9

The horizontal axis labeled i ranges from 1 to 96. The vertical axis labeled e i ranges from negative 20 to 20 in increments of 20 units. Multiple data points are plotted on the X Y plane. A dotted horizontal line of regression, originates from 0 on the vertical axis and extends parallel to the X axis. It has 0 slope. 46 points lie below the line of regression. 45 points fall above the line of regression. 5 points fall on the line of regression.

If the residuals are approximately normally distributed, we can move forward with the statistical inference techniques associated with the simple linear regression model with normal error terms. The R code below uses the hist function to draw a histogram of the residuals, which is displayed in Figure 2.10. This reflects a population bell-shaped probability distribution for the error terms in the model.

Figure 2.10: Histogram of the residuals for the heights data.

The histogram is useful for a preliminary visual assessment of the normality of the residuals, but the partitioning of observations into cells can make conclusions drawn from the histogram misleading. A second technique for visually assessing the normality of the residuals is to inspect a QQ (quantile–quantile) plot of the residuals. A QQ plot displays the theoretical quantiles of the residuals on the horizontal axis and the associated sample quantiles on the vertical axis. The second-to-last line of the R code given above uses the qqnorm function to draw a QQ plot of the residuals, which is displayed in Figure 2.11. To provide some detail on the two most extreme points on this plot, the smallest residual is [latex]-16.7[/latex] which corresponds to a wife who is 162 centimeters tall who is married to a husband who is 156 centimeters tall. The theoretical quantile corresponds to a left-hand tail probability for the standard normal distribution of [latex]0.5 / 96[/latex] (in general this left-hand tail probability is [latex](i - 0.5) / n[/latex] for [latex]i = 1, \, 2, \, \ldots, \, n[/latex]), which can be calculated in R with qnorm(0.5 / 96), resulting in a theoretical quantile of [latex]-2.56[/latex]. The point [latex](-2.56, \, -16.7)[/latex] is plotted in the lower-left-hand corner of Figure 2.11. Similarly, the largest residual is 17.8, which corresponds to a wife who is 147 centimeters tall who is married to a husband who is 178 centimeters tall. The theoretical quantile corresponds to a left-hand tail probability for the standard normal distribution of [latex]95.5 / 96[/latex], which is calculated with qnorm(95.5 / 96), which gives a theoretical quantile of 2.56. The point [latex](2.56, \, 17.8)[/latex] is plotted in the upper-right-hand corner of Figure 2.11. If the points on a QQ plot fall in an approximately linear fashion, an analyst can conclude that the assumption of normality is reasonable. Before deciding whether the points fall close enough to a line in this case with [latex]n = 96[/latex] values, you should make a dozen or so runs of the command qqnorm(rnorm(96)) so your eye can assess how much deviation from linearity occurs when the 96 values are truly from a normal distribution. In the case of Figure 2.11, the appropriate conclusion is that these residuals could have been drawn from a normal population. Any slight departures from linearity on the QQ plot can be attributed to random sampling variability. This conclusion is consistent with the conclusion that was drawn from the histogram in Figure 2.10.

Figure 2.11: QQ normal plot of residuals for the [latex]n = 96[/latex] heights data pairs.

Long Description for Figure 2.11

The horizontal axis X measuring theoretical quantiles ranges from negative 3 to 3 in increments of 1 unit. The vertical axis Y measuring sample quantiles ranges from negative 20 to 20 in increments of 10 units. Multiple data points plotted in the quadrant follow a linear, increasing trend, from left to right. Most of the data points fall between the values of negative 2 and 2 on the X axis, and between the values of negative 10 and 10 on the Y axis.

The two visual assessments drawn by examining the histogram and the QQ plot are subjective. A formal statistical goodness-of-fit test should be conducted to confirm the visual assessments. The final statement in the R code invokes the built-in shapiro.test function, which executes the Shapiro–Wilk test for normality. The Shapiro–Wilk test was chosen because it has been shown to have superior power over the Anderson–Darling, Kolmogorov–Smirnov, and Lilliefors goodness-of-fit tests. The details associated with the Shapiro–Wilk test can be found in any applied statistics textbook. The p-value for the Shapiro–Wilk test returned by the shapiro.test function is 0.953. The null hypothesis for the Shapiro–Wilk is that the residuals have been drawn with a normal population, so the high p-value indicates that we should fail to reject H₀, and we can move forward with using the simple linear regression model with normal error terms for the purposes of statistical inference. Although we have some slight misgivings about non-constant variability (shorter wives marrying husbands with possibly slightly greater variability than their taller counterparts), we will move forward with using the fitted simple linear regression model with normal error terms. All other aspects of the modeling assumptions are satisfied for these data pairs.
Now that the simple linear regression with normal error terms has been established, we can proceed to addressing questions that require statistical inference techniques. Since the original question posed concerned whether the slope of the regression line had a statistically significant positive slope, the appropriate hypothesis test is
$\begin{array}{l} H_{0} : β_{1} = 0 \end{array}$

versus the one-sided alternative hypothesis

$\begin{array}{l} H_{1} : β_{1} > 0. \end{array}$

The appropriate test statistic is based on Theorem 2.5 which states that

$\begin{array}{l} \frac{{\hat{β}}_{1} - β_{1}}{\sqrt{M S E / S_{X X}}} \sim t (n - 2) . \end{array}$

under H₀, where [latex]\beta_1 = 0[/latex] in this setting. The R code below calculates the test statistic and associated p-value for the hypothesis test with a one-sided alternative.

The test statistic calculated by this code is [latex]t = 11.5[/latex], which corresponds to a p-value of approximately 0 for the one-sided alternative hypothesis.

Some keystrokes can be saved by using R’s lm function to calculate the p-value for this test. The three R statements

generate the regression summary given below.

The first section of the output from the call to the summary function echos the call that was made to the lm function. The second section gives the minimum, maximum, and the quartiles of the residuals. The third section concerns the coefficients β₀ (on the first line) and β₁ (on the second line). Reading across the second line, (a) the column labeled Estimate contains the least squares estimate [latex]\hat \beta_1 = 0.8329[/latex], which was stored in beta1hat in the earlier R code, (b) the column labeled Standard Error contains [latex]\sqrt{MSE / S_{XX}} = 0.07269[/latex], which was stored in stderror in the earlier R code, (c) the column labeled t value contains the test statistic [latex]t = \hat \beta_1 / \sqrt{MSE / S_{XX}} = 11.46[/latex], which was stored in teststat in the earlier R code, and (d) the column labeled Pr( > |t|) contains the p-value for the test, which was stored in p in the earlier R code, which was calculated using the pt function. The default for R is a two-sided alternative hypothesis, so the p-value given here should be halved in order to obtain the p-value for the hypothesis test with the one-sided alternative hypothesis. The three stars *** that follow the p-value indicate that the p-value is less than 0.001.

So the null hypothesis is rejected. There is overwhelming statistical evidence in these data pairs that the slope of the regression line is positive, which implies that height is a statistically significant factor in the selection of a spouse.
The point estimator for the expected height of a husband married to a wife who is [latex]x_h = 150[/latex] centimeters tall is simply the fitted value. An exact 95% confidence interval for [latex]E[Y_h][/latex] is given by Theorem 2.12. Since the details associated with the calculations are given in Example 2.4, we simply use the R predict function to calculate the point and interval estimates.

The point estimate for the expected husband’s height is [latex]\hat{Y} _ h = 162.7[/latex] centimeters and the associated exact 95% confidence interval is

$\begin{array}{l} 160.4 < E [Y_{h}] < 165.1 . \end{array}$

We are 95% confident that the mean husband’s height associated with a wife whose height is [latex]X_h = 150[/latex] is between 160.4 and 165.1 centimeters. This confidence interval is illustrated in Figure 2.12.

Figure 2.12: Point estimate and 95% confidence interval associated with [latex]X_h = 150[/latex].

Long Description for Figure 2.12

The horizontal axis X ranges from 140 to 180 in increments of 10 units. The vertical axis Y ranges from 150 to 190 in increments of 20 units. Multiple data points are plotted in the quadrant. The line of regression, originating from the vertical axis at (0, 153), extends toward the top right of the quadrant. 44 points are below the line of regression. 42 points are above the line of regression. 8 points are plotted along the line of regression. A short vertical line is marked on the line of regression at points 150, 160.4. All data are approximate.
Now consider a 97th wife who is not part of the original [latex]n = 96[/latex] data pairs and is [latex]x_h = 150[/latex] centimeters tall. What conclusions can we draw concerning the height of her husband? The point estimator for his expected height is again just the fitted value. An exact 95% confidence interval for [latex]Y_h^{\kern 0.15em \star}[/latex] is given by Theorem 2.15. Since the details associated with the calculations are given in Example 2.5, we use the R predict function to calculate the point and interval estimates.

The point estimate for the expected husband’s height is again [latex]\hat{Y} _ h = 162.7[/latex] centimeters and the associated exact 95% prediction interval is

$\begin{array}{l} 149.7 < Y_{h}^{*} < 175.8 . \end{array}$

The probability that the husband’s height associated with a wife whose height is [latex]X_h = 150[/latex] is between 149.7 and 175.8 centimeters is 0.95. This prediction interval is illustrated in Figure 2.13. As expected, it is significantly wider than the associated confidence interval.

Figure 2.13: Point estimate and 95% prediction interval associated with [latex]X_h = 150[/latex]

Long Description for Figure 2.13

“The horizontal axis X ranges from 140 to 180 in increments of 10 units. The vertical axis Y ranges from 150 to 190 in increments of 10 units. Multiple data points are plotted in the quadrant. The line of regression, originating from the vertical axis at (0, 153), extends toward the top right of the quadrant. 44 points line below the line of regression. 42 points are above the line of regression. 8 points are plotted along the line of regression. A vertical line is marked on the line of regression at points 150, and 160.4. All data are approximate.”

The confidence interval and prediction interval associated with [latex]X_h = 150[/latex] could have been calculated for any X_h within the scope of the model. Figure 2.14 shows these confidence and prediction intervals for values of the independent variable (the wife’s height) in the scope of the model [latex]141 < X < 181[/latex]. The darker gray bands contain the confidence intervals; the lighter gray bands contain the prediction intervals. As indicated in Theorems 2.12 and 2.15, these intervals are narrowest at [latex]\bar X = 163.9[/latex] centimeters. The R code for generating a plot similar to that in Figure 2.14 is given below.

The confidence and prediction intervals are calculated with the predict function, the confidence and prediction intervals are plotted with the polygon function, the regression line is plotted with the abline function, and finally the data pairs are plotted as solid points with the points function.

Figure 2.14: Scatterplot, regression line, and 95% confidence and prediction intervals.

Long Description for Figure 2.14

The horizontal axis X ranges from 140 to 180 in increments of 10 units. The vertical axis Y ranges from 150 to 200 in increments of 10 units. Multiple data points are plotted in the quadrant. The line of regression, originating from the vertical axis from 0, 153, extends toward the top right on the X Y plane. 44 points line below the line of regression. 42 points fall above the line of regression. 8 points fall on the line of regression. Prediction interval is the region shaded between 140 and 170 of Y axis along the regression line. The confidence interval is the region shaded between 152 and 159 of the Y axis along the regression line. 26 points are in the confidence interval, and 4 points are outside the predicted intervals, and the remaining are within in the predicted intervals.

The previous example might leave you wondering whether taller (and shorter) women marrying taller (and shorter) men, and having taller (and shorter) children might eventually result in a planet filled with people of more extreme heights. As first noticed by Sir Francis Galton in 1886 and usually known as “regression to the mean,” this will probably not be the case. Consider the right-hand tail of the height distribution. A taller-than-average woman will indeed typically date and marry a taller-than-average man, but the husband’s height, on average, will not fall as far out into the right-hand-tail of his height distribution as the wife’s percentile in her height distribution. Some mathematics associated with the simple linear regression model backs this up. Recall from Definition 1.3 that the coefficient of correlation is

$\begin{array}{l} r = \pm \sqrt{\frac{S S R}{S S E}}, \end{array}$

where the sign associated with r is the same as the sign of [latex]\hat \beta_1[/latex]. Theorem 1.9 gave the alternate formula

$\begin{array}{l} r = {\hat{β}}_{1} \sqrt{\frac{S_{X X}}{S_{Y Y}}} . \end{array}$

This can be rewritten as [latex]\hat \beta_1 S_X = r S_Y[/latex], where S_X is the sample standard deviation of [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex] and S_Y is the sample standard deviation of [latex]Y_1, \, Y_2, \, \ldots, \, Y_n[/latex]. The left-hand side of this equation represents the expected increase (or decrease) in the dependent variable for a one standard deviation increase in the independent variable. But since [latex]|r| < 1[/latex] in nearly all applications (the only exception is when all data pairs fall in a line), this standard deviation increase in X will result in less than a standard deviation increase in Y. In the previous example, where [latex]r = 0.763[/latex] was the correlation coefficient between the heights of the wives and their husbands, a standard deviation increase in the height of a wife results in a increase of just [latex]0.763 S_Y[/latex] increase in the height of her husband. Women do tend to marry taller men on average, but the height of their husbands, on average, are at a lesser percentile of the men’s height distribution than the wife’s height percentile.

The next example considers an automotive application of regression which uses speed as an independent variable and stopping distance as a dependent variable.

Example 2.8 R contains a built-in data frame named cars, which consists of [latex]n = 50[/latex] data pairs of speeds (which will be the independent variable) and associated stopping distances (which will be the dependent variable) for cars. The speed X is measured in miles per hour and the stopping distance Y is measured in feet. The data pairs were gathered in the 1920s, which accounts for the top speed of just 25 miles per hour. We would like to establish the relationship between X and Y. Common sense indicates that faster moving cars take a longer distance to stop, so we anticipate a positively correlated set of data pairs. Draw a scatterplot of the data pairs to determine if a simple linear regression model is appropriate, fit a simple linear regression model to the data pairs and assess the adequacy of the model.

The R code

generates the scatterplot in Figure 2.15. The xlim and ylim arguments on the plot function are used to include the origin in the scatterplot. The number of data pairs plotted on the scatterplot appears to be only 49 because the data pair [latex](13, \, 34)[/latex] appears twice. As expected, Y increases as X increases. The relationship between X and Y is approximately linear, but some complicating factors cast doubt on a linear regression model. First, the relationship between X and Y should pass through the origin (stationary cars require zero feet to stop), but a fitted regression line will miss the origin by a significant margin. This might be evidence that a nonlinear relationship, such as a quadratic relationship, might provide a better fit than a linear relationship. Second, the population variance of the error terms, σ², might be increasing as the speeds increase. In spite of these misgivings, we will proceed forward and fit the simple linear regression model and assess whether it is an appropriate model. In the next chapter, other regression models will prove to provide a better fit to this data set.

Figure 2.15: Scatterplot of speed X and stopping distance Y for the cars data.

Long Description for Figure 2.15

The horizontal axis X measuring speed ranges from 0 to 25 in increments of 5 units. The vertical axis Y measuring stopping distance ranges from 0 to 120 in increments of 40 units. Multiple data points are plotted on the X Y plane. Most of them lie between a speed of 4 and 25 and a stopping distance of 0 and 80. The data shows a linear relationship between X and Y.

The scatterplot with the fitted regression line for a simple linear regression model can be generated with the R commands below.

This plot is given in Figure 2.16. The regression is performed by the lm function. Some extra features have been added to the plot to give it a slightly different look than previous scatterplots. The polygon function colors the plotting region gray. The abline function draws the vertical and horizontal white grid lines. Finally, a call to the points function with the plotting character parameter pch set to 16 plots the points as solid dots on top of the gray background. The intercept and slope of the least squares regression line are

$\begin{array}{l} {\hat{β}}_{0} = - 17.6 and {\hat{β}}_{1} = 3.9 . \end{array}$

Figure 2.16: Fitted model of speed X and stopping distance Y for the cars data.

Long Description for Figure 2.16

The horizontal axis X measuring speed ranges from 0 to 25 in increments of 5 units. The vertical axis Y measuring stopping distance ranges from 0 to 120 in increments of 20 units. Multiple data points are plotted in the quadrant. A diagonal line, originating from the horizontal axis at (4, 0), slopes up toward the top right of the quadrant. The data points are clustered between values 10 and 20 of the X axis, and between values 18 and 70 of the Y axis. 24 data points are below the line of regression. 15 points are above the line of regression. 8 are along the line of regression. All data are approximate.

These correspond to a minimized sum of squares of [latex]SSE = 11,354[/latex]. As anticipated, the regression line falls below the origin. Having an estimated stopping distance of [latex]\hat \beta_0 = -17.6[/latex] feet for a stationary car makes the simple linear regression model less plausible. The slope of [latex]\hat \beta_1 = 3.9[/latex], indicates that there are about four extra feet of stopping distance for each additional mile per hour of speed.

We now investigate whether the residuals appear to be independent and identically distributed observations from a normal population. For those programmers who like succinct coding, a plot of the residuals can be generated with the single R command

because the data pairs in the cars data frame are sorted by the speeds. The plot of the residuals is given in Figure 2.17. The misgivings that were identified from the scatterplot are also evident in the plot of the residuals in Figure 2.17. The first 21 data pairs (corresponding to the slower speeds and shorter stopping distances) seem to have less dispersion about the regression line than the subsequent 29 data pairs.

Figure 2.17: Residuals for the cars data.

Long Description for Figure 2.17

The horizontal axis labeled i ranges from 1 to 50 in increments of 10. The vertical axis labeled e i ranges from negative 50 to 50 in increments of 50. Multiple data points are plotted in the quadrant. A dotted horizontal line, the line of regression, originating from 0 on the vertical axis, extends parallel to the X axis. It has 0 slope. 26 points are below the line of regression. 20 points are above the line of regression. 4 points are on the line of regression.

Three tweaks applied to Figure 2.17 are commonly used in regression when analyzing the residuals. First, the residuals can be standardized by subtracting their sample mean and dividing by their estimated standard deviation: [latex](e_i - \bar e) / \sqrt{MSE} = e_i / \sqrt{MSE}[/latex], where [latex]\sqrt{MSE}[/latex] is an approximation to the standard deviation of [latex]e_1, \, e_2, \, \ldots, \, e_n[/latex]. If the standardized residuals are independent and identically distributed realizations from a standard normal population, then approximately 95% of the standardized residuals will fall between [latex]-2[/latex] and 2. Second, the value of the independent variable is plotted on the horizontal axis rather than using the index of the observation. This ties the plot of the residuals more closely to the scatterplot. Third, the tied value at the data pair [latex](13, \, 34)[/latex] has the associated standardized residuals jittered. Continuing with the theme of succinct coding, the R commands

give the appropriate plot (without the jittering). The associated plot of standardized residuals that includes horizontal lines at 0 and [latex]\pm 2[/latex] is given in Figure 2.18. The variance of the deviations from the regression line appear to be increasing as the speed increases, with a smaller spread for speeds between 4 and 13 miles per hour and a larger spread for speeds between 14 and 25 miles per hour. This change in dispersion is inconsistent with the assumption of constant variance of the error terms for a simple linear regression model in Definition 1.1.

Figure 2.18: Standardized residuals for the cars data.

Long Description for Figure 2.18

The horizontal axis listing X i ranges from 0 to 25 in increments of 5. The vertical axis listing e i over square root of M S E ranges from negative 3 to 3 in increments of 1. Multiple data points are plotted on the X Y plane. A solid horizontal line, the line of regression, originating from 0 on the vertical axis, extends parallel to the X axis. It has 0 slope. Two dotted horizontal lines originating from value negative 2 and 2 on the vertical axis extend parallel to the X axis. 17 points line between the line of regression and the horizontal line originating from 2 on the vertical axis. 27 points lie between the line of regression and the horizontal line originating from negative 2 on the vertical axis. 3 points are on the line of regression. 2 points are on the dotted horizontal lines, one on each. Two points fall outside the horizontal lines, originating from value 2 on the vertical axis.

A histogram of the standardized residuals is generated with the additional R command

The histogram of the standardized residuals is given in Figure 2.19. The longer stopping distances in the right-hand tail of this histogram cast doubt on the assumption of normal error terms.

Figure 2.19: Histogram of the standardized residuals for the cars data.

A QQ plot to assess the normality of the residuals is generated with the R command

The QQ plot will assume the same shape regardless of whether the residuals or standardized residuals are examined; the only difference will be in the scale used on the vertical axis. The QQ plot is displayed in Figure 2.20. The QQ plot shows some significant departures from linearity. First, there is a large jump in the sequence of observations between the 37th ordered residual [latex]e_{(37)} = 4.27[/latex] and 38th ordered residual [latex]e_{(38)} = 10.86[/latex], which are the values plotted on the vertical axis. Second, the two largest residuals, [latex]e_{(49)} = 42.53[/latex] and [latex]e_{(50)} = 43.20[/latex], might indicate that the right-hand tail of the distribution of the error terms is not symmetric with the left-hand tail of the distribution.

Figure 2.20: QQ normal plot of residuals for the [latex]n = 50[/latex] cars data pairs.

Long Description for Figure 2.20

The horizontal axis X measuring theoretical quantiles ranges from negative 3 to 3 in increments of 1 unit. The vertical axis Y measuring sample quantiles ranges from negative 30 to 50 in increments of 10 units. Multiple data points plotted on the X Y plane follow an increasing trend, from left to right but do not appear linear. Data points are plotted from negative 2 and 0.6 on the X axis and from negative 20 to 9 of the Y axis, following a linear trend. There is small a gap, and data points are plotted from 0.6 on the X axis and 10 on the Y axis, onward. These data points are more scattered and non-linear.

The visual assessment that normally distributed error terms are not appropriate for these data pairs can be confirmed by conducting the Shapiro–Wilk test for normality. The R statement

conducts the Shapiro–Wilk test on the residuals and returns a p-value of 0.0215. The null hypothesis that the error terms are normally distributed is rejected in this case, which confirms our visual assessments via the plot of the residuals, the histogram of the residuals, and the QQ plot for normality.

We have identified four misgivings with respect to using the simple linear regression model with normal error terms to model the relationship between the speed of a car and its stopping distance:

the relationship between X and Y might be nonlinear,
the variance of the error terms appears to be increasing in X (this is known to regression modelers as heteroscadasticity),
the regression line does not pass near the origin as one would expect it would from the problem setting because a stationary vehicle does not require any distance to stop, and
the non-normality of the errors as indicated by the plot of the residuals, the histogram of the residuals, the QQ plot for normality, and the Shapiro–Wilk test.

So we abandon using the use of a simple linear regression model to describe the relationship between speed and stopping distance. Although simple linear regression can be used to model the relationship between speed and stopping distance, it should not be used here because the model is not valid. This data set will be reexamined in the next chapter in an effort to establish a regression model that overcomes some of the difficulties described here.

The fourth and final example concerns a large data set of home sale prices and associated predictors. Real estate platforms, such as Zillow and Trulia, are able to assess home values using a variety of predictors, and one key predictor is illustrated in the final example.

Example 2.9 The ames data frame in the modeldata package in R contains 2930 rows and 74 columns of data concerning houses that sold in Ames, Iowa from 2006 to 2010. Each row in the data frame contains data on one particular home. Each column in the data frame contains data on one particular aspect of a home, such as the number of bedrooms, the acreage of the lot, whether the home has a pool, or the area of the living space measured in square feet. One of the primary factors that a real estate assessor uses to determine the value of a home is the number of square feet in the home. This example concerns the modeling of the selling price of a home in Ames as a function of the number of square feet of living space.

The following R code installs the modeldata package, loads the modeldata package into the current R session, extracts the living space column from the ames data frame and places it in the vector x, extracts the sales price column and places it in the vector y, and generates a scatterplot of the [latex]n = 2930[/latex] data pairs, which is displayed in Figure 2.21.

Figure 2.21: Scatterplot of living area X and sale price Y for the ames data.

As expected, larger homes sell for higher prices on average. The scatterplot clearly shows that a simple linear regression model is not appropriate for this data set because the variance of the error terms is not constant over the various values of X. The variance of the error terms increases as the size of a home increases. In addition, the three large-but-relatively-inexpensive homes will exert significant leverage over a regression line. Although some remedial procedures to account for handling nonconstant variance of the error terms are given in the next chapter, we take the approach of restricting the home sizes to 2500 ft² to 3500 ft² in the hope that the simple linear regression assumptions will be satisfied on the restricted scope. The R code below generates the scatterplot and the associated regression line for the [latex]n = 120[/latex] homes satisfying [latex]2500 \le X \le 3500[/latex] displayed in Figure 2.22.

Figure 2.22: Scatterplot of living area X and sale price Y for the ames data, [latex]2500 \le X \le 3500[/latex].

Long Description for Figure 2.22

The horizontal axis X measures the living area and ranges from 2500 to 3500 in increments of 500 units. The vertical axis Y measures the sale price and ranges from 0 to 600,000 in increments of 200,000. The data points are concentrated between the X axis values of 2500 and 3000, and the Y axis values of values of 200,000 and 600,000. A diagonal line, originating from the vertical axis at (0, 3000,000), slopes up gently toward the top right of the quadrant. 7 data points fall along the regression line.

As was the case with the larger data set, there is a positive correlation between the size of the home and its sales price. Even though the observations are clustered more densely on the left-hand side of the scatterplot, the variance of the error terms does not seem to vary over this restricted scope of [latex]2500 \le X \le 3500[/latex]. The least squares estimators of the intercept and slope of the regression line are

$\begin{array}{l} {\hat{β}}_{0} = $ 21,233 and {\hat{β}}_{1} = $ 112. \end{array}$

The price of a home increases by an average of $112 for every additional square foot in the home and the estimated price of a home with zero square feet is $21,223. In other words, the estimated value of the land is $21,223. The estimated value of the land will not be very precise because we have eliminated homes with less than 2500 square feet in the reduced data set, leaving the intercept way outside of the scope of the reduced simple linear regression model. We anticipate a particularly wide confidence interval for β₀ if we determine that the simple linear regression model is valid.

The next step is to assess the residuals for the [latex]n = 120[/latex] homes with square footage between 2500 and 3500 to determine whether a simple linear regression model with normal error terms is appropriate. The R code below (a) generates a plot of the standardized residuals, (b) generates a histogram of the standardized residuals (c) generates a QQ plot of the residuals, and (d) performs the Shapiro–Wilk test for normality of the residuals.

The plot of the standardized residuals is given in Figure 2.23. Although there are a few more homes that sell significantly above their predicted value than significantly below their predicted value (that is, outside of the dashed lines in Figure 2.23), normally distributed error terms seems plausible.

Figure 2.23: Standardized residuals for the ames data on [latex]2500 \le X \le 3500[/latex].

Long Description for Figure 2.23

“The horizontal axis X i ranges from 2500 to 3500 in increments of 500. The vertical axis e i over square root of M S E ranges from negative 3 to 3 in increments of 1. Multiple data points are plotted in the quadrant. A solid horizontal line, the line of regression, originating from 0 on the vertical axis, extends parallel to the X axis. It has 0 slope. Two dotted horizontal lines originating from negative 2 and 2 on the vertical axis extend across the quadrant, parallel to the X axis. 50 points lie between the line of regression and the horizontal line originating from 2 on the vertical axis. 55 points line between the line of regression and the horizontal line originating from negative 2 on the vertical axis. 3 points are on the line of regression. 2 points are on the dotted horizontal line originating from negative 2. 4 points fall outside the horizontal line originating from 2. All data are approximate.”

The histogram of the standardized residuals is given in Figure 2.24. The histogram is consistent with a bell-shaped distribution. The nonsymmetry between the heights of the two most central bars in the histogram might possibly be due to the particular binning that was performed internally in R. The QQ plot tends to be a better graphic than the histogram to visually assess the normality of the residuals.

Figure 2.24: Histogram of the standardized residuals for the ames data.

The QQ plot for the normality of the residuals is given in Figure 2.25. The graph of the sample and theoretical quantiles appears to be fairly close to linear, so we expect that the Shapiro–Wilk test will yield a fairly high p-value associated with normally distributed error terms.

Figure 2.25: QQ normal plot of residuals for the [latex]n = 120[/latex] ames data pairs.

Long Description for Figure 2.25

“The horizontal axis X measuring theoretical quantiles ranges from negative 3 to 3 in increments of 1 unit. The vertical axis Y measuring sample quantiles ranges from negative 200,000 to 300,000 in increments of 100,000 units. Multiple data points plotted on the X Y plane follow a linear, increasing trend, from left to right. Most of the data points fall between negative 2 and 2 of the X axis and negative 180,000 and 200,000 of the Y axis.”

The Shapiro–Wilk test for normality of the residuals yields a p-value of [latex]p = 0.4339[/latex]. Since this p-value exceeds 0.05, we fail to reject the null hypothesis of normally distributed error terms. This analysis of the residuals is enough evidence for us to proceed with the simple linear regression model with normal error terms.

An ANOVA table can be generated in R with the anova function using the code below. This is an organized way to display the sum of squares and associated mean squares.

This code returns the output given below.

This output corresponds to the ANOVA table given in Table 2.5.

Table 2.5: ANOVA table for the restricted ames housing data.
Source	SS	df	MS	F	p
Regression	[latex]8.0882 \cdot 10 ^ {10}[/latex]	1	[latex]8.0882 \cdot 10 ^ {10}[/latex]	8.0016	0.005493
Error	[latex]1.1928 \cdot 10 ^ {12}[/latex]	118	[latex]1.0108 \cdot 10 ^ {10}[/latex]
Total	[latex]1.2737 \cdot 10 ^ {12}[/latex]	120

The test statistic [latex]F = 8.002[/latex] and associated p-value [latex]p = 0.005[/latex] indicate that the null hypothesis

$\begin{array}{l} H_{0} : β_{1} = 0 \end{array}$

should be rejected in favor of

$\begin{array}{l} H_{1} : β_{1} \neq 0 \end{array}$

for the restricted data pairs. The statistically significant positive slope of the regression line indicates that larger homes, on average, have higher selling prices on the range [latex]2500 \le X \le 3500[/latex], which is consistent with intuition. Based on this F test, we expect that a confidence interval for β₁ will not include [latex]\beta_1 = 0[/latex]. The square root of the mean square error, which is [latex]\hat{\sigma} = \text{100,540}[/latex], provides an estimate of the standard deviation of the error terms in the model. Using Theorem 2.3, an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval for σ is

$\begin{array}{l} \sqrt{\frac{S S E}{χ_{n - 2, α / 2}^{2}}} < σ < \sqrt{\frac{S S E}{χ_{n - 2, 1 - α / 2}^{2}}} . \end{array}$

So an exact two-sided 95% confidence interval for σ for the restricted [latex]n = 120[/latex] ames housing data pairs is

$\begin{array}{l} \sqrt{\frac{1, 192, 772, 613, 044}{149.96}} < σ < \sqrt{\frac{1, 192, 772, 613, 044}{89.83}} \end{array}$

or

$\begin{array}{l} 89, 186 < σ < 115, 233. \end{array}$

The coefficient of determination and the coefficient of correlation can be calculated using the values from the ANOVA table. Using Definition 1.3, the coefficient of determination is

$\begin{array}{l} R^{2} = \frac{S S R}{S S T} = \frac{80,881,836,066}{1,273,654,449,110} = 0. 0635 \end{array}$

and the coefficient of correlation is

$\begin{array}{l} r = \sqrt{R^{2}} = 0.252 . \end{array}$

So 6.35% of the variation in the selling price of a home is explained by the square footage of the home over the range [latex]2500 \le X \le 3500[/latex].

With the simple linear regression model with normal error terms established, we can compute confidence intervals for β₀ and β₁. The R code below uses the confint function to compute the lower and upper bounds of 95% confidence intervals for the intercept and slope of the regression line.

The 95% confidence intervals are

$\begin{array}{l} - 195,284 < β_{0} < 237,749 and 33. 57 < β_{1} < 190.27 . \end{array}$

The extraordinarily wide confidence interval for β₀ is due to the significant vertical distances between the data pairs and the regression line, and the large gap between the smallest value of the independent variable ([latex]X = 2500[/latex] square feet) and the value of the independent variable for an empty lot ([latex]X = 0[/latex] square feet). Including some of the other predictors of the sale price of a home in a regression model would narrow this confidence interval. Both confidence intervals would be narrowed with a larger number of data pairs.

In addition to individual confidence intervals for β₀ and β₁, a joint confidence region for both of the parameters can be generated. From Theorem 2.17, the boundary of the exact joint [latex]{100(1 - \alpha)}\%[/latex] confidence region for β₀ and β₁ consists all β₀ and β₁ values satisfying

$\begin{array}{l} \frac{n - 2}{2 \sum_{i = 1}^{n} e_{i}^{2}} [n {({\hat{β}}_{0} - β_{0})}^{2} + 2 ({\hat{β}}_{0} - β_{0}) ({\hat{β}}_{1} - β_{1}) \sum_{i = 1}^{n} X_{i} + {({\hat{β}}_{1} - β_{1})}^{2} \sum_{i = 1}^{n} X_{i}^{2}] = F_{2, n - 2, α}, \end{array}$

where [latex]F_{2, \, n - 2, \, \alpha}[/latex] is the [latex]1 - \alpha[/latex] fractile of an F distribution with 2 and [latex]n - 2[/latex] degrees of freedom. The boundary of the confidence region is an ellipse in the β₀ and β₁ plane. Plotting this ellipse requires the use of numerical methods and some significant coding, so it is easiest to use an R package that provides this capability. The ellipse function in the ellipse package is capable of plotting the ellipse. The R code below generates the confidence region plotted in Figure 2.26. The point estimators [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] are plotted as a point at the center of the ellipse. Dashed lines have been added at the confidence interval bounds for the individual confidence intervals for β₀ and β₁.

Figure 2.26: Exact 95% confidence region for β₀ and β₁ for the restricted ames data.

Long Description for Figure 2.26

The horizontal axis labeled beta 0 ranges from negative 195,284 to negative 237,749 in increments of 216,516. The vertical axis labeled beta 1 ranges from 34 to 190 in increments of 78. An ellipse extends from (negative 195,284, 190) to (237,749, 34) with the point (21,233, 112) plotted in the center. A dotted square extends between negative 195,284 to 216,516 of horizontal axis and 34 and 190 of vertical axis.

Not surprisingly based on the off-diagonal elements of the variance–covariance matrix of [latex]\big( \hat \beta_0 , \, \hat \beta_1 \big)[/latex] from Theorem 1.4, there is a negative correlation between [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] which accounts for the tilt in the ellipse.

2.6 Exercises

2.1 True or false: An alternative way to express the simple linear regression model with normal error terms is

$\begin{array}{l} Y \sim N (β_{0} + β_{1} X, σ^{2}) \end{array}$

or

$\begin{array}{l} Y_{i} \sim N (β_{0} + β_{1} X_{i}, σ^{2}) \end{array}$

for [latex]i = 1, \, 2, \, \ldots, \, n[/latex].
2.2 Consider a simple linear regression model with normal error terms and population parameters [latex]\beta_0 = 5[/latex], [latex]\beta_1 = 2[/latex], and [latex]\sigma = 2[/latex]. The independent variable assumes the values [latex]x = 1, \, 2, \, \ldots , \, 10[/latex], and [latex]n = 10[/latex] data pairs are collected, one for each potential value of the independent variable.
1. Conduct a Monte Carlo simulation experiment which determines the shape of the marginal distribution of Y.
2. How do you think the marginal distribution of Y will change as [latex]\sigma \rightarrow 0[/latex].
3. How do you think the marginal distribution of Y will change as [latex]\sigma \rightarrow \infty[/latex].
2.3 Show that

$\begin{array}{l} \frac{S S E}{σ^{2}} \sim χ^{2} (n - 2) . \end{array}$
2.4 For a simple linear regression model with normal error terms and known value of σ², give an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval for β₁.
2.5 Fit the data pairs from the first of the Anscombe’s quartet from Example 2.6 to the simple linear regression model with normal error terms and give point and exact two-sided 95% confidence intervals for the parameters β₀, β₁, and σ².
2.6 For what value of the independent variable is the confidence interval for the expected value of the associated dependent variable the narrowest?
2.7 For a simple linear regression model with normal error terms, known value of σ², and a fixed value X_h in the scope of the model, give an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval for [latex]E[ Y_h ] = \beta_0 + \beta_1 X_h[/latex].
2.8 Conduct a Monte Carlo simulation that yields compelling numerical evidence that the confidence interval for [latex]E[Y_h][/latex] from Theorem 2.12 is an exact confidence interval for the following parameter settings: [latex]n = 10[/latex], [latex]\beta_0 = 1[/latex], [latex]\beta_1 = 1 / 2[/latex], [latex]\sigma ^ {\, 2} = 1[/latex], [latex]X_h = 3[/latex], [latex]\alpha = 0.05[/latex], and [latex]X_i = i[/latex] for [latex]i = 1, \, 2, \, \ldots, \, 10[/latex].
2.9 Prove Theorem 2.11.
2.10 True or false: The width of a 95% confidence interval for [latex]E[Y_h][/latex] shrinks to zero in the limit as [latex]n \rightarrow \infty[/latex].
2.11 True or false: The width of a 95% prediction interval for [latex]Y_h^\star[/latex] shrinks to zero in the limit as [latex]n \rightarrow \infty[/latex].
2.12 The R data frame named longley contains seven macroeconomical variables from the United States collected from 1947 to 1962. Use the number of people employed to predict the gross national product (GNP) measured in constant 1954 dollars. Assuming that the simple linear regression model with normal error terms is appropriate,
1. make a scatterplot of the [latex]n = 16[/latex] data pairs and superimpose the regression line,
2. make a plot of the standardized residuals,
3. make a QQ plot of the residuals,
4. conduct the Shapiro–Wilk test for normality of the residuals,
5. give a point estimate and an exact 95% confidence interval for the slope β₁,
6. give a point estimate and an exact 95% confidence interval for the intercept β₀,
7. give a point estimate and an exact 95% confidence interval for the mean value of the GNP, [latex]E \left[ Y_h \right][/latex], when [latex]X_h = 65[/latex] million people are employed, and
8. give a point estimate and an exact 95% prediction interval for the GNP, [latex]Y_h^\star[/latex], associated with a new data pair when [latex]X_h = 65[/latex] million people are employed.
2.13 Under the simple linear regression model with normal error terms and parameters estimated from the data pairs [latex]\left( X_1, \, Y_1 \right), \, \left( X_2, \, Y_2 \right), \, \ldots , \, \left( X_n, \, Y_n \right)[/latex], the exact two-sided [latex]{100(1 - \alpha)}\%[/latex] prediction interval for [latex]Y_h^*[/latex] given in Theorem 2.15 is appropriate for a single new observation associated with a fixed value of the independent variable X_h. What if there are m new observations? In this case, an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] prediction interval for the mean response [latex]Y_h^*[/latex] is

$\begin{array}{l} {\hat{Y}}_{h}^{⋆} - t_{n - 2, α / 2} \sqrt{\hat{V} [{\hat{Y}}_{h}^{⋆}]} < Y_{h}^{⋆} < {\hat{Y}}_{h}^{⋆} + t_{n - 2, α / 2} \sqrt{\hat{V} [{\hat{Y}}_{h}^{⋆}]}, \end{array}$

for [latex]Y_h^\star[/latex], where

$\begin{array}{l} {\hat{Y}}_{h}^{⋆} = {\hat{β}}_{0} + {\hat{β}}_{1} X_{h} and \hat{V} [{\hat{Y}}_{h}^{⋆}] = [\frac{1}{m} + \frac{1}{n} + \frac{(X_{h} - \bar{X})^{2}}{S_{X X}}] M S E \end{array}$

and X_h is a fixed value of the independent variable within the scope of the simple linear regression model. Find a 95% prediction interval for the heights data pairs (using the wife’s height as the independent variable) from the PBImisc package from Example 2.7 with [latex]m = 4[/latex] and [latex]X_h = 150[/latex].
2.14 For the prediction interval for the population mean of the average of m new observations at a single setting of the independent variable X_h given in the previous question, what does the prediction interval collapse to in the limit as [latex]m \rightarrow \infty[/latex].
2.15 Consider the built-in data frame in R named trees, which contains data pairs of diameters (which will be the independent variable and is erroneously labeled Girth in the data frame) measured at 4 feet 6 inches above the ground and associated volumes (which will be the dependent variable) for [latex]n = 31[/latex] felled black cherry trees. Assuming that the simple linear regression model with normal error terms is appropriate, perform the following statistical inference procedures.
1. Plot the data pairs and the associated regression line.
2. Find a point estimate and an exact 95% confidence interval for β₁. Interpret the point estimate and the confidence interval.
3. Find a point estimate and an exact 95% confidence interval for the mean volume, [latex]E \left[ Y_h \right][/latex], when the diameter is [latex]X_h = 20[/latex] inches.
4. Find a point estimate and an exact 95% prediction interval for the volume, [latex]Y_h^\star[/latex], associated with a new data pair with a diameter of [latex]X_h = 20[/latex] inches.
5. Graph all values of the exact 95% confidence interval bounds for the expected volume for all diameters in the scope of the simple linear regression model. Also, graph all values of the exact 95% prediction interval bounds for the volume for a 32nd tree for all diameters in the scope of the simple linear regression model.
2.16 Plot a 95% confidence region for the data pairs in the cars data set under a simple linear regression model with normal error terms
1. using numerical methods, and
2. using the ellipse function from the ellipse package.
Include the maximum likelihood estimates for [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] and 95% confidence intervals for β₀ and β₁ on your plot.
2.17 Conduct a Monte Carlo simulation that provides convincing numerical evidence that the confidence region given in Theorem 2.17 is an exact confidence region for the following parameter settings: [latex]n = 10[/latex], [latex]\beta_0 = 1[/latex], [latex]\beta_1 = 1 / 2[/latex], [latex]\sigma ^ {\, 2} = 1[/latex], [latex]\alpha = 0.05[/latex], and [latex]X_i = i[/latex] for [latex]i = 1, \, 2, \, \ldots, \, 10[/latex].
2.18 Consider the simple linear regression model with normal error terms applied to the first set of [latex]n = 11[/latex] data pairs from Anscombe’s quartet from Example 2.6. Show that the p-values are identical for testing

$\begin{array}{l} H_{0} : β_{1} = 0 \end{array}$

versus

$\begin{array}{l} H_{1} : β_{1} \neq 0 \end{array}$

using
1. the F test based on using the test statistic which is the ratio of MSR to MSE, and
2. the t test based on using the test statistic [latex]\hat \beta_1 / \sqrt{MSE / S_{XX}}[/latex].
2.19 Figures 1.24, 1.25, and 1.26 depict three examples of extreme cases for SSE, SSR, and SST for [latex]n = 7[/latex] data pairs. Assuming the simple linear regression model with normal error terms is an appropriate model,
1. plot and label the potential points associated with the extreme cases when SSR is plotted on the horizontal axis and SSE is plotted on the vertical axis, and
2. on this same graph, shade the area associated with rejecting H₀ at level of significance at [latex]\alpha = 0.05[/latex] for the statistical test
  $\begin{array}{l} H_{0} : β_{1} = 0 \end{array}$
  
  versus
  
  $\begin{array}{l} H_{1} : β_{1} \neq 0. \end{array}$
2.20 Plot a power function for the Ftest for testing

$\begin{array}{l} H_{0} : β_{1} = 0 \end{array}$

versus

$\begin{array}{l} H_{0} : β_{1} \neq 0 \end{array}$

for [latex]n = 10[/latex], [latex]\beta_0 = 1[/latex], [latex]\sigma ^ {\, 2} = 1[/latex], [latex]\alpha = 0.05[/latex], and [latex]X_i = i[/latex], for [latex]i = 1, \, 2, \, \ldots, \, n[/latex]. You may use Monte Carlo simulation or the noncentral F distribution to generate the power function. Allow [latex]\beta_1[/latex] to vary from [latex]-1[/latex] to 1 in the plot.
2.21 Make plots of the standardized residuals for the four data sets from Anscombe’s quartet given in Example 2.6.
2.22 The confidence interval for [latex]E[Y_h][/latex] given in Theorem 2.12 is meaningful for a fixed value of the independent variable X_h. What if a confidence band that contains the entire regression line with a prescribed probability is desired. The Working–Hotelling [latex]{100(1 - \alpha)}\%[/latex] confidence band for the regression line at any level X_h is given by

$\begin{array}{l} {\hat{Y}}_{h} - t_{n - 2, α / 2} \sqrt{\hat{V} [{\hat{Y}}_{h}]} < E [Y_{h}] < {\hat{Y}}_{h} + t_{n - 2, α / 2} \sqrt{\hat{V} [{\hat{Y}}_{h}]}, \end{array}$

under the simple linear regression model with normal error terms, where

$\begin{array}{l} {\hat{Y}}_{h} = {\hat{β}}_{0} + {\hat{β}}_{1} X_{h} and \hat{V} [{\hat{Y}}_{h}] = 2 F_{2, n - 2, α} M S E [\frac{1}{n} + \frac{(X_{h} - \bar{X})^{2}}{S_{X X}}] . \end{array}$

Plot a 95% confidence band for the heights data pairs from Example 2.7.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Statistical Modeling: Regression, Survival Analysis, and Time Series Analysis Copyright © 2023 by Lawrence M. Leemis is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Wife’s height	Husband’s height
141	152
143	156
145	160
146	164
147	178
⋮	⋮
178	187
179	192
180	192
181	186
181	188

Wife’s height	Husband’s height
141	152
143	156
145	160
146	164
147	178
⋮	⋮
178	187
179	192
180	192
181	186
181	188

Chapter 2 Inference in Simple Linear Regression

2.1 Simple Linear Regression with Normal Error Terms

2.2 Maximum Likelihood Estimators

2.3 Inference in Simple Linear Regression

2.3.1 Inference Concerning σ2

2.3.2 Inference Concerning β1

2.3.3 Inference Concerning β0

2.3.4 Inference Concerning E[Yh]

2.3.5 Inference Concerning Yh⋆

2.3.6 Joint Inference Concerning β0 and β1

2.4 The ANOVA Table

2.5 Examples

2.6 Exercises

License

Share This Book

2.3.1 Inference Concerning σ²

2.3.2 Inference Concerning β₁

2.3.3 Inference Concerning β₀

2.3.4 Inference Concerning $E [Y_{h}]$

2.3.5 Inference Concerning $Y_{h}^{⋆}$

2.3.6 Joint Inference Concerning β₀ and β₁

Wife’s height	Husband’s height
141	152
143	156
145	160
146	164
147	178
⋮	⋮
178	187
179	192
180	192
181	186
181	188