Preface
This book provides a brief introduction to three statistical modeling techniques: regression, survival analysis, and time series analysis. My motivation for writing this book came from a recent article in Nature that indicated that the paper introducing the product–limit estimator by American statisticians Edward Kaplan and Paul Meier in 1958 and the paper introducing the proportional hazards model written by British statistician David Cox in 1972 were the two most cited papers in the statistical literature. Yet most undergraduates majoring in applied mathematics, statistics, data science, systems engineering, and management science do not encounter the statistical models developed in either of these two pivotal papers. This book provides an elementary introduction to these two statistical procedures, and many others.
This book is designed as a one-semester introduction to regression, survival analysis, and time series analysis for advanced undergraduates or first-year graduate students. The pre-requisites for this book are (a) a course in linear algebra, (b) a calculus-based introduction to probability, and (c) a course in mathematical statistics that covers point estimation, interval estimation, and hypothesis testing. The book is not comprehensive and is not a replacement for a full-semester class on each of the topics. It contains only brief introductions to the three topics.
Three chapters are devoted to each of the three topics. The initial two chapters move at about the pace one would expect in a full-semester course. The third chapter on each of the topics is like a “further reading” section which briefly introduces some topics that would be covered in depth in a full-semester course. An instructor might choose to skip or expand on these topics.
The material in the book can be covered at the ambitious pace of one chapter per week. An instructor could also choose to move more slowly if some of this material is part of a course covering another topic.
Most of the data sets that are used for examples in the book are given as clear text on the website www.math.wm.edu/~leemis/data/topics.
The text is organized into chapters, sections, and subsections. When there are several topics within a subsection, they are set off by boldface headings. Definitions and theorems are boxed; examples are indented; proofs are terminated with a box, like this: [latex]\Box[/latex]. Proofs are included when they are instructive to the material being presented. Exercises are numbered sequentially at the end of each chapter. Computer code is set in monospace font, and is not punctuated. Indentation is used to indicate nesting in code and pseudocode. An index is included. Italicized page numbers in the index correspond to the primary source of information on a topic.
The term estimator is used to describe a point estimator in the abstract or as a random variable; the term estimate is used to describe a point estimator that assumes a specific value estimated from a realization of data values. In some instances the case is altered to highlight this distinction. The sample mean [latex]\bar{X}[/latex], for example, is a point estimator for the population mean μ. A numerical value of the sample mean calculated from data values is sometimes denoted by the point estimate [latex]\bar{x}[/latex].
The R language is used throughout the text for graphics, computation, and Monte Carlo simulation. In many of the examples involving computations, the results are computed arithmetically, then confirmed in R, and then computed a third time using an R built-in function (such as lm for computing the coefficients in a regression model, coxph from the survival package for computing the regression coefficients in a Cox proportional hazards model, survfit from the survival package to calculate the step heights in the Kaplan–Meier product–limit estimator, or arima to fit a univariate time series). This three-step process is used to avoid treating R functions as black boxes without considering what goes on underneath the hood. R can be downloaded for free at r-project.org.
There are no references cited in the text for readability. The sources of materials in the various chapters are cited in the paragraphs below.
Chapter 1 notes: The quote by George Box is from page 202 of the book chapter: Box, G.E.P. (1979), “Robustness in the Strategy of Scientific Model Building,” from Robustness in Statistics, edited by R.L. Launer and G.N. Wilkinson, New York: Academic Press, pages 201–236. The data pairs associated with the boiling points and barometric pressures in Example 1.11 are from Forbes, J. (1857), “Further Experiments and Remarks on the Measurement of Heights and Boiling Point of Water,” Transactions of the Royal Society of Edinburgh, Volume 21, Issue 2, pages 235–243.
Chapter 2 notes: The four sets of data pairs known as Anscombe’s quartet are from Anscombe, F.J. (1973), “Graphs in Statistical Analysis,” The American Statistician, Volume 27, Number 1, pages 17–21. The housing data set in Example 2.9 is from De Cock, D. (2011), “Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project,” Journal of Statistics Education, Volume 19, Number 3, pages 1–15. The Shapiro–Wilk test for normality (and related tests) are overviewed in Razali, N., and Wah, Y.B. (2011), “Power Comparisons of Shapiro–Wilk, Kolmogorov–Smirnov, Lilliefors and Anderson–Darling Tests,” Journal of Statistical Modeling and Analytics, Volume 2, Number 1, pages 21–33.
Chapter 3 notes: The chemical data from Example 3.1 is from Bennett, N.A., and Franklin, N.L. (1954), Statistical Analysis in Chemistry and the Chemical Industry, New York: Wiley. Cook’s distances are derived in Cook, R.D. (1977), “Detection of Influential Observations in Linear Regression,” Technometrics, Volume 19, Number 1, pages 15–18. The U.S. National debt over time is from https://www.thebalance.com/national-debt-by-year-compared-to-gdp-and-major-events-3306287. The original paper introducing ridge regression is Hoerl, A.E., and Kennard, R.W. (1970), “Ridge Regression: Biased Estimation for Nonorthogonal Problems,” Technometrics, Volume 12, Number 1, pages 55–67.
Chapter 4 notes: Early references on the Weibull distribution include Fisher, R.A., and Tippett, L.H.C. (1928), “Limiting Forms of the Frequency Distribution of the Largest or Smallest Member of a Sample,” Proceedings of the Cambridge Philosophical Society, Volume 24, Issue 2, pages 180– 190, Weibull, W. (1939), “A Statistical Theory of the Strength of Materials,” Ingeniors Vetenskaps Akademien Handlingar, Number 153, and Weibull, W. (1951), “A Statistical Distribution Function of Wide Applicability,” Journal of Applied Mechanics, Volume 18, pages 293–297. The moment ratio diagrams given in Section 4.5 are adapted from those given in Vargo, E., Pasupathy, R., and Leemis, L. (2010), “Moment-Ratio Diagrams for Univariate Distributions,” Journal of Quality Technology, Volume 42, Number 3, pages 1–11. The Cox proportional hazards model was formulated in Cox, D.R. (1972), “Regression Models and Life-Tables” (with discussion), Journal of the Royal Statistical Society B, Volume 34, Number 2, pages 187–220.
Chapter 5 notes: The ball bearing failure times from Example 5.5 are from Lieblein, J., and Zelen, M. (1956), “Statistical Investigation of the Fatigue Life of Deep-Groove Ball Bearings,” Journal of Research of the National Bureau of Standards, Volume 57, Number 5, pages 273–316. The 48.48 data value in the ball bearing data set is given as 48.40 on page 99 of Lawless, J.F. (2003), Statistical Models and Methods for Lifetime Data, Second Edition, Hoboken, NJ: John Wiley & Sons, Inc., and page 4 of Meeker, W.Q., and Escobar, L.A. (2022), Statistical Methods for Reliability Data, Second Edition, New York: John Wiley & Sons, Inc., but is listed as 48.48 in Caroni, C. (2002), “The Correct ‘Ball Bearings’ Data,” Lifetime Data Analysis, Volume 8, Number 4, pages 395–399. The 6–MP data set is from Gehan, E.A. (1965), “A Generalized Wilcoxon Test for Comparing Arbitrarily Singly-Censored Samples,” Biometrika, Volume 52, Parts 1 and 2, pages 203–223. The automotive a/c switch failure times are from pages 253–254 of Kapur, K.C., and Lamberson, L.R. (1977), Reliability in Engineering Design, New York: John Wiley & Sons, Inc. The initial estimator for the Weibull shape parameter κ from a complete data set is given by Menon, M.V. (1963), “Estimation of the Shape and Scale Parameters of the Weibull Distribution,” Technometrics, Volume 5, Number 2, pages 175–182.
Chapter 6 notes: The Clopper–Pearson confidence interval was introduced by Clopper, C.J., and Pearson, E.S. (1934), “The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial,” Biometrika, Volume 26, Number 4, pages 404–413. The Wilson–Score confidence interval was introduced by Wilson, E.B. (1927), “Probable Inference, the Law of Succession, and Statistical Inference,” Journal of the American Statistical Association, Volume 22, Number 158, pages 209–212. The Jeffreys confidence interval is described by Brown, L.D., Cai, T.T., and Das Gupta, A. (2001), “Interval Estimation for a Binomial Proportion,” Statistical Science, Volume 16, Number 2, pages 101–133. The Agresti–Coull confidence interval was introduced by Agresti, A., and Coull, B.A. (1998), “Approximate is Better than ‘Exact’ for Interval Estimation of Binomial Proportions,” The American Statistician, Volume 52, Number 2, pages 119–126. The product–limit estimator was devised by Kaplan, E.L., and Meier, P. (1958), “Nonparametric Estimation from Incomplete Observations,” Journal of the American Statistical Association, Volume 53, Number 282, pages 457–481. The earliest reference to Greenwood’s formula comes from Greenwood, M. (1926), “The Natural Duration of Cancer,” Reports on Public Health and Medical Subjects, Her Majesty’s Stationery Office, London, Volume 33, pages 1–26. The proof of Theorem 6.1 is given in Appendix C of Leemis, L.M. (2009), Reliability: Probability Models and Statistical Methods, Second Edition, Lightning Source. A lucid presentation of Poisson processes and nonhomogeneous Poisson processes is given by Ross, S.M. (2019), Introduction to Probability Models, Twelfth Edition, London: Academic Press.
Chapter 7 notes: The monthly international airline traveler counts from Example 7.2 are Series G from Box, G.E.P., and Jenkins, G.M. (1976), Time Series Analysis: Forecasting and Control, Oakland, CA: Holden–Day. Introductions to the basics of time series analysis are given in Chatfield, C. (2004), The Analysis of Time Series: An Introduction, Sixth Edition, Boca Raton, FL: Chapman & Hall/CRC, and Brockwell, P.J., and Davis, R.A. (2016), Introduction to Time Series and Forecasting, Third Edition, Springer International Publishing Switzerland.
Chapter 8 notes: The details associated with the Box–Pierce test are given in Box, G.E.P., and Pierce, D.A. (1970), “Distribution of Residual Auto-Correlations in Autoregressive-Integrated Moving Average Time-Series Models,” Journal of the American Statistical Association, Volume 65, Number 332, pages 1509–1526. The details associated with the Ljung–Box test are given in Ljung, G.M., and Box, G.E.P. (1978), “On a Measure of Lack of Fit in Time Series Models,” Biometrika, Volume 65, Number 2, pages 297–303. The turning point test was first devised by Bienaymé, I.–J. (1874), “Sur Une Question de Probabilités,” Bulletin de la Société Mathématique de France, Volume 2, pages 153–154. The Akaike Information Criterion was formulated in Akaike, H. (1974), “A New Look at the Statistical Model Identification,” IEEE Transactions on Automatic Control, Volume 19, Number 6, pages 716–723. The corrected Akaike Information Criterion was formulated in Hurvich, C.M., and Tsai, C–L. (1989), “Regression and Time Series Model Selection in Small Samples,” Biometrika, Volume 76, Number 2, pages 297–307.
Chapter 9 notes: The graph displaying the stationarity region in terms of [latex]\Phi_1[/latex] and [latex]\Phi_2[/latex] shown in Figure 9.13 is adapted from a figure in Stralkowski, C.M. (1968), “Lower Order Autoregressive-Moving Average Stochastic Models and Their Use for the Characterization of Abrasive Cutting Tools,” PhD Thesis, The University of Wisconsin. Rom Lipscus from the Virginia Institute of Marine Science helped me find the source of the Lake Huron lake level data given in Example 9.14. Lauren M. Fry from the NOAA Great Lakes Environmental Research Laboratory provided the source of the Lake Huron levels on pages 151–154 of https://tidesandcurrents.noaa.gov/publications/Monthly_Annual_Averages_IGLD55_1860thru1985.PDF
The time series of 210 consecutive chemical production yields from Example 9.35 are from pages 120–121 of Box, G.E.P., Hunter, J.S., and Hunter, W.G. (2005), Statistics for Experimenters: Design, Innovation, Discovery, Second Edition, Hoboken, NJ: John Wiley & Sons. The data values are IGLD55, which means that they are the sea level (in feet) above the level of the Atlantic Ocean. The shortened version of the lynx pelt sales from the Hudson’s Bay Company that was fit to the transformed AR(3) model in Example 9.29 was suggested by Wei, W.W.S. (2006), Time Series Analysis: Univariate and Multivariate Methods, Second Edition, Boston: Pearson/Addison–Wesley. An early comprehensive treatment of ARIMA modeling is given in Box, G.E.P., and Jenkins, G.M. (1976), Time Series Analysis: Forecasting and Control, Oakland, CA: Holden–Day.
There are dozens of people to thank for making this book possible. Carrie Cooper, Lisa Nickel, Tami Back, Rosie Liljenquist, and all of the librarians and generous donors at the Swem Library at William & Mary made this book possible through their Library Scholar position. Thanks also goes to the William & Mary statisticians Ed Chadraa, Flip deCamp, Greg Hunt, Ross Iaci, Rui Pereira, Heather Sasinowska, and Guannan Wang, who have helped brainstorm about the topics and their sequencing in the book. I am grateful for Olivia Ding, Kexin Feng, Robert Jackson, Yuxin Qin, Chris Weld, and Hailey Young taking the time to read all or portions of an early draft of the text and providing helpful feedback. Barry Lawson from Bates College helped with the inset and lines in Figure 7.25. Special thanks goes to Kay Helm who converted the text from its original LaTeX format to the electronic version you are now reading. Five special people have made extraordinary contributions to this book: Heather Sasinowska edited the regression and time series chapters, Robert Lewis edited most of the entire textbook, Raghu Pasupathy provided keen insight concerning the presentation of the time series material and the moment ratio diagrams, Rosie Liljenquist edited the first two chapters just before having her sixth baby, and my wife Jill helped me push the book over the finish line. Finally, thanks goes to Drea George for the handsome book cover.
Since this is an open educational resource, this book is a work in progress. Please e-mail any typographical errors or suggested alterations that you spot to me. Thank you.
Williamsburg, VA Larry Leemis
January 2023