Reading 10

Quantitative Methods · Simple Linear Regression

MODULE 10.1: LINEAR REGRESSION BASICS

LOS 10.a

Describe a simple linear regression model, how the least squares criterion is used to estimate regression coefficients, and the interpretation of these coefficients.

The purpose of simple linear regression is to explain the variation in a dependent variable in terms of the variation in a single independent variable. Here, the term variation is interpreted as the degree to which a variable differs from its mean value. Don't confuse variation with variance—they are related, but they are not the same.

\[\text{variation in } Y = \sum_{i=1}^{n} (Y_i - \bar{Y})^2\]

The dependent variable is the variable whose variation is explained by the independent variable. We are interested in answering the question, "What explains fluctuations in the dependent variable?" The dependent variable is also referred to as the terms explained variable, endogenous variable, or predicted variable.
The independent variable is the variable used to explain the variation of the dependent variable. The independent variable is also referred to as the terms explanatory variable, exogenous variable, or predicting variable.

Example: Dependent vs. independent variables

Identify the dependent and independent variable

Suppose you want to predict stock returns with GDP growth. Which variable is the independent variable?

Answer:

Because GDP is going to be used as a predictor of stock returns, stock returns are being explained by GDP. Hence, stock returns are the dependent (explained) variable, and GDP is the independent (explanatory) variable.

Suppose we want to use excess returns on the S&P 500 (the independent variable) to explain the variation in excess returns on ABC common stock (the dependent variable). For this model, we define excess return as the difference between the actual return and the return on 1-month Treasury bills.

We would start by creating a scatter plot with ABC excess returns on the vertical axis and S&P 500 excess returns on the horizontal axis. Monthly excess returns for both variables from June 20X2 to May 20X5 are plotted in Figure 10.1. For example, look at the point labeled May 20X4. In that month, the excess return on the S&P 500 was −7.8%, and the excess return on ABC was 1.1%.

The two variables in Figure 10.1 appear to be positively correlated: excess ABC returns tended to be positive (negative) in the same month that S&P 500 excess returns were positive (negative). This is not the case for all the observations, however (for example, May 20X4). In fact, the correlation between these variables is approximately 0.40.

中文翻譯

簡單線性迴歸（simple linear regression）的目的，是用單一自變數的變化來解釋依變數的變化。此處的「變異（variation）」指的是某變數偏離其均值的程度，不要與「變異數（variance）」混淆——兩者相關但不相同。

\(Y\) 的變異 \(= \sum_{i=1}^{n}(Y_i - \bar{Y})^2\)

依變數（dependent variable）：其變化被自變數所解釋的變數，也稱為「被解釋變數（explained variable）」、「內生變數（endogenous variable）」或「被預測變數（predicted variable）」。
自變數（independent variable）：用來解釋依變數變化的變數，也稱為「解釋變數（explanatory variable）」、「外生變數（exogenous variable）」或「預測變數（predicting variable）」。

範例：若要用 GDP 成長率預測股票報酬，則 GDP 為自變數（解釋變數），股票報酬為依變數（被解釋變數）。

假設以 S&P 500 超額報酬（自變數）解釋 ABC 股票超額報酬（依變數）的變化，超額報酬 = 實際報酬 − 1 個月國庫券報酬。散點圖中，ABC 與 S&P 500 超額報酬呈正相關（相關係數約 0.40），但並非所有月份皆如此（如 20X4 年 5 月）。

Simple Linear Regression Model

The following linear regression model is used to describe the relationship between two variables, \(X\) and \(Y\):

\[Y_i = b_0 + b_1 X_i + \varepsilon_i, \quad i = 1, \dots, n\]

where:

\(Y_i\) = \(i\)th observation of the dependent variable, \(Y\)
\(X_i\) = \(i\)th observation of the independent variable, \(X\)
\(b_0\) = regression intercept term
\(b_1\) = regression slope coefficient
\(\varepsilon_i\) = residual for the \(i\)th observation (also referred to as the disturbance term or error term)

Based on this regression model, the regression process estimates an equation for a line through a scatter plot of the data that "best" explains the observed values for \(Y\) in terms of the observed values for \(X\).

The linear equation, often called the line of best fit or regression line, takes the following form:

\[\hat{Y}_i = \hat{b}_0 + \hat{b}_1 X_i, \quad i = 1, 2, 3, \dots, n\]

where:

\(\hat{Y}_i\) = estimated value of \(Y_i\) given \(X_i\)
\(\hat{b}_0\) = estimated intercept term
\(\hat{b}_1\) = estimated slope coefficient

Professor's Note

The hat "^" above a variable or parameter indicates a predicted value.

The regression line is just one of the many possible lines that can be drawn through the scatter plot of \(X\) and \(Y\). The criteria used to estimate this line is the essence of linear regression. The regression line is the line that minimizes the sum of the squared differences (vertical distances) between the \(Y\)-values predicted by the regression equation (\(\hat{Y}_i = \hat{b}_0 + \hat{b}_1 X_i\)) and the actual \(Y\)-values, \(Y_i\). The sum of the squared vertical distances between the estimated and actual \(Y\)-values is referred to as the sum of squared errors (SSE).

Thus, the regression line is the line that minimizes the SSE. This explains why simple linear regression is frequently referred to as ordinary least squares (OLS) regression, and the values determined by the estimated regression equation, \(\hat{Y}_i\), are called least squares estimates.

The estimated slope coefficient (\(\hat{b}_1\)) for the regression line describes the change in \(Y\) for a one-unit change in \(X\). It can be positive, negative, or zero, depending on the relationship between the regression variables. The slope term is calculated as follows:

\[\hat{b}_1 = \frac{\text{Cov}_{XY}}{\sigma_X^2}\]

The intercept term (\(\hat{b}_0\)) is the line's intersection with the \(Y\)-axis at \(X = 0\). It can be positive, negative, or zero. A property of the least squares method is that the intercept term may be expressed as follows:

\[\hat{b}_0 = \bar{Y} - \hat{b}_1 \bar{X}\]

where \(\bar{Y}\) = mean of \(Y\), and \(\bar{X}\) = mean of \(X\).

The intercept equation highlights the fact that the regression line passes through a point with coordinates equal to the mean of the independent and dependent variables (i.e., the point \(\bar{X}, \bar{Y}\)).

Example: Computing the slope coefficient and intercept term

Calculate \(\hat{b}_1\) and \(\hat{b}_0\) for the ABC regression

Compute the slope coefficient and intercept term using the following information:

Cov(S&P 500, ABC) = 0.000336
Var(S&P 500) = 0.000522
Mean return, S&P 500 = −2.70%
Mean return, ABC = −4.05%

Answer:

The slope coefficient is calculated as \(\hat{b}_1 = 0.000336 / 0.000522 = 0.64\).

The intercept term is calculated as follows:

\[\hat{b}_0 = \overline{\text{ABC}} - \hat{b}_1 \overline{\text{S\&P 500}} = -4.05\% - 0.64(-2.70\%) = -2.3\%\]

The estimated regression line that minimizes the SSE in our ABC stock return example has an intercept of −2.3% and a slope of 0.64. The model predicts that if the S&P 500 excess return is −7.8% (May 20X4 value), then the ABC excess return would be \(-2.3\% + (0.64)(-7.8\%) = -7.3\%\). The residual (error) for the May 20X4 ABC prediction is 8.4%—the difference between the actual ABC excess return of 1.1% and the predicted return of −7.3%.

中文翻譯

簡單線性迴歸模型：\(Y_i = b_0 + b_1 X_i + \varepsilon_i\)，其中 \(b_0\) 為截距、\(b_1\) 為斜率、\(\varepsilon_i\) 為殘差（誤差項）。

迴歸線（最佳配適線）：\(\hat{Y}_i = \hat{b}_0 + \hat{b}_1 X_i\)。帽號「^」代表估計值或預測值。

迴歸線是使誤差平方和（SSE）最小的那條線，因此簡單線性迴歸又稱為普通最小平方（OLS）迴歸。

斜率係數：\(\hat{b}_1 = \text{Cov}_{XY} / \sigma_X^2\)，代表 \(X\) 每變動一單位，\(Y\) 的預期變動量。

截距：\(\hat{b}_0 = \bar{Y} - \hat{b}_1 \bar{X}\)，代表 \(X=0\) 時依變數的預測值。迴歸線必過點 \((\bar{X},\bar{Y})\)。

範例計算：斜率 \(= 0.000336/0.000522 = 0.64\)；截距 \(= -4.05\% - 0.64 \times (-2.70\%) = -2.3\%\)。當 S&P 500 超額報酬為 −7.8% 時，ABC 預測值為 −7.3%，而實際值 +1.1%，殘差 = 1.1% − (−7.3%) = 8.4%。

Interpreting a Regression Coefficient

The estimated intercept represents the value of the dependent variable at the point of intersection of the regression line and the axis of the dependent variable (usually, the vertical axis). In other words, the intercept is an estimate of the dependent variable when the independent variable is zero.

We also mentioned earlier that the estimated slope coefficient is interpreted as the expected change in the dependent variable for a one-unit change in the independent variable. For example, an estimated slope coefficient of 2 would indicate that the dependent variable is expected to change by two units for every one-unit change in the independent variable.

Example: Interpreting regression coefficients

Interpret \(\hat{b}_1 = 0.64\) and \(\hat{b}_0 = -2.3\%\)

In the previous example, the estimated slope coefficient was 0.64 and the estimated intercept term was −2.3%. Interpret each coefficient estimate.

Answer:

The slope coefficient of 0.64 can be interpreted to mean that when excess S&P 500 returns increase (decrease) by 1%, ABC excess returns is expected to increase (decrease) by 0.64%.

The intercept term of −2.3% can be interpreted to mean that when the excess return on the S&P 500 is zero, the expected return on ABC stock is −2.3%.

Professor's Note

The slope coefficient in a regression of the excess returns of an individual security (the \(y\)-variable) on the return on the market (the \(x\)-variable) is called the stock's beta, which is an estimate of systematic risk of ABC stock. Notice that ABC is less risky than the average stock, because its returns tend to increase or decrease by less than the overall change in the market returns. A stock with a beta (regression slope coefficient) of 1 has an average level of systematic risk, and a stock with a beta greater than 1 has more-than-average systematic risk. We will apply this concept in the Portfolio Management topic area.

Keep in mind, however, that any conclusions regarding the importance of an independent variable in explaining a dependent variable are based on the statistical significance of the slope coefficient. The magnitude of the slope coefficient tells us nothing about the strength of the linear relationship between the dependent and independent variables. A hypothesis test must be conducted, or a confidence interval must be formed, to assess the explanatory power of the independent variable. Later in this reading we will perform these hypothesis tests.

中文翻譯

截距的解釋：當自變數為零時，依變數的估計值。

斜率的解釋：自變數每變動一單位，依變數預期的變動量。例如斜率 = 2，則自變數增加 1 單位，依變數預期增加 2 單位。

範例詮釋：斜率 0.64 代表 S&P 500 超額報酬每增減 1%，ABC 超額報酬預期增減 0.64%；截距 −2.3% 代表 S&P 500 超額報酬為 0 時，ABC 預期超額報酬為 −2.3%。

教授提示：個股超額報酬對市場報酬的迴歸斜率即為貝他值（beta），衡量系統性風險。ABC 的 beta = 0.64 < 1，表示系統性風險低於市場平均。

重要：斜率係數大小不代表線性關係強弱，必須透過假設檢定或信賴區間才能評估自變數的解釋力。

LOS 10.b

Explain the assumptions underlying the simple linear regression model, and describe how residuals and residual plots indicate if these assumptions may have been violated.

Linear regression is based on numerous assumptions. Most of the major assumptions pertain to the regression model's residual term (\(\varepsilon\)). Linear regression assumes the following:

A linear relationship exists between the dependent and the independent variables.
The variance of the residual term is constant for all observations (homoskedasticity).
The residual term is independently distributed; that is, the residual for one observation is not correlated with that of another observation (or, the paired \(x\) and \(y\) observations are independent of each other).
The residual term is normally distributed.

中文翻譯

線性迴歸的四大假設（主要針對殘差項 \(\varepsilon\)）：

依變數與自變數之間存在線性關係。
所有觀察值的殘差項具有相同變異數（同方差性，homoskedasticity）。
殘差項彼此獨立（不相關）。
殘差項服從常態分佈。

Linear Relationship

A linear regression model is not appropriate when the underlying relationship between \(X\) and \(Y\) is nonlinear. In Panel A of Figure 10.3, we illustrate a regression line fitted to a nonlinear relationship. Note that the prediction errors (vertical distances from the dots to the line) are positive for low values of \(X\), then increasingly negative for higher values of \(X\), and then turning positive for still-greater values of \(X\). One way of checking for linearity is to examine the model residuals (prediction errors) in relation to the independent regression variable. In Panel B, we show the pattern of residuals over the range of the independent variable: positive, negative, then positive.

Homoskedasticity

Homoskedasticity refers to the case where prediction errors all have the same variance. Heteroskedasticity refers to the situation when the assumption of homoskedasticity is violated. Figure 10.4, Panel A shows a scatter plot of observations around a fitted regression line where the residuals (prediction errors) increase in magnitude with larger values of the independent variable \(X\). Panel B shows the residuals plotted versus the value of the independent variable, and it also illustrates that the variance of the error terms is not likely constant for all observations.

Another type of heteroskedasticity results if the variance of the error term changes over time (rather than with the magnitude of the independent variable). We could observe this by plotting the residuals from a linear regression model versus the dates of each observation and finding that the magnitude of the errors exhibits a pattern of changing over time.

Independence

Suppose we collect a company's monthly sales and plot them against monthly GDP as in Figure 10.5, Panel A, and observe that some prediction errors (the unfilled dots) are noticeably larger than others. To investigate this, we plot the residuals versus time, as in Panel B. The residuals plot illustrates that there are large prediction errors every 12 months (in December). This suggests that there is seasonality in sales such that December sales (the unfilled dots in Figure 10.5) are noticeably farther from their predicted values than sales for the other months. If the relationship between \(X\) and \(Y\) is not independent, the residuals are not independent, and our estimates of the model parameters' variances will not be correct.

Normality

When the residuals (prediction errors) are normally distributed, we can conduct hypothesis testing for evaluating the goodness of fit of the model (discussed later). With a large sample size, based on the central limit theorem, our parameter estimates may be valid, even when the residuals are not normally distributed.

Outliers are observations (one or a few) that are far from our regression line (have large prediction errors or \(X\) values that are far from the others). Outliers will influence our parameter estimates so that the OLS model will not fit the other observations well.

中文翻譯

線性關係：若 \(X\) 與 \(Y\) 為非線性關係，強行套用線性迴歸會導致殘差呈現系統性型態（如正→負→正的波動），可透過殘差圖檢驗。

同方差性（Homoskedasticity）：殘差變異數在所有觀察中保持固定。若違反（異方差，Heteroskedasticity），殘差的散佈程度會隨自變數大小或時間而改變。

獨立性（Independence）：殘差之間不應相關。若殘差存在季節性或自相關，模型參數的變異數估計將不正確。

常態性（Normality）：殘差服從常態分佈是進行假設檢定的基礎。大樣本下依中央極限定理，即使殘差非常態，參數估計仍可能有效。離群值（Outliers）會扭曲 OLS 估計結果。

Module Quiz 10.1

1. What is the most appropriate interpretation of a slope coefficient estimate equal to 10.0?

A. The predicted value of the dependent variable when the independent variable is 0 is 10.0.
B. For every 1-unit change in the independent variable, the model predicts that the dependent variable will change by 10 units.
C. For every 1-unit change in the independent variable, the model predicts that the dependent variable will change by 0.1 units.

B is correct. The slope coefficient is best interpreted as the predicted change in the dependent variable for a 1-unit change in the independent variable; if the slope coefficient estimate is 10.0 and the independent variable changes by 1 unit, the dependent variable is expected to change by 10 units. The intercept term is best interpreted as the value of the dependent variable when the independent variable is equal to zero. (LOS 10.a)

2. Which of the following is least likely a necessary assumption of simple linear regression analysis?

A. The residuals are normally distributed.
B. There is a constant variance of the error term.
C. The dependent variable is uncorrelated with the residuals.

C is correct. The model does not assume that the dependent variable is uncorrelated with the residuals. It does assume that the independent variable is uncorrelated with the residuals. (LOS 10.b)

MODULE 10.2: ANALYSIS OF VARIANCE (ANOVA) AND GOODNESS OF FIT

LOS 10.c

Calculate and interpret measures of fit and formulate and evaluate tests of fit and of regression coefficients in a simple linear regression.

LOS 10.d

Describe the use of analysis of variance (ANOVA) in regression analysis, interpret ANOVA results, and calculate and interpret the standard error of estimate in a simple linear regression.

Analysis of variance (ANOVA) is a statistical procedure for analyzing the total variability of the dependent variable. Let's define some terms before we move on to ANOVA tables:

The total sum of squares (SST) measures the total variation in the dependent variable. SST is equal to the sum of the squared differences between the actual \(Y\)-values and the mean of \(Y\): \[\text{SST} = \sum_{i=1}^{n} (Y_i - \bar{Y})^2\]
The sum of squares regression (SSR) measures the variation in the dependent variable that is explained by the independent variable. SSR is the sum of the squared distances between the predicted \(Y\)-values and the mean of \(Y\): \[\text{SSR} = \sum_{i=1}^{n} (\hat{Y}_i - \bar{Y})^2\]
The mean square regression (MSR) is the SSR divided by the number of independent variables. A simple linear regression has only one independent variable, so in this case, MSR = SSR.

Professor's Note

Multiple regression (i.e., with more than one independent variable) is addressed in the Level II CFA curriculum.

The sum of squared errors (SSE) measures the unexplained variation in the dependent variable. It's also known as the sum of squared residuals or the residual sum of squares. SSE is the sum of the squared vertical distances between the actual \(Y\)-values and the predicted \(Y\)-values on the regression line: \[\text{SSE} = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2\]
The mean squared error (MSE) is the SSE divided by the degrees of freedom, which is \(n - 1\) minus the number of independent variables. A simple linear regression has only one independent variable, so in this case, degrees of freedom are \(n - 2\).

You probably will not be surprised to learn the following:

\[\text{SST} = \text{SSR} + \text{SSE}\]

The output of the ANOVA procedure is an ANOVA table, which is a summary of the variation in the dependent variable. A generic ANOVA table for a simple linear regression (one independent variable) is presented in Figure 10.7.

中文翻譯

變異數分析（ANOVA）是分析依變數總變異的統計程序。主要術語：

SST（總平方和）：依變數的總變異 \(= \sum(Y_i - \bar{Y})^2\)。
SSR（迴歸平方和）：由自變數「解釋的」依變數變異 \(= \sum(\hat{Y}_i - \bar{Y})^2\)。
MSR（迴歸均方）= SSR / 自變數個數；簡單迴歸（\(k=1\)）時 MSR = SSR。
SSE（誤差平方和）：依變數「未被解釋的」變異 \(= \sum(Y_i - \hat{Y}_i)^2\)。
MSE（均方誤差）= SSE / (\(n-2\))，自由度為 \(n-2\)。

關鍵等式：SST = SSR + SSE（總變異 = 已解釋變異 + 未解釋變異）。

Figure 10.7: ANOVA Table for a Simple Linear Regression

Source of Variation	Degrees of Freedom	Sum of Squares	Mean Sum of Squares
Regression (explained)	1	SSR	MSR = SSR / k = SSR / 1 = SSR
Error (unexplained)	n − 2	SSE	MSE = SSE / (n − 2)
Total	n − 1	SST

中文翻譯

ANOVA 表（簡單線性迴歸）：迴歸列自由度為 1，誤差列自由度為 \(n-2\)，總列自由度為 \(n-1\)。MSR = SSR（因 \(k=1\)），MSE = SSE/(\(n-2\))。

Standard Error of Estimate (SEE)

The SEE for a regression is the standard deviation of its residuals. The lower the SEE, the better the model fit:

\[\text{SEE} = \sqrt{\text{MSE}}\]

Coefficient of Determination (R²)

The coefficient of determination (\(R^2\)) is defined as the percentage of the total variation in the dependent variable explained by the independent variable. For example, an \(R^2\) of 0.63 indicates that the variation of the independent variable explains 63% of the variation in the dependent variable:

\[R^2 = \text{SSR} / \text{SST}\]

Professor's Note

For simple linear regression (i.e., with one independent variable), the coefficient of determination, \(R^2\), may be computed by simply squaring the correlation coefficient, \(r\). In other words, \(R^2 = r^2\) for a regression with one independent variable.

Example: Using the ANOVA table

Calculate R² and SEE from the ANOVA table

Given the following ANOVA table based on 36 observations, calculate the \(R^2\) and the standard error of estimate (SEE).

Completed ANOVA table for ABC regression

Source of Variation	Degrees of Freedom	Sum of Squares	Mean Sum of Squares
Regression (explained)	1	0.0076	0.0076
Error (unexplained)	34	0.0406	0.0012
Total	35	0.0482

Answer:

\[R^2 = \frac{\text{SSR}}{\text{SST}} = \frac{0.0076}{0.0482} = 0.158 \text{ or } 15.8\%\] \[\text{SEE} = \sqrt{\text{MSE}} = \sqrt{0.0012} = 0.035\]

中文翻譯

估計標準誤（SEE）= \(\sqrt{\text{MSE}}\)，是殘差的標準差。SEE 越小，模型配適越佳。

決定係數（\(R^2\)）= SSR / SST，代表依變數總變異中由自變數解釋的比例。教授提示：在簡單線性迴歸中，\(R^2 = r^2\)（相關係數的平方）。

範例計算：\(R^2 = 0.0076/0.0482 = 15.8\%\)；SEE \(= \sqrt{0.0012} = 0.035\)。

The F-Statistic

An F-test assesses how well a set of independent variables, as a group, explains the variation in the dependent variable.

The F-statistic is calculated as follows:

\[F = \frac{\text{MSR}}{\text{MSE}} = \frac{\text{SSR}/k}{\text{SSE}/(n - k - 1)}\]

where MSR = mean regression sum of squares and MSE = mean squared error.

Important: This is always a one-tailed test!

For simple linear regression, there is only one independent variable, so the F-test is equivalent to a \(t\)-test of the statistical significance of the slope coefficient:

\[H_0: b_1 = 0 \text{ versus } H_a: b_1 \neq 0\]

To determine whether \(b_1\) is statistically significant using the F-test, the calculated F-statistic is compared with the critical F-value, \(F_c\), at the appropriate level of significance. The degrees of freedom for the numerator and denominator with one independent variable are as follows:

\[df_{\text{numerator}} = k = 1\] \[df_{\text{denominator}} = n - k - 1 = n - 2\]

where \(n\) = number of observations. The decision rule for the F-test is to reject \(H_0\) if \(F > F_c\).

Rejecting the null hypothesis that the value of the slope coefficient equals zero at a stated level of significance indicates that the independent variable and the dependent variable have a significant linear relationship.

Example: Calculating and interpreting the F-statistic

Test H₀: b₁ = 0 at 5% significance using the F-test

Use the ANOVA table from the previous example to calculate and interpret the F-statistic. Test the null hypothesis at the 5% significance level that the slope coefficient is equal to 0.

Answer:

\[F = \frac{\text{MSR}}{\text{MSE}} = \frac{0.0076}{0.0012} = 6.33\] \[df_{\text{numerator}} = k = 1\] \[df_{\text{denominator}} = n - k - 1 = 36 - 1 - 1 = 34\]

The null and alternative hypotheses are \(H_0: b_1 = 0\) versus \(H_a: b_1 \neq 0\). The critical F-value for 1 and 34 degrees of freedom at a 5% significance level is approximately 4.1. (Remember, it's a one-tailed test, so we use the 5% F-table.) Therefore, we can reject the null hypothesis and conclude that the slope coefficient is significantly different than zero.

中文翻譯

F 統計量用於檢定自變數整體能否顯著解釋依變數的變異：\(F = \text{MSR}/\text{MSE}\)。

重要：F 檢定永遠是單尾。決策規則：若 \(F > F_c\)，則拒絕 \(H_0\)。

簡單線性迴歸（\(k=1\)）中，F 檢定等同於對斜率進行 \(t\) 檢定：\(H_0: b_1 = 0\) vs. \(H_a: b_1 \neq 0\)。

自由度：分子 df = 1，分母 df = \(n - 2\)。

範例：\(F = 0.0076/0.0012 = 6.33\)，臨界值 \(F_c(1, 34) \approx 4.1\)，因 6.33 > 4.1，拒絕虛無假設，結論：斜率係數顯著不為零。

Hypothesis Test of a Regression Coefficient

A \(t\)-test may also be used to test the hypothesis that the true slope coefficient, \(b_1\), is equal to a hypothesized value. Letting \(\hat{b}_1\) be the point estimate for \(b_1\), the appropriate test statistic with \(n - 2\) degrees of freedom is:

\[t_{\hat{b}_1} = \frac{\hat{b}_1 - b_1}{s_{\hat{b}_1}}\]

The decision rule for tests of significance for regression coefficients is:

\[\text{Reject } H_0 \text{ if } t > +t_{\text{critical}} \text{ or } t < -t_{\text{critical}}\]

Rejection of the null supports the alternative hypothesis that the slope coefficient is different from the hypothesized value of \(b_1\). To test whether an independent variable explains the variation in the dependent variable (i.e., it is statistically significant), the null hypothesis is that the true slope is zero (\(b_1 = 0\)). The appropriate test structure for the null and alternative hypotheses is:

\[H_0: b_1 = 0 \text{ versus } H_a: b_1 \neq 0\]

Example: Hypothesis test for significance of regression coefficients

Determine if the estimated slope is significantly different from zero

The estimated slope coefficient from the ABC example is 0.64 with a standard error equal to 0.26. Assuming that the sample has 36 observations, determine if the estimated slope coefficient is significantly different than zero at a 5% level of significance.

Answer:

The calculated test statistic is:

\[t = \frac{\hat{b}_1 - b_1}{s_{\hat{b}_1}} = \frac{0.64 - 0}{0.26} = 2.46\]

The critical two-tailed \(t\)-values are ±2.03 (from the \(t\)-table with \(df = 36 - 2 = 34\)). Because \(t > t_{\text{critical}}\) (i.e., 2.46 > 2.03), we reject the null hypothesis and conclude that the slope is different from zero.

Note that the \(t\)-test for a simple linear regression is equivalent to a \(t\)-test for the correlation coefficient between \(x\) and \(y\):

\[t = \frac{r\sqrt{n - 2}}{\sqrt{1 - r^2}}\]

中文翻譯

對斜率係數進行 \(t\) 檢定：\(t = (\hat{b}_1 - b_1) / s_{\hat{b}_1}\)，自由度 = \(n - 2\)。

決策規則（雙尾）：若 \(|t| > t_{\text{critical}}\) 則拒絕 \(H_0\)。

範例：\(t = 0.64/0.26 = 2.46\)，臨界值 = ±2.03（df = 34）。2.46 > 2.03，拒絕 \(H_0\)，斜率顯著不為零。

補充：簡單迴歸的 \(t\) 統計量等同於對相關係數的 \(t\) 檢定：\(t = r\sqrt{n-2}/\sqrt{1-r^2}\)。

Indicator Variables

An indicator variable or dummy variable is a time series that takes on a value of 1 in periods when some condition holds, and 0 in periods when that condition does not hold. In a simple regression, an analyst can use an indicator as the independent variable to test whether a condition has a significant effect on the dependent variable.

For example, let's say an analyst wants to decide whether to classify a company's quarterly earnings as cyclical. One way to do so is with an indicator variable for economic contractions, giving it a value of 1 in quarters when the economy was in recession and 0 in all other quarters. A simple linear regression, with the company's quarterly earnings as the dependent variable and this indicator (Recession) as the independent variable, is as follows:

\[\text{Earnings}_i = b_0 + b_1 (\text{Recession})_i + \varepsilon_i\]

The analyst can test the hypothesis that the slope, \(b_1\), is equal to zero. If the analyst can reject the hypothesis, this would suggest the company's earnings are significantly different in recessionary quarters than they are in expansionary quarters.

Professor's Note

In this example we would expect the slope coefficient to be negative, because earnings would likely be lower in recessionary periods if the business cycle affects them.

This approach is equivalent to grouping the quarterly earnings into recessionary periods and non-recessionary periods and performing a difference-in-means test. In fact, the slope coefficient \(b_1\) should equal the difference in means. It represents how different earnings are, on average, in recessionary periods.

中文翻譯

指示變數（indicator variable）又稱虛擬變數（dummy variable）：條件成立時取值 1，否則取值 0。

應用範例：以「是否為衰退期」（衰退=1，其他=0）作為自變數，對公司季度盈餘進行迴歸。若斜率 \(b_1\) 顯著不為零，說明衰退期盈餘顯著不同於非衰退期。

教授提示：若景氣循環影響盈餘，預期 \(b_1 < 0\)（衰退期盈餘較低）。

此方法等同於將盈餘分兩組後進行均值差異檢定（difference-in-means test），斜率 \(b_1\) 即為兩組均值之差。

Module Quiz 10.2

1. Consider the following statement: "In a simple linear regression, the appropriate degrees of freedom for the critical \(t\)-value used to calculate a confidence interval around both a parameter estimate and a predicted \(Y\)-value is the same as the number of observations minus two." This statement is:

A. justified.
B. not justified, because the appropriate degrees of freedom used to calculate a confidence interval around a parameter estimate is the number of observations.
C. not justified, because the appropriate degrees of freedom used to calculate a confidence interval around a predicted \(Y\)-value is the number of observations.

A is correct. In simple linear regression, the appropriate degrees of freedom for both confidence intervals is the number of observations in the sample (\(n\)) minus two. (LOS 10.c)

2. What is the appropriate alternative hypothesis to test the statistical significance of the intercept term in the following regression? \[Y = a_1 + a_2(X) + \varepsilon\]

A. \(H_A: a_1 \neq 0\).
B. \(H_A: a_1 > 0\).
C. \(H_A: a_2 \neq 0\).

A is correct. In this regression, \(a_1\) is the intercept term. To test the statistical significance means to test the null hypothesis that \(a_1\) is equal to zero, versus the alternative that \(a_1\) is not equal to zero. (LOS 10.c)

3. The variation in the dependent variable explained by the independent variable is measured by the:

A. mean squared error.
B. sum of squared errors.
C. regression sum of squares.

C is correct. The regression sum of squares (SSR) measures the amount of variation in the dependent variable explained by the independent variable (i.e., the explained variation). The sum of squared errors (SSE) measures the variation in the dependent variable not explained by the independent variable. The mean squared error (MSE) is equal to the SSE divided by its degrees of freedom. (LOS 10.d)

MODULE 10.3: PREDICTED VALUES AND FUNCTIONAL FORMS OF REGRESSION

LOS 10.e

Calculate and interpret the predicted value for the dependent variable, and a prediction interval for it, given an estimated linear regression model and a value for the independent variable.

Predicted values are values of the dependent variable based on the estimated regression coefficients and a prediction about the value of the independent variable. They are the values that are predicted by the regression equation, given an estimate of the independent variable.

For a simple regression, this is the predicted (or forecast) value of \(Y\):

\[\hat{Y} = \hat{b}_0 + \hat{b}_1 X_p\]

where \(\hat{Y}\) = predicted value of the dependent variable, and \(X_p\) = forecasted value of the independent variable.

Example: Predicting the dependent variable

Calculate the predicted ABC excess return

Given the ABC regression equation as follows:

\[\widehat{\text{ABC}} = -2.3\% + (0.64)(\widehat{\text{S\&P 500}})\]

Calculate the predicted value of ABC excess returns if forecast S&P 500 excess returns are 10%.

Answer:

The predicted value for ABC excess returns is determined as follows:

\[\widehat{\text{ABC}} = -2.3\% + (0.64)(10\%) = 4.1\%\]

中文翻譯

預測值（predicted values）是根據估計迴歸係數和自變數的預測值所得到的依變數估計值：

\(\hat{Y} = \hat{b}_0 + \hat{b}_1 X_p\)

範例：若 S&P 500 超額報酬預測為 10%，則 ABC 超額報酬預測值 = −2.3% + (0.64)(10%) = 4.1%。

Confidence Intervals for Predicted Values

This is the equation for the confidence interval for a predicted value of \(Y\):

\[\hat{Y} \pm (t_c \times s_f) \Rightarrow [\hat{Y} - (t_c \times s_f) < Y < \hat{Y} + (t_c \times s_f)]\]

where:

\(t_c\) = two-tailed critical \(t\)-value at the desired level of significance with \(df = n - 2\)
\(s_f\) = standard error of the forecast

The challenge with computing a confidence interval for a predicted value is calculating \(s_f\). On the Level I exam, it's highly unlikely that you will have to calculate the standard error of the forecast (it will probably be provided if you need to compute a confidence interval for the dependent variable). However, if you do need to calculate \(s_f\), it can be done with the following formula for the variance of the forecast:

\[s_f^2 = \text{SEE}^2 \left[ 1 + \frac{1}{n} + \frac{(X - \bar{X})^2}{(n - 1)s_x^2} \right]\]

where \(\text{SEE}^2\) = variance of the residuals, \(s_x^2\) = variance of the independent variable, and \(X\) = value of the independent variable for which the forecast was made.

Example: Confidence interval for a predicted value

Calculate a 95% prediction interval for ABC excess returns

Calculate a 95% prediction interval on the predicted value of ABC excess returns from the previous example. Assume the standard error of the forecast is 3.67, and the forecast value of S&P 500 excess returns is 10%.

Answer:

This is the predicted value for ABC excess returns:

\[\widehat{\text{ABC}} = -2.3\% + (0.64)(10\%) = 4.1\%\]

The 5% two-tailed critical \(t\)-value with 34 degrees of freedom is 2.03. This is the prediction interval at the 95% confidence level:

\[\widehat{\text{ABC}} \pm (t_c \times s_f) \Rightarrow [4.1\% \pm (2.03 \times 3.67\%)] = 4.1\% \pm 7.5\%\]

Or, −3.4% to 11.6%.

We can interpret this range to mean that, given a forecast value for S&P 500 excess returns of 10%, we can be 95% confident that the ABC excess returns will be between −3.4% and 11.6%.

中文翻譯

預測值的信賴區間：\(\hat{Y} \pm (t_c \times s_f)\)，其中 \(t_c\) 為 \(df = n-2\) 的雙尾臨界值，\(s_f\) 為預測標準誤。

預測誤差的變異數：\(s_f^2 = \text{SEE}^2\left[1 + \frac{1}{n} + \frac{(X-\bar{X})^2}{(n-1)s_x^2}\right]\)。L1 考試通常直接提供 \(s_f\)。

範例：預測值 4.1%，\(s_f = 3.67\%\)，\(t_c = 2.03\)（df=34）。預測區間 = 4.1% ± 7.5% = −3.4% 至 11.6%。解讀：在 95% 信賴水準下，當 S&P 500 超額報酬為 10% 時，ABC 超額報酬落在 −3.4% 至 11.6% 之間。

LOS 10.f

Describe different functional forms of simple linear regressions.

One of the assumptions of linear regression is that the relationship between \(X\) and \(Y\) is linear. What if that assumption is violated? Consider \(Y\) = EPS for a company and \(X\) = time index. Suppose that EPS is growing at approximately 10% annually. In such a situation, transforming one or both of the variables can produce a linear relationship. The appropriate transformation depends on the relationship between the two variables. One often-used transformation is to take the natural log of one or both of the variables. Here are some examples:

Log-lin model. This is if the dependent variable is logarithmic, while the independent variable is linear.
Lin-log model. This is if the dependent variable is linear, while the independent variable is logarithmic.
Log-log model. Both the dependent variable and the independent variable are logarithmic.

Selecting the correct functional form involves determining the nature of the variables and evaluating the goodness-of-fit measures (e.g., \(R^2\), SEE, F-stat).

中文翻譯

若 \(X\) 與 \(Y\) 的關係非線性，可對一個或兩個變數取自然對數以建立線性關係。三種常見函數形式：

Log-lin 模型：依變數取對數，自變數為線性。
Lin-log 模型：依變數為線性，自變數取對數。
Log-log 模型：依變數與自變數均取對數。

選擇正確函數形式需結合變數性質與配適度指標（\(R^2\)、SEE、F 統計量）。

Log-Lin Model

Taking the natural logarithm of the dependent variable, our model now becomes this:

\[\ln Y_i = b_0 + b_1 X_i + \varepsilon_i\]

In this model, the slope coefficient is interpreted as the relative change in dependent variable for an absolute change in the independent variable.

Lin-Log Model

Taking the natural logarithm of the independent variable, our model now becomes this:

\[Y_i = b_0 + b_1 \ln(X_i) + \varepsilon_i\]

In this model, the slope coefficient is interpreted as the absolute change in dependent variable for a relative change in the independent variable.

Log-Log Model

Taking the natural logarithm of both variables, our model now becomes this:

\[\ln Y_i = b_0 + b_1 \ln(X_i) + \varepsilon_i\]

In this model, the slope coefficient is interpreted as the relative change in dependent variable for a relative change in the independent variable.

中文翻譯

Log-lin 模型：\(\ln Y_i = b_0 + b_1 X_i + \varepsilon_i\)。斜率解釋：自變數絕對增加 1 單位，依變數相對（百分比）變動 \(b_1 \times 100\%\)。（依變數取對數→指數成長關係）

Lin-log 模型：\(Y_i = b_0 + b_1 \ln(X_i) + \varepsilon_i\)。斜率解釋：自變數相對增加 1%，依變數絕對變動 \(b_1/100\)。

Log-log 模型：\(\ln Y_i = b_0 + b_1 \ln(X_i) + \varepsilon_i\)。斜率解釋：自變數相對增加 1%，依變數相對變動 \(b_1\)%（彈性解釋，\(b_1\) 即為彈性係數）。

Module Quiz 10.3

1. For a regression model of \(Y = 5 + 3.5X\), the analysis (based on a large data sample) provides the standard error of the forecast as 2.5 and the standard error of the slope coefficient as 0.8. A 90% confidence interval for the estimate of \(Y\) when the value of the independent variable is 10 is closest to:

A. 35.1 to 44.9.
B. 35.6 to 44.4.
C. 35.9 to 44.1.

C is correct. The estimate of \(Y\), given \(X = 10\), is \(Y = 5 + 3.5(10) = 40\). The critical value for a 90% confidence interval with a large sample size (z-statistic) is approximately 1.65. Given the standard error of the forecast of 2.5, the confidence interval for the estimated value of \(Y\) is \(40 \pm 1.65(2.5) = 35.875\) to \(44.125\). (LOS 10.e)

2. The appropriate regression model for a linear relationship between the relative change in an independent variable and the absolute change in the dependent variable is a:

A. log-lin model.
B. lin-log model.
C. lin-lin model.

B is correct. The appropriate model would be a lin-log model, in which the values of the dependent variable (\(Y\)) are regressed on the natural logarithms of the independent variable (\(X\)): \(Y = b_0 + b_1 \ln(X)\). (LOS 10.f)

Key Concepts

LOS 10.a

Linear regression provides an estimate of the linear relationship between an independent variable (the explanatory variable) and a dependent variable (the predicted variable).

The general form of a simple linear regression model: \(Y_i = b_0 + b_1 X_i + \varepsilon_i\)

The least squares model minimizes the sum of squared errors (SSE):

\(\hat{b}_0\) = fitted intercept = \(\bar{Y} - \hat{b}_1 \bar{X}\)
\(\hat{b}_1\) = fitted slope coefficient = Cov(\(X, Y\)) / Var(\(X\))

The estimated intercept, \(\hat{b}_0\), represents the value of the dependent variable when the independent variable is zero. The estimated slope coefficient, \(\hat{b}_1\), is the change in the dependent variable for a one-unit change in the independent variable.

LOS 10.b

Assumptions of simple linear regression:

A linear relationship exists between the dependent and the independent variable.
The variance of the residual term is constant (homoskedasticity).
The residual term is independently distributed (residuals are uncorrelated).
The residual term is normally distributed.

LOS 10.c

Key decomposition: SST = SSR + SSE

SST = \(\sum(Y_i - \bar{Y})^2\) — total variation
SSR = \(\sum(\hat{Y}_i - \bar{Y})^2\) — explained variation
SSE = \(\sum(Y_i - \hat{Y}_i)^2\) — unexplained variation

Coefficient of determination: \(R^2 = \text{SSR}/\text{SST} = (\text{SST} - \text{SSE})/\text{SST}\)

In simple linear regression, the F-test and the t-test of \(b_1\) test the same hypothesis: \(H_0: b_1 = 0\) versus \(H_a: b_1 \neq 0\). F-stat = MSR/MSE with 1 and \(n - 2\) degrees of freedom.

An indicator variable (dummy variable) takes a value of 1 when a specified condition holds and 0 otherwise; it can be used to test whether a dependent variable is significantly different between the two conditions.

LOS 10.d

ANOVA Table for Simple Linear Regression (k = 1)

Source of Variation	Degrees of Freedom (df)	Sum of Squares	Mean Sum of Squares
Regression (explained)	1	SSR	MSR = SSR / k = SSR
Error (unexplained)	n − 2	SSE	MSE = SSE / (n − 2)
Total	n − 1	SST

Standard error of the estimate: \(\text{SEE} = \sqrt{\text{SSE}/(n - 2)} = \sqrt{\text{MSE}}\)

LOS 10.e

A predicted value of the dependent variable: \(\hat{Y}_p = \hat{b}_0 + \hat{b}_1 X_p\)

Confidence interval for a predicted \(Y\)-value: \(\hat{Y} \pm (t_c \times s_f)\), where \(s_f\) is the standard error of the forecast and \(df = n - 2\).

LOS 10.f

Dependent Variable	Independent Variable	Model	Slope Interpretation
Logarithmic	Linear	Log-lin	Relative change in dependent variable for an absolute change in independent variable
Linear	Logarithmic	Lin-log	Absolute change in dependent variable for a relative change in independent variable
Logarithmic	Logarithmic	Log-log	Relative change in dependent variable for a relative change in independent variable