Reading 9
MODULE 9.1: TESTS FOR INDEPENDENCE
Explain parametric and nonparametric tests of the hypothesis that the population correlation coefficient equals zero, and determine whether the hypothesis is rejected at a given level of significance.
Correlation measures the strength of the relationship between two variables. If the correlation between two variables is zero, there is no linear relationship between them. When the sample correlation coefficient for two variables is different from zero, we must address the question of whether the true population correlation coefficient (\(\rho\)) is equal to zero. The appropriate test statistic for the hypothesis that the population correlation equals zero, when the two variables are normally distributed, is as follows:
\[\frac{r\sqrt{n - 2}}{\sqrt{1 - r^2}}\]where:
- \(r\) = sample correlation
- \(n\) = sample size
This test statistic follows a t-distribution with \(n - 2\) degrees of freedom. Note that the test statistic increases, not only with the sample correlation coefficient, but also with sample size.
相關係數衡量兩個變數之間關係的強弱。若兩變數的相關係數為零,則它們之間不存在線性關係。當樣本相關係數不為零時,我們必須檢驗母體相關係數(\(\rho\))是否真的等於零。在兩變數均服從常態分配的前提下,用來檢驗「母體相關係數是否為零」的適當檢定統計量如下:
\[\frac{r\sqrt{n - 2}}{\sqrt{1 - r^2}}\]其中:
- \(r\) = 樣本相關係數
- \(n\) = 樣本數
此檢定統計量服從自由度為 \(n - 2\) 的 t 分配。需注意,檢定統計量不僅隨樣本相關係數增大而增大,也隨樣本數增大而增大。
A researcher computes the sample correlation coefficient for two normally distributed random variables as 0.35, based on a sample size of 42. Determine whether to reject the hypothesis that the population correlation coefficient is equal to zero at a 5% significance level.
Answer:
Our test statistic is \(\dfrac{0.35\sqrt{42 - 2}}{\sqrt{1 - 0.35^2}} = 2.363\).
Using the t-table with \(42 - 2 = 40\) degrees of freedom for a two-tailed test and a significance level of 5%, we can find the critical value of 2.021. Because our computed test statistic of 2.363 is greater than 2.021, we reject the hypothesis that the population correlation coefficient is zero and conclude that it is not equal to zero. That is, the two populations are correlated—in this case, positively.
【例題】檢驗母體相關係數是否等於零
研究者根據樣本數 42 的樣本,計算兩個服從常態分配的隨機變數之樣本相關係數為 0.35。試在 5% 顯著水準下,判斷是否應拒絕「母體相關係數等於零」的虛無假說。
解答:檢定統計量 \(\dfrac{0.35\sqrt{42 - 2}}{\sqrt{1 - 0.35^2}} = 2.363\)。
查 t 分配表,自由度 \(42 - 2 = 40\),雙尾、顯著水準 5%,臨界值為 2.021。計算所得 2.363 > 2.021,故拒絕「母體相關係數為零」的假說,結論為兩母體存在相關性——本例為正相關。
The correlation coefficient we refer to here is the Pearson correlation coefficient, which is a measure of the linear relationship between two variables. There are other correlation coefficients that better measure the strength of any nonlinear relationship between two variables.
教授提醒:此處所指的相關係數為皮爾森相關係數(Pearson correlation coefficient),衡量兩變數之間的線性關係強度。若要衡量兩變數間非線性關係的強弱,尚有其他相關係數可供選用。
The Spearman rank correlation test, a nonparametric test, can be used to test whether two sets of ranks are correlated. Ranks are simply ordered values. If there is a tie (equal values), the ranks are shared—so if second and third rank is the same, the ranks are shared, and each gets a rank of \((2 + 3) / 2 = 2.5\).
The Spearman rank correlation, \(r_s\) (when all ranks are integer values), is calculated as follows:
\[r_s = 1 - \frac{6 \sum_{i=1}^{n} d_i^2}{n(n^2 - 1)}\]where:
- \(r_s\) = rank correlation
- \(n\) = sample size
- \(d_i\) = difference between two ranks
We can test the significance of the Spearman rank correlation using the same test statistic we used for the parametric correlation coefficient:
\[\frac{r_s \sqrt{n - 2}}{\sqrt{1 - r_s^2}}\]When the sample size is greater than 30, the test statistic follows a t-distribution with \(n - 2\) degrees of freedom.
斯皮爾曼等級相關檢定(Spearman rank correlation test)是一種無母數檢定,用來檢驗兩組排名之間是否存在相關性。「排名」是將觀測值依大小順序賦予名次的結果。若出現並列(觀測值相等),並列名次取平均——例如第二、三名並列,各自排名均為 \((2 + 3) / 2 = 2.5\)。
當所有排名均為整數時,斯皮爾曼等級相關係數 \(r_s\) 的公式為:
\[r_s = 1 - \frac{6 \sum_{i=1}^{n} d_i^2}{n(n^2 - 1)}\]其中:
- \(r_s\) = 等級相關係數
- \(n\) = 樣本數
- \(d_i\) = 兩組排名之差
對 \(r_s\) 進行顯著性檢定,採用與母數相關係數相同的統計量:
\[\frac{r_s \sqrt{n - 2}}{\sqrt{1 - r_s^2}}\]當樣本數大於 30 時,此統計量服從自由度為 \(n - 2\) 的 t 分配。
Explain tests of independence based on contingency table data.
A contingency or two-way table shows the number of observations from a sample that have a combination of two characteristics. Figure 9.1 is a contingency table where the characteristics are earnings growth (low, medium, or high) and dividend yield (low, medium, or high). We can use the data in the table to test the hypothesis that the two characteristics, earnings growth and dividend yield, are independent of each other.
| Earnings Growth | Dividend Yield – Low | Dividend Yield – Medium | Dividend Yield – High | Total |
|---|---|---|---|---|
| Low | 28 | 53 | 42 | 123 |
| Medium | 42 | 32 | 39 | 113 |
| High | 49 | 25 | 14 | 88 |
| Total | 119 | 110 | 95 | 324 |
We index our three categories of earnings growth from low to high with \(i = 1, 2\), or 3, and our three categories of dividend yield from low to high with \(j = 1, 2\), or 3. From the table, we see in Cell 1,1 that 28 firms have both low earnings growth and low dividend yield. We see in Cell 3,2 that 25 firms have high earnings growth and medium dividend yields.
For our test, we are going to compare the actual table values to what the values would be if the two characteristics were independent. The test statistic is a chi-square test statistic calculated as follows:
\[X^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{i,j} - E_{i,j})^2}{E_{i,j}}\]where:
- \(O_{ij}\) = number of observations in Cell \(i,j\): Row \(i\) and Column \(j\) (i.e., observed frequency)
- \(E_{ij}\) = expected number of observations for Cell \(i,j\)
- \(r\) = number of row categories
- \(c\) = number of column categories
The degrees of freedom are \((r - 1) \times (c - 1)\), which is 4 in our example for dividend yield and earnings growth.
\(E_{ij}\), the expected number of observations in Cell \(i,j\), is:
\[\frac{\text{total for Row } i \times \text{total for Column } j}{\text{total for all columns and rows}}\]The expected number of observations for Cell 2,2 is:
\[\frac{110 \times 113}{324} = 38.4\]In calculating our test statistic, the term for Cell 2,2 is:
\[\frac{(32 - 38.4)^2}{38.4} = 1.0667\]列聯表(contingency table)又稱雙向表,呈現樣本中同時具備兩種特性之各組合的觀測次數。圖 9.1 的列聯表中,兩種特性分別是盈餘成長(低、中、高)和股利殖利率(低、中、高)。可用此表資料,檢驗「盈餘成長」與「股利殖利率」是否彼此獨立。
令盈餘成長三類別的列索引 \(i = 1, 2, 3\)(由低到高),股利殖利率三類別的欄索引 \(j = 1, 2, 3\)(由低到高)。從表中可見:儲存格 (1,1) 有 28 家公司同時具備「低盈餘成長」與「低股利殖利率」;儲存格 (3,2) 有 25 家公司具備「高盈餘成長」與「中股利殖利率」。
檢定時,我們將實際觀測值與「兩特性獨立時的期望值」相比,檢定統計量為卡方統計量:
\[X^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{i,j} - E_{i,j})^2}{E_{i,j}}\]其中:
- \(O_{ij}\) = 儲存格 \((i,j)\) 的實際觀測次數(觀測頻率)
- \(E_{ij}\) = 儲存格 \((i,j)\) 的期望觀測次數
- \(r\) = 列的類別數
- \(c\) = 欄的類別數
自由度為 \((r - 1) \times (c - 1)\),本例中為 4。
儲存格 \((i,j)\) 的期望次數 \(E_{ij}\):
\[\frac{\text{第 } i \text{ 列合計} \times \text{第 } j \text{ 欄合計}}{\text{全表總計}}\]儲存格 (2,2) 的期望次數 \(= \dfrac{110 \times 113}{324} = 38.4\)。
計算檢定統計量時,儲存格 (2,2) 的貢獻項 \(= \dfrac{(32 - 38.4)^2}{38.4} = 1.0667\)。
Figure 9.2 shows the expected frequencies for each pair of categories in our earnings growth and dividend yield contingency table.
| Earnings Growth | Dividend Yield – Low | Dividend Yield – Medium | Dividend Yield – High |
|---|---|---|---|
| Low | 45.2 | 41.8 | 36.1 |
| Medium | 41.5 | 38.4 | 33.1 |
| High | 32.3 | 29.9 | 25.8 |
For our test statistic, we sum, for all nine cells, the squared difference between the expected frequency and observed frequency, divided by the expected frequency. The resulting sum is 27.47. Figure 9.3 shows the results for each cell in calculating the test statistic.
| Earnings Growth | Dividend Yield – Low | Dividend Yield – Medium | Dividend Yield – High |
|---|---|---|---|
| Low | 6.5451 | 3.0010 | 0.9643 |
| Medium | 0.0060 | 1.0667 | 1.0517 |
| High | 8.6344 | 0.8030 | 5.3969 |
| Sum | 27.4691 |
Our degrees of freedom are \((3 - 1) \times (3 - 1) = 4\). The critical value for a significance level of 5% (from the chi-square table in the Appendix) with 4 degrees of freedom is 9.488. Based on our sample data, we can reject the hypothesis that the earnings growth and dividend yield categories are independent.
圖 9.2 列出「盈餘成長」與「股利殖利率」列聯表中,各類別組合的期望頻率。
計算檢定統計量時,對全部九個儲存格加總「(觀測頻率 − 期望頻率)² / 期望頻率」,加總結果為 27.47。圖 9.3 顯示各儲存格的個別計算值。
自由度 \((3 - 1) \times (3 - 1) = 4\)。查卡方分配表,5% 顯著水準下 4 個自由度的臨界值為 9.488。由於檢定統計量 27.47 遠大於 9.488,我們拒絕「盈餘成長」與「股利殖利率」兩類別彼此獨立的假說。
- A. t-distribution.
- B. normal distribution.
- C. chi-square distribution.
- A. a null hypothesis that rank correlations are equal to zero.
- B. whether multiple characteristics of a population are independent.
- C. the number of p-values from multiple tests that are less than adjusted critical values.
- A. degrees of freedom are \(n - 1\).
- B. the test statistic follows a t-distribution.
- C. the test statistic increases with a greater sample size.
To test a hypothesis that a population correlation coefficient equals zero, the appropriate test statistic is a t-statistic with \(n - 2\) degrees of freedom:
\[\frac{r\sqrt{n - 2}}{\sqrt{1 - r^2}}\]where \(r\) is the sample correlation coefficient.
A nonparametric test of correlation can be performed when we have only ranks (e.g., deciles of investment performance). The Spearman rank correlation is:
\[r_s = 1 - \frac{6 \sum_{i=1}^{n} d_i^2}{n(n^2 - 1)}\]where \(d_i^2\) is the squared difference in pairs of ranks and \(n\) is the number of sample periods. Its test statistic also uses \(\dfrac{r_s\sqrt{n-2}}{\sqrt{1-r_s^2}}\) and follows a t-distribution for sample sizes greater than 30.
A contingency table can be used to test the hypothesis that two characteristics (categories) of a sample are independent. The test statistic follows a chi-square distribution:
\[X^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{i,j} - E_{i,j})^2}{E_{i,j}}\]where \(O_{ij}\) is the observed frequency, \(E_{ij} = \dfrac{\text{Row } i \text{ total} \times \text{Column } j \text{ total}}{\text{Grand total}}\) is the expected frequency, \(r\) is the number of row categories, and \(c\) is the number of column categories. Degrees of freedom are \((r-1)(c-1)\). Reject independence if the test statistic exceeds the critical chi-square value.
1. A — The test statistic for a Spearman rank correlation test follows a t-distribution. (LOS 9.a)
2. B — A contingency table is used to determine whether two characteristics of a group are independent. (LOS 9.b)
3. A — Degrees of freedom are \(n - 2\) for a test of the hypothesis that correlation is equal to zero. The test statistic increases with sample size (degrees of freedom increase) and follows a t-distribution. (LOS 9.a)