We will discuss the statistics commonly used for assessing the statistical significance and strength of association of cross-tabulated variables: The statistical significance of the observed association is commonly measured by the chi-square statistic. The strength of association, or degree of association, is important from a practical or substantive perspective. Generally, the strength of association is of interest only if the association is statistically significant.
strength of the association can be measured by the phi correlation coefficient, the contingency coefficient, Cramer’s V, and the lambda coefficient.
The chi-square statistic (X²) is used to test the statistical significance of the observed association in a cross-tabulation. It assists us in determining whether a systematic association exists between the two variables. The null hypothesis is that there is no association between the variables. The test is conducted by computing the cell frequencies that would be expected if no association were present between the variables, given the existing row and column totals. These expected cell frequencies, denoted are then compared to the actual observed frequencies, found in the cross-tabulation to calculate the chi-square statistic. The greater the discrepancies between the expected and actual frequencies, the larger the value of the statistic. Assume that a cross tabulation has r rows and c columns and a random sample of n observations. Then the expected frequency for each cell can be calculated by using a simple formula:
For the Internet usage data in Table 15.3, the expected frequencies for the cells, going from left to right and from top to bottom, are:
To determine whether a systematic association exists, the probability of obtaining a value of chi-square as large as or larger than the one calculated from the cross-tabulation is estimated, An important characteristic of the chi-square statistic is the number of degrees of freedom (df) associated with it. In general, the number of degrees of freedom is equal to the number of observations less the number of constraints needed to calculate a statistical term. In the case of a chi-square statistic associated with a cross-tabulation, the number of degrees of freedom is equal to the product of number of rows (r) less one and the number of columns (c) less one. That is, df = (r – 1) X (c – : ).9 The null hypothesis of no association between the two variables, will-be rejected only when the calculated value of the test statistic is greater than the critical value of the chi-square distribution with the appropriate degrees of freedom, as shown in Figure 15.8, The chi-square distribution is a skewed distribution whose shape depends solely on the number of degrees of freedom.’? As the number of degrees of freedom increases, the chi-square
distribution becomes more symmetrical. Table 3 in the Statistical Appendix contains upper-tail areas of the chi-square distribution for different degrees of freedom. In this table, the value at the top of each column indicates the area in the upper portion (the right side, as shown in Figure 15.8) of the chi-square distribution. To illustrate. for I degree of freedom. the value for an upper-tail area of 0.05 is 3.841. This indicates that for I degree of freedom. the probability of exceeding a chi-square value of 3 .841 is 0.05. In other words, at the 0.05 level of significance with degree of freedom. the critical value of the chi-square statistic is 3.841.
For the cross-tabulation given in Table 15.3. there are (2 – 1) X (2 – 1) = 1 degree of freedom. The calculated chi-square statistic had a value of 3.333. Because this is less than the critical value of 3.841. the null hypothesis of no association cannot be rejected. indicating that the association is not statistically significant at the 0.05 level. Note that this lack of significance is mainly due to the small sample size (30). If. instead, the sample size were 300 and each entry of Table 15.3 were multiplied by 10. it can be seen that the value of the chi-square statistic would be multiplied by 10 and would be 33.33. which is significant at the 0.05 level.
The chi-square statistic can also be used in goodness-of-fit tests to determine whether certain models fit the observed data. These tests are conducted by calculating the significance of sample deviations from assumed theoretical (expected) distributions. and can be performed on cross-tabulations as well as on frequencies (one-way tabulations). The calculation of the chi-square statistic and the determination of its significance is the same as illustrated here
The chi-square statistic should be estimated only on counts of a When the data are in percentage form. they should first be converted to absolute counts or numbers. In addition. an underlying assumption of the chi-square test is that the observations are drawn independently. As a general rule. chi-square analysis should not be conducted when the expected or the theoretical frequencies in any of the cells is less than five. If the number Of observations in any cell is less than 10.or if the table has two rows and two columns (a 2 X 2 table). a correction factor should be applied. II With the correction factor. the value is 2.133, which is not significant at the 0.05 level. The calculation of the correction factor is complex but can be conveniently done using appropriate software. In the case of a 2 X 2 table. the chi-square is related to the phi coefficient.
The phi coefficient (4)) is used as a measure of the strength of association in the special case of a table with two rows and two columns (a 2 X 2 table). The phi coefficient is proportional to the square root of the chi-square statistic. For a sample of size n, this statistic is calculated as:
It takes the value of 0 when there is no association. which would be indicated by a chi-square value of 0 as well. When the variables are perfectly associated, phi assumes the value of 1 and all the observations fall just on the main or minor diagonal. (In some computer programs, phi assumes a value of – 1 rather than I when there is perfect negative association.) In our case, because the association was not significant at the 0.05 level, we would not normally compute the phi value. However, for the purpose of illustration, we show how the values of phi and other measures of the strength of association would be computed. The value of phi is:
Thus, the association is not very strong. In the more general case involving a table of any size, the strength of association can be assessed by using the contingency coefficient.
Whereas the phi coefficient is specific to a 2 X 2 table, the contingency coefficient (C) can be used to assess the strength of association in a table of any size. This index is also related to chi square as follows:
The contingency coefficient varies between 0 and 1. The 0 value occurs in the case of no association (i.e., the variables are statistically independent), but the maximum value of 1 is never achieved. Rather, the maximum value of the contingency coefficient depends on the size of the table (number of rows and number of columns). For this reason, it should be used only to compare tables of the same size. The value of the contingency coefficient for Table 15.3 is:
This value of C indicates that the association is not very strong. Another statistic that can be calculated for any table is Cramer’s V.
Cramer’s V is a modified version of the phi correlation coefficient, 4>, and is used in tables larger than 2 X 2. When phi is calculated for a table larger than 2 X 2, it has no upper limit. Cramer’s V is obtained by adjusting phi for either the number of rows or the number of columns in the table, based on which of the two is smaller. The adjustment is such that V will range from 0 to 1. A large value of V merely indicates a high degree of association. It does not indicate how the variables are associated. For a table with rows and columns, the relationship between Cramer’s V and the phi correlation coefficient is expressed as:
The value of Cramer’s V for Table 15.3
Thus, the association is not very strong. As can be seen, in this case V = ø This is always the case for a 2 X 2 table. Another statistic commonly estimated is the lambda coefficient.
Lambda assumes that the variables are measured on a nominal scale. Asymmetric lambda measures the percentage improvement in predicting the value of the dependent variable. given the value of the independent variaile. Lambda also varies between 0 and 1. A value of 0 means no improvement in prediction. A value of I indicates that the prediction can be made without error. This happens when each independent variable category is associated with a single category of the dependent variable.
A symmetric lambda is computed for each of the variables (treating it as the dependent variable). In general, the two asymmetric lambdas are likely to be different because the marginal distributions are not usually the same. A symmetric lambda is also computed. which is a kind of average of the two asymmetric values. The symmetric lambda does not make an assumption about which variable is dependent. It measures the overall improvement when prediction is done in both directions.l? The value of asymmetric lambda in Table 15.3, with usage as the dependent variable, is 0.333. This indicates that knowledge of sex increases our predictive ability by the proportion of 0.333, that is, a 33.3 percent improvement. The symmetric lambda is also 0.333.
Note that in the calculation of the chi-square statistic, the variables are treated as being measured on only a nominal scale. Other statistics such as tau b. tau c, and gamma are available to measure association between two ordinal-level variables. All these statistics use information about the ordering of categories of variables by considering every possible pair of cases in the table. Each pair is examined to determine if its relative ordering on the first variable is the same as its relative ordering on the second variable (concordant), if the ordering is reversed (discordant). or if the pair is tied. The manner in which the ties are treated is the basic difference between these statistics. Both tau b and tau c adjust for ties. Tau b is the most appropriate with square tables, in which the number of rows and the number of columns arc equal. Its value varies between + 1 and – 1. Thus the direction (positive or negative) as well as the strength (how close the value is to I) of the relationship can be determined. For a rectangular table in which the number of rows IS different from the number of columns, tau c should be used. Gamma does not make an adjustment for either ties or table size. Gamma also varies between + 1 and -1 and generally has a higher numerical value than tau b or tau c, For the data in Table 15.3, as sex is a nominal variable. it is not appropriate to calculate ordinal statistics. All these statistics can be estimated by using the appropriate computer programs for cross-tabulation. Other statistics for measuring the strength of association, namely product moment correlation and non metric correlation.