Chi-Square Test of Independence – Fresh Take

Lumen Learning

Chi-Square Test of Independence – Fresh Take

Complete a chi-square test of independence
Write the conclusion of a chi-square test of independence in context of the problem

Conditional and Marginal Distributions

The Pew Research Center is a non-partisan, social science research think tank. One of the surveys they conduct periodically is called the Core Trends Survey, in which they poll a representative sample of American adults on a multitude of variables. The contingency table detailing the observed counts for the variables Number of books read in the last year and Type of residence is given below.^[1]

		Type of residence
		Urban	Suburban	Rural	Total
Number of books read	None	133	144	81	358
	1–4	146	149	53	348
	5–9	76	74	33	183
	10+	194	216	76	486
	Total	549	583	243	1,375

conditional distribution

The conditional distribution of one variable with respect to a value of a second variable gives the counts or the relative frequencies of the first variable restricted to only that value of the second variable. In terms of the table, this means we will restrict ourselves to either one row or one column of the interior part of the table.

If we consider the conditional distribution of Number of books read in the last year for people who live in urban residences (as shown in the following table), we are restricting ourselves to the “Urban” column of the table and looking at the distribution of Number of books read in the last year for just the urban dwellers.

		Urban
Number of books read	None	133
	1–4	146
	5–9	76
	10+	194
	Total	549

Often, when we discuss the conditional distribution, we’re more interested in the relative frequencies, or the proportion corresponding to each value of the variable of interest.

For example, among all the people living in an urban setting, the relative frequency of individuals who read no books in the last year is:

[latex]\dfrac{133}{549}=0.2423=24.23\%[/latex]

marginal distribution

The marginal distribution of a variable gives the distribution of one of the variables with no regard to the other variable whatsoever. In the table, this will be either the total row or the total column. One way to remember this is that the “margins” are on the outsides of a piece of paper (sides, top, and bottom), and the total row and column are the outside row and column of the table (on the side and bottom).

If we are considering the marginal distribution of Number of books read in the last year, we will look only at the totals in the far right column of the table because those give us the counts for each category of the variable Number of books read in the last year, with no regard to the other variable.

Number of books read	None	358
	1–4	348
	5–9	183
	10+	486
	Total	1,375

As before, we are often interested in the relative frequencies of the marginal distribution. For example, the relative frequency of individuals who read no books last year is:

[latex]\dfrac{358}{1375}=0.2604=26.04\%[/latex]

Note: Sometimes the percentages will not sum exactly to [latex]100\%[/latex]. This is due to a rounding error when you compute and round each percentage.

In the chi-square test of independence, we will be considering whether two variables are independent or not.

Two variables are independent if knowing the value of one does not affect the likelihood of any value of the other.

For example, if our two variables are independent, then knowing that someone lives in an urban area should not affect the probability that they fall into any one category of Number of books read in the last year.

Consider the following contingency table again. If knowing the Type of residence should not affect the likelihood of Number of books read in the last year, each column in our contingency table should have approximately the same distribution of Number of books read in the last year. In other words, the conditional distribution of Number of books read in the last year for each value of Type of residence should match the marginal distribution of Number of books read in the last year.

For example, the relative frequencies for the conditional distribution of Number of books read in the last year for urban dwellers should match the marginal distribution you found in Question 2. The relative frequencies of Number of books read in the last year for rural dwellers should also match that marginal distribution.

		Type of residence
		Urban	Suburban	Rural	Total
Number of books read	None	133	144	81	358
	1–4	146	149	53	348
	5–9	76	74	33	183
	10+	194	216	76	486
	Total	549	583	243	1,375

Let’s look again at the marginal distribution for the number of books read, but this time, we’ll include more decimal places so we can avoid rounding errors in our next calculation.

Relative frequency of number of books read as a percentage	None	0.26036364
	1–4	0.25309091
	5–9	0.13309091
	10+	0.35345455
	Total	1

Let’s imagine that the conditional distribution of Number of books read in the last year for urban dwellers had relative frequencies that matched the marginal distribution. Note that there are [latex]549[/latex] total urban dwellers, so [latex]0.26036364 (26.036364\%)[/latex] of them would have read no books, or[latex]26.036364\% \text{ of } 549 = 0.26036364 \times 549 = 142.940[/latex] urban dwellers would have read no books.

Pew Research Center. (2019). Core trends survey - Mobile technology and home broadband 2019. https://www.pewresearch.org/internet/dataset/core-trends-survey/ ↵