In this practical, we will use R to investigate the sampling distribution of Pearson’s chi-squared statistic for goodness of fit. We will do a sampling experiment for rolling a die, first in the case where the die is fair and then in the case that it is biased.

  • You will need the following skills from previous practicals:
    • Use of the functions sample, sapply and replicate.
    • Use of lotting functions such asplot, hist, lines and abline.
  • New R techniques:
    • Use of pchisq/dchisq for the chi-square distribution.
    • Use tabulate to count the number of instances of each integer.
    • Use of the pchisq, qchisq and chisq.test functions.

1 Rolling a fair die

We can simulate the rolling of a fair die using the sample function that we have seen before

sample(1:6, 1)

This samples a number between 1 and 6 with the probability of each number being equal. Actually, the function sample has 2 default values so that the above is really

sample(x=1:6,size=1, replace=FALSE, prob=c(1/6,1/6,1/6,1/6,1/6,1/6))

though you don’t have to type all of this. Type ?sample to recall how this function works. Here, the default is for each number in x to have an equal probability.

  • By changing the default value of replace how do you simulate 10 rolls of a fair die using sample?
  • Create a vector called rolls that contains 1200 simulated rolls of a fair die.

2 A chi-squared test by hand

We refer to the hypothesis that the die we have rolled is fair as the null hypothesis, \(H_0\). In this case we know \(H_0\) is true, however, the purpose of this experiment is to compare how a chi-squared test for examining \(H_0\) performs when we know that \(H_0\) is true.

To compute Pearson’s chi-squared statistic we need to identify the discrete set of possible outcomes (in our case the integers 1 to 6) and the expected number of each outcome in our experiment under \(H_0\). The statistic is

\[X^2=\sum_i\frac{(O_i-E_i)^2}{E_i}\]

where \(O_i\) is the observed number of instances of outcome \(i\) in the sample and \(E_i\) is the expected number of instances of outcome \(i\) in the sample under \(H_0\).

The repeat command: If we want to repeat a value multiple times in a vector we use rep(a ,n ) where a is the value and n is the number of repeats.

Counting integers: The function tabulate takes a large vector of integers and computes the number of instances of each integer. Suppose we have a large vector of integers between 1 and a called intVec , then we can compute the number of 1’s, 2’s, …, a’s using

tabulate(intVec, nbins = a)

Under \(H_0\), the probability of each outcome is 1/6, so we can get the vector E easily using rep via

 E <- rep(1200/6, 6)

We can count the number of 1’s, 2’s etc in our sample using the tabulate function.

 O <- tabulate(rolls, nbins=6)

(Note O here is the letter, not a zero!).

  • What is your value of X2?

In order to see if your value of X2 is unusually large under H0 we need to know the distribution of X2 under \(H_0\). We investigate this with a sampling experiment.

3 A repeated sampling experiment

We will write a function that simulates the rolling of 1200 dice and computes the chi-squared statistic. We will then use this function to do a repeated sampling experiment and look at the properties of the chi-squared distribution.

Functions with no arguments. So far, we have written our own R programs as functions of a number of inputs that we specify. We can write programs that do a certain task every time, without any input variable by simply not including an input variable at the beginning. Here is an example of a function that takes a random integer from 1 to 10

RanInt <- function(){ sample(1:10,1) }

The important thing to note is that when calling a function with no arguments, we still must include the brackets at the end, To use the function above, for example, we type RanInt().

Functions with default arguments. Default arguments are very useful in R, and most of the functions we have used up to now contained hidden defaults. For example, the function sample has the default argument replace=FALSE, amongst others.

A default allows us to change an argument if we wish to, but to not specify it when we call the function if we don’t. For example, sample(1:10,4) will sample 4 values from 1 to 10 without replacement by default. If we want to sample with replacement we write sample(1:10,4,replace=TRUE) in order to change the default.

To add defaults to our own functions we simply add the default value after an = sign in the inputs to the function. E.g.

RanInt <- function(N=10){ sample(1:N,1)}

Now RanInt draws a sample from the natural numbers up to N. If we don’t specify an N by typing RanInt(), we get a sample from the numbers 1 to 10, else we could specify RanInt(30) to get a number from 1 to 30.

  • Write a function, called rollFun, with no arguments that simulates 1200 rolls of a fair die and computes the chi-squared statistic in the way we computed it above.
  • When we call rollFun, we should get a single number back from the function. Call rollFun to check this is the case.
  • Use replicate to perform 1000 sampling experiments with rollFun and call the resulting vector of \(X^2\)’s, roll1200.

Plot your \(X^2\)’s as a histogram using a density scale on the vertical axis using the command

hist(roll1200, freq=FALSE)

The result is an approximation to the distribution of \(X^2\) under \(H_0\).

Overlaying distributions on histograms: We often want to apply theory that says a random sample behaves as a draw from a certain distribution. One useful sanity check for this is to draw a histogram of the sample and overlay the shape of the distribution to see if the sample is consistent with our assumptions. The function lines(x,y) allows us to draw lines on plots, including histograms.

To overlay a particular distribution onto a histogram, we require a large sequence x that that runs from the minimum to the maximum of the values in the histogram, and a y of the distribution function in question applied to each value in x.

We know by now how to generate sequences and we can get maxima and minima using commands given above. We can get the density function using d prefixed to one of the many distributions provided by R. The density function for the chi-squared distribution is dchisq, for example, and for the normal it is dnorm. Other distributions in R are unif, exp and pois. Don’t forget to prefix the d if you want the density function. You can read more about these functions using their help pages.

Putting this together: Suppose we have a vector x that we suppose is a random sample from a chi-squared distribution with degrees of freedom 5. The following commands will draw the histogram of the vector and then overlay the chi-squared density to check our supposition:

hist(x)

forLine <- seq(from=min(x), to = max(x), len=100)

lines(forLine, dchisq(forLine,df=5))

As introduced in the lectures, when all the \(E_i\) are greater than 5, we can approximate the distribution of \(X^2\) under \(H_0\) using a chi-squared distribution with \(k-1\) degrees of freedom, denoted by \(\chi^2_{k-1}\), where \(k\) is the number of possible outcomes (6 in our case). To see if our sample is consistent with a \(\chi^2_{5}\), we can overlay the \(\chi^2_{5}\) density on our histogram.

  • Use the commands seq and lines (see the technique tips) to overlay the \(\chi^2_{5}\) density onto your histogram. Is your sample consistent with the theory?
  • Add a vertical line showing the value of our test statistic for rolls using the command abline.

To quantify the goodness of fit of our value X2 under \(H_0\) we can use the p-value, which is the probability, under \(H_0\), of observing a value of \(X^2\) that is greater than our value, X2. We can either do this using our sampling distribution directly or through the approximate chi-square distribution.

  • Compare your sampling p-value with the p-value obtained from the \(\chi^2_{5}\) approximation (type ?pchisq). Are the two results close?

4 Using a biased die

We are now going to generate a sample from a biased die, and see what happens to the \(X^2\) statistic. To do this we will modify rollFun to accept an argument containing probabilities for the different values of the die, but with a default that the die is fair.

  • Edit rollFun so that it has an input called probs with default value rep(1/6,6), and so that the sample of 1200 die rolls uses the vector probs for the probability of rolling 1,…,6.

4.1 A die with a small bias

To see what happens to the \(X^2\) statistic using slightly biased dice, we are going to redraw the histogram of our sampling distribution of \(X^2\) under \(H_0\) using a larger \(x\)-axis to allow us to view large values. Type the following before moving on.

hist(roll1200, breaks=seq(from=0,to=50,by=5), freq=FALSE)

We will now create a die that is slightly biased away from 6 towards 1 using the following vector of probabilities:

 bp <- c(1/6 + 0.02, 1/6, 1/6, 1/6, 1/6, 1/6 - 0.02)
  • Create a function rollFun1 with an argument probs, which calculates the \(X^2\) statistic for the biased die. You can do this by calling rollFun(probs=bp) inside the function. Create the function in such a way that it returns the value of \(X^2\) and that it also adds this value as a blue vertical line on the existing histogram. You many define the colour of the line by letting the function have a second argument with default option colour = 'blue' Apply your function to the histogram.
  • Repeat this procedure 10 times, using replicate.
  • What do you notice about the blue vertical lines on the histogram?
  • Would the results of this experiment lead you to suspect that the die was biased if you didn’t know?

4.2 A die with a larger bias

  • Repeat the above experiment with bp <- c(1/6 + 0.04, 1/6, 1/6, 1/6, 1/6, 1/6 - 0.04), this time changing the colour of the vertical line to red. What do you notice?
  • Each experiment, corresponding to each line on our histogram, represents 1200 recorded rolls of a die. Receiving 1 of these experiments in real life would normally be considered a large data set. Would any of the 20 or so biased die experiments you have done, if seen alone, lead you not to suspect that the die was biased, based on a chi-squared test, if seen in isolation?
  • For what values of \(X^2\) would we reject the \(H_0\) at a 5% level of significance? Hint: type ?qchisq to see how to find the corresponding critical value of the test.

5 Further exercises

5.1 Example 1

We can use the pre-defined function supported by the stats package in R, in order to apply chi-square test to the example in the lecture notes.

Suppose we wish to know whether a die is fair or not. If it is fair then each face is equally likely to occur, i.e. 1/6. To test this claim, a die is rolled 120 times and the number of times each face appeared was recorded.

DieValue 1 2 3 4 5 6
Observed 10 15 16 25 24 30
  • Enter the data into R, then use the chisq.test to perform chi-square test for goodness of fit. Type ?chisq.test for help.

5.2 Example 2

A random sample of 1000 residents of a major city to gather information on the alcohol consumption yielded the data displayed below.

Less than 1 1-50 Over 50
Single 38 120 42
Married 232 357 73
Widowed 48 29 4
Divorced 15 34 8
  • Enter the data into R in the form of a 4 by 3 matrix, then use the chisq.test to perform chi-square test for independence.