# # PRACTICAL 2-4: Chi-square tests # --------------------------------- # # In this practical, we will use R to investigate # the sampling distribution of Pearson's chi-squared statistic for goodness of # fit. We will do a sampling experiment for rolling a die, first in the case # where the die is fair and then in the case that it is biased. # 1. Rolling a fair die ------------------------------------------------- # We can simulate the rolling of a fair die using the sample function that we # have seen before # sample(1:6, 1) # # This samples a number between 1 and 6 with the probability of each number # being equal. Actually, the function sample has 2 default values so that the # above is really # sample(x=1:6,size=1, replace=FALSE, prob=c(1/6,1/6,1/6,1/6,1/6,1/6)) # # though you don't have to type all of this. You can see this for yourself in # the help file for sample. Here, the default is for each number in x # (1,2,3,4,5,6) to have an equal probability. # # Question: By changing the default value of replace how do you simulate 10 # rolls of a fair die using sample? # Question: Create a vector called rolls that contains 1200 simulated rolls of a # fair die. # 2. A chi-squared test by hand ------------------------------------------- # We refer to the hypothesis that the die we have rolled is fair as the null # hypothesis, H0. In this case we know H0 is true, however, the purpose of this # experiment is to compare how a chi-squared test for examining H0 performs when # we know that H0 is true. # # To compute Pearson's chi-squared statistic we need # to identify the discrete set of possible outcomes (in our case the integers # 1 to 6) and the expected number of each outcome in our experiment under # H0. The statistic is # _____ 2 # 2 \ /O - E \ # X = \ \ i i/ # / ------- # /____ E # i i # where O is the observed number of instances of outcome i in the sample and # i # E is the expected number of instances of outcome i in the sample under H0. # i # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ # ^ TECHNIQUE: # ^ # The **repeat** command: If we want to repeat a value multiple times in a vector we use # ```rep(a ,n )``` # where `a` is the value and `n` is the number of repeats. # **Maxima and minima**: To find the maximum and minimum of a vector `x` , use the commands `max(x)` and `min(x)`. # **Counting integers**: # The function `tabulate` takes a large vector of integers and computes the number of instances of each integer. Suppose we have a large vector of # integers between 1 and `a` called `intVec` , then we can compute the number of `1`’s, `2`’s, ..., `a`’s using # #```tabulate(intVec, a)``` # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # Under H0, the probability of each outcome is 1/6, so we can get the vector # E easily using rep via # E = rep(1200/6, 6) # # We can count the number of 1's, 2's etc in our sample using the tabulate # function. See the techniques sheet or the online help for more details. # O = tabulate(rolls, nbins=6) # # (Note O here is an `Oh', not a zero!). Finally, to get the test statistic # 2 # value, we can simply create the vector (O-E) /E and sum its values via # # # Question: What is your value of X2? # In order to see if your value of X2 is unusually large under H0 we need to # know the distribution of X2 under H0. We investigate this with a sampling # experiment. # 3. A repeated sampling experiment -------------------------------------- # We will write a program that simulates the rolling of 1200 dice and computes # the chi-squared statistic. We will then use this program to do a repeated # sampling experiment and look at the properties of the chi-squared # distribution. # # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ # ^ TECHNIQUE: # ^ # ***Functions with no arguments*** # So far, we have written our own _R_ programs as functions of a number of inputs that we specify. We can write programs that do a certain task every # time, without any input variable by simply not including an input variable at the beginning. Here is an example of a function that takes a random #integer from 1 to 10 #` #RanInt <- function(){ sample(1:10,1) #} #` # The important thing to note is that when *calling* a function with no arguments, we still **must include the brackets at the end!** To use the function # above, for example, we type `RanInt()`. # **Functions with default arguments** # Default arguments are very useful in _R_, and most of the functions we have used up to now contained hidden defaults. For example, the function # `sample` has the default argument `replace=FALSE`, amongst others. # A default allows us to change an argument if we wish to, but to not specify it when we call the function if we don’t. For example, `sample(1:10,4)` # will sample 4 values from 1 to 10 without replacement by default. If we want to sample with replacement we write `sample(1:10,4,replace=TRUE)` in # order to change the default. # To add defaults to our own functions we simply add the default value after an `=` sign in the inputs to the function. E.g. #`RanInt <- function(N=10){ sample(1:N,1) #}` #Now `RanInt` draws a sample from the natural numbers up to `N`. If we don’t specify an `N` by typing `RanInt()`, we get a sample from the numbers 1 to #10, else we could specify `RanInt(30)` to get a number from 1 to 30. # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # Question: Write a function, called rollFun, with no arguments that simulates # 1200 rolls of a fair die and computes the chi-squared statistic in the way we # computed it above. Write your script below. # Question: When we call rollFun, we should get a single number back from the # function. Call rollFun to check this is the case. Don't forget the brackets! # Question: Use replicate to perform 1000 sampling experiments with rollFun and # call the resulting vector of chi-squared statistics, roll1200. # Plot your statistics as a histogram using a density scale on the vertical axis # using the command # hist(roll1200, freq=FALSE) # The result is an approximation to the distribution of the chi-squared # statistic under H0. # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ # ^ TECHNIQUE: # ^ #***Overlaying distributions on histograms***: #We often want to apply theory that says a random sample behaves as a draw from a certain distribution. One useful sanity check for this is to draw # a histogram of the sample and overlay the shape of the distribution to see if the sample is consistent with our assumptions. The function # `lines(x,y)` allows us to draw lines on plots, including histograms. #To overlay a particular distribution onto a histogram, we require a large sequence `x` that that runs from the minimum to the maximum of #the values in the histogram, and a `y` of the distribution function in question applied to each value in `x`. #We know by now how to generate sequences and we can get maxima and minima using commands given above. We can get the density function using # `d` prefixed to one of the many distributions provided by _R_. The density function for the chi-squared distribution is `dchisq`, for example, # and for the normal it is `dnorm`. Other distributions in _R_ are `unif`, `exp` and `pois`. Don’t forget to prefix the `d` if you want the #density function. You can read more about these functions using their help pages. # Putting this together: Suppose we have a vector `x` that we suppose is a random sample from a chi-squared distribution with degrees of freedom 5. The # following commands will draw the histogram of the vector and then overlay the chi-squared density to check our supposition: #```hist(x)``` #```forLine <- seq(from=min(x), to = max(x), len=100) ``` #```lines(forLine, dchisq(forLine,df=5))``` # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #As introduced in the lectures, when all the E are # 2 i # greater than 5, we can approximate the distribution of X under H0 using a # chi-squared distribution with k-1 degrees of freedom, where k is the number of # possible outcomes (6 in our case). To see if our sample is consistent with a # chi-squared(5) distribution, we can overlay the chi-squared(5) density on our # histogram. # # Question: Use the commands `seq` and `lines` (see the technique tips) to overlay the # $\chi^2_{5}$ density onto your histogram. Is your sample consistent with the # theory? # Question: Add a vertical line showing the value of our test statistic for # `rolls` using the command `abline`. # 3.1 Computing p-values --------------------------------------------------- # To quantify the `goodness of fit' of our value X2 under H0 we can use the # 2 # p-value, which is the probability, under H0, of observing a value of X # that is greater than our value, X2. We can either do this using our sampling # distribution directly via # # Question: Compare your sampling p-value with the p-value obtained from the # $\chi^2_{5}$ approximation. Are the two results close? # 4. Using a biased die -------------------------------------------------- # We are now going to generate a sample from a biased die, and see what happens # to the chi-squared statistic. To do this we will modify rollFun to accept an # argument containing probabilities for the different values of the die, but # with a default that the die is fair. # Question: Edit rollFun so that it has an input called probs with default value # rep(1/6,6), and so that the sample of 1200 die rolls uses the vector probs for # the probability of rolling 1,..., 6. # To see what happens to the chi-squared statistic using slightly biased dice, # 2 # we are going to redraw the histogram of our sampling distribution of X # under H0 using a larger x-axis to allow us to view large values. # Type the following before moving on. # hist(roll1200, breaks=seq(from=0,to=50,by=2.5), freq=FALSE) hist(roll1200,breaks=seq(0,50,2.5),freq=FALSE) #4.1 A die with a small bias ----------------------------------------------------- # We will create a die that is slightly biased away from 6 towards one using the # following vector of probabilities: # # bp = c(1/6 + 0.02, 1/6, 1/6, 1/6, 1/6, 1/6 - 0.02) # # Create a function `rollFun1` with an argument `probs`, which calculates the $X^2$ statistic for the biased die. # You can do this by calling `rollFun(probs)` inside the function. # Create the function in such a way that it returns the value of $X^2$ and that it also # adds this value as a blue vertical line on the existing histogram. You many define the colour # of the line by letting the function have a default argument `col = 'blue'` bp = c(1/6 + 0.02, 1/6, 1/6, 1/6, 1/6, 1/6 - 0.02) # Question: What do you notice about the blue vertical lines on the histogram? # Would the results of this experiment lead you to suspect that the # die was biased if you didn't know? #4.2 A die with a larger bias ------------------------------------------------- # Question: Repeat the above experiment 10 times with # # bp = c(1/6 + 0.04, 1/6, 1/6, 1/6, 1/6, 1/6 - 0.04) # # this time changing the colour of the vertical line to red. What do you notice? # Question: Each experiment, corresponding to each line on our histogram, # represents 1200 recorded rolls of a die. Receiving 1 of these experiments in # real life would normally be considered a large data set. Would any of the 20 # or so biased die experiments you have done, if seen alone, lead you not to # suspect that the die was biased, based on a chi-squared test, if seen in # isolation? #Question: For what values of $X^2$ would we reject the $H_0$ at a 5% level of significance? # 5. Further exercises ## Example 1 # We can use the pre-defined function supported by the stats package in _R_, in order to apply chi-square test to the example in the lecture notes. # Suppose we wish to know whether a die is fair or not. If it is fair then each face is equally likely to occur, i.e. 1/6. To test this claim, # a die is rolled 120 times and the number of times each face appeared was recorded. # Question: Enter the data into _R_, then use the `chisq.test` to perform chi-square test # for goodness of fit. Type `?chisq.test` for help. ## Example 2 # A random sample of 1000 residents of a major city to gather information on the alcohol consumption # yielded the data displayed below. # Question: Enter the data into _R_ in the form of a 4 by 3 matrix, then use the `chisq.test` to perform chi-square test for independence.