In this practical, we will use R to investigate the sampling distribution of Pearson’s chi-squared statistic for goodness of fit. We will do a sampling experiment for rolling a die, first in the case where the die is fair and then in the case that it is biased.
sample
, sapply
and replicate
.plot
,
hist
, lines
and abline
.pchisq
/dchisq
for the chi-square
distribution.tabulate
to count the number of instances of each
integer.pchisq
, qchisq
and
chisq.test
functions.
We can simulate the rolling of a fair die using the
sample
function that we have seen before
sample(1:6, 1)
This samples a number between 1 and 6 with the probability of each
number being equal. Actually, the function sample
has 2
default values so that the above is really
sample(x=1:6,size=1, replace=FALSE, prob=c(1/6,1/6,1/6,1/6,1/6,1/6))
though you don’t have to type all of this. Type ?sample
to recall how this function works. Here, the default is for each number
in x
to have an equal probability.
replace
how do you
simulate 10 rolls of a fair die using sample
?
rolls
that contains 1200
simulated rolls of a fair die.
We refer to the hypothesis that the die we have rolled is fair as the null hypothesis, \(H_0\). In this case we know \(H_0\) is true, however, the purpose of this experiment is to compare how a chi-squared test for examining \(H_0\) performs when we know that \(H_0\) is true.
To compute Pearson’s chi-squared statistic we need to identify the discrete set of possible outcomes (in our case the integers 1 to 6) and the expected number of each outcome in our experiment under \(H_0\). The statistic is
\[X^2=\sum_i\frac{(O_i-E_i)^2}{E_i}\]
where \(O_i\) is the observed number of instances of outcome \(i\) in the sample and \(E_i\) is the expected number of instances of outcome \(i\) in the sample under \(H_0\).
The repeat command: If we want to repeat a value
multiple times in a vector we use rep(a ,n )
where
a
is the value and n
is the number of
repeats.
Counting integers: The function
tabulate
takes a large vector of integers and computes the
number of instances of each integer. Suppose we have a large vector of
integers between 1 and a
called intVec
, then
we can compute the number of 1
’s, 2
’s, …,
a
’s using
tabulate(intVec, nbins = a)
Under \(H_0\), the probability of
each outcome is 1/6, so we can get the vector E easily using
rep
via
E <- rep(1200/6, 6)
We can count the number of 1’s, 2’s etc in our sample using the
tabulate
function.
O <- tabulate(rolls, nbins=6)
(Note O here is the letter, not a zero!).
In order to see if your value of X2
is unusually large
under H0 we need to know the distribution of X2
under \(H_0\). We investigate this with a sampling
experiment.
We will write a function that simulates the rolling of 1200 dice and computes the chi-squared statistic. We will then use this function to do a repeated sampling experiment and look at the properties of the chi-squared distribution.
Functions with no arguments. So far, we have written our own R programs as functions of a number of inputs that we specify. We can write programs that do a certain task every time, without any input variable by simply not including an input variable at the beginning. Here is an example of a function that takes a random integer from 1 to 10
RanInt <- function(){ sample(1:10,1) }
The important thing to note is that when calling a function
with no arguments, we still must include the brackets at the
end, To use the function above, for example, we type
RanInt()
.
Functions with default arguments. Default
arguments are very useful in R, and most of the functions we
have used up to now contained hidden defaults. For example, the function
sample
has the default argument replace=FALSE
,
amongst others.
A default allows us to change an argument if we wish to, but to not
specify it when we call the function if we don’t. For example,
sample(1:10,4)
will sample 4 values from 1 to 10 without
replacement by default. If we want to sample with replacement we write
sample(1:10,4,replace=TRUE)
in order to change the
default.
To add defaults to our own functions we simply add the default value
after an =
sign in the inputs to the function. E.g.
RanInt <- function(N=10){ sample(1:N,1)}
Now RanInt
draws a sample from the natural numbers up to
N
. If we don’t specify an N
by typing
RanInt()
, we get a sample from the numbers 1 to 10, else we
could specify RanInt(30)
to get a number from 1 to 30.
rollFun
, with no
arguments that simulates 1200 rolls of a fair die and computes
the chi-squared statistic in the way we computed it above.
rollFun
, we should get a single number
back from the function. Call rollFun
to check this is the
case.
replicate
to perform 1000 sampling experiments with
rollFun
and call the resulting vector of \(X^2\)’s, roll1200
.
Plot your \(X^2\)’s as a histogram using a density scale on the vertical axis using the command
hist(roll1200, freq=FALSE)
The result is an approximation to the distribution of \(X^2\) under \(H_0\).
Overlaying distributions on histograms: We
often want to apply theory that says a random sample behaves as a draw
from a certain distribution. One useful sanity check for this is to draw
a histogram of the sample and overlay the shape of the distribution to
see if the sample is consistent with our assumptions. The function
lines(x,y)
allows us to draw lines on plots, including
histograms.
To overlay a particular distribution onto a histogram, we require a
large sequence x
that that runs from the minimum to the
maximum of the values in the histogram, and a y
of the
distribution function in question applied to each value in
x
.
We know by now how to generate sequences and we can get maxima and
minima using commands given above. We can get the density function using
d
prefixed to one of the many distributions provided by
R. The density function for the chi-squared distribution is
dchisq
, for example, and for the normal it is
dnorm
. Other distributions in R are
unif
, exp
and pois
. Don’t forget
to prefix the d
if you want the density function. You can
read more about these functions using their help pages.
Putting this together: Suppose we have a vector x
that
we suppose is a random sample from a chi-squared distribution with
degrees of freedom 5. The following commands will draw the histogram of
the vector and then overlay the chi-squared density to check our
supposition:
hist(x)
forLine <- seq(from=min(x), to = max(x), len=100)
lines(forLine, dchisq(forLine,df=5))
As introduced in the lectures, when all the \(E_i\) are greater than 5, we can approximate the distribution of \(X^2\) under \(H_0\) using a chi-squared distribution with \(k-1\) degrees of freedom, denoted by \(\chi^2_{k-1}\), where \(k\) is the number of possible outcomes (6 in our case). To see if our sample is consistent with a \(\chi^2_{5}\), we can overlay the \(\chi^2_{5}\) density on our histogram.
seq
and lines
(see the
technique tips) to overlay the \(\chi^2_{5}\) density onto your histogram.
Is your sample consistent with the theory?
rolls
using the command abline
.
To quantify the goodness of fit of our value X2
under
\(H_0\) we can use the p-value, which
is the probability, under \(H_0\), of
observing a value of \(X^2\) that is
greater than our value, X2
. We can either do this using our
sampling distribution directly or through the approximate chi-square
distribution.
?pchisq
). Are the two results close?
We are now going to generate a sample from a biased die, and see what
happens to the \(X^2\) statistic. To do
this we will modify rollFun
to accept an argument
containing probabilities for the different values of the die, but with a
default that the die is fair.
rollFun
so that it has an input called
probs
with default value rep(1/6,6)
, and so
that the sample of 1200 die rolls uses the vector probs
for
the probability of rolling 1,…,6.
To see what happens to the \(X^2\) statistic using slightly biased dice, we are going to redraw the histogram of our sampling distribution of \(X^2\) under \(H_0\) using a larger \(x\)-axis to allow us to view large values. Type the following before moving on.
hist(roll1200, breaks=seq(from=0,to=50,by=5), freq=FALSE)
We will now create a die that is slightly biased away from 6 towards 1 using the following vector of probabilities:
bp <- c(1/6 + 0.02, 1/6, 1/6, 1/6, 1/6, 1/6 - 0.02)
rollFun1
with an argument
probs
, which calculates the \(X^2\) statistic for the biased die. You can
do this by calling rollFun(probs=bp)
inside the function.
Create the function in such a way that it returns the value of \(X^2\) and that it also adds this value as a
blue vertical line on the existing histogram. You many define the colour
of the line by letting the function have a second argument with default
option colour = 'blue'
Apply your function to the
histogram.
replicate
.
bp <- c(1/6 + 0.04, 1/6, 1/6, 1/6, 1/6, 1/6 - 0.04)
,
this time changing the colour of the vertical line to red. What do you
notice?
?qchisq
to see how to find the
corresponding critical value of the test.
We can use the pre-defined function supported by the stats package in R, in order to apply chi-square test to the example in the lecture notes.
Suppose we wish to know whether a die is fair or not. If it is fair then each face is equally likely to occur, i.e. 1/6. To test this claim, a die is rolled 120 times and the number of times each face appeared was recorded.
DieValue | 1 | 2 | 3 | 4 | 5 | 6 |
Observed | 10 | 15 | 16 | 25 | 24 | 30 |
chisq.test
to perform chi-square test for goodness of fit. Type
?chisq.test
for help.
A random sample of 1000 residents of a major city to gather information on the alcohol consumption yielded the data displayed below.
Less than 1 | 1-50 | Over 50 | |
---|---|---|---|
Single | 38 | 120 | 42 |
Married | 232 | 357 | 73 |
Widowed | 48 | 29 | 4 |
Divorced | 15 | 34 | 8 |
chisq.test
to perform chi-square test for
independence.