In this practical, we illustrate frequentist techniques for comparing two samples under the assumption that the populations are normal. Our objective will be to assess whether or not two samples can be regarded as being from the same population.
c
, seq
,
length
mean
, sd
,
var
, min
, max
plot
, a histogram with
hist
, and a boxpot with boxplot
qqnorm
qnorm
and pnorm
Eriksen, Björnstad an Götestam (1986) studied a social skills training program for alcoholics. Twenty-three alcohol-dependent male inpatients at an alcohol treatment centre were randomly assigned to two groups. The 12 control group patients were given a traditional treatment programme. The 11 treatment group patients were given the traditional treatment, plus a class in social skills training (“SST”). The patients were monitored for one year at 2-week intervals, and their total alcohol intake over the year was recorded (in cl pure alcohol).
A: Control | 1042 | 1617 | 1180 | 973 | 1552 | 1251 | 1151 | 1511 | 728 | 1079 | 951 | 1391 |
B: SST | 874 | 389 | 612 | 798 | 1152 | 893 | 541 | 741 | 1064 | 862 | 213 |
c
function to
enter these data into R as two vectors called A
for the
control group, and B
for the treatment group.summary(A)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 728 1025 1166 1202 1421 1617
summary(B)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 213.0 576.5 798.0 739.9 883.5 1152.0
boxplot(A,B)
par(mfrow=c(1,2))
hist(A)
hist(B)
A better graphical way to assess whether the data is distributed
normally is to draw a normal quantile plot (or
quantile-quantile, or QQ plot). We can do this in R using the
qqnorm
function, where to draw the Normal quantile plot for
a vector of data x
we use the command
qqnorm(x)
With this technique, the quantiles of the data (i.e. the ordered data values) are plotted against the quantiles which would be expected of a matching normal distribution of a normal distribution. If the data are normally distributed, then the points of the QQ plot will should lie on an approximately straight line. Deviations from a straight line suggest departures from the normal distribution.
The straight line can be superimposed on the plot by using the following command
qqline(x)
qqnorm
. Do the samples look approximately Normally
distributed? Add the lines using qqline
. Do your
conclusions agree with those made on the basis of the histograms?
We will use the independent sample \(t\)-test to compare these two samples. First, let’s recap the underlying theory (skip this if you remember the setup).
We assume that we have two random quantities \(X\sim\mathcal{N}\left({\mu_X, \sigma^2}\right)\) and \(Y \sim \mathcal{N}\left({\mu_Y, \sigma^2}\right)\). We obtain an i.i.d. sample of \(n\) observations of \(X\): \(X_1, \dots, X_n\); and a sample of size \(m\) from \(Y\): \(Y_1,\dots, Y_m\). We assume that each \(X_i\) is independent of each \(Y_j\). Note that we have assumed different means for \(X\) and \(Y\), but the same variances.
The question we are asking is: are the means of the two populations equal, i.e. is \(\mu_X=\mu_Y\), given the evidence we see in our samples?
Under the above assumptions, we have the following definitions and results:
The independent sample \(t\)-statistic is: \[ t = \frac{(\bar{x}-\bar{y}) - (\mu_X - \mu_Y)}{s_p \sqrt{\frac{1}{n} + \frac{1}{m}}}, \]
\(s_p^2\) is the pooled sample variance which is used to estimate the common variance \(\sigma^2\) by using data from both samples: \[ s_p^2 = \frac{(n-1)s_X^2 + (m-1)s_Y^2}{n+m-2} \]
Under \(H_0: \mu_X=\mu_Y\), \(t\) has a \(t\)-distribution with \(m+n-2\) degrees of freedom.
The \(t\)-test reject \(H_0\) at level \(\alpha\) in favour of an alternative \(H_1: \mu_X\neq\mu_Y\) if \[ |t| > t^*_{(m+n-2), \alpha/2} \]
The corresponding \(100(1-\alpha)\%\) confidence interval for \(\mu_X - \mu_Y\) is then \[ (\bar{x}-\bar{y}) \pm t^*_{(m+n-2), \alpha/2} \ \ ~\ ~s_p ~\sqrt{\frac{1}{n} + \frac{1}{m}} \]
To compute the \(t\)-test statistic, we’re going to need to begin by finding some simple summaries of our two samples.
mean
, var
, and length
function to find the sample mean, sample variance, and sample size for
the Treatmeant A group. Save these as abar
(\(\bar{a}\)), sa2
(\(s_a^2\)), and n
.bbar
(\(\bar{b}\)),
sb2
(\(s_b^2\)), and
m
.
sp2
(\(s_p^2\)).
t
. Check
you get the same value as shown below.
df
.
## [1] 4.00757
R provides a range of functions to support
calculations with standard probability distributions.
In the previous practical, we encountered the normal density function
dnorm
, as well as the random number generation functions
for the uniform (runif
) and normal (rnorm
)
distributions.
For every distribution there are four functions. The functions for each distribution begin with a particular letter to indicate the functionality (see table below) followed by the name of the distribution:
Letter | e.g. | Function |
---|---|---|
“d” | dnorm |
evaluates the probability density (or mass) function, \(f(x)\) |
“p” | pnorm |
evaluates the cumulative density function, \(F(x)=P[X \leq x]\), hence finds the probability the specified random variable is less than the given argument. |
“q” | qnorm |
evaluates the inverse cumulative density function (quantiles), \(F^{-1}(q)\) i.e. the value \(x\) such that \(P[X \leq x] = q\). Used to obtain critical values associated with particular probabilities \(q\). |
“r” | rnorm |
generates random numbers |
The appropriate functions for Normal, \(t\) and \(\chi^2\) distributions are given below, along with the optional parameter arguments.
dnorm
, pnorm
,
qnorm
, rnorm
. Parameters: mean
(\(\mu\)) and sd
(\(\sigma\)).dt
, pt
, qt
, rt
.
Parameter: df
.dchisq
, pchisq
, qchisq
,
rchisq
. Parameter: df
.For a list of all the supported distributions, run the command
help(Distributions)
qt
function function to
find the critical \(t\) value for the
test of \(H_0: \mu_X=\mu_Y\) against
\(H_1: \mu_X\neq\mu_Y\) at the 5% level
of significance when we have df
degrees of freedom? Check
you get the same value as below.
## [1] 2.079614
t
, and hence perform the
significance test.If we abandon the assumption of equal variances, then our first sample is i.i.d. \(\mathcal{N}\left({\mu_X, \sigma_X^2}\right)\) and the second sample i.i.d \(\mathcal{N}\left({\mu_Y, \sigma_Y^2}\right)\), with \(\sigma_X\) not necessarily equal to \(\sigma_Y\). Clearly, there is no single variance parameter to estimate and the test which uses the pooled sample variance would not be appropriate. We won’t cover the detail of the theory for this in lectures (nor is it examinable), but the idea is straightforward and easy to apply. Without the equal variance assumption, our test statistic becomes \[ t = \frac{(\bar{x}-\bar{y}) - (\mu_X - \mu_Y)}{\sqrt{\frac{s_X^2}{n} + \frac{s_Y^2}{m}}}, \] which still has a \(t\) distribution. The problem arises in determining the appropriate degrees of freedom for the distribution of \(t\). The degrees of freedom of the \(t\) distribution is no longer \(n+m-2\), but instead is approximated by this expression \[ \nu \approx \frac{\left(\frac{s^2_X}{n}+\frac{s^2_Y}{m}\right)^2}{\frac{s^4_X}{n^2(n-1)} + \frac{s^4_Y}{m^2(m-1)}}, \] though often in practice we take the lazier route of using \[\nu=\min(n,m)-1\] as the degrees of freedom (this simpler case corresponds to a conservative version of the test).
stats
packageThankfully, tests such as \(t\)-tests are supported by the
stats
package in R which allows us to pass the
problem of computing the test line by line, and we can then simply
interpret the results
library
function
to load the R package stats
.t.test
function and apply it to your two samples
A
and B
. Use the optional argument
var.equal
to perform an equal-variance test
(TRUE
), and FALSE
to test without this
assumption. Compare with your results from Section 2.
t.test
function.
equalvariance
argument than can be
TRUE
or FALSE
adjusting the test that is
performed. You will need to use an if
statement to handle
the different cases.immer
data set from the
MASS
package, which contains pairs of measurements of the
barley yield from the same fields in years 1931 (Y1
) and
1932 (Y2
).t.test
function using the
argument paired=TRUE
.