In this practical, we illustrate frequentist techniques for comparing two samples under the assumption that the populations are normal. Our objective will be to assess whether or not two samples can be regarded as being from the same population.

  • You will need the following skills from previous practicals:
    • Basic R skills with arithmetic, functions, etc
    • Manipulating and creating vectors: c, seq, length
    • Calculating data summaries: mean, sd, var, min, max
    • Plotting a scatterplot with plot, a histogram with hist, and a boxpot with boxplot
  • New R techniques:
    • Normal quantile plots for assessing normality using qqnorm
    • Quantile and inverse-quantile functions for standard distributions, e.g. qnorm and pnorm

1 The data

Eriksen, Björnstad an Götestam (1986) studied a social skills training program for alcoholics. Twenty-three alcohol-dependent male inpatients at an alcohol treatment centre were randomly assigned to two groups. The 12 control group patients were given a traditional treatment programme. The 11 treatment group patients were given the traditional treatment, plus a class in social skills training (“SST”). The patients were monitored for one year at 2-week intervals, and their total alcohol intake over the year was recorded (in cl pure alcohol).

A: Control 1042 1617 1180 973 1552 1251 1151 1511 728 1079 951 1391
B: SST 874 389 612 798 1152 893 541 741 1064 862 213
  • Use the c function to enter these data into R as two vectors called A for the control group, and B for the treatment group.
  • To ensure the data are correct, check your summary statistics match the ones provided below.
summary(A)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     728    1025    1166    1202    1421    1617
summary(B)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   213.0   576.5   798.0   739.9   883.5  1152.0
  • Compare the distributions of A and B using a side-by-side boxplot. What conclusions can you draw about the two groups?
  • Draw histograms (hist) of both samples. On the basis of these plots, do the samples look approximately Normally distributed?
boxplot(A,B)
par(mfrow=c(1,2))
hist(A)
hist(B)

A better graphical way to assess whether the data is distributed normally is to draw a normal quantile plot (or quantile-quantile, or QQ plot). We can do this in R using the qqnorm function, where to draw the Normal quantile plot for a vector of data x we use the command

qqnorm(x)

With this technique, the quantiles of the data (i.e. the ordered data values) are plotted against the quantiles which would be expected of a matching normal distribution of a normal distribution. If the data are normally distributed, then the points of the QQ plot will should lie on an approximately straight line. Deviations from a straight line suggest departures from the normal distribution.

The straight line can be superimposed on the plot by using the following command

qqline(x)

  • Draw Normal quantile plots of both samples using qqnorm. Do the samples look approximately Normally distributed? Add the lines using qqline. Do your conclusions agree with those made on the basis of the histograms?

We will use the independent sample \(t\)-test to compare these two samples. First, let’s recap the underlying theory (skip this if you remember the setup).

2 Theory: the independent sample \(t\)-test

We assume that we have two random quantities \(X\sim\mathcal{N}\left({\mu_X, \sigma^2}\right)\) and \(Y \sim \mathcal{N}\left({\mu_Y, \sigma^2}\right)\). We obtain an i.i.d. sample of \(n\) observations of \(X\): \(X_1, \dots, X_n\); and a sample of size \(m\) from \(Y\): \(Y_1,\dots, Y_m\). We assume that each \(X_i\) is independent of each \(Y_j\). Note that we have assumed different means for \(X\) and \(Y\), but the same variances.

The question we are asking is: are the means of the two populations equal, i.e. is \(\mu_X=\mu_Y\), given the evidence we see in our samples?

Under the above assumptions, we have the following definitions and results:

  • The independent sample \(t\)-statistic is: \[ t = \frac{(\bar{x}-\bar{y}) - (\mu_X - \mu_Y)}{s_p \sqrt{\frac{1}{n} + \frac{1}{m}}}, \]

  • \(s_p^2\) is the pooled sample variance which is used to estimate the common variance \(\sigma^2\) by using data from both samples: \[ s_p^2 = \frac{(n-1)s_X^2 + (m-1)s_Y^2}{n+m-2} \]

  • Under \(H_0: \mu_X=\mu_Y\), \(t\) has a \(t\)-distribution with \(m+n-2\) degrees of freedom.

  • The \(t\)-test reject \(H_0\) at level \(\alpha\) in favour of an alternative \(H_1: \mu_X\neq\mu_Y\) if \[ |t| > t^*_{(m+n-2), \alpha/2} \]

  • The corresponding \(100(1-\alpha)\%\) confidence interval for \(\mu_X - \mu_Y\) is then \[ (\bar{x}-\bar{y}) \pm t^*_{(m+n-2), \alpha/2} \ \ ~\ ~s_p ~\sqrt{\frac{1}{n} + \frac{1}{m}} \]

3 Applying the independent sample \(t\)-test

To compute the \(t\)-test statistic, we’re going to need to begin by finding some simple summaries of our two samples.

  • Use the mean, var, and length function to find the sample mean, sample variance, and sample size for the Treatmeant A group. Save these as abar (\(\bar{a}\)), sa2 (\(s_a^2\)), and n.
  • Repeat the process for the Treatmeant B group, creating variables bbar (\(\bar{b}\)), sb2 (\(s_b^2\)), and m.
  • Use the formula in Section 2 and the variables computed in the previous Exercise to calculate the pooled sample variance; save this to sp2 (\(s_p^2\)).
  • Under the null hypothesis of no difference in population means, calculate the value of the test statistic, \(t\), as defined in the theory section above and save it as t. Check you get the same value as shown below.
  • How many degrees of freedom do we associate with the distribution of \(t\)? Save this value to a variable df.
## [1] 4.00757

R provides a range of functions to support calculations with standard probability distributions. In the previous practical, we encountered the normal density function dnorm, as well as the random number generation functions for the uniform (runif) and normal (rnorm) distributions.

For every distribution there are four functions. The functions for each distribution begin with a particular letter to indicate the functionality (see table below) followed by the name of the distribution:

Letter e.g. Function
“d” dnorm evaluates the probability density (or mass) function, \(f(x)\)
“p” pnorm evaluates the cumulative density function, \(F(x)=P[X \leq x]\), hence finds the probability the specified random variable is less than the given argument.
“q” qnorm evaluates the inverse cumulative density function (quantiles), \(F^{-1}(q)\) i.e. the value \(x\) such that \(P[X \leq x] = q\). Used to obtain critical values associated with particular probabilities \(q\).
“r” rnorm generates random numbers

The appropriate functions for Normal, \(t\) and \(\chi^2\) distributions are given below, along with the optional parameter arguments.

  • Normal distribution: dnorm, pnorm, qnorm, rnorm. Parameters: mean (\(\mu\)) and sd (\(\sigma\)).
  • \(t\) distribution: dt, pt, qt, rt. Parameter: df.
  • \(\chi^2\) distribution: dchisq, pchisq, qchisq, rchisq. Parameter: df.

For a list of all the supported distributions, run the command help(Distributions)

  • How would you use the qt function function to find the critical \(t\) value for the test of \(H_0: \mu_X=\mu_Y\) against \(H_1: \mu_X\neq\mu_Y\) at the 5% level of significance when we have df degrees of freedom? Check you get the same value as below.
## [1] 2.079614
  • Compare this to your value of t, and hence perform the significance test.
  • Use this information to construct a 95% confidence interval for the difference in means between the two populations.
  • What is the probability of observing a value of the test statistic whose absolute value is at least as large as the one observed under the null hypothesis? i.e. what is \(P[T_{df}\geq t]\)
  • On the basis of your calculations above, what would you conclude about the population means for the two groups \(A\) and \(B\)?
  • Do you think the assumption of equal variances holds here? If not, would this affect your conclusions?

4 Independent \(t\)-test: Relaxing the equal variance assumption

If we abandon the assumption of equal variances, then our first sample is i.i.d. \(\mathcal{N}\left({\mu_X, \sigma_X^2}\right)\) and the second sample i.i.d \(\mathcal{N}\left({\mu_Y, \sigma_Y^2}\right)\), with \(\sigma_X\) not necessarily equal to \(\sigma_Y\). Clearly, there is no single variance parameter to estimate and the test which uses the pooled sample variance would not be appropriate. We won’t cover the detail of the theory for this in lectures (nor is it examinable), but the idea is straightforward and easy to apply. Without the equal variance assumption, our test statistic becomes \[ t = \frac{(\bar{x}-\bar{y}) - (\mu_X - \mu_Y)}{\sqrt{\frac{s_X^2}{n} + \frac{s_Y^2}{m}}}, \] which still has a \(t\) distribution. The problem arises in determining the appropriate degrees of freedom for the distribution of \(t\). The degrees of freedom of the \(t\) distribution is no longer \(n+m-2\), but instead is approximated by this expression \[ \nu \approx \frac{\left(\frac{s^2_X}{n}+\frac{s^2_Y}{m}\right)^2}{\frac{s^4_X}{n^2(n-1)} + \frac{s^4_Y}{m^2(m-1)}}, \] though often in practice we take the lazier route of using \[\nu=\min(n,m)-1\] as the degrees of freedom (this simpler case corresponds to a conservative version of the test).

  • Apply the independent sample \(t\)-test with unequal variances to compare the two groups.
  • What do you conclude? Do the results agree with or contradict the equal-variance test?

5 Using the stats package

Thankfully, tests such as \(t\)-tests are supported by the stats package in R which allows us to pass the problem of computing the test line by line, and we can then simply interpret the results

  • Use the library function to load the R package stats.
  • Read the techniques page on the t.test function and apply it to your two samples A and B. Use the optional argument var.equal to perform an equal-variance test (TRUE), and FALSE to test without this assumption. Compare with your results from Section 2.

6 Creating your own functions

  • Use your code to construct your own version of the t.test function.
    • Add an optional equalvariance argument than can be TRUE or FALSE adjusting the test that is performed. You will need to use an if statement to handle the different cases.
    • Use this to show the test statistic, degrees of freedom, and p-value.
  • Write a function to perform a paired sample t-test.
    • Try it out on the immer data set from the MASS package, which contains pairs of measurements of the barley yield from the same fields in years 1931 (Y1) and 1932 (Y2).
    • Check your results with the t.test function using the argument paired=TRUE.