#
# PRACTICAL 2-2: Comparing Two Samples
# ---------------------------------
#
#
# In this practical, we illustrate frequentist techniques for comparing two samples 
# under the assumption that the populations are normal. 
# Our objective will be to assess whether or not two samples can be regarded as being
#from the same population.
# 
# You will need the following skills from previous practicals:
#    *   Basic R skills with arithmetic, functions, etc
#    *   Manipulating and creating vectors: `c`, `seq`, `length`
#    *   Calculating data summaries: `mean`, `sd`, `var`, `min`, `max`
#    *   Plotting a scatterplot with `plot`, a histogram with `hist`, and a boxpot with `boxplot`
# 
# New R techniques:
#    *   Normal quantile plots for assessing normality using `qqnorm`
#    *   Quantile and inverse-quantile functions for standard distributions, e.g. `qnorm` and `pnorm`

# ==================================================================================
#
# 1. The data
# 
#
# Eriksen, Björnstad an Götestam (1986) studied a social skills training program for alcoholics.
# Twenty-three alcohol-dependent male inpatients at an alcohol treatment centre were randomly
# assigned to two groups. The 12 control group patients were given a traditional treatment 
# programme. The 11 treatment group patients were given the traditional treatment, plus a class
# in social skills training ("SST"). The patients were monitored for one year at 2-week intervals,
# and their total alcohol intake over the year was recorded (in cl pure alcohol).
#
# A: Control	1042	1617	1180	973	1552	1251	1151	1511	 728	1079	951	1391
# B: SST	    874	   389	 612	798	1152	 893	 541	 741	1064	 862	213	
#


#
# Exercise 1.1:
# ~~~~~~~~~~~~~
#
#  * Use the `c` function to enter these data into R as two vectors called `A` for the control 
#    group, and `B` for the treatment group. 
#
#  * Check your summary statistics match mine below to ensure the data are correct.

## summary(A)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     728    1025    1166    1202    1421    1617
## summary(B)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   213.0   576.5   798.0   739.9   883.5  1152.0


#
# Exercise 1.2:
# ~~~~~~~~~~~~~
#
# * Compare the distributions of  `A` and `B` using a side-by-side `boxplot`. What conclusions 
#   can you draw about the two groups?
# * Draw histograms (`hist`) and Normal quantile plots (`qqnorm`) of both samples. On the basis 
#   of these plots, do the samples look approximately symmetric and Normally distributed?


# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ TECHNIQUE: 
# ^ 
# ^ A better graphical way to assess whether the data is distributed normally is to draw a "normal
# ^ quantile plot" (or quantile-quantile, or QQ plot). We can do this in _R_ using the `qqnorm`
# ^ function, where to draw the Normal quantile plot for a vector of data `x` we use the command
# ^ 
# ^ qqnorm(x)
# ^ 
# ^ With this technique, the quantiles of the data (i.e. the ordered data values) are plotted
# ^ against the quantiles which would be expected of a matching normal distribution. If the data
# ^ are normally distributed, then the points of the quantile plot will should lie on an 
# ^ approximately straight line. Deviations from a straight line suggest departures from the
# ^ normal distribution.
# ^
# ^ The straight line can be superimposed on the plot by using the following command
# ^ 
# ^ qqline(x)
# ^
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


#
# Exercise 1.3:
# ~~~~~~~~~~~~~
#
# * Draw Normal quantile plots of both samples using `qqnorm`. Add the lines using `qqline`. Do the samples look approximately
#   Normally distributed? Do your conclusions agree with those made on the basis of the histograms?
#


# ==================================================================================
#
# 3. Applying the independent sample t-test
#
# 
# To compute the t-test statistic, we're going to need to begin by finding some simple summaries of our two samples.

#
# Exercise 3.1:
# ~~~~~~~~~~~~~
# 
# * Use the `mean`, `var`, and `length` functions to find the sample mean, sample variance, and
#   sample size for the Treatmeant A group. Save these as `abar`, `sa2`, and `n`. 
#
# * Repeat the process for the Treatmeant B group, creating variables `bbar`, `sb2`, and `m`.
#


#
# Exercise 3.2:
# ~~~~~~~~~~~~~
# 
# * Use the formula given above and the variables you have just created to calculate the 
#   pooled sample variance; save this to `sp2`.
#


#
# Exercise 3.3:
# ~~~~~~~~~~~~~
# 
# * Under the null hypothesis of no difference in population means, calculate the value of the 
#   test statistic, t, as defined in the theory section above and save it as `t`. Check you
#   get the same value as shown below.
#
## [1] 4.00757
#
# * How many degrees of freedom do we associate with the distribution of t? Save this value to
#   a variable `df`.
#


# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ 
# ^ TECHNIQUE: 
# ^ 
# ^ R provides a range of functions to support calculations with standard probability 
# ^ distributions. In the previous practical, we encountered the normal density function `dnorm`,
# ^ as well as the random number generation functions for the uniform (`runif`) and normal 
# ^ (`rnorm`) distributions.
# ^ 
# ^ For every distribution there are four functions. The functions for each distribution begin 
# ^ with a particular letter to indicate the functionality (see table below) followed by the 
# ^ name of the distribution:
# ^ 
# ^ | Letter | e.g. | Function                                                             |
# ^ |-----|---------|----------------------------------------------------------------------|
# ^ | "d" | `dnorm` | evaluates the probability density (or mass) function, $f(x)$         |
# ^ | "p" | `pnorm` | evaluates the cumulative density function, $F(x)=P[X \leq x]$, hence |
# ^ |     |         | finds the probability the specified random variable is less than the |
# ^ |     |         | given argument.                                                      |
# ^ | "q" | `qnorm` | evaluates the inverse cumulative density function (quantiles),       |
# ^ |     |         | $F^{-1}(q)$ i.e. the value $x$ such that $P[X \leq x] = q$. Used to  |
# ^ |     |         | obtain critical values associated with particular probabilities $q$. |
# ^ | "r" | `rnorm` | generates random numbers                                             |
# ^ 
# ^ The appropriate functions for Normal, $t$ and $\chi^2$ distributions are given below, 
# ^ along with the optional parameter arguments.
# ^ 
# ^   + Normal distribution: `dnorm`, `pnorm`, `qnorm`, `rnorm`. 
# ^        Parameters: `mean` (μ) and `sd` (σ).
# ^   + $t$ distribution: `dt`, `pt`, `qt`, `rt`. 
# ^        Parameter: `df`.
# ^   + $\chi^2$ distribution: `dchisq`, `pchisq`, `qchisq`, `rchisq`. 
# ^        Parameter: `df`.
# ^ 
# ^ For a list of all the supported distributions, run the command `help(Distributions)`
# ^ 
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


#
# Exercise 3.4:
# ~~~~~~~~~~~~~
# 
# * How would you use the `qt`` function function to find the appropriate critical value for
#   a test of H_0: μ_X=μ_Y against H_1: μ_X≠μ_Y at the 5% level of significance. 
#   Check you get the same value as below.
#
## [1] 2.079614
#


#
# Exercise 3.5:
# ~~~~~~~~~~~~~
# 
# *  Compare this to your value of `t`, and hence perform the significance test. 
# *  Use this information to construct a 95% confidence interval for the difference in means
#    between the two populations.
# *  What is the probability of observing a value of the test statistic whose absolute value is
#    at least as large as the one observed under the null hypothesis? i.e. what is 
#    P[T_{df} >= t]?
# *  On the basis of your calculations above, what would you conclude about the population 
#    means for the two groups A and B?
# *  Do you think the assumption of equal variances holds here? If not, would this affect your
#    conclusions?


# 4. Independent $t$-test: Relaxing the equal variance assumption

# If we abandon the assumption of equal variances, then our first sample is i.i.d. 
# $N{mu_X, sigma_X^2} and the second sample i.i.d $d{mu_Y, sigma_Y^2}, 
# with sigma_X not necessarily equal to sigma_Y. Clearly, there is no single variance 
# parameter to estimate and the test which uses the pooled sample variance would not be
# appropriate. *We won't cover the detail of the theory for this in lectures 
# (nor is it examinable)*, but the idea is straightforward and easy to apply. 
# Without the equal variance assumption, our test statistic becomes
#
#     (xbar-ybar) - (mu_X - mu_Y)
# t = ----------------------------,
#       sqrt(s_X^2/n) + s_Y^2/m)
#
# which still has a t-distribution. The problem arises in determining the appropriate
# degrees of freedom for the distribution of $t$. The degrees of freedom of the 
# t-distribution is no longer n+m-2, but instead is approximated by this expression
#
#                               (s^2_X/n + s^2_Y/m)^2
# nu is approximately = -------------------------------------
#                        s^4_X/(n^2(n-1)) + s^4_Y/(m^2(m-1))
#
# though often in practice we take the lazier route of using nu=\min(n,m)-1 as the degrees of 
# freedom (this simpler case corresponds to a conservative version of the test).

#
# Exercise 4.1:
# ~~~~~~~~~~~~~
# 
#  Apply the independent sample $t$-test with unequal variances to compare the two groups. 
#  What do you conclude? Do the results agree with or contradict the equal-variance test?


# 5. Doing it the easy way: Using the `stats` package

# Thankfully, tests such as t-tests are supported by the `stats` package in R which 
# allows us to pass the problem of computing the test line by line, and we can then
# simply interpret the results.

#
# Exercise 5.1:
# ~~~~~~~~~~~~~
# 
# Use the library  function to load the R package `stats`.
# Read the [techniques page on the `t.test` function] (r_10_statsmethods.html#t.test)
# and apply it to your two samples `A` and `B`. Use the optional argument `var.equal`
# to perform an equal-variance test (`TRUE`), and `FALSE` to test without this assumption. 
# Compare with your results from Section 2.
                                                                              

# 6. Creating your own functions

#
# Exercise 6.1:
# ~~~~~~~~~~~~~
# 
# Use your code to construct your own version of the `t.test` function. 
# Add an optional `equalvariance` argument than can be `TRUE` or `FALSE` adjusting 
# the test that is performed. You will need to use an `if` statement to handle the different
# cases 
# Use this to show the test statistic, degrees of freedom, and p-value.


#
# Exercise 6.2:
# ~~~~~~~~~~~~~
# 
# Write a function to perform a paired sample t-test. 
# Try it out on the `immer` data set from the `MASS` package, which contains pairs of 
# measurements of the barley yield from the same fields in years 1931 (`Y1`) and 1932 (`Y2`). 
# Check your results with the `t.test` function using the argument `paired=TRUE`.