#
# PRACTICAL 1-2: 1-3: Likelihood Inference
# ---------------------------------
#
#
# We have seen in lectures that the MLEs for the parameters of several common 
# distributions can be found in closed form. However, in general, the problem is 
# not guaranteed to be so easy with a simple closed-form result. Therefore, in 
# these situations, we will have to find and maximise the likelihood numerically.


#  *  You will need the following skills from previous practicals:
#     *   Basic R skills with arithmetic, functions, etc
#     *   Manipulating and creating vectors: `c`, `seq`,
#     *   Calculating data summaries: `mean`, `sum`, `length`
#     *   Drawing a histogram with `hist`, and adding simple lines to plots `abline`
#     *   Customising plots using `col`our
#  *  New R techniques:
#     *   Drawing a plot of a simple function using `curve`
#     *   Optimising a function with `optim`
#     *   Extracting named elements of a list using the `$` operator


#
#
# 1. The Poisson Likelihood
# ==================================================================================
#
# Our likelihood calculations begin with writing down the likelihood function of our
# data. For a single observation $X=x$ from a Poisson distribution, the likelihood is
# the the probability of observing $X=x$ written as a function of the parameter lambda:
# 
# l(lambda) = P[X=x | lambda] = (e^{-lambda} lambda^x)/(x!).
# 
# To begin with, suppose we observe a count of X=5.


#
# Exercise:
# ~~~~~~~~~~~~~
# * Write a simple function called `poisson` that takes arguments `lambda` 
#    which evaluates the Poisson probability P[X=5 | lambda].
#   Hint: You will need the `exp` and `factorial` functions.
# * Evaluate the Poisson probability for X=5 when lambda=7 - you should get a 
#   value of 0.1277167.


# What we have just created is the likelihood function, l(lambda), for the Poisson
# parameter lambda given a single observation of X=5.
# 
# Before we consider a larger data problem, let's briefly explore this simplest case
# of a Poisson likelihood. Let's begin with a plot of the likelihood function itself.


# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ TECHNIQUE: 
# ^ 
# ^ **Using `curve` to plot a funtion**
# ^ 
# ^ Given an expression for a function f(x), we can plot the values of f for various 
# ^ values of x in a given range. This can be accomplished using an R function called
# ^ `curve`. The main syntax for `curve` is as follows:
# ^ 
# ^ curve(expr, from = NULL, to = NULL, add = FALSE,...)
# ^ 
# ^ The arguments are:
# ^ * expr: an expression or function in a variable x which evaluates to the function
# ^          to be drawn. Examples include `sin(x)` or `x+3*x^2`
# ^  * from, `to`: specifies the range of `x` for which the function will be plotted
# ^  * add: if `TRUE` the graph will be added to an existing plot; if `FALSE` a new 
# ^         plot will be created
# ^  * ...: any of the standard plot customisation arguments can be given here 
# ^         (e.g. `col` for colour, `lty` for line type, `lwd` for line width)
# ^ 
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        

# For example, the following code when evaluated produces the plot below:
# 
  curve(x^3 - 5*x, from = -4, to = 4)


#
# Exercise:
# ~~~~~~~~~~~~~
# * Try out using `curve` to draw a quick plot of the following functions:
#   * f(x)=3x^2+x over [0,10]
#   * f(x)=sin(x)+cos(x)$ over [0,2pi] as a blue curve. 
#      Note: R defines `pi` as the constant pi
#   * Use the `add=TRUE` option to superimpose g(x)=0.5sin(x)+cos(x) as a red 
#      curve over [0,2pi] on the previous plot.
# * Now, use `curve`  and your `poisson` function to draw a plot of your Poisson
#      likelihood for lambda in [0.1, 20].
# * By eye, what is the value lambdahat of lambda which maximises the 
#      likelihood?


# Typically, we usually work with the log-likelihood L(lambda)=log l(lambda),
# rather than the likelihood itself. This can often make the mathematical
# maximisation simpler, the numerical computations more stable, and it is also
# intrinsically connected to the idea of *information* and variance (which we will
# consider later).


#
# Exercise:
# ~~~~~~~~~~~~~
#
# * Modify your `poisson` function to compute the logarithm of the Poisson 
#   probability, and call this new function `logPoisson`. You will need to use the
#   `log` function.
# * Re-draw the plot to show the log-likelihood against lambda. Do you find the same
#   maximum value lambdahat?

  
#
#
# 2.  Likelihood methods for real data
# ==================================================================================
#
# 
# The Poisson distribution has been used by traffic engineers as a model for light 
# traffic. This assumption is based on the rationale that if the rate is approximately
# constant and the traffic is light (so the individual cars move independently of 
# each other), then the distribution of counts of cars in a given time interval or 
# space area should be nearly Poisson (Gerlough and Schuhl 1955).
# 
# The following table shows the number of right turns during three hundred 3-minute
# intervals at a specific junction.


traffic <- rep(0:12, c(14,30,36,68,43,43,30,14,10,6,4,1,1))

# Note that we interpret these data as 14 observations of the value 0, 
# 30 observations of the value 1, and so forth.


#
# Exercise:
# ~~~~~~~~~~~~~
#
# * Evaluate the piece of code above to create the vector `traffic` corresponding to the above data set.
# * Plot a `hist`ogram of the data. Does an assumption of a Poisson distribution appear reasonable?


# If we suppose a Poisson distribution might be a good choice of distribution for
# this dataset, we still need to find out _which_ particular Poisson distribution 
# is best by estimating the parameter lambda.
# 
# Let's begin by constructing the likelihood and log-likelihood functions for this
# data set of 300 observations. Assuming that the observations x_1, ..., x_300 
# can be treated as 300 i.i.d. values from Po(lambda), the log-likelihood 
# function for the entire sample is:
# 
# L(lambda) = -nlambda +(sum_{i=1}^n x_i ) log lambda ,
# 
# up to an additive constant from the factorial terms (which we know we can ignore 
# for the purposes of these calculations).

#
# Exercise:
# ~~~~~~~~~~~~~
#
#  * Write a new functions `logLike` with a single parameter `lambda`, which 
# evaluates the log-likelihood expression above for the `traffic` data and 
# the supplied values of `lambda`.
# _Hint_: Use `length` and `sum` to compute the summary statistics.
#  * Evaluate the data log-likelihood for all lambda in `lambdavals` and plot the 
#  log-likelihood against lambda. Without further calculations, use the plot to 
#  have a guess at the maximum likelihood estimate lambdahat -- 
#  we'll check this later!


#
# 2.1 Maximising the likelihood
# ----------------------------------------------------------------------------------
# 
# 
# The maximum likelihood estimate (MLE) lambdahat for the lambda parameter of the
# Poisson distribution is the value of lambda which maximises the likelihood
# 
# lambdahat = argmax_lambda l(lambda)= argmax_lambda L(lambda).
# 
# Therefore to find the MLE, we must maximise l or L. We know we can do this 
# analytically for certain common distributions, but in general we'll have to 
# optimise the function numerically instead. For such problems, `R` provides 
# various functions to perform a numerical optimisation of a give function 
# (even one we have defined ourself). The function we will focus on is `optim`:
#   


# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ TECHNIQUE: 
# ^ 
# ^ 
# ^ `R` offers several optimisation functions, however the one we shall be using 
# ^ today is `optim` which is one of the most general optimisation functions. 
# ^ This generality comes at the price of having a lot of optional parameters that
# ^ need to be specified.
# ^ 
# ^ For our purposes, we're doing a 1-dimensional optimisation of a given function 
# ^ named `logLike`. So the syntax we will use to _maximise_ we call optim  as follows
# ^ 

optim(start, logLike, lower=lwr, upper=upr, method='Brent', control=list(fnscale=-1))

# ^ 
# ^ The arguments are:
# ^    * `start` : an initial value for lambda at which to start the optimisation.
# ^                *Replace this with a sensible value.*
# ^    * `logLike` : this is the function we're optimising. In general, we can 
# ^               replace this with any other function.
# ^ * `lower=lwr, upper=upr` : lower and upper bounds to our search for lambdahat.
# ^               *Replace `lwr` and `upr`  with appropriate numerical values*;
# ^               an obvious choice for this problem is the min and max of our data.
# ^ * `method='Brent'` : this selects an appropriate optimisation algorithm for a 1-D 
# ^               optimisation between specified bounds. Other options are available
# ^               for other classes of optimisation problem.
# ^ * `control=list(fnscale=-1)` : this looks rather strange, but is simply telling
# ^               `optim` to *max*imise the given function, rather than *min*imise
# ^               which is the default (it's instructing R to scale the function
# ^               by a factor of $-1$ and minimise the result).
# ^ 
# ^ `optim` returns a number of values in the form of a list. The relevant components are:
# ^ 
# ^    * `par` : the optimal value of the parameter
# ^    * `value` : the optimum of the function being optimised.
# ^    * `convergence`: if `0` this indicates that the optimisation has completed successfully.
# ^ 
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


#
# Exercise:
# ~~~~~~~~~~~~~
#
#   * Use `optim` to maximise the `logLike` function. You should choose an 
#     appropriate initial value for lambda, as well as sensible `upper` and 
#     `lower` bounds.
#   * What value of lambdahat do you obtain? How does this compare to 
#     your guess at the MLE?
#   * We know from lectures that the lambdahat = Xbar -- evaluate the 
#     sample mean of the `traffic` data. Does this agree with your results from
#     directly maximising the log-likelihood?


# You should see output something like this:

## $par
## [1] 3.893333
## 
## $value
## [1] 419.6223
## 
## $counts
## function gradient 
##       NA       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL


# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ^ TECHNIQUE: 
# ^ 
# ^ 
# ^ The output of `optim` is in the form of a `list`. Unlike vectors, `list`s can 
# ^ combine many different variables together (numbers, strings, matrices, etc) and
# ^ each element of a list can be given a name to make it easier to access.
# ^ 
# ^ For example, we can create our own lists using the `list` function:
# ^ 
# ^  test <- list(Name='Donald', IQ=-10, Vegetable='Potato')`
# ^ 
# ^ Once created, we can access the named elements of the list using the `$` operator.
# ^ For example,
# ^ 
# ^  test$IQ
# ^ 
# ^ will return the value saved to `IQ` in the list, i.e. `-10`. RStudio will show 
# ^ auto-complete suggestions for the elements of the list after you press `$`, which
# ^ can help make life easier.
# ^
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

# 
# We can use this `$` operator to extract the MLE value from `optim` by first 
# saving the results to a variable, called `results` say, and then extracting the
# list element named `par`, which corresponds to the maximising value of lambda.


#
# Exercise:
# ~~~~~~~~~~~~~
#
# * Extract the optimal value of lambda from the output of `optim` and save it 
# to a new variable called `mle`.
# * Re-draw your plot of the log-likelihood function, and use `abline` to:
#     * Add a red solid vertical line at lambda=lambdahat.
#     _Hint:_ pass the `col='red'` argument to your plotting function.
#     * Add a blue dashed vertical line at lambda=xbar.
# * Do the results agree with each other, and with your guess at lambdahat?
#


# Let's return to the original distribution of the data. The results of our 
# log-likelihood analysis suggest that a Poisson distribution with parameter 
# lambdahat is the best choice of Poisson distributions for this data set.

 
#
# Exercise:
# ~~~~~~~~~~~~~
#
# * Re-draw the `hist`ogram of the `traffic` data.
# * Modify your `poisson` function to evaluate the Poisson probability
#    P[X=x] when X ~ Po(lambdahat). Save the function as `poissonProb`.
# * The expected number of Poisson counts under this Po(lambdahat)
#   distribution will be 300P[X=x], for x=0,...,16. Use the 
#   `curve` function to `add` draw this function over your histogram in red.
#   (We will ignore the fact that $X$ should be discrete for the time-being.)
# * Does this look like a good fit to the data?


# We will return to the question of fitting distributions to data and the 
# "goodness of fit" later next term.

 
#
# 2.2 Information and inference
# ----------------------------------------------------------------------------------
# 
# For an iid sample X_1,...,X_n from a sufficiently nicely-behaved ("regular") density 
# function f(x_i | theta) with unknown parameter theta, the sampling distribution 
# of the MLE thetahat is such that
# 
# thetahat ~~> N(theta, 1/(I_n(theta)) ),
# 
# as n -> infinity. For large samples, we can also replace the expected Fisher 
# information in the variance with the observed information, I(thetahat) = -L''(thetahat)
# instead, Whilst we may be used to finding Normal distributions in unusual places,
# this result is noteworthy as we never assumed any particular distribution of our
# data here. This result therefore provides us with a general method for making 
# inferences about parameters of distributions and their MLEs for problems involving
# large samples!
#  
# In particular, we can apply this to our Poisson data problem above to construct a
# large-sample approximate 100(1-alpha)% confidence interval for lambda in the form:
# 
#    lambdahat +/- z*{alpha/2} 1/(sqrt[-L''(lambdahat)]),
# 
# where z*{alpha/2} is the alpha/2 critical value of a standard Normal distribution
# and we have estimated the expected Fisher information I_n(theta) by the observed
# information I(lambdahat}) = -L''(lambdahat).

#
# Exercise:
# ~~~~~~~~~~~~~
#
#  * We can request that `R` computes the second derivative at the maximum, 
#    L''(lambdahat), as part of the maximisation process. To do this, re-run your
#    optimisation, but add the additional argument `hessian=TRUE`.

 
#
# You should now obtain R output that looks like this:
#
## $par
## [1] 3.893333
## 
## $value
## [1] 419.6223
## 
## $counts
## function gradient 
##       NA       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $hessian
##           [,1]
## [1,] -77.05481
 
 
#
# Exercise:
# ~~~~~~~~~~~~~
#
# * The output from `optim` will now contain an additional element named `hessian`.
#   Extract this element, multiply by -1, and save it as `obsInfo`.
# * To get the variance of lambdahat we need to invert the information matrix. We
#   can achieve this using R's `solve` function. Try this now on `obsInfo`. Save 
#   this to `varMLE`.
# * As `obsInfo` is just a scalar (or a $1\times1$ matrix), the invers of the matrix
#   is the same as the reciprocal $\frac{1}{I(\widehat{\lambda})}$ - check this now!
# * Finally, `varMLE` will be in the form of a $1\times 1$ matrix rather than a 
#   scalar (as the Hessian matrix is a matrix of the second order partial derivatives).
#   To extract the value from within the matrix, run the following code
#
# varMLE <- varMLE[1,1]

 
#
# Exercise:
# ~~~~~~~~~~~~~
#
# * Use your standard Normal tables (or R's built-in function `qnorm`) to find the
#   critical z*{alpha/2} value for a 95% confidence interval. Save this as `zcrit`.
# * Combine `mle`, `zcrit`, and `varMLE` to evaluate an approximate 95% Wald 
#   confidence interval for lambda.
# * Return (and possibly re-draw) the plot of your log-likelihood function, with 
#   the MLE indicated as a vertical line. Add the limits of the confidence interval
#   to your plot as vertical green lines.
# * What does the confidence interval suggest about the reliability of your estimate.
# * Does the log-likelihood look reasonably quadratic near the MLE? Does this affect
#   your conclusions, and how would you investigate this further?