# # PRACTICAL 1-2: Sampling and Simulation # --------------------------------- # # # In this practical, we introduce the ideas of statistical simulation as a mechanism # to investigate the behaviour of sampling distributions in a case study problem # of jury selection. We'll expand our R techniques to see how to write custom # functions, to sample and generate random numbers, and how to repeat straightforward # calculations an arbitrary number of times. # * You will need the following skills from previous practicals: # * Basic R skills with arithmetic, functions, etc # * Manipulating and creating vectors: `c`, `seq`, # * Calculating data summaries: `mean`, `sd`, `var`, `min`, `max` # * Plotting a histogram with `hist` # * New R techniques: # * Sampling from a vector using `sample` # * Generating random numbers from a binomial distribution using `rbinom` # * Creating new functions with `function` # * Using `replicate` to repeatedly call a function with no arguments # * Adding straight lines to plots with `abline` # # # 1. Jury selection: Swain vs. Alabama (1965) # ================================================================================== # # The context of our problem today is concerns the selection of a jury in a trial # in 1965 Alabama, USA. In the early 1960s, in Talladega County in Alabama, a Black # man called Robert Swain was convicted and was sentenced to death. At the time, # only men aged 21 or older were allowed to serve on juries in Talladega County, # where 26% of the eligible jurors were Black. Of the 100 jurors available in the # jury panel, only 8 were Black and no Black man was selected for the jury of the # actual trial itself. # Robert Swain appealed his sentence, citing among other factors the fact the jury # at his trial was all White. Moreover, he pointed out that all Talladega County # jury panels for the past 10 years had contained only a small percent of Black # panelists. Robert Swain was represented in the U.S. Supreme Court by Constance # Baker Motley (https://en.wikipedia.org/wiki/Constance_Baker_Motley), the first # African-American woman to argue a case in that Court. She argued 10 cases in the # Supreme Court and lost only one – Robert Swain's. The U.S. Supreme Court concluded, # "the overall percentage disparity has been small" and that there was insufficient # evidence of "invidious discrimination". # # But was this assertion reasonable? If jury panelists were selected at random # from the county’s eligible population, there would be some chance variation. We # wouldn’t get exactly 26 Black panelists on every 100-person panel. But would we # expect as few as eight? In this practical, we will use statistical simulation to # investigate how plausible such an outcome would be, given the composition of the # wider population of potential jurors. # # # 2. Simulation and sampling # ================================================================================== # # Statistical simulation - or the Monte Carlo method(https://en.wikipedia.org/wiki/Monte_Carlo_method) # is a statistical technique where we artificially generate ("simulate") data in # order to explore sampling distributions, investigate the behaviour of statistical # methods, and test hypotheses. These methods are particularly useful when the # corresponding mathematical calculations are difficult or even impossible. Of # course, since our results are a product of how we set up the simulation we must # take particular care in setting up the problem and interpreting what we find. # # For this practical, we will use R and a bit of statistical thinking to examine # the disparity between the 8 out of 100 Black men in the jury panel, the # distribution of the actual jury, and the distribution in the population. # # 2.1 Simulating a jury # ---------------------------------------------------------------------------------- # # # We're going to approach the problem as follows: # # * Let $X_i$ be one of the $n=100$ potential jurors in the panel # * Then $X_i=1$ if that juror is Black and $X_i=0$ otherwise # * The jury panel comprises $n=100$ individuals are selected at random and # representatitvely from the population. # * Thus we assume that $X_i$ is drawn from a distribution where $P[X_i=1]=0.26$ to # represent the 26% of eligible Black people in the local population. # * $10$ people from the 100-person jury panel are then selected to be on the trial # jury. # # Under these assumptions, we can *simulate* or *randomly generate* possible jury # panels consistent with random selection from this population. If the panel were # truly selected at random then our simulated results should be close to those # observed in Swain's case. If the results of our simulation are not consistent # with the composition of the panel in the trial, that will be evidence against # our model assumptions (ie the model of random selection) and potential evidence # of bias. # # To begin with, we will simulate a jury using the `sample` function: # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ TECHNIQUE: # ^ # ^ The `sample` function allows us to take a sample of a specified size from a # ^ vector of values which is treated as the population. The sample function has the # ^ following syntax, which you can find by typing `?sample`: # ^ # ^ `sample(x, size, replace = FALSE, prob = NULL)` # ^ # ^ The arguments to this function are: # ^ # ^ * `x` a vector of values representing the population to be sampled # ^ * `size` the size of sample to take - in our case this is $n$ # ^ * `replace` - whether to sample with replacement (`TRUE`) or not (`FALSE`) # ^ * `prob` - a vector of probabilities for the selection of each element of `x`. # ^ If probabilities aren't specified here, then the elements of `x` are assumed # ^ equally likely. # ^ # ^ Note that the arguments `replace` and `prob` are given default values of `FALSE` # ^ and `NULL`. This means that if we don't specify a value, R will assume they take # ^ the defaults specified in the function definition. # ^ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # # Exercise: # ~~~~~~~~~~~~~ # * Use the `sample` function to: # * Take a sample of size $n=100$ # * From a population of $[0,1]$ values # * With a probabilities $[0.74,0.26]$ # * Using replacement # * _Hint:_ last time we saw we can use the `c` function to create vectors from two or more constants. # Your code should return a vector of length 100, that will look something like the one below. # # [1] 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 1 # [56] 1 1 0 0 0 1 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 # # Exercise: # ~~~~~~~~~~~~~ # * What do the values $0$ and $1$ represent? # * Re-run your code from the previous exercise, but now save it as a variable `x` # * Compute the `sum` of the elements of `x` to find out how many Black jurors would be in your simulated jury pool. # * Are you close to the value of $8$ that was seen in the Swain case? # * How would you modify your code to compute the proportion of Black jurors in the panel? # # 2.2 Using a function to repeat your calculations # ---------------------------------------------------------------------------------- # # # As with other programming languages (such as Python), we can combine multiple # commands in `R` into our own custom functions that perform more complicated tasks # or calculations. Throughout the year, we will be writing our own functions to # solve specific problems and today we will learn the basic syntax of how to create # a function. # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ TECHNIQUE: # ^ # ^ When creating a new function, it needs to have a name, probably at least one # ^ argument (although it doesn’t have to), and a body of code that does something. # ^ At the end it usually should (although doesn’t have to) return a value or object # ^ out of the function. # ^ # ^ The general syntax for writing your own function is # ^ name.of.function <- function(arg1, arg2, arg3=2) { # function code to do some useful stuff return(something) # return value } # ^ # ^ + name.of.function: is the function’s name. This can be any valid variable name, # ^ but you should avoid using names that are used elsewhere in R, such as `mean`, # ^ `function`, `plot`, etc. # ^ + arg1, arg2, arg3: these are the arguments of the function. You can write a # ^ function with any number of arguments, or none at all. Essentially, this is the # ^ list of everything that is needed for the `name.of.function` function to run. # ^ Some arguments have default values specified, such as `arg3` in our example, # ^ which is set to `2` unless otherwise specified. Arguments without a default # ^ must have a value supplied for the function to run. You do not need to provide # ^ a value for those arguments with a default as they are considered as optional, # ^ and when omitted the function will simply use the default value in its definition. # ^ + **Function body**: The function code between the `{}` brackets is run every # ^ time the function is called. Note that unlike Python where the code _inside_ # ^ the function is _indented_, with `R` the code inside the function must be # ^ enclosed in curly braces `{}`. # ^ + **Return value**: The last line of the code is the value that will be returned # ^ by the function. It is not necessary that a function return anything, for # ^ example a function that makes a plot might not return anything (and so either # ^ state `return()` or omitting the `return` statement entirely), whereas a function # ^ that does a mathematical operation might return a number, or a vector. # ^ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # For example, we can write a function to compute the sum of squares of two numbers as sum.of.squares <- function(x,y) { return(x^2 + y^2) } # and we can then evaluate sum.of.squares(3,4) # # Exercise: # ~~~~~~~~~~~~~ # * Write a function called `simulatePanel` which has no arguments, simulates a # 100 person jury panel, and returns the number of Black jurors it contains. # # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ TECHNIQUE: # ^ # ^ `replicate` is a function that repeatedly executes a fragment of code (an # ^ *expression*) a certain number of times. The function and the number of times # ^ its called are its two arguments: # ^ # ^ `replicate(n, expr)` # ^ # ^ evaluates the code `expr` repeatedly `n` times. For example, the following line # ^ of code will print `"Hello World"` to the console ten times: # ^ replicate(10, print("Hello world")) # ^ # ^ If the expression `expr` returns a value, then `replicate` combines the results # ^ together and returns them as a single object. # ^ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # # Exercise: # ~~~~~~~~~~~~~ # * Use `replicate` with your `simulatePanel` function to simulate the number of # Black jurors on 5000 simulated jury panels. Save the results to a variable # called `panels`. # * What does each number in the vector `panels` represent? # * Draw a `hist`ogram of `panels` - how does this compare to the observation of $x=8$? # # 2.3 Using a binomial distribution # ---------------------------------------------------------------------------------- # # # A different (and simpler) way of viewing this problem is to recognise that we # have all of the conditions for a binomial distribution here. If we let the # number of Black jurors on the 100-person panel be $X$, then if we're sampling # independently and randomly with fixed $p=0.26$, then we should have that # X ~ Bin(100, 0.26). # # As you might expect from statistical software, R can directly generate random # numbers from common distributions such as the binomial. # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ TECHNIQUE: # ^ # ^ The `rbinom` function allows us to generate random observations from a # ^ binomial distribution. Its syntax is # ^ # ^ rbinom(n, size, prob) # ^ # ^ The arguments to this function are: # ^ # ^ * n - how many random numbers to generate. Note this is not the binomial sample size parameter n! # ^ * size - the $n$ parameter of the binomial distribution to use # ^ * prob - the probability of success p. # ^ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # # Exercise: # ~~~~~~~~~~~~~ # * How would you use the `rbinom` function to simulate $X$, the number of Black # jurors in a jury panel of 100 individuals? # * How would you modify your code to simulate 5000 values of $X$? # # 2.4 Exploring the results # ---------------------------------------------------------------------------------- # # We've now randomly simulated a large number of potential jury panels consistent # with this area in Alabama. Let's do a little statistical analysis on our results. # You can use either the results from your own function or `rbinom` - it shouldn't # matter which. # # Exercise: # ~~~~~~~~~~~~~ # * Compute the `mean` and `var`iance of your simulated Xs. # * If X ~ Bin(100, 0.26) what are the theoretical mean and variance of X? # Do these values agree with your simulations? # * Does the simulated distribution of X the values appear to be consistent # with the observation x=8? # # Exercise: # ~~~~~~~~~~~~~ # # * What distribution would make a good approximation to that of X? Hint: # we're summing a large number of independent random variables. # * Re-draw/go back to your`hist`ogram of the values of $X$. Does the shape of # the histogram agree with your expectation? # Let's add the observed value of x=8 for reference. To do this, we will likely # need to modify our x-axis to have a larger range to ensure that 8 is visible # within the plot range, and learn how to add lines to an existing plot. # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ TECHNIQUE: # ^ # ^ To control the ranges of the horizontal and vertical axes, we can add the # ^ xlim and ylim arguments to our original plotting command. To set the horizontal # ^ axis limits, we pass a vector of two numbers to represent the lower and upper # ^ limits, xlim = c(lower, upper), and repeat the same for `ylim` to customise # ^ the vertical axis. For example, # ^ plot(x=1:10, y=1:10, xlim=c(-10,10), ylim=c(0,20)) # ^ # ^ We don't have to specify axis limits, and when omitted R will figure out something # ^ sensible for us. # ^ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # # Exercise: # ~~~~~~~~~~~~~ # # * Re-draw your histogram using $x$ axis limits that go from 0 up to 50. # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^ TECHNIQUE: # ^ # ^ It is often useful to add simple straight lines to lines to plots, which can be # ^ achieved using the `abline` function. `abline` can be used in three different ways: # ^ # ^ + Draw a horizontal line: pass a value to the `h` argument, `abline(h=3)` draws # ^ a horizontal line at y=3 # ^ + Draw a vertical line: pass a value to the `v` argument, `abline(v=5)` draws # ^ a vertical line at x=5 # ^ + Draw a line with given intercept and slope: pass value to the `a` and `b` # ^ arguments representing the intercept and slope respectively; `abline(a=1,b=2)` # ^ draws the line at y=1+2x # ^ # ^ `abline` can be customised using any of the usual colour and line modifications # ^ using colour, line types and widths . # ^ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # # Exercise: # ~~~~~~~~~~~~~ # # * Add a vertical line at $x=8$ to represent the observation in the Swain jury # selection case. # * You can also customise your axis labels to be more readable - see the online help # for details. # * You can customise your line (and your histogram) using colour: # * Read the online help page on using colour in plots and experiment with redrawing # your histogram and vertical line using some custom colour. # * What do you conclude about the plausibility of observing $x=8$ at random from # this population? # You should end up with a plot that looks a little like the one on the webpage. # # Exercise: # ~~~~~~~~~~~~~ # # * Without using the normal distribution, how would you use your simulations to estimate P[X <= 8]? # * *Hints*: # * Think about how you might determine how many of your simulated jury panels have 8 or fewer Black jurors? # * You can use the `<=` operator to test every element of a vector. # * `sum` will treat all the values of `TRUE` in a vector as `1` and `FALSE` as `0` # * You may need to make many more simulations to get a non-zero estimate. # * Ask for help if you're stuck! # # This is evidence that the model of random selection of the jurors in the panel # is not consistent with the data from the panel. While it is possible that the # panel could have been generated by chance, our simulation demonstrates that it # is hugely unlikely. # # The reality of the trial panel is very much at odds with the model’s assumption # of random selection from the eligible population. When the data and a model are # inconsistent, then the model is hard to justify. After all, the data are real # but the model is just a set of assumptions. When assumptions are at odds with # reality, we must question those assumptions. # # Therefore the most reasonable conclusion is that the either the proportion of # eligible Black jurors was far smaller than stated or the assumption of random # selection is unjustified for this jury panel. As the proportion was not in doubt, # the most reasonable conclusion is that the jury panel was not selected by random # sampling from the population of eligible jurors. Notwithstanding the opinion of # the Supreme Court, the difference between 26% and 8% is not so small as to be # explained well by chance alone. # # # 3. Further Exercises: # ================================================================================== # # Exercise: # ~~~~~~~~~~~~~ # # * How would you use the techniques above to simulate the selection (without # replacement) of 10 trial jurors from the 100-person jury panel? Do your results # indicate whether it is plausible to obtain no Black jurors purely at random? # * Link up your two simulations! First simulate the selection of a representative # 100-person panel using `rbinom`, and then use your simulated panel to `sample` # a simulated trial jury without replacement. Revisit your analysis to see # whether this affects your conclusions.