This practical will introduce you to working with R
as a
statistical programming and analysis language, as well as going through
the basics of how to use the RStudio
editor which is a
convenient and helpful interface to R
.
You will be familiar with the basics of programming with Python, but there are some important differences between R and Python. Additionally, R has a lot more support for statistics in terms of the functions, data plotting, and packages that are available.
If you’re also taking Data Science & Statistical Computing 2, then you may already be familiar with much of this introductory material - feel free to skip over sections that you already know and have a go at the ‘Additional Exercises’ at the end.
You should make sure you complete up to and including Section 5 before moving on.
c
,
seq
,mean
, sd
,
var
, min
, max
plot
, a histogram with
hist
, and a boxplot with boxplot
.R is an open source programming language for statistical computing and graphics that is widely used among statisticians for data analysis. As R is a programming language and not a program itself (like Excel) we will need some programming skills, as even the simplest data analysis will require writing and evaluating small fragments of code to carry out the analysis we require. To make life easier, we will be using RStudio as an interface to write, edit, and evaluate our R code.
Hopefully you will have installed your own copy of R and RStudio, or have opened it using the AppHub or AppsAnywhere. If not, check the instructions here.
R
file.
You’ll now see that the default RStudio environment is divided in 4 panels, with the default arrangement illustrated below. Going anticlockwise, the four panels are: Code; Console; Plots, Help, and Files; and Workspace and History.
If you want to rearrange which panes go where, go to the Options dialog Tools > Options menu on Windows (RStudio > Preferences on a Mac). You’ll find a ‘Panes’ tab where you can rearrange the layout to your preference.
Code Editor – the main editor for your
R code. This should contain the practical script file that you
have just downloaded and opened. This pane is hidden at first, however
when you load or start a new R script file it will be
displayed. R code can be entered here, and it provides support such as
auto-complete using [Tab]
, colour highlighting, and
additional buttons and menu items to help edit and evaluate your code.
In particular, a single line of code (or any selected block of code) can
be evaluated in one go by typing [Ctrl]+[Enter]
([Cmd]+[Enter]
for a Mac), and the whole file can be
evaluated with [Ctrl]+[Shift]+[S]
([Cmd]+[Shift]+[S]
for a Mac).
R Console – this pane is where your code from the Editor is evaluated by R. You can also use the console here to execute quick calculations that you don’t need to save. Commands entered in the Console tab are immediately executed by R, and the results displayed on the following line. In this way, R can be used as a simple calculator by typing directly into the Console window. However, for more complex calculations with many steps it is preferable to write the code in a script file using the Code pane first, and then evaluate it in the Console. Note: when in the Console pane, you can use the (up arrow) and (down arrow) keys to navigate through previous commands (e.g. to correct mistakes).
Plots, Help, and Files – this pane has multiple roles indicated by the tabs along its top. The Plots tab will show the results of any plots you produce in R, as well as buttons to save, clear, and backtrack through previous plots. The Help tab is where RStudio will display any built-in help files. The Files tab provides a simple file viewer to quickly navigate between and open files.
Workspace and History – this pane has two functions, also indicated by tabs. The Environment tab lists all the variables you have currently available in this session of R, along with their types and values. The History tab shows a list of all the R commands you have evaluated in the console.
For more detailed help on the RStudio programming environment, see the RStudio cheatsheet.
When we just want to do small and quick calculations, we can type R commands directly into the console window. These commands will be evaluated immediately and the answer returned on the next line.
Let’s try some simple arithmetic:At the >
prompt in the console you can type numerical
expressions as you would into a calculator, hit [Enter] and
R
will print the answer.
R can be used as a calculator to perform the usual simple arithmetic operations. The operators are:
+
addition-
subtraction*
multiplication/
division^
raise to the power (alternatively **
also works)%%
modulus, e.g. 5 %% 3
is
2
%/%
integer division, e.g. 5 %/% 3
is
1
Many standard mathematical functions are also available:
abs(x)
- the absolute value of x
sqrt(x)
- the square root of x
log(x)
- the natural logarithm of x
(use
log10
for base-10)exp(x)
- the exponential of x
, i.e. \(e^x\).sin(x)
, cos(x)
, tan(x)
- the
sine, cosine, and tangentWhen we want to save the code for our calculations or the calculation is long and requires many steps, its better to write the code in a script file in the Editor pane. Then, when the code is ready, we can evaluate it either line-by-line, or all in one go.
R
command to find \(\sum_{i=1}^5
i^i\).RStudio
gives us
some shortcuts.Run
button at the top of the Editor window to
execute the current line[Ctrl]+[Enter]
In both cases, the line of code is copied to the console and
evaluated, and the cursor advances to the next line of executable code.
Pressing [Ctrl]+[Enter]
will step through your code one
line at a time, which is useful to find any bugs. Alternatively, to run
multiple lines at once simply highlight the lines you want to run and
click Run
or press [Ctrl]+[Enter]
. We can also
save and evaluate an entire script file by pressing
[Ctrl]+[Shift]+[S]
. This is useful when you’ve finished a
long calculation and don’t want to step through it line-by-line.
During an R session we work with and create variables. Variables can
be a scalars, vectors, matrices, functions, or lists. In order to create
a new object we use the assign symbol <-
, although one
can often use =
instead. For example, to create the object
a
with a value of 2 we can type:
a <- 2
or
a = 2
R has many variables and functions available by default, and many many more that can be downloaded using special libraries.
For each of these standard objects R provides online help, which can
be dislpayed in the Help window. To access this help type ?
followed immediately by the name of the object in question.
pi
.
pi
in the console.?
to bring up the help for pi
(you
don’t need to read it)
Vectors are the most basic type of variable in R, and all numerical
variables are created as vectors - even scalars are vectors of length 1.
Formally, R
stores a vector as an ordered list numbers, and
R
contains many linear algebra tools that allow us to
perform basic vector calculus (one of the many reasons why
R
is preferred to Excel).
The object you have just created, a
, is a vector of
length 1. When you type a
and press [Enter]
you will see the following output (without the ##
):
## [1] 2
The [1] is a number indicating which element of the vector starts the current line. In this example this is unnecessary as we only have a vector of length one. However, if we consider a bigger vector, such as the integers from 1 to 50:
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
we can see that the vector covers multiple lines, the second line begins with element 24 of the vector, and the final line begins at element 47.
The most basic way to create a vector is to use th c
function to ‘c
ombine’ several values of the same type into
a vector. For example,
x <- c(1,2,3)
R provides some simple functions for quickly creating numerical vectors.
We can use the colon :
operator to create integer
sequences between two values and return the result as a vector:
y <- 1:10
Vectors can also be combined with the c
function:
z <- c(x,y)
The seq
function also generates a vector containing a
sequence between its two arguments from
and
to
, but is more sophisticated than :
. We can
dictate the length of the sequence by supplying the optional
length
argument, or the step size in the sequence by
passing a value to the by
argument. If we supply neither
length
nor by
, then seq
gives an
integer sequence like :
.
y <- seq(1,9) ## same as 1:9
seq(1,10,length=25) ## sequence of given length
seq(15,45,by=3) ## sequence of given step
See the RHelp on vectors for more information.
Using the techniques above:
R
and call it v1
.a
that you
defined before, and without typing the numbers 4 or 6. Hint:
multiplication also works for evey element in a vector.Assigning names to the objects you create is very important so that they can be re-used. To edit or repeat previous commands in the console, we can simply press the (up arrow) and the (down arrow) keys to navigate through previous commands (e.g. to correct mistakes). However, it is often much easier to use the Editor, rather than the console, to debug any problems.
Most real data will have to be imported into R
from an
external data file, or will be provided by a library or package. For
these practicals, we have already tidied and processed the data, so we
just need to load the data file.
Download data: hospital
First, right-click and save the data from the link above to the same place as your practical script file. Then you can load the file by any of the following methods (in descending order of difficulty):
load
function and give it the path to the
file as an argumentThis will create a hospital
data set object in R for you
to use in the following.
You should now view the data either in the console by typing
hospital
or in a spreadsheet layout with the View
function:
View(hospital)
To see a brief description of the data set and its variables, see this page.
A data frame which is a two-dimensional table of data (like
a matrix) where each column contains values of one variable, and each
row contains one set of values from each observation. Each column is
labelled with a variable name, and the values within each column must be
of the same type but the types of data held in each column can differ
according to the type of variable it represents. The data set
hospital
that we have just loaded is a data frame.
The rows and columns of data frames can be extracted as vectors
allowing us to easily perform calculations of summary statistics, for
example. In the hospital
data, we see that the data
contains 3 columns each with different names.
We can extract individual columns from a data frame by using the
variable name and the dollar-sign $
. For example, to
extract the beds
column from hospital
we would
type:
hospital$beds
We can extract a vector containing all of the column names for a
given data frame using the names
function:
names(hospital)
names
function to extract the names of the
variables in the hospital data.beds
and discs
containing the data for the number of beds, and number of discharges in
the hospital data respectively.
When a vector represents numerical data, there are a number of standard functions that will be useful for any statistical calculations:
sum
sums all values in the vectormean
computes the sample mean, i.e. \(\frac{1}{n} \sum_{i=1}^n x_i\). Note that
if given a data matrix, mean
will average all
elements rather than computing the sample mean vector.median
computes the sample median valuemin
and max
compute the sample minimum and
maximumrange
computes the min and max values.var
and sd
compute the sample
variance and standard deviation, i.e. \(s^2=\frac{1}{n-1} \sum_{i=1}^n
(x_i-\bar{x})^2\) for single variables. var
will
produce the sample variance matrix if supplied with a data matrix.quantile
computes the min, max, median, and lower and
upper quartiles. Other quantiles can be computed using the
probs
argumentsummary
calculates the min, max, mean, median, and
quantiles.
One of R
’s greatest strengths is the facility it
provides for producing effective graphics. In the following tasks we
will use some of the key plotting tools to study the
hospital
data.
The plot
function produces a scatterplot of its two
arguments. Suppose we have saved our \(x\) coordinates in a vector a
,
and our \(y\) coordinates in a vector
b
, then to draw a scatterplot of \((x,y)\) we type
plot(x=a, y=b)
If the argument labels x
and y
are not
supplied, R will assume the first argument is x
and the second is y
. If only one vector of data is
supplied, this will be taken as the \(y\) value and will be plotted against the
integers 1:length(y)
, i.e. in the sequence in which they
appear in the data.
All of the standard plot functions can be customised by passing additional arguments to the function. For instance, we can add a plot title and axis labels by supplying optional arguments:
main
- provides a title to display at the top of the
plotxlab
- provides a label for the horizontal axisylab
- provides a label for the vertical axisFor example,
plot(x=a,y=b, xlab='A', ylab='B', main='Plot of B vs A')
Note: Once a plot has been drawn, it is not possible to
erase any features from it - we can only add extra lines or points to
it. So, if you make a mistake drawing your plot then you’ll need to
start over with a fresh one by calling plot
again.
plot
. The scatterplot will appear in
the ‘Plots’ sub-window.Beds
and
Discharges
.A histogram
consists of parallel vertical bars that graphically shows the frequency
distribution of a quantitative variable. The area of each bar is
proportional to the frequency of items found in each class. To plot a
histogram, we use the hist
function and apply it to a
single vector of data
hist(x)
As with plot
, we can use main
and
xlab
to set the plot title and horizontal axis label.
Histogram also takes a number of arguments specific to the plotting of histograms:
breaks
- allows us to control the number of bars in the
histogram. If breaks
is set to a single number, this will
be used to (suggest) the number of bars in the histogram. If
breaks
is set to a vector, the values will be used to
indicate the endpoints of the bars of the histogram. Note:
R
interprets this number as a suggestion only.freq
- if TRUE
the histogram shows the
simple frequencies or counts within each bar; if FALSE
then
the histogram shows frequency densities rather than counts.discharges
from the
hospital
data.beds
data. What do you
notice about the two variables?A boxplot provides a graphical view of the median, quartiles,
maximum, and minimum of a data set. A box plot is a bit like a histogram
seen from above. The full name is ‘box and whisker’ plot. The ‘box’
shows the middle 50% of the observations, from the first to the third
quartile. Inside the box, the line shows the median (which is the second
quartile). The ‘whiskers’ show the full range of the data, although in
R
there is the possibility of excluding outliers, which
appear as individual points.
To draw a boxplot of a single variable we simply pass the data
directly to the boxplot
function:
boxplot(x)
To draw multiple boxplots within the same plot, we can pass multiple
vectors separated by commas as arguments
(e.g. boxplot(x,y,z)
). We can even pass an entire dataframe
to the boxplot function, and it will draw boxplots of all the columns
within a single plot.
When performing a statistical analysis it is important that you make your graphics informative as possible. This will aid you, as the statistician, in spotting relation- ships and performing the analysis, and will aid those to whom the analysis is going to be communicated. Often this involves adding simple things like colour or tweak- ing some of the default settings of the plotting region. In other cases we might plot things together to facilitate easy comparison. We’ll now explore some simple ways of improving the plots we constructed earlier.
par
function to combine multiple plots.
hospital
data. Does this make comparison of
the distributions easier?
Download data: movies
Multivariate statistics inevitably involves working with large data
sets. The effort involved in preparing and making a large dataset usable
for analysis should not be underestimated, but thankfully we’re going to
look at a dataset that has been “prepared earlier”. This
movies
data set is reasonably large(ish), containing 24
different attributes of 28819 movies gathered from IMDB. One of the variables is the movie
length in minutes, and it is interesting to look at this variable in
some detail.
In this exercise we’ll use the basic techniques above to investigate the behaviour of the data, and see if we can find (and explain) any interesting features. This process is known as data exploration or exploratory analysis, and is usually the precursor to more formal statistical analyses.
load
function by specifying the file location.movies
data set, load it into R, and
draw a histogram of the length
variable.rug
function. Add a rugplot to
the histogram - what problems does this highlight?Although it’s tempting to dismiss these as simple errors, it is worth checking if possible.
r1
to r10
give the
percentage of reviews which rated the movie as a 1
up to a
10
out of 10
. Are these movies particularly
popular?Incidentally, this data set is no longer up-to-date and there are some even longer films now (though it’s a mystery why.)
Clearly, the extreme values are not representative of most movies and so - for not - let’s ignore them. For exploring the main distribution of movie lengths it makes sense to set some kind of upper limit. Over 99% of the data are less than three hours in length, so let’s restrict ourselve to those.
Useful context: