--- title: "Miscada: ASML R Tutorial" author: "Jochen Einbeck" output: html_document: df_print: paged pdf_document: default html_notebook: df_print: paged word_document: default --- ## Software requirements If you work on a CIS computer, please simply open RStudio from the App Hub. If you work on your own laptop, you will need recent versions of R and RStudio installed. R is the actual programming engine. RStudio is simply an editor which is tailored to the use of R. Both [R](https://cran.r-project.org/) and [Rstudio](https://rstudio.com/products/rstudio/download/) are free software and can be downloaded and installed from the links given in this sentence. Note that R needs to be installed first, then RStudio. ## Resources The full material for this tutorial includes * this html document (this is "knitted" R notebook); * a [raw](http://www.maths.dur.ac.uk/~dma0je/PG/Mix/MSc/ASMLTutorial.Rmd) version of this notebook, `ASMLTutorial.Rmd`, which you will use for your practical work; * a [complete](http://www.maths.dur.ac.uk/~dma0je/PG/Mix/MSc/ASMLTutorial-Sol.Rmd) version of this notebook with solutions (to be provided at the end of this meeting); also in [HTML](http://www.maths.dur.ac.uk/~dma0je/PG/Mix/MSc/ASMLTutorial-Sol.html). * a [Handout](http://www.maths.dur.ac.uk/~dma0je/PG/Mix/MSc/ASMLHandout.pdf) (PDF) giving basics of R programming. ## Objectives of this tutorial * using R Notebooks; * workspace handling; * reading in data; * working with vectors, matrices, and data frames; * basic programming devices (such as if...then, for, while, apply, functions); * application to real data sets. ## Preparation Please create somewhere on your computer an empty directory with title `ASMLTutorial`. This will be our *working directory*. [Download](http://www.maths.dur.ac.uk/~dma0je/PG/Mix/MSc/ASMLTutorial.Rmd) the file `ASMLTutorial.Rmd` (use the right mouse button), and place it in the directory that you have just created. Then open it in RStudio, by double-clicking on that file. If you had already opened this file without having created the directory, please close it again, create the directory, place this file in there, and open the file as described. We are now continuing our work using this .Rmd file. You can of course still use the html file alongside. ## Using R Notebooks This is a R Notebook. It is based on R Markdown and can be used for parallel typesetting and coding. It is best used in RStudio. *Note*: For the use of R Notebooks we will need the packages `rmarkdown` and `knitr`. In principle, RStudio should automatically install these if required. If this does not happen on your system for any reason (and you encounter any problems below), you can do this manually, via `install.packages("rmarkdown", "knitr")`. There are (basically) three ways of executing (`rendering') code from a R Notebook. * Chunk-by-chunk: A chunk of R code takes the shape (some random example code!) ```{r} x<-3 x D<- date() D DayofWeek<- substr(D, 1,x) # extracts the first x letters from date object cat("Today's day of the week is:", DayofWeek) ``` which you can execute by clicking on the green triangle at the top right of the chunk. **DO THIS**. * Row-wise. Click on any row of the code and then CTRL-ENTER. **DO THIS** for any of the rows in the chunk above. * You can render (`knit') the entire document, it produces a nice and tidy summary of all text, code, and results. **TRY THIS NOW**, that is click on the 'Knit' button at the top of this window. You can choose any of PDF, HTML, Word, or Preview, as output options. [*Note*: Preview does not actually execute any chunks, it just shows pre-existing output. My recommendation would be to set this option to `Knit to HTML`. This will produce a .html file in your workspace that you can open separately.] You can, of course, also edit this document yourself. Specifically, you can also create your own chunk, by clicking on the Insert icon (towards the right of top menu bar of this window). **DO THIS.** The syntax of R Markdown is largely self-explaining, detailed explanations are available at https://rmarkdown.rstudio.com/authoring_basics.html. There is no need to save or copy outputs of today's work into Latex, MS Word, etc. This document itself will be able to reproduce all your work done today. The best way to work through this tutorial is to go chunk-by-chunk. ## Workspace handling Every R session uses a *working directory*. This is some directory on your hard drive (or USB key, etc.) which R will use in order to save images, data, and your *workspace* (see below). R will also assume by default that any data sets that you attempt to read in are stored in this directory. By default, R Notebooks will use the location of the .Rmd file as working directory. Check the content ("workspace") and location of the current working directory via ```{r} ls() getwd() ``` This should return you the directory `ASMLTutorial` that you have created above. If this is not right, then the easiest way of fixing this is to close this .Rmd file and start again. *Notes*: * The working directory used within R chunks can be different from the working directory used by your current R console. For us, only the directory used by the chunks is of relevance. * If you are already somewhat familiar with R then you will know that the R working directory can, in principle, be changed via `setwd(``pathname'')`. However, this will not work for R Notebooks, as it will only change the directory of the current chunk! More experienced users can follow these [instructions](https://philmikejones.me/tutorials/2015-05-20-set-root-directory-knitr/) to change the working directory for all R chunks. You can, at any time, save the entire workspace for later use, by using the command `save.image(``filename'')`. Let's do this. Render ```{r} save.image("asmltut.RData") ``` then close RStudio and open it again. Then load the saved workspace back via ```{r} load("asmltut.RData") ls() ``` and check whether everything is there! (In our case it should obviously just be `x`, `D` and `DayofWeek`.) *Important (but rather confusing)*: RStudio opens a new R session to knit your R Notebook file. That is, even if you have some other objects (for instance from previous work) in the global environment (see top right RStudio window) then those objects will not be available when you knit your notebook. To see this, type for instance `test<-1` in your R console. You will see it shows up directly in the Global Environment. But, if you create a chunk which refers to `test` and then knit the notebook, you get an error. ## Reading in Data The first data set that we are going to investigate give the energy use (kg of oil equivalent per capita) over 135 countries from 1960 to 2010. *Energy use* is defined as the use of primary energy before transformation to other end-use fuels, which is equal to indigenous production plus imports and stock changes, minus exports and fuels supplied to ships and aircraft engaged in international transport. Source: [Worldbank](http://data.worldbank.org/indicator/EG.USE.PCAP.KG.OE) You can read the data in via ```{r} energy.use <-read.csv("http://www.maths.dur.ac.uk/~dma0je/Data/energy.csv", header=TRUE) ``` Alternatively, you can download the data from the given web address, place them in your working directory, end then call `energy.use <-read.csv("energy.csv", header=TRUE)`. Check now whether things have gone right. Try ```{r} dim(energy.use) ``` which should give you the dimension $135 \times 12$. Also try ```{r} head(energy.use) ``` in order to see the first six rows. ## Working with vectors, matrices and data frames The object `energy.use` is a *data frame*. You can check whether or not an object is a data frame by typing `class(object)` or `is.data.frame(object)`. Try this for the object `energy.use` in the chunk below. ```{r} class(energy.use) is.data.frame(energy.use) ``` It is easy to access individual rows, columns, or elements of a data frame. For instance, ```{r} energy.use[127,] energy.use[,49] energy.use[127,49] ``` will give you the 127th row; 49th column; and the entry of the 127th row and the 49th column, respectively (this is the UK energy consumption in 2007). You can also access columns directly through their column names, such as ```{r} energy.use$X2007 ``` Data frames are very important as they are the standard form in which data are expected by many R functions, such as `lm`, `glm`,.... Let us now simplify the data frame a little bit, so that it is easier to use for the applied work. We reduce our interest to the energy consumption in the years 2001 and 2007. We do this via ```{r} energy <- energy.use[,c("X2001", "X2007")] ``` Also, we would like to give the rows and columns of the new data frame meaningful names. Please type ```{r} rownames(energy)<- energy.use[, 1] colnames(energy)<- c("use01", "use07") ``` in order to specify row and column names, respectively. Then type `energy` to look at your final data frame. This data frame allows to access information quickly. For instance, ```{r} energy["United Kingdom",] ``` gives you the UK values of energy consumption. **DO THIS** for a couple of countries. ```{r} energy["Spain",] energy["China",] ``` ## Basic programming devices One may be interested in looking at these data in a form in which they are *ordered* by their energy consumption. This can be done using ```{r} order(energy$use07) ``` which gives you a list of numbers. The first number tells you the index (here: 39) of the country with the smallest per-capita energy consumption (here: Eritrea), and typing `energy[order(energy$use07),]` gives you the full ordered list. In the chunk below, save this ordered data frame into a new data frame `senergy'. ```{r} senergy <- energy[order(energy$use07),] ``` Next, we wish to identify the nations with extremely large energy consumption, say, more than 10000 kg of oil per capita (Intuitively, what do you think, which countries will this be?). Calling ```{r} energy$use07 > 10000 ``` will give you a vector of logical values, with a `TRUE` for each country for which this condition is met. The command ```{r} sum(energy$use07 > 10000) ``` will tell you how many these are, and ```{r} which(energy$use07 > 10000) ``` will give you the index numbers of these countries. From this, we would get the data rows corresponding to these countries via ```{r} energy[which(energy$use07 > 10000),] ``` We would like to compare the energy use in 2001 and 2007. Do the same as above but now use the condition `energy$use01 > energy$use07` instead. Observe and understand the information that you gain at each step. ```{r} energy$use01> energy$use07 sum(energy$use01> energy$use07) which(energy$use01> energy$use07) energy[which(energy$use01> energy$use07),] ``` A very useful tool to carry out repeated operations is the `for` command (see [Handout](http://www.maths.dur.ac.uk/~dma0je/PG/Mix/MSc/ASMLHandout.pdf)!). Task: Implement a loop which, for all 135 countries, writes a text like
In 2007, the energy use in *country* was equivalent to *value* kg oil per capita.
  ```{r} for (i in 1:135){ cat("In 2007, the energy use in ", rownames(energy)[i], " was equivalent to", energy[i,2], "kg oil per capita.", "\n") } ``` Another command for repeated operations is `while`. It does not have a fixed number of loops, but proceeds until a certain condition is met. For instance, consider the ordered frame `senergy` created above. Assume we are interested in the following question: If we take exactly one person from each of the countries with the smallest energy use, i.e. one person from Eritrea, one person from Bangladesh, etc., then how many persons are needed in order to achieve the same use of energy as a single person in Qatar? To answer this, create objects `i` and `sum07` and assign them the initial value 0. Then use the `while` function (see [Handout](http://www.maths.dur.ac.uk/~dma0je/PG/Mix/MSc/ASMLHandout.pdf)) with *condition* `sum07< senergy["Qatar",2]` and *action* `i <- i+1; sum07 <- sum07+ senergy[i,2]`. Make it clear to yourself what each row does. Also, interpret the result. ```{r} energy["Qatar",] i <-0 sum07 <-0 while(sum07< senergy["Qatar",2] ){ i=i+1 sum07<- sum07+ senergy[i,2] } i sum07 # So individuals from the 41 least-consuming countries use less energy per captita than one single individual in Qatar! ``` Use `apply` to compute a vector which contains, for each country, the larger of the two energy consumption values given for 2001 and 2007. Consult the see [Handout](http://www.maths.dur.ac.uk/~dma0je/PG/Mix/MSc/ASMLHandout.pdf) and the corresponding help file (via `help(apply)` or `?apply`) if you are unsure how to do this. ```{r} apply(energy,1,max) ``` Use `hist` and `boxplot` to create histograms and boxplots of the variables `use01` and `use07`. Comment on the distributional shape. ```{r} boxplot(energy$use01, energy$use07) par(mfrow=c(2,1)) hist(energy$use01) hist(energy$use07) ``` Next, add logarithmic versions of these variables, say `luse01` and`luse07`, to the data frame via ```{r} energy$luse01<- log(energy$use01) ``` and for`use07` analogously. Repeat the previous question using the transformed variables. What can we say about the distribution of these transformed variables, compared to the original ones? ```{r} energy$luse07 <- log(energy$use07) boxplot(energy$luse01, energy$luse07) par(mfrow=c(2,1)) hist(energy$luse01) hist(energy$luse07) ``` ## Simple clustering via kmeans Next, we consider a data set featuring $n=82$ observations of galaxy velocities. Load the `galaxies` data, read the associated help file, and create a histogram using the option `breaks =18` in function `hist`. ```{r} data(galaxies, package="MASS") ?galaxies hist(galaxies, breaks=18) ``` For both data sets, the dominating feature is the presence of multiple modes or `clusters'. It is a relevant problem in Statistics and Machine Learning to identify such clusters, and also find the corresponding cluster centers. A simple method to do this is the **k-means algorithm**. See for instance [this resource](https://bigdata-madesimple.com/possibly-the-simplest-way-to-explain-k-means-algorithm/) for a quick introduction into this algorithm. In R, this algorithm is implemented in the function `kmeans`. The algorithm requires the specification of the number of clusters in advance, through the argument `centers`. Study the help file of `kmeans` and then apply this function onto the `luse01` and `galaxies` data. You will need only the first two arguments of `kmeans`. For the choice of the number of clusters, you can use visual inspection as a guide for your choice. ```{r} ?kmeans kmeans(energy$luse01, centers=2) kmeans(galaxies, centers=5) ``` Have a look at the produced output and try to understand and interpret it in the light of the graphical representations of the data presented earlier. This was an example for the application a very simple clustering technique, in the one-dimensional (univariate) case. In one of the lectures of ASML2, we will pick up from here and consider a more elaborated clustering method (mixture models). Finally, note that clustering is an `unsupervised' learning technique, since the algorithm needs to make decisions without having seen (not even for training samples) the true allocation of samples to clusters/classes. Such methods are called *unsupervised learning*. Most of the material dealt with in ASML will be set in the world of *supervised learning*, where training samples with true and known class labels (or output values) are available. ## Further resources If you would like to do do a bit more to improve your R skills, we recommend the following resources: * [https://education.rstudio.com/](https://education.rstudio.com/) * [https://www.jaredknowles.com/r-bootcamp](https://www.jaredknowles.com/r-bootcamp) ## Thanks for your participation, and see you in February! #### The ASML Lecturers ASML1: Ian H. Jermyn, i.h.jermyn@durham.ac.uk ASML2: Jochen Einbeck, jochen.einbeck@durham.ac.uk ASML3: Louis Aslett, louis.aslett@durham.ac.uk