As R is a statistical programming language it also provides the standard programming constructs such as loops, if statements, and the ability to write functions.
A function needs to have a name, probably at least one argument (although it doesn’t have to), and a body of code that does something. At the end it usually should (although doesn’t have to) return an object out of the function. The general syntax for writing your own function is
name.of.function <- function(arg1, arg2, arg3=2, ...) {
# function code to do some useful stuff
return(something) # return value
}
name.of.function
: is the function’s
name. This can be any valid variable name, but you should avoid using
names that are used elsewhere in R, such as mean
,
function
, plot
, etc.arg1
, arg2
,
arg3
: these are the arguments of the function. You
can write a function with any number of arguments, or none at all. These
can be any R object: numbers, strings, arrays, data frames, of
even pointers to other functions; anything that is needed for the
name.of.function
function to run. Some arguments have
default values specified, such as arg3
in our example.
Arguments without a default must have a value supplied for the function
to run. You do not need to provide a value for those arguments with a
default as they are considered as optional, and when omitted the
function will simply use the default value in its definition....
argument: The …, or
ellipsis, element in the function definition allows for other
unspecified optional arguments to be passed into the function, which are
usually passed onto to another function. This technique is often in
plotting, but has uses in many other places.{}
brackets is run every time the function is called.
This code might be very long or very short. Ideally functions are short
and do just one thing – problems are rarely too small to benefit from
some abstraction. Sometimes a large function is unavoidable, but usually
these can be in turn constructed from a bunch of small functions. More
on that below.For example, we can write a function to compute the sum of squares of two numbers as
sum.of.squares <- function(x,y) {
return(x^2 + y^2)
}
and we can then evaluate
sum.of.squares(3,4)
## [1] 25
Now, it’s not necessarily the case that you must use
return()
at the end of your function. If we don’t
explicitly return
something, then R will
automatically return the results of evaluating the last statement inside
the function. The reason you return
an object (aside from
making your code more readable) is if you’ve saved the value of your
statements into an object inside the function. Variables created inside
a function only exist within that function, and won’t appear outside in
your workspace. See how it works in the following two examples:
fun1 <- function(x) {
3 * x - 1
}
fun1(5)
## [1] 14
fun2 <- function(x) {
y <- 3 * x - 1
}
fun2(5)
In the first function, I just evaluate the statement
3*x-1
without saving it anywhere inside the function. So
when I run fun1(5)
, the result comes popping out of the
function. However, when I call fun2(5)
, nothing happens.
That’s because the object y
that I saved my result into
doesn’t exist outside the function and I haven’t used
return(y)
to pass the value of y
outside the
function. If I try to use y
outside of the function, I will
encounter errors because it only exists within the local
environment of the function. I can return the value of
y
using the return(y)
at the end of the
function fun2
, but I can’t return the object itself; it’s
stuck inside the function.
Conceptually, a loop is a way to repeat a sequence of instructions
under certain conditions. They allow you to automate parts of your code
that are in need of repetition. Typically, there are two types of loops.
Loops which execute for a prescribed number of times, as controlled by a
counter or an index, incremented at each iteration cycle are represented
as for
loops in R. Loops that are based on the
testing of a logical condition at each iteration in the loop are
while
or repeat
loops.
In R a for
{#for} loop takes the following
general form
for (variable in sequence) {
## code to repeat goes here
}
where variable
is a name given to the iteration variable
and which takes each possible value in the vector sequence
at each pass through the loop. Here is a quick trivial example, printing
the square root of the integers one to ten:
for (x in 1:10) {
print(sqrt(x))
}
## [1] 1
## [1] 1.414214
## [1] 1.732051
## [1] 2
## [1] 2.236068
## [1] 2.44949
## [1] 2.645751
## [1] 2.828427
## [1] 3
## [1] 3.162278
As with the example above, there is often no need to explicity write
for
loops to repeat calculations in R code as most
built-in functions and arithmetic can be evaluated for vector arguments
anyway (and usually more efficiently). For the example above, simply
evaluating sqrt(1:10)
would give the answer we need for
rather less typing
sqrt(1:10)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
## [9] 3.000000 3.162278
A break
statement is used inside a loop
(for
, while
) to stop the iterations and jump
to the code outside of the loop. In a nested looping situation, where
there is a loop inside another loop, this statement exits from the
innermost loop that is being evaluated.
x <- 1:5
for (val in x) {
if (val == 3){
break
}
print(val)
}
## [1] 1
## [1] 2
Conversely, a next
statement is useful when we want to
skip the rest of the code in current iteration of a loop without
terminating it. On encountering next
, R skips any
further evaluation and starts next iteration of the loop.
x <- 1:5
for (val in x) {
if (val == 3){
next
}
print(val)
}
## [1] 1
## [1] 2
## [1] 4
## [1] 5
R Help: for
We saw above how to use a for
loop to apply the same
code to a collection of objects described by the sequence
over which we are looping. However, we can achieve the same results by
writing a function to perform the code within the body of the loop, and
then applying that function to every element of sequence. Handily,
R has a family of functions, the ‘apply’ family, which do
exactly that. We will use the following two members of apply function
family:
sapply
- applies a function to every element of a
vector and returns a vector formed from the resultsapply
- applies a function to the either the rows or
the columns of a matrix (or data frame)Each of these has an argument FUN
which takes a function
to apply to each element of the object. So, to replicate the simple
example above using apply, we would write
sapply(1:10, FUN=sqrt)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
## [9] 3.000000 3.162278
Unlike a loop, sapply
automatically returns its results
as a vector (or whatever form is most natural) without us having to
write code for that. Therefore, if we combine this technique with the
ability to write our own functions then we have a very flexible way of
re-writing a standard loop in a vectorised way. In general, using an
apply
-type function is to be preferred to a
for
loop particularly when we want to keep the results of
the calculations from each iteration. However, for
loops
are still useful and more natural in certain cases (where we do not want
the output values, or where each iteration has a dependency on the
calculations at the previous step).
The apply
function can be used to evaluate the same
function for either every row or every column of a given matrix (or data
frame). To apply the function over rows we supply the argument
MARGIN=1
, and to apply to each column we set
MARGIN=2
. We must also provide the function we wish to
apply in the FUN
argument.
For example, to calculate the means of each column in the
mtCars
data set, we could write
data(mtcars)
apply(mtcars, MARGIN=2, FUN=mean)
## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
Thus apply
is very useful for quickly computing
summaries and calculations across entire data sets.
A standard programming construct is the if
statement,
which are used to tell R that we want to make a choice based on
the results of a test.
if(test){
## do this code if TRUE
} else{
## do this code if FALSE
}
If the test
is TRUE
, then the code inside
the if
statement (i.e., the lines in the curly braces
underneath it) is executed. If the test
is
FALSE
, the body of the else is executed instead. Only one
or the other is ever executed:
x <- -5
if(x > 0){
print("Non-negative number")
} else {
print("Negative number")
}
## [1] "Negative number"
We can chain a sequence of if
and else
statements together to consider a sequence of alternative test
conditions:
x <- 0
if (x < 0) {
print("Negative number")
} else if (x > 0) {
print("Positive number")
} else {
print("Zero")
}
## [1] "Zero"