Basic R - Introduction - Data Types and Structures
November 04, 2015
Introduction
Hello guys,
Let’s start to talk about R in here. For those that don’t know, R is a programming language designed for data analysis, statistics, data mining, machine learning, among others. Created by Ross Ihaka and Robert Gentleman, is based on the S language (hence, the name).
To install R, check the link: link
It’s also recommended to use Rstudio, since it’s a great IDE for R that you can find here: link
With both installed, you are ready to open Rstudio and start to write your R scripts and commands in R.
Basic functionality
When you open Rstudio, this is what it looks like:
You can start to write your code on the console. Type "Hello world"
on it and see what happens:
"Hello, world"
## [1] "Hello, world"
R will show what you wrote, along with this [1]. We’ll soon see what this [1] means, but as we can see, R automatically prints the result of your commands. Try a sum:
3 + 5
## [1] 8
Now R shows the result of the sum. R has the following basic operations:
Sign | Operation |
---|---|
+ | Addition |
– | Subtraction |
/ | Division |
* | Multiplication |
^ | Power |
sqrt | Square root |
Let’s see some examples of them:
4+5
## [1] 9
7-2
## [1] 5
6*8
## [1] 48
9/3
## [1] 3
4^3
## [1] 64
sqrt(25)
## [1] 5
Variables and basic data types
To store values in a variable in R, we use <-
. In this case, automatic print
does not occur, and if you want to check what is on the variable, you just have
to type it again, or utilize print(name_of_variable)
. Let’s see:
a <- 5 + 3
a
## [1] 8
print(a)
## [1] 8
This is a numeric variable, as the class()
function will tell us. If you want to
work with integers, you have to define them as Integer
with the function
as.integer()
var1 <- 3
var2 <- as.integer(3)
class(var1)
## [1] "numeric"
class(var2)
## [1] "integer"
In R, data can also be of the type character
, which represents a text variable
(also known as string
on other programming languages) and logical
(also
known as boolean), that assumes values TRUE
or FALSE
(or T
or F
). Examples:
b <- "Hello world"
c <- TRUE
class(b)
## [1] "character"
class(c)
## [1] "logical"
To finish this part, in R we also have factors
. Factors are used to represent
categories. This means that in a factor variable, the possible values will be
between limited options. Think, for example, about military ranks. There is a
defined number of ranks and the value will be one of these predefined ranks. A
person’s name, on the contrary, can have infinite possibilities. Anybody can
create a new name. To create a “factor”, lets make use of the military rank
example, using the function factor()
:
military <- c("Private","Private", "Colonel","General", "Lieutenant",
"Lieutenant", "Sergeant","Private","Private","Sergeant",
"Sergeant","Private")
military <- factor(military)
print(military)
## [1] Private Private Colonel General Lieutenant Lieutenant
## [7] Sergeant Private Private Sergeant Sergeant Private
## Levels: Colonel General Lieutenant Private Sergeant
table(military)
## military
## Colonel General Lieutenant Private Sergeant
## 1 1 2 5 3
When we print the vector, we see that R show us the levels
. These are the
unique values present in the vector. Using the table
function, we can check
the count for each category (rank, in this case). Notice that they are ordered
in alphabetical order. But we can rearrange this order, if there is one that
makes more sense, like this:
military <- factor(military, levels=c("Private","Sergeant","Lieutenant",
"Colonel","General"))
table(military)
## military
## Private Sergeant Lieutenant Colonel General
## 5 3 2 1 1
Data Structures on R
Now, let’s talk about Data Structures on R.
Vectors
Vectors are a simple sequence of elements of the same type. When we define a variable as we did on the previous examples, a vector with one element is created. Observe:
var1 <- 3
is.vector(var1)
## [1] TRUE
That is why there is a [1] along the value of the variable you defined, because it is the first element of a vector.
To create vectors with more than one element, we need to include them inside
a c()
function. See the following example:
var1 <- c(3,6, 7.8, 332)
print(var1)
## [1] 3.0 6.0 7.8 332.0
var2 <- c("Hello","how","are")
print(var2)
## [1] "Hello" "how" "are"
var3 <- c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE)
print(var3)
## [1] TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
if you try to mix the types, R will coerce some elements so they can all be of the same type:
var4 <- c(3, 6, 9, "alface")
print(var4)
## [1] "3" "6" "9" "lettuce"
class(var4)
## [1] "character"
var5 <- c(TRUE, FALSE, 1, 3)
print(var5)
## [1] 1 0 1 3
class(var5)
## [1] "numeric"
In the first case, R transformed the numeric variable in character variables,
because it was the type where it would be possible to have all the values. On
the second case, TRUE and FALSE values can also be represented as 1 and 0,
respectively. So R adopted these values and the vector is numeric
.
Subsetting: to select one or some items from a vector, you need to use square brackets ([ ])
. For more than one, you separate them with :
:
var4[1]
## [1] "3"
var5[2:4]
## [1] 0 1 3
Matrices
Matrices are structures that correspond to their mathematical counterparts. Groups of elements organized in columns and rows. As vectors, all of their elements have to be of the same type. There are some ways to create a matrix:
mat1 <- matrix(
c(1,5,10,30,15,8),
nrow=3,
ncol=2,
byrow=TRUE)
print(mat1)
## [,1] [,2]
## [1,] 1 5
## [2,] 10 30
## [3,] 15 8
vec1 <- c(3,4,5)
vec2 <- c(9,10,11)
mat2 <- rbind(vec1, vec2)
print(mat2)
## [,1] [,2] [,3]
## vec1 3 4 5
## vec2 9 10 11
class(mat2)
## [1] "matrix"
In the first one, we use the matrix
function. The nrow
and ncol
parameters arguments indicate the number of rows and columns and byrow
will define if the matrix items will be filled by row (TRUE
) or by column (FALSE
).
On the second way, using the rbind
function, note that R named the matrix rows with the vector names, instead of defining its numbers.
You can see that on the second way, R named the matrix rows with the name of the vectors that were used to create it.
To select an item from a matrix, you need to use square brackets, first supplying the row and then the column that we want:
mat1[2][1]
## [1] 10
Lists
Lists are a special kind of vector, that can have elements from different types, including vectors. To show what we are talking about, I will create a list based in some vectors.
a <- c(3,6,9)
b <- c("a","b","c","d")
c <- c(TRUE, FALSE, TRUE, TRUE)
list1 <- list(a,b,c)
print(list1)
## [[1]]
## [1] 3 6 9
##
## [[2]]
## [1] "a" "b" "c" "d"
##
## [[3]]
## [1] TRUE FALSE TRUE TRUE
We can also retrieve a slice of the list with square brackets:
list1[2]
## [[1]]
## [1] "a" "b" "c" "d"
class(list1[2])
## [1] "list"
But, as we could see, what is returned is a new list. To retrieve the element
on its own, directly, we should use double square brackets ([[ ]])
:
list1[[1]]
## [1] 3 6 9
class(list1[[1]])
## [1] "numeric"
Now we got the element, as we could see from its class. To retrieve an item inside a list item, we only need to use square brackets again, as we do with vectors. Like that, we could even modify an item inside the list:
list1[[2]][1]
## [1] "a"
list1[[2]][1] <- "j"
list1[[2]][1]
## [1] "j"
One last interesting feature of lists is that you can name their items, instead of referencing them through numbers. In these cases, the selection properties using square brackets and double square brackets are still used. How? Like that:
list2 <- list(alpha=c(1,2,3), beta=c(10,20,30))
list2
## $alpha
## [1] 1 2 3
##
## $beta
## [1] 10 20 30
list2["alpha"]
## $alpha
## [1] 1 2 3
list2[["alpha"]]
## [1] 1 2 3
With named lists, a second way to reference elements is through the $ symbol. It is equivalent to double square brackets:
list2$beta
## [1] 10 20 30
Data Frames
Data Frame is the structure to store information in shape of tables, organized
in rows and columns. Rows and columns can be named, and columns can be of
different types. You can create a data frame with the data.frame
function:
df1 <- data.frame(c(1,2,3),c("low","medium","high"),c(TRUE, TRUE, FALSE))
print(df1)
## c.1..2..3. c..low.....medium....high.. c.TRUE..TRUE..FALSE.
## 1 1 low TRUE
## 2 2 medium TRUE
## 3 3 high FALSE
But, to make things a little easier, R comes with some data sets organized in
data frames, so we can test and learn. One of them is mtcars
:
print(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
To select an item from the data frame, we also utilize square brackets, where you can use the number or the name of the row or column:
mtcars[1,2]
## [1] 6
mtcars["Mazda RX4","gear"]
## [1] 4
mtcars["Cadillac Fleetwood",3]
## [1] 472
Besides that, we can also select a part of the data frame through different ways:
# Defining column and interval of rows
mtcars[1:5,1]
## [1] 21.0 21.0 22.8 21.4 18.7
# If you don't supply the value for the lines, all of them are selected.
# The same if true for columns.
mtcars[,2]
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[3,]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
Some basic functions that are useful to work with data frames are nrow
, that
shows the number of rows in a data frame, ncol
, that shows the number of columns,
or dim
, that shows both:
nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11
dim(mtcars)
## [1] 32 11
To finish it for this introduction, the str
and summary
functions show some very
useful information for data frames. It’s very common to use them after importing
a data set to R, to have an idea of what you are dealing with. str
counts the rows and columns and show their types and some example values. summary
calculates and
shows values for mean, median, quantiles and missing values for each column:
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
For a basic R introduction, this is enough. Soon we’ll talk more about it.
If you have any comments, you can leave them here.
Regards