Basic R - Introduction - Data Types and Structures

November 04, 2015

Introduction

Hello guys,

Let’s start to talk about R in here. For those that don’t know, R is a programming language designed for data analysis, statistics, data mining, machine learning, among others. Created by Ross Ihaka and Robert Gentleman, is based on the S language (hence, the name).

To install R, check the link: link

It’s also recommended to use Rstudio, since it’s a great IDE for R that you can find here: link

With both installed, you are ready to open Rstudio and start to write your R scripts and commands in R.

Basic functionality

When you open Rstudio, this is what it looks like:

You can start to write your code on the console. Type "Hello world" on it and see what happens:

"Hello, world"
## [1] "Hello, world"

R will show what you wrote, along with this [1]. We’ll soon see what this [1] means, but as we can see, R automatically prints the result of your commands. Try a sum:

3 + 5
## [1] 8

Now R shows the result of the sum. R has the following basic operations:

Sign	Operation
+	Addition
–	Subtraction
/	Division
*	Multiplication
^	Power
sqrt	Square root

Let’s see some examples of them:

4+5
## [1] 9
7-2
## [1] 5
6*8
## [1] 48
9/3
## [1] 3
4^3
## [1] 64
sqrt(25)
## [1] 5

Variables and basic data types

To store values in a variable in R, we use <-. In this case, automatic print does not occur, and if you want to check what is on the variable, you just have to type it again, or utilize print(name_of_variable). Let’s see:

a <- 5 + 3
a
## [1] 8
print(a)
## [1] 8

This is a numeric variable, as the class() function will tell us. If you want to work with integers, you have to define them as Integer with the function as.integer()

var1 <- 3
var2 <- as.integer(3)
class(var1)
## [1] "numeric"
class(var2)
## [1] "integer"

In R, data can also be of the type character, which represents a text variable (also known as string on other programming languages) and logical (also known as boolean), that assumes values TRUE or FALSE (or T or F). Examples:

b <- "Hello world"
c <- TRUE
class(b)
## [1] "character"
class(c)
## [1] "logical"

To finish this part, in R we also have factors. Factors are used to represent categories. This means that in a factor variable, the possible values will be between limited options. Think, for example, about military ranks. There is a defined number of ranks and the value will be one of these predefined ranks. A person’s name, on the contrary, can have infinite possibilities. Anybody can create a new name. To create a “factor”, lets make use of the military rank example, using the function factor():

military <- c("Private","Private", "Colonel","General", "Lieutenant",
               "Lieutenant", "Sergeant","Private","Private","Sergeant",
               "Sergeant","Private")
military <- factor(military)
print(military)
##  [1] Private    Private    Colonel    General    Lieutenant Lieutenant
##  [7] Sergeant   Private    Private    Sergeant   Sergeant   Private
## Levels: Colonel General Lieutenant Private Sergeant
table(military)
## military
##    Colonel    General Lieutenant    Private   Sergeant
##          1          1          2          5          3

When we print the vector, we see that R show us the levels. These are the unique values present in the vector. Using the table function, we can check the count for each category (rank, in this case). Notice that they are ordered in alphabetical order. But we can rearrange this order, if there is one that makes more sense, like this:

military <- factor(military, levels=c("Private","Sergeant","Lieutenant",
                                        "Colonel","General"))
table(military)
## military
##    Private   Sergeant Lieutenant    Colonel    General
##          5          3          2          1          1

Data Structures on R

Now, let’s talk about Data Structures on R.

Vectors

Vectors are a simple sequence of elements of the same type. When we define a variable as we did on the previous examples, a vector with one element is created. Observe:

var1 <- 3
is.vector(var1)
## [1] TRUE

That is why there is a [1] along the value of the variable you defined, because it is the first element of a vector.

To create vectors with more than one element, we need to include them inside a c() function. See the following example:

var1 <- c(3,6, 7.8, 332)
print(var1)
## [1]   3.0   6.0   7.8 332.0
var2 <- c("Hello","how","are")
print(var2)
## [1] "Hello"  "how" "are"
var3 <- c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE)
print(var3)
## [1]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

if you try to mix the types, R will coerce some elements so they can all be of the same type:

var4 <- c(3, 6, 9, "alface")
print(var4)
## [1] "3"      "6"      "9"      "lettuce"
class(var4)
## [1] "character"
var5 <- c(TRUE, FALSE, 1, 3)
print(var5)
## [1] 1 0 1 3
class(var5)
## [1] "numeric"

In the first case, R transformed the numeric variable in character variables, because it was the type where it would be possible to have all the values. On the second case, TRUE and FALSE values can also be represented as 1 and 0, respectively. So R adopted these values and the vector is numeric.

Subsetting: to select one or some items from a vector, you need to use square brackets ([ ]). For more than one, you separate them with ::

var4[1]
## [1] "3"
var5[2:4]
## [1] 0 1 3

Matrices

Matrices are structures that correspond to their mathematical counterparts. Groups of elements organized in columns and rows. As vectors, all of their elements have to be of the same type. There are some ways to create a matrix:

mat1 <- matrix(
               c(1,5,10,30,15,8),
               nrow=3,
               ncol=2,
               byrow=TRUE)
print(mat1)
##      [,1] [,2]
## [1,]    1    5
## [2,]   10   30
## [3,]   15    8
vec1 <- c(3,4,5)
vec2 <- c(9,10,11)
mat2 <- rbind(vec1, vec2)
print(mat2)
##      [,1] [,2] [,3]
## vec1    3    4    5
## vec2    9   10   11
class(mat2)
## [1] "matrix"

In the first one, we use the matrix function. The nrow and ncol parameters arguments indicate the number of rows and columns and byrow will define if the matrix items will be filled by row (TRUE) or by column (FALSE).

On the second way, using the rbind function, note that R named the matrix rows with the vector names, instead of defining its numbers.

You can see that on the second way, R named the matrix rows with the name of the vectors that were used to create it.

To select an item from a matrix, you need to use square brackets, first supplying the row and then the column that we want:

mat1[2][1]
## [1] 10

Lists

Lists are a special kind of vector, that can have elements from different types, including vectors. To show what we are talking about, I will create a list based in some vectors.

a <- c(3,6,9)
b <- c("a","b","c","d")
c <- c(TRUE, FALSE, TRUE, TRUE)
list1 <- list(a,b,c)
print(list1)
## [[1]]
## [1] 3 6 9
## 
## [[2]]
## [1] "a" "b" "c" "d"
## 
## [[3]]
## [1]  TRUE FALSE  TRUE  TRUE

We can also retrieve a slice of the list with square brackets:

list1[2]
## [[1]]
## [1] "a" "b" "c" "d"
class(list1[2])
## [1] "list"

But, as we could see, what is returned is a new list. To retrieve the element on its own, directly, we should use double square brackets ([[ ]]):

list1[[1]]
## [1] 3 6 9
class(list1[[1]])
## [1] "numeric"

Now we got the element, as we could see from its class. To retrieve an item inside a list item, we only need to use square brackets again, as we do with vectors. Like that, we could even modify an item inside the list:

list1[[2]][1]
## [1] "a"
list1[[2]][1] <- "j"
list1[[2]][1]
## [1] "j"

One last interesting feature of lists is that you can name their items, instead of referencing them through numbers. In these cases, the selection properties using square brackets and double square brackets are still used. How? Like that:

list2 <- list(alpha=c(1,2,3), beta=c(10,20,30))
list2
## $alpha
## [1] 1 2 3
## 
## $beta
## [1] 10 20 30
list2["alpha"]
## $alpha
## [1] 1 2 3
list2[["alpha"]]
## [1] 1 2 3

With named lists, a second way to reference elements is through the $ symbol. It is equivalent to double square brackets:

list2$beta
## [1] 10 20 30

Data Frames

Data Frame is the structure to store information in shape of tables, organized in rows and columns. Rows and columns can be named, and columns can be of different types. You can create a data frame with the data.frame function:

df1 <- data.frame(c(1,2,3),c("low","medium","high"),c(TRUE, TRUE, FALSE))
print(df1)
##   c.1..2..3. c..low.....medium....high.. c.TRUE..TRUE..FALSE.
## 1          1                         low                 TRUE
## 2          2                      medium                 TRUE
## 3          3                        high                FALSE

But, to make things a little easier, R comes with some data sets organized in data frames, so we can test and learn. One of them is mtcars:

print(mtcars)
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

To select an item from the data frame, we also utilize square brackets, where you can use the number or the name of the row or column:

mtcars[1,2]
## [1] 6
mtcars["Mazda RX4","gear"]
## [1] 4
mtcars["Cadillac Fleetwood",3]
## [1] 472

Besides that, we can also select a part of the data frame through different ways:

# Defining column and interval of rows
mtcars[1:5,1]
## [1] 21.0 21.0 22.8 21.4 18.7

# If you don't supply the value for the lines, all of them are selected.
# The same if true for columns.
mtcars[,2]
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[3,]
##             mpg cyl disp hp drat   wt  qsec vs am gear carb
## Datsun 710 22.8   4  108 93 3.85 2.32 18.61  1  1    4    1

Some basic functions that are useful to work with data frames are nrow, that shows the number of rows in a data frame, ncol, that shows the number of columns, or dim, that shows both:

nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11
dim(mtcars)
## [1] 32 11

To finish it for this introduction, the str and summary functions show some very useful information for data frames. It’s very common to use them after importing a data set to R, to have an idea of what you are dealing with. str counts the rows and columns and show their types and some example values. summary calculates and shows values for mean, median, quantiles and missing values for each column:

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

For a basic R introduction, this is enough. Soon we’ll talk more about it.

If you have any comments, you can leave them here.

Regards

Felipe Galvao