R Introduction - Importing Data into R

November 17, 2015

Introduction

Today we will talk about importing data into R. There are many ways to do that, for different types of data, so that we can use R to do some analysis. Some of the most common types are:

  • CSV (Comma separated values)
  • Data in tabular shape, using other separators besides the comma
  • TAB spaced data
  • XLS File (Microsoft Excel)
  • Text lines from a file
  • HTML, XML, Json
  • And many others (HDF5, SPSS, Stata)

Let’s start to talk about how we can import them into R.

CSV and other tabular formats

CSV, or Comma Separated Value, is one of the most used formats used to store data in tabular form. On this format, basically, the comma indicates the end of a column. The table can have a header with column names or not.

Let’s use a CSV file with informations on air passengers that you can download clicking here.

After that, let’s import this CSV to a data frame in R, using the read.csv function. Let’s see:

airline_dataframe <- read.csv("AirPassengers1.csv", header=TRUE)
head(airline_dataframe)
##   X     time AirPassengers
## 1 1 1949.000           112
## 2 2 1949.083           118
## 3 3 1949.167           132
## 4 4 1949.250           129
## 5 5 1949.333           121
## 6 6 1949.417           135

We use the header=TRUE argument because, in this CSV file, the first line indicates the columns names, being it’s header. If we used header=FALSE, the values on the first row wouldn’t be the columns names, but values of the data frame, and R would give generic names for the columns. We can also use read.table to read data in tabular format, including CSVs. In this function, we need to specify the separator through argument sep. The separator must be between quotation marks:

airline_dataframe <- read.table("AirPassengers1.csv", header=TRUE, sep=",")
head(airline_dataframe)
##   X     time AirPassengers
## 1 1 1949.000           112
## 2 2 1949.083           118
## 3 3 1949.167           132
## 4 4 1949.250           129
## 5 5 1949.333           121
## 6 6 1949.417           135

Now, let’s see another example where the separator is TAB and the file is a txt. You can download this file we used clicking here.

test_dataframe <- read.table("test1.txt", header=TRUE, sep="\t")
head(test_dataframe)
##    X t1 t2 t3 t4 t5 t6 t7 t8
## 1 r1  1  0  1  0  0  1  0  2
## 2 r2  1  2  2  1  2  1  2  1
## 3 r3  0  0  0  2  1  1  0  1
## 4 r4  0  0  1  1  2  0  0  0
## 5 r5  0  2  1  1  1  0  0  0
## 6 r6  2  2  0  1  1  1  0  0

Some times, the file where the data is contained will have some instructions or informations on the first rows. In these cases, you can use the skip argument to indicate the number of rows you want to skip. Let’s use the same file from the previous example, but I edited it to include some lines of text on the beginning. Download it clicking here.

test_dataframe <- read.table("test21.txt", skip=4, header=TRUE, sep="\t")
head(test_dataframe)
##    X t1 t2 t3 t4 t5 t6 t7 t8
## 1 r1  1  0  1  0  0  1  0  2
## 2 r2  1  2  2  1  2  1  2  1
## 3 r3  0  0  0  2  1  1  0  1
## 4 r4  0  0  1  1  2  0  0  0
## 5 r5  0  2  1  1  1  0  0  0
## 6 r6  2  2  0  1  1  1  0  0

There are other useful argument for read.table. With na.strings you define which string should be interpreted as a NA. With row.names and col.names, you can define the names of the columns and rows. You can also define if strings should be interpreted as factors, with the stringsAsFactors argument. For the complete list of arguments, click here.

Reading Excel files

For Excel files, there are some ways. The first one is to use Excel ta save the tabular data in CSV format.

But, to read the XLS file, we can also use a package named xlsx.

First, install it with install.packages("xlsx"). Then, load it with library(xlsx). To load the data, you use the read.xlsx function, entering the name of the file to be read and the number or index of the sheet.

I created a test file that you can download and use clicking here.

library(xlsx)
## Loading required package: rJava
## Loading required package: xlsxjars
xl_data <- read.xlsx("test_excel1.xlsx", "Plan1")
print(xl_data)
##     Name Age             Email
## 1 Felipe  70 felipe@felipe.com
## 2   Jose  30     jose@jose.com
## 3  Maria  20   maria@maria.com
## 4 Rafael  34 rafael@rafael.com
## 5  Luiza  25   luiza@luiza.com

With this package, you can also define the starting row (startRow), the ending row (endRow), if the table have a header (header), define the class of each column (colClasses), among others. To see the complete list of arguments, you can access the package manual clicking here.

Reading Text Files

You may also want to analyze text lines from a file. A text file can contain, for example, tweets or Facebook posts. To get this text data into R, you use the readLines function. Just supply the name of the file or a connection. Additionally, you can also define the number of text lines to be read, with the n argument. You can download the test file to try this out clicking here:

text_vector <- readLines("text_lines.txt")
## Warning in readLines("text_lines.txt"): incomplete final line found on
## 'text_lines.txt'
print(text_vector)
## [1] "This is an example with text lines. This is the first one"
## [2] "The second one, with a little more text"
## [3] "Line of text 3 incoming"
## [4] "Just one more to finish"
text_vector_2 <- readLines("text_lines.txt",2)
print(text_vector_2)
## [1] "This is an example with text lines. This is the first one"
## [2] "The second one, with a little more text"

Reading a webpage

You can also read a webpage to extract useful information from it. First, you create a connection using the url function. Then, you just need to use readLines just like we used to read a text file on the previous example:

connection <- url("https://felipegalvao.com.br/")
webpage_lines <- readLines(connection)
print(head(webpage_lines))
## [1] "<!DOCTYPE html&gt;"              "<html lang=\"pt\"&gt;"
## [3] ""                             "<head&gt;"
## [5] ""                             "    <meta charset=\"utf-8\"&gt;"

There are other ways to import data into R, but for now, we will stop here.

If you want me to talk about any specific subject, leave a comment.

Regards