R Introduction - Importing Data into R
November 17, 2015
Introduction
Today we will talk about importing data into R. There are many ways to do that, for different types of data, so that we can use R to do some analysis. Some of the most common types are:
- CSV (Comma separated values)
- Data in tabular shape, using other separators besides the comma
- TAB spaced data
- XLS File (Microsoft Excel)
- Text lines from a file
- HTML, XML, Json
- And many others (HDF5, SPSS, Stata)
Let’s start to talk about how we can import them into R.
CSV and other tabular formats
CSV, or Comma Separated Value, is one of the most used formats used to store data in tabular form. On this format, basically, the comma indicates the end of a column. The table can have a header with column names or not.
Let’s use a CSV file with informations on air passengers that you can download clicking here.
After that, let’s import this CSV to a data frame in R, using the read.csv
function. Let’s see:
airline_dataframe <- read.csv("AirPassengers1.csv", header=TRUE)
head(airline_dataframe)
## X time AirPassengers
## 1 1 1949.000 112
## 2 2 1949.083 118
## 3 3 1949.167 132
## 4 4 1949.250 129
## 5 5 1949.333 121
## 6 6 1949.417 135
We use the header=TRUE
argument because, in this CSV file, the first line
indicates the columns names, being it’s header. If we used header=FALSE
, the
values on the first row wouldn’t be the columns names, but values of the data
frame, and R would give generic names for the columns. We can also use
read.table
to read data in tabular format, including CSVs. In this function, we
need to specify the separator through argument sep. The separator must be
between quotation marks:
airline_dataframe <- read.table("AirPassengers1.csv", header=TRUE, sep=",")
head(airline_dataframe)
## X time AirPassengers
## 1 1 1949.000 112
## 2 2 1949.083 118
## 3 3 1949.167 132
## 4 4 1949.250 129
## 5 5 1949.333 121
## 6 6 1949.417 135
Now, let’s see another example where the separator is TAB and the file is a txt. You can download this file we used clicking here.
test_dataframe <- read.table("test1.txt", header=TRUE, sep="\t")
head(test_dataframe)
## X t1 t2 t3 t4 t5 t6 t7 t8
## 1 r1 1 0 1 0 0 1 0 2
## 2 r2 1 2 2 1 2 1 2 1
## 3 r3 0 0 0 2 1 1 0 1
## 4 r4 0 0 1 1 2 0 0 0
## 5 r5 0 2 1 1 1 0 0 0
## 6 r6 2 2 0 1 1 1 0 0
Some times, the file where the data is contained will have some instructions or
informations on the first rows. In these cases, you can use the skip
argument
to indicate the number of rows you want to skip. Let’s use the same file from
the previous example, but I edited it to include some lines of text on the
beginning. Download it clicking here.
test_dataframe <- read.table("test21.txt", skip=4, header=TRUE, sep="\t")
head(test_dataframe)
## X t1 t2 t3 t4 t5 t6 t7 t8
## 1 r1 1 0 1 0 0 1 0 2
## 2 r2 1 2 2 1 2 1 2 1
## 3 r3 0 0 0 2 1 1 0 1
## 4 r4 0 0 1 1 2 0 0 0
## 5 r5 0 2 1 1 1 0 0 0
## 6 r6 2 2 0 1 1 1 0 0
There are other useful argument for read.table. With na.strings
you define
which string should be interpreted as a NA
. With row.names
and col.names
, you
can define the names of the columns and rows. You can also define if strings
should be interpreted as factors, with the stringsAsFactors
argument. For the
complete list of arguments, click here.
Reading Excel files
For Excel files, there are some ways. The first one is to use Excel ta save the tabular data in CSV format.
But, to read the XLS file, we can also use a package named xlsx
.
First, install it with install.packages("xlsx")
. Then, load it with
library(xlsx)
. To load the data, you use the read.xlsx
function, entering the
name of the file to be read and the number or index of the sheet.
I created a test file that you can download and use clicking here.
library(xlsx)
## Loading required package: rJava
## Loading required package: xlsxjars
xl_data <- read.xlsx("test_excel1.xlsx", "Plan1")
print(xl_data)
## Name Age Email
## 1 Felipe 70 felipe@felipe.com
## 2 Jose 30 jose@jose.com
## 3 Maria 20 maria@maria.com
## 4 Rafael 34 rafael@rafael.com
## 5 Luiza 25 luiza@luiza.com
With this package, you can also define the starting row (startRow
), the ending
row (endRow
), if the table have a header (header
), define the class of each
column (colClasses
), among others. To see the complete list of arguments, you
can access the package manual clicking here.
Reading Text Files
You may also want to analyze text lines from a file. A text file can contain,
for example, tweets or Facebook posts. To get this text data into R, you use
the readLines
function. Just supply the name of the file or a connection.
Additionally, you can also define the number of text lines to be read, with the
n
argument. You can download the test file to try this out clicking here:
text_vector <- readLines("text_lines.txt")
## Warning in readLines("text_lines.txt"): incomplete final line found on
## 'text_lines.txt'
print(text_vector)
## [1] "This is an example with text lines. This is the first one"
## [2] "The second one, with a little more text"
## [3] "Line of text 3 incoming"
## [4] "Just one more to finish"
text_vector_2 <- readLines("text_lines.txt",2)
print(text_vector_2)
## [1] "This is an example with text lines. This is the first one"
## [2] "The second one, with a little more text"
Reading a webpage
You can also read a webpage to extract useful information from it. First, you
create a connection using the url
function. Then, you just need to use readLines
just like we used to read a text file on the previous example:
connection <- url("https://felipegalvao.com.br/")
webpage_lines <- readLines(connection)
print(head(webpage_lines))
## [1] "<!DOCTYPE html>" "<html lang=\"pt\">"
## [3] "" "<head>"
## [5] "" " <meta charset=\"utf-8\">"
There are other ways to import data into R, but for now, we will stop here.
If you want me to talk about any specific subject, leave a comment.
Regards