Data Science with Python – Pandas Basics – Reading Dataframes
February 18, 2016
Introduction
In a previous post, we showed how to install the Anaconda distribution so you could utilize Pandas for Data Analysis and Science (click here to see it). Now, with that out of the way, we’ll make an introduction about Pandas, talking about one of its most important structures, the DataFrame
.
DataFrames are structures that store data in a tabular format. DataFrames are composed of rows and columns, with the column being one field of the Dataset and the row being a record. Each column holds data of one type (Integer, String, etc). You can imagine an Excel spreadsheet, but in this case, each column is limited to one data type.
Creating a DataFrame
First, a simple way of creating a DataFrame, using the good old Dictionary from Python (let’s import Pandas first, since this will be the first time the code will be executed in this post):
import pandas as pd
df_data = {'country': ['Brazil', 'Argentina','Argentina', 'Brazil', 'Chile','Chile'],
'year': [2005, 2006, 2005, 2006, 2007, 2008],
'population': [170.1, 30.5, 32.2, 172.6, 40.8, 42.0]}
df = pd.DataFrame(df_data)
print(df)
country population year
0 Brazil 170.1 2005
1 Argentina 30.5 2006
2 Argentina 32.2 2005
3 Brazil 172.6 2006
4 Chile 40.8 2007
5 Chile 42.0 2008
Another way, way more used, is to import data from a file (in a lot of different formats) or a database directly to a DataFrame.
We will use the Titanic example from Kaggle, as an example. For those who don’t know what Kaggle is, it is a website of Data Science competitions. Normally hosted by companies, the competitions usually involve predictions throughs Datasets of training and test, but there are also competitions with another tasks, like image recognition and creation of nice data visualizations with Datasets provided by Kaggle. Introduction made, let’s follow.
Sign up for an account on Kaggle and click on “Competitions”. Scroll down a bit and find the competition “Titanic: Machine Learning from Disaster”. Click on it and then you can read details about this competition. This is a permanent “competition” on Kaggle, that exists for people to train and learn about Data Science and Machine Learning, applying their newly found knowledge to some data that is ready to use.
For this competition, several informations are provided in a training Dataset regarding passenger from the Titanic, like age, gender, cabin, ticket value, among others. Besides that, the train Dataset also shows the target variable, if a passenger survived (value == 1) or not (value == 0) to the wreckage. On the test Dataset, the same informations are shown, except for the target variable (survival). For the passengers on the test Dataset, you will have to predict if they survived or not, based on all the information provided, comparing the information on the training set with the information on the test set. On the competition page you can also see another details, like what each variable on the Dataset means, the Leaderboard for the competition, some tutorial related to the competition.
In the competition page, click the “Data” link, on the sidebar on the left, and download the train and test files. These are CSV files. Save them somewhere on your computer and then create the Python script on the same folder, so we can read them without having to specify the entire path. So, finally, the command read_csv
from Pandas will read a csv file and make it in a DataFrame. We will use it to store the Dataset in a variable, and then we will use the head()
command of a DataFrame, to print just the first 5 rows, just to show that the command worked. Let’s see (from now, we will consider that Pandas is already imported as pd):
import pandas as pd
train_dataset = pd.read_csv('train.csv')
print(train_dataset.head())
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1
2 Heikkinen, Miss. Laina female 26 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
4 Allen, Mr. William Henry male 35 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
Reading DataFrame information
To retrieve information from a column, there are two most common ways. The first one is to call the column as if the DataFrame was a Dictionary. The second is analogous to calling an object attribute. With the first way, you can also retrieve multiple columns:
print(train_dataset['Age'])
0 22
1 38
2 26
3 35
4 35
...
886 27
887 19
888 NaN
889 26
890 32
print(train_dataset.Sex)
0 male
1 female
2 female
3 female
4 male
...
886 male
887 female
888 female
889 male
890 male
print(train_dataset[['Sex','Age','Fare']])
Sex Age Fare
0 male 34 7.2500
1 female 10 71.2833
2 female 30 7.9250
3 female 5 53.1000
4 male 22 8.0500
...
886 male 59 13.0000
887 female 30 30.0000
888 female 19 23.4500
889 male 57 30.0000
890 male 7 7.7500
To retrieve the contents of a row, we use the ix
attribute, passing to it the row position. If a List
of values is passed, all the rows will be shown:
print(train_dataset.ix[3])
PassengerId 4
Survived 1
Pclass 1
Name Futrelle, Mrs. Jacques Heath (Lily May Peel)
Sex female
Age 35
SibSp 1
Parch 0
Ticket 113803
Fare 53.1
Cabin C123
Embarked S
print(train_dataset.ix[[0,10,50]])
PassengerId Survived Pclass Name Sex \
0 1 0 3 Braund, Mr. Owen Harris male
10 11 1 3 Sandstrom, Miss. Marguerite Rut female
50 51 0 3 Panula, Master. Juha Niilo male
Age SibSp Parch Ticket Fare Cabin Embarked
0 22 1 0 A/5 21171 7.2500 NaN S
10 4 1 1 PP 9549 16.7000 G6 S
50 7 4 1 3101295 39.6875 NaN S
As we already saw, head()
shows the 5 first lines of the DataFrame. If we pass an integer number to it, it will retrieve this same number of rows. Just like it, tails()
will show the last rows of a DataFrame. To get the name of each column, we use columns()
. And finally, we can use shape
to retrieve the number of columns and rows. dataframe.shape[0]
is the number of rows and [1]
is the number of columns:
print(train_dataset.tail())
PassengerId Survived Pclass Name \
886 887 0 2 Montvila, Rev. Juozas
887 888 1 1 Graham, Miss. Margaret Edith
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie"
889 890 1 1 Behr, Mr. Karl Howell
890 891 0 3 Dooley, Mr. Patrick
Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 male 27 0 0 211536 13.00 NaN S
887 female 19 0 0 112053 30.00 B42 S
888 female NaN 1 2 W./C. 6607 23.45 NaN S
889 male 26 0 0 111369 30.00 C148 C
890 male 32 0 0 370376 7.75 NaN Q
print(train_dataset.columns)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
print(train_dataset.shape)
(891, 12)
A really interesting method of a DataFrame is describe()
. Describe calculates summary statistics for each numerical column in the DataFrame, like counting of values, sum, mean, median, etc. It’s interesting when you want to have an idea of what the data being worked looks like:
print(train_dataset.describe())
PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
One last important thing to know about reading DataFrames is filtering. We will filter the DataFrame by the value of the “Sex” column. We will select only rows where the “Sex” columns equals “female”, by making an assertion:
print(train_dataset[train_dataset.Sex == "female"])
PassengerId Survived Pclass \
1 2 1 1
2 3 1 3
3 4 1 1
8 9 1 3
9 10 1 2
.. ... ... ...
880 881 1 2
882 883 0 3
885 886 0 3
887 888 1 1
888 889 0 3
Name Sex Age SibSp \
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1
2 Heikkinen, Miss. Laina female 26 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0
9 Nasser, Mrs. Nicholas (Adele Achem) female 14 1
.. ... ... ... ...
880 Shelley, Mrs. William (Imanita Parrish Hall) female 25 0
882 Dahlberg, Miss. Gerda Ulrika female 22 0
885 Rice, Mrs. William (Margaret Norton) female 39 0
887 Graham, Miss. Margaret Edith female 19 0
888 Johnston, Miss. Catherine Helen "Carrie" female NaN 1
Parch Ticket Fare Cabin Embarked
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
8 2 347742 11.1333 NaN S
9 0 237736 30.0708 NaN C
.. ... ... ... ... ...
880 1 230433 26.0000 NaN S
882 0 7552 10.5167 NaN S
885 5 382652 29.1250 NaN Q
887 0 112053 30.0000 B42 S
888 2 W./C. 6607 23.4500 NaN S
I believe this is a good start in the reading part. In the next post I’ll talk about data manipulation, with creation of columns, how to change column values, among other stuff.
Stay tuned! :)