Data Science with Python – Pandas Basics – Reading Dataframes

February 18, 2016

Introduction

In a previous post, we showed how to install the Anaconda distribution so you could utilize Pandas for Data Analysis and Science (click here to see it). Now, with that out of the way, we’ll make an introduction about Pandas, talking about one of its most important structures, the DataFrame.

DataFrames are structures that store data in a tabular format. DataFrames are composed of rows and columns, with the column being one field of the Dataset and the row being a record. Each column holds data of one type (Integer, String, etc). You can imagine an Excel spreadsheet, but in this case, each column is limited to one data type.

Creating a DataFrame

First, a simple way of creating a DataFrame, using the good old Dictionary from Python (let’s import Pandas first, since this will be the first time the code will be executed in this post):

import pandas as pd

df_data = {'country': ['Brazil', 'Argentina','Argentina', 'Brazil', 'Chile','Chile'],
           'year': [2005, 2006, 2005, 2006, 2007, 2008],
           'population': [170.1, 30.5, 32.2, 172.6, 40.8, 42.0]}
df = pd.DataFrame(df_data)
print(df)

     country  population  year
0     Brazil       170.1  2005
1  Argentina        30.5  2006
2  Argentina        32.2  2005
3     Brazil       172.6  2006
4      Chile        40.8  2007
5      Chile        42.0  2008

Another way, way more used, is to import data from a file (in a lot of different formats) or a database directly to a DataFrame.

We will use the Titanic example from Kaggle, as an example. For those who don’t know what Kaggle is, it is a website of Data Science competitions. Normally hosted by companies, the competitions usually involve predictions throughs Datasets of training and test, but there are also competitions with another tasks, like image recognition and creation of nice data visualizations with Datasets provided by Kaggle. Introduction made, let’s follow.

Sign up for an account on Kaggle and click on “Competitions”. Scroll down a bit and find the competition “Titanic: Machine Learning from Disaster”. Click on it and then you can read details about this competition. This is a permanent “competition” on Kaggle, that exists for people to train and learn about Data Science and Machine Learning, applying their newly found knowledge to some data that is ready to use.

For this competition, several informations are provided in a training Dataset regarding passenger from the Titanic, like age, gender, cabin, ticket value, among others. Besides that, the train Dataset also shows the target variable, if a passenger survived (value == 1) or not (value == 0) to the wreckage. On the test Dataset, the same informations are shown, except for the target variable (survival). For the passengers on the test Dataset, you will have to predict if they survived or not, based on all the information provided, comparing the information on the training set with the information on the test set. On the competition page you can also see another details, like what each variable on the Dataset means, the Leaderboard for the competition, some tutorial related to the competition.

In the competition page, click the “Data” link, on the sidebar on the left, and download the train and test files. These are CSV files. Save them somewhere on your computer and then create the Python script on the same folder, so we can read them without having to specify the entire path. So, finally, the command read_csv from Pandas will read a csv file and make it in a DataFrame. We will use it to store the Dataset in a variable, and then we will use the head() command of a DataFrame, to print just the first 5 rows, just to show that the command worked. Let’s see (from now, we will consider that Pandas is already imported as pd):

import pandas as pd
train_dataset = pd.read_csv('train.csv')
print(train_dataset.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex  Age  SibSp  \
0                            Braund, Mr. Owen Harris    male   22      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   38      1   
2                             Heikkinen, Miss. Laina  female   26      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   35      1   
4                           Allen, Mr. William Henry    male   35      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

Reading DataFrame information

To retrieve information from a column, there are two most common ways. The first one is to call the column as if the DataFrame was a Dictionary. The second is analogous to calling an object attribute. With the first way, you can also retrieve multiple columns:

print(train_dataset['Age'])
0      22
1      38
2      26
3      35
4      35
...
886    27
887    19
888   NaN
889    26
890    32

print(train_dataset.Sex)
0        male
1      female
2      female
3      female
4        male
...
886      male
887    female
888    female
889      male
890      male

print(train_dataset[['Sex','Age','Fare']])
        Sex  Age      Fare
0      male   34    7.2500
1    female   10   71.2833
2    female   30    7.9250
3    female    5   53.1000
4      male   22    8.0500
...
886    male   59   13.0000
887  female   30   30.0000
888  female   19   23.4500
889    male   57   30.0000
890    male    7    7.7500

To retrieve the contents of a row, we use the ix attribute, passing to it the row position. If a List of values is passed, all the rows will be shown:

print(train_dataset.ix[3])
PassengerId                                               4
Survived                                                  1
Pclass                                                    1
Name           Futrelle, Mrs. Jacques Heath (Lily May Peel)
Sex                                                  female
Age                                                      35
SibSp                                                     1
Parch                                                     0
Ticket                                               113803
Fare                                                   53.1
Cabin                                                  C123
Embarked                                                  S

print(train_dataset.ix[[0,10,50]])
    PassengerId  Survived  Pclass                             Name     Sex  \
0             1         0       3          Braund, Mr. Owen Harris    male   
10           11         1       3  Sandstrom, Miss. Marguerite Rut  female   
50           51         0       3       Panula, Master. Juha Niilo    male   

    Age  SibSp  Parch     Ticket     Fare Cabin Embarked 
0    22      1      0  A/5 21171   7.2500   NaN        S  
10    4      1      1    PP 9549  16.7000    G6        S 
50    7      4      1    3101295  39.6875   NaN        S 

As we already saw, head() shows the 5 first lines of the DataFrame. If we pass an integer number to it, it will retrieve this same number of rows. Just like it, tails() will show the last rows of a DataFrame. To get the name of each column, we use columns(). And finally, we can use shape to retrieve the number of columns and rows. dataframe.shape[0] is the number of rows and [1] is the number of columns:

print(train_dataset.tail())
     PassengerId  Survived  Pclass                                      Name  \
886          887         0       2                     Montvila, Rev. Juozas   
887          888         1       1              Graham, Miss. Margaret Edith   
888          889         0       3  Johnston, Miss. Catherine Helen "Carrie"   
889          890         1       1                     Behr, Mr. Karl Howell   
890          891         0       3                       Dooley, Mr. Patrick   

        Sex  Age  SibSp  Parch      Ticket   Fare Cabin Embarked  
886    male   27      0      0      211536  13.00   NaN        S  
887  female   19      0      0      112053  30.00   B42        S  
888  female  NaN      1      2  W./C. 6607  23.45   NaN        S  
889    male   26      0      0      111369  30.00  C148        C  
890    male   32      0      0      370376   7.75   NaN        Q  

print(train_dataset.columns)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

print(train_dataset.shape)
(891, 12)

A really interesting method of a DataFrame is describe(). Describe calculates summary statistics for each numerical column in the DataFrame, like counting of values, sum, mean, median, etc. It’s interesting when you want to have an idea of what the data being worked looks like:

print(train_dataset.describe())
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  

One last important thing to know about reading DataFrames is filtering. We will filter the DataFrame by the value of the “Sex” column. We will select only rows where the “Sex” columns equals “female”, by making an assertion:

print(train_dataset[train_dataset.Sex == "female"])
     PassengerId  Survived  Pclass  \
1              2         1       1   
2              3         1       3   
3              4         1       1   
8              9         1       3   
9             10         1       2   
..           ...       ...     ...   
880          881         1       2   
882          883         0       3   
885          886         0       3   
887          888         1       1   
888          889         0       3   

                                                  Name     Sex  Age  SibSp  \
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female   38      1   
2                               Heikkinen, Miss. Laina  female   26      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   35      1   
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female   27      0   
9                  Nasser, Mrs. Nicholas (Adele Achem)  female   14      1   
..                                                 ...     ...  ...    ...   
880       Shelley, Mrs. William (Imanita Parrish Hall)  female   25      0   
882                       Dahlberg, Miss. Gerda Ulrika  female   22      0   
885               Rice, Mrs. William (Margaret Norton)  female   39      0   
887                       Graham, Miss. Margaret Edith  female   19      0   
888           Johnston, Miss. Catherine Helen "Carrie"  female  NaN      1   

     Parch            Ticket      Fare Cabin Embarked  
1        0          PC 17599   71.2833   C85        C  
2        0  STON/O2. 3101282    7.9250   NaN        S  
3        0            113803   53.1000  C123        S  
8        2            347742   11.1333   NaN        S  
9        0            237736   30.0708   NaN        C  
..     ...               ...       ...   ...      ...  
880      1            230433   26.0000   NaN        S  
882      0              7552   10.5167   NaN        S  
885      5            382652   29.1250   NaN        Q  
887      0            112053   30.0000   B42        S  
888      2        W./C. 6607   23.4500   NaN        S  

I believe this is a good start in the reading part. In the next post I’ll talk about data manipulation, with creation of columns, how to change column values, among other stuff.

Stay tuned! :)