Data Manipulation with Python (Pandas)

February 29, 2016

Introduction

Having talked about the basics of information reading in Pandas (this post), we’ll now talk about the basics of data manipulation, change of values, creation and removal of columns, among others. We will use the Titanic dataset again for our examples. If you don’t know what it’s about, check the post already mentioned in the beginning of this paragraph.

Removing information from the DataFrame

Let’s start talking about removal of information from a DataFrame. Information that is not useful for you in a given moment can be removed from the Dataset to make your job easier.

To remove information from a DataFrame, we use drop(). With drop, you shall include the index to be removed, be it name or position. If you don’t define an axis parameter, it will remove a row. If you define axis=1, it will remove the column. You can also pass a list as parameter, and all the rows or columns on the list will be removed. It’s important to note that drop() will not remove the data directly. It will, instead, create a new object, so, if you want to keep it, you will have to assign the Dataframe from drop() to a new variable. Let’s remove the "Name" column from the DataFrame. It contains very large values and this makes table reading longer. We will keep the DataFrame like that until the end of the post ;)

train_df = pd.read_csv('train.csv')
train_df = train_df.drop('Name',axis=1)
print(train_df.head())
   PassengerId  Survived  Pclass     Sex  Age  SibSp  Parch            Ticket  \
0            1         0       3    male   22      1      0         A/5 21171   
1            2         1       1  female   38      1      0          PC 17599   
2            3         1       3  female   26      0      0  STON/O2. 3101282   
3            4         1       1  female   35      1      0            113803   
4            5         0       3    male   35      0      0            373450   

      Fare Cabin Embarked  
0   7.2500   NaN        S  
1  71.2833   C85        C  
2   7.9250   NaN        S  
3  53.1000  C123        S  
4   8.0500   NaN        S  

Now we will do it with rows. Note that now, the rows will start at index 1, because we removed row 0:

print(train_df.drop(0).head())
   PassengerId  Survived  Pclass     Sex  Age  SibSp  Parch            Ticket  \
1            2         1       1  female   38      1      0          PC 17599   
2            3         1       3  female   26      0      0  STON/O2. 3101282   
3            4         1       1  female   35      1      0            113803   
4            5         0       3    male   35      0      0            373450   
5            6         0       3    male  NaN      0      0            330877   

      Fare Cabin Embarked  
1  71.2833   C85        C  
2   7.9250   NaN        S  
3  53.1000  C123        S  
4   8.0500   NaN        S  
5   8.4583   NaN        Q  

Changing values and creating new columns

To change values of a column, we do the same thing as we would when we retrieve a column, defining a value to it. It’s possible to define an unique value to the entire column, a list or array or a function (like one for random number generation). The list / array must have the same length as the DataFrame. Let’s use the “Embarked” column for this example, but then we will load the Dataset again because we will need the column for a future example:

train_dataset['Embarked'] = "Port"
print(train_dataset['Embarked'])

0      Port
1      Port
2      Port
3      Port
4      Port
...
886    Port
887    Port
888    Port
889    Port
890    Port

train_df = pd.read_csv('train.csv')
train_df = train_df.drop('Name',axis=1)

You can also create a new column, defining values for a column with a name that still does not exist in the DataFrame. On the Titanic Dataset, for example, the column "SibSp" is the number of brothers and spouse, and the "Parch" is the number of children or parents on the ship. We will create the "Family_Size", which will be the sum of these two previous columns:

train_df['Family_Size'] = train_df['SibSp'] + train_df['Parch']
print(train_df['Family_Size'])

0       1
1       1
2       0
3       1
4       0
...
886     0
887     0
888     3
889     0
890     0

Working with missing values (NaN)

If the Dataset you are working on have some missing values (NaN), Pandas have some functions to deal with them in the best way possible.

First, if you want to check where in your DataFrame are the missing values, you can use the isnull() function. In the Titanic Dataset, for example, the "Cabin" column has a lot of missing values, so you can check the missing values using the isnull() function on this column.

train_df['Cabin'].isnull()
0       True
1      False
2       True
3      False
4       True
...
886     True
887    False
888     True
889    False
890     True

However, if you want to know which columns on your Dataset has missing values and you have a lot of columns, the previous way is not very practical. For these cases, you can use isnull() together with any(). This will, for each column, return True if there is at least one missing value on the column. For the opposite, you can use notnull() together with all(). Let’s see:

train_df.isnull().any()
PassengerId        False
Survived           False
Pclass             False
Sex                False
Age                 True
SibSp              False
Parch              False
Ticket             False
Fare               False
Cabin               True
Embarked            True
Tamanho_Familia    False

train_df.notnull().all()
PassengerId         True
Survived            True
Pclass              True
Sex                 True
Age                False
SibSp               True
Parch               True
Ticket              True
Fare                True
Cabin              False
Embarked           False
Tamanho_Familia     True

Knowing where the missing values are, you can decide if you want to remove the data from the Dataset, not utilizing it, or you can fill the missing values with an adequate strategy.

The dropna() command removes from the DataFrame any row that has at least one NaN. If we pass the how='all' parameter, it will only remove rows where all the columns are NaN. To remove columns, we will use the same command, but with the axis=1 parameter:

print(train_df.dropna().head())
    PassengerId  Survived  Pclass     Sex  Age  SibSp  Parch    Ticket  \
1             2         1       1  female   38      1      0  PC 17599   
3             4         1       1  female   35      1      0    113803   
6             7         0       1    male   54      0      0     17463   
10           11         1       3  female    4      1      1   PP 9549   
11           12         1       1  female   58      0      0    113783   

       Fare Cabin Embarked  Tamanho_Familia  
1   71.2833   C85        C                1  
3   53.1000  C123        S                1  
6   51.8625   E46        S                0  
10  16.7000    G6        S                2  
11  26.5500  C103        S                0  

print(train_df.dropna(axis=1).head())
   PassengerId  Survived  Pclass     Sex  SibSp  Parch            Ticket  \
0            1         0       3    male      1      0         A/5 21171   
1            2         1       1  female      1      0          PC 17599   
2            3         1       3  female      0      0  STON/O2. 3101282   
3            4         1       1  female      1      0            113803   
4            5         0       3    male      0      0            373450   

      Fare  Tamanho_Familia  
0   7.2500                1  
1  71.2833                1  
2   7.9250                0  
3  53.1000                1  
4   8.0500                0  

In the first case, note the row indexes. Lots of rows were removed for having missing values, possibly in the "Cabin" column. On the second one, the "Cabin" column itself was removed, along with "Embarked" and "Age".

If you want to fill the missing values, you use the fillna() function. With it, you pass along the value that should replace the NaN values. Generally, it will probably make more sense to work with one column a time. As we saw above, the "Age" and "Embarked" columns have missing values. We will fill the "Age" column with the mean of the ages on it and the "Embarked" column with the mode of the column, or the value that occurs the most in it. The fillna() function, by default, returns a new object, but you can use the inplace=True parameter to modify the original object. We will create a new DataFrame, filled_df, just to preserve the original one, and then we will fill the missing values:

filled_df = train_df
filled_df['Age'].fillna(filled_df['Age'].mean(), inplace=True)
filled_df['Embarked'].fillna(filled_df['Embarked'].mode()[0], inplace=True)

# Now we can see that the Age and Embarked columns do not have missing values anymore
print(filled_df.isnull().any())
PassengerId        False
Survived           False
Pclass             False
Sex                False
Age                False
SibSp              False
Parch              False
Ticket             False
Fare               False
Cabin               True
Embarked           False
Tamanho_Familia    False

Removing duplicates

If you do not want duplicate rows on your Dataset, we have two functions for that. The first one, duplicated(), indicates if a row is duplicated or not. By default, it will only return True if a row is completely equal to a previous one. But we can also check for duplicates based on a column value. For that, we will pass the column name as a parameter inside a Python List. Inside this List, we can have one or more column names. If we have more than one, the function will return True if values from all the columns are equal for two rows. Additionally, the drop_duplicates() function will remove the duplicated rows from the DataFrame. It works in the same way as duplicated(), meaning that by default, it will only remove a row if it is entirely equal to a previous one, but it can remove based on selected column values. Let’s see examples with the “Embarked” column. Since it only have three values, we know there will be a lot of duplicated values on it:

print(train_df.duplicated("Embarked"))
0      False
1      False
2       True
3       True
4       True
5      False
...
885     True
886     True
887     True
888     True
889     True
890     True

drop_duplicate_df = train_df.drop_duplicates(['Embarked'])
print(drop_duplicate_df)
   PassengerId  Survived  Pclass     Sex        Age  SibSp  Parch     Ticket  \
0            1         0       3    male  22.000000      1      0  A/5 21171   
1            2         1       1  female  38.000000      1      0   PC 17599   
5            6         0       3    male  29.699118      0      0     330877   

      Fare Cabin Embarked  Tamanho_Familia  
0   7.2500   NaN        S                1  
1  71.2833   C85        C                1  
5   8.4583   NaN        Q                0  
 

Some Statistics on DataFrames

Pandas also have a serie of functions to calculate descriptive statistics on DataFrames. The statistics can be calculated along rows or columns. There are functions for mean, mode, median, among a lot of others. By default, the calculation is made along columns, but it can also be made along rows if you pass axis=1 as a parameter. Let’s see some examples:

train_df.sum()
PassengerId        397386.0000
Survived              342.0000
Pclass               2057.0000
Age                 21205.1700
SibSp                 466.0000
Parch                 340.0000
Fare                28693.9493
Tamanho_Familia       806.0000

train_df.mean()
PassengerId        446.000000
Survived             0.383838
Pclass               2.308642
Age                 29.699118
SibSp                0.523008
Parch                0.381594
Fare                32.204208
Tamanho_Familia      0.904602

train_df.max(axis=1)
0       22.0000
1       71.2833
2       26.0000
3       53.1000
4       35.0000
...
886    887.0000
887    888.0000
888    889.0000
889    890.0000
890    891.0000

The complete list of functions can be found on the following link, from the official documentation: http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics

Sorting DataFrames

For data reading, sometimes it can be important that the DataFrame is sorted. To sort by index, we will first reindex the DataFrame so that the indexes are reverted, and then we will use the function sort_index() to sort it back. You can also use the ascending=False parameter to sort in descending:

# Here we reverse the DataFrame indexes
new_train_df = train_df.reindex(list(reversed(range(891))))
new_train_df.head()
     PassengerId  Survived  Pclass     Sex  Age  SibSp  Parch      Ticket  \
890          891         0       3    male   32      0      0      370376   
889          890         1       1    male   26      0      0      111369   
888          889         0       3  female  NaN      1      2  W./C. 6607   
887          888         1       1  female   19      0      0      112053   
886          887         0       2    male   27      0      0      211536   

      Fare Cabin Embarked  Tamanho_Familia  
890   7.75   NaN        Q                0  
889  30.00  C148        C                0  
888  23.45   NaN        S                3  
887  30.00   B42        S                0  
886  13.00   NaN        S                0  

# And here we sort it back through sort_index()
new_train_df = new_train_df.sort_index()
new_train_df.head()
   PassengerId  Survived  Pclass     Sex  Age  SibSp  Parch            Ticket  \
0            1         0       3    male   22      1      0         A/5 21171   
1            2         1       1  female   38      1      0          PC 17599   
2            3         1       3  female   26      0      0  STON/O2. 3101282   
3            4         1       1  female   35      1      0            113803   
4            5         0       3    male   35      0      0            373450   

      Fare Cabin Embarked  Tamanho_Familia  
0   7.2500   NaN        S                1  
1  71.2833   C85        C                1  
2   7.9250   NaN        S                0  
3  53.1000  C123        S                1  
4   8.0500   NaN        S                0  

# You can also use sort_index to sort in descending
new_train_df.sort_index(ascending=False).head()
     PassengerId  Survived  Pclass     Sex  Age  SibSp  Parch      Ticket  \
890          891         0       3    male   32      0      0      370376   
889          890         1       1    male   26      0      0      111369   
888          889         0       3  female  NaN      1      2  W./C. 6607   
887          888         1       1  female   19      0      0      112053   
886          887         0       2    male   27      0      0      211536   

      Fare Cabin Embarked  Tamanho_Familia  
890   7.75   NaN        Q                0  
889  30.00  C148        C                0  
888  23.45   NaN        S                3  
887  30.00   B42        S                0  
886  13.00   NaN        S                0  

It is also possible to sort based on column values, through the function sort_values(). You will pass the column that you want to sort by through the by parameter:

new_train_df.sort_values(by='Age').head()
     PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch  Ticket  \
803          804         1       3    male  0.42      0      1    2625   
755          756         1       2    male  0.67      1      1  250649   
644          645         1       3  female  0.75      2      1    2666   
469          470         1       3  female  0.75      2      1    2666   
78            79         1       2    male  0.83      0      2  248738   

        Fare Cabin Embarked  Tamanho_Familia  
803   8.5167   NaN        C                1  
755  14.5000   NaN        S                2  
644  19.2583   NaN        C                3  
469  19.2583   NaN        C                3  
78   29.0000   NaN        S                2  

And just like that we finish one more post talking a little about the main functionalities of Pandas. Still thinking about the next post, but it will probably be about Data Visualization, matplotlib and how to plot and make some visualizations.

Stay tuned! :)