Data Manipulation with Python (Pandas)
February 29, 2016
Introduction
Having talked about the basics of information reading in Pandas (this post), we’ll now talk about the basics of data manipulation, change of values, creation and removal of columns, among others. We will use the Titanic dataset again for our examples. If you don’t know what it’s about, check the post already mentioned in the beginning of this paragraph.
Removing information from the DataFrame
Let’s start talking about removal of information from a DataFrame. Information that is not useful for you in a given moment can be removed from the Dataset to make your job easier.
To remove information from a DataFrame, we use drop()
. With drop
, you shall include the index to be removed, be it name or position. If you don’t define an axis
parameter, it will remove a row. If you define axis=1
, it will remove the column. You can also pass a list as parameter, and all the rows or columns on the list will be removed. It’s important to note that drop()
will not remove the data directly. It will, instead, create a new object, so, if you want to keep it, you will have to assign the Dataframe from drop()
to a new variable. Let’s remove the "Name"
column from the DataFrame. It contains very large values and this makes table reading longer. We will keep the DataFrame like that until the end of the post ;)
train_df = pd.read_csv('train.csv')
train_df = train_df.drop('Name',axis=1)
print(train_df.head())
PassengerId Survived Pclass Sex Age SibSp Parch Ticket \
0 1 0 3 male 22 1 0 A/5 21171
1 2 1 1 female 38 1 0 PC 17599
2 3 1 3 female 26 0 0 STON/O2. 3101282
3 4 1 1 female 35 1 0 113803
4 5 0 3 male 35 0 0 373450
Fare Cabin Embarked
0 7.2500 NaN S
1 71.2833 C85 C
2 7.9250 NaN S
3 53.1000 C123 S
4 8.0500 NaN S
Now we will do it with rows. Note that now, the rows will start at index 1, because we removed row 0:
print(train_df.drop(0).head())
PassengerId Survived Pclass Sex Age SibSp Parch Ticket \
1 2 1 1 female 38 1 0 PC 17599
2 3 1 3 female 26 0 0 STON/O2. 3101282
3 4 1 1 female 35 1 0 113803
4 5 0 3 male 35 0 0 373450
5 6 0 3 male NaN 0 0 330877
Fare Cabin Embarked
1 71.2833 C85 C
2 7.9250 NaN S
3 53.1000 C123 S
4 8.0500 NaN S
5 8.4583 NaN Q
Changing values and creating new columns
To change values of a column, we do the same thing as we would when we retrieve a column, defining a value to it. It’s possible to define an unique value to the entire column, a list or array or a function (like one for random number generation). The list / array must have the same length as the DataFrame. Let’s use the “Embarked” column for this example, but then we will load the Dataset again because we will need the column for a future example:
train_dataset['Embarked'] = "Port"
print(train_dataset['Embarked'])
0 Port
1 Port
2 Port
3 Port
4 Port
...
886 Port
887 Port
888 Port
889 Port
890 Port
train_df = pd.read_csv('train.csv')
train_df = train_df.drop('Name',axis=1)
You can also create a new column, defining values for a column with a name that still does not exist in the DataFrame. On the Titanic Dataset, for example, the column "SibSp"
is the number of brothers and spouse, and the "Parch"
is the number of children or parents on the ship. We will create the "Family_Size"
, which will be the sum of these two previous columns:
train_df['Family_Size'] = train_df['SibSp'] + train_df['Parch']
print(train_df['Family_Size'])
0 1
1 1
2 0
3 1
4 0
...
886 0
887 0
888 3
889 0
890 0
Working with missing values (NaN)
If the Dataset you are working on have some missing values (NaN), Pandas have some functions to deal with them in the best way possible.
First, if you want to check where in your DataFrame are the missing values, you can use the isnull()
function. In the Titanic Dataset, for example, the "Cabin"
column has a lot of missing values, so you can check the missing values using the isnull()
function on this column.
train_df['Cabin'].isnull()
0 True
1 False
2 True
3 False
4 True
...
886 True
887 False
888 True
889 False
890 True
However, if you want to know which columns on your Dataset has missing values and you have a lot of columns, the previous way is not very practical. For these cases, you can use isnull()
together with any()
. This will, for each column, return True if there is at least one missing value on the column. For the opposite, you can use notnull()
together with all()
. Let’s see:
train_df.isnull().any()
PassengerId False
Survived False
Pclass False
Sex False
Age True
SibSp False
Parch False
Ticket False
Fare False
Cabin True
Embarked True
Tamanho_Familia False
train_df.notnull().all()
PassengerId True
Survived True
Pclass True
Sex True
Age False
SibSp True
Parch True
Ticket True
Fare True
Cabin False
Embarked False
Tamanho_Familia True
Knowing where the missing values are, you can decide if you want to remove the data from the Dataset, not utilizing it, or you can fill the missing values with an adequate strategy.
The dropna()
command removes from the DataFrame any row that has at least one NaN
. If we pass the how='all'
parameter, it will only remove rows where all the columns are NaN
. To remove columns, we will use the same command, but with the axis=1
parameter:
print(train_df.dropna().head())
PassengerId Survived Pclass Sex Age SibSp Parch Ticket \
1 2 1 1 female 38 1 0 PC 17599
3 4 1 1 female 35 1 0 113803
6 7 0 1 male 54 0 0 17463
10 11 1 3 female 4 1 1 PP 9549
11 12 1 1 female 58 0 0 113783
Fare Cabin Embarked Tamanho_Familia
1 71.2833 C85 C 1
3 53.1000 C123 S 1
6 51.8625 E46 S 0
10 16.7000 G6 S 2
11 26.5500 C103 S 0
print(train_df.dropna(axis=1).head())
PassengerId Survived Pclass Sex SibSp Parch Ticket \
0 1 0 3 male 1 0 A/5 21171
1 2 1 1 female 1 0 PC 17599
2 3 1 3 female 0 0 STON/O2. 3101282
3 4 1 1 female 1 0 113803
4 5 0 3 male 0 0 373450
Fare Tamanho_Familia
0 7.2500 1
1 71.2833 1
2 7.9250 0
3 53.1000 1
4 8.0500 0
In the first case, note the row indexes. Lots of rows were removed for having missing values, possibly in the "Cabin"
column. On the second one, the "Cabin"
column itself was removed, along with "Embarked"
and "Age"
.
If you want to fill the missing values, you use the fillna()
function. With it, you pass along the value that should replace the NaN
values. Generally, it will probably make more sense to work with one column a time. As we saw above, the "Age"
and "Embarked"
columns have missing values. We will fill the "Age"
column with the mean of the ages on it and the "Embarked"
column with the mode of the column, or the value that occurs the most in it. The fillna()
function, by default, returns a new object, but you can use the inplace=True
parameter to modify the original object. We will create a new DataFrame, filled_df
, just to preserve the original one, and then we will fill the missing values:
filled_df = train_df
filled_df['Age'].fillna(filled_df['Age'].mean(), inplace=True)
filled_df['Embarked'].fillna(filled_df['Embarked'].mode()[0], inplace=True)
# Now we can see that the Age and Embarked columns do not have missing values anymore
print(filled_df.isnull().any())
PassengerId False
Survived False
Pclass False
Sex False
Age False
SibSp False
Parch False
Ticket False
Fare False
Cabin True
Embarked False
Tamanho_Familia False
Removing duplicates
If you do not want duplicate rows on your Dataset, we have two functions for that. The first one, duplicated()
, indicates if a row is duplicated or not. By default, it will only return True
if a row is completely equal to a previous one. But we can also check for duplicates based on a column value. For that, we will pass the column name as a parameter inside a Python List. Inside this List, we can have one or more column names. If we have more than one, the function will return True
if values from all the columns are equal for two rows. Additionally, the drop_duplicates()
function will remove the duplicated rows from the DataFrame. It works in the same way as duplicated()
, meaning that by default, it will only remove a row if it is entirely equal to a previous one, but it can remove based on selected column values. Let’s see examples with the “Embarked” column. Since it only have three values, we know there will be a lot of duplicated values on it:
print(train_df.duplicated("Embarked"))
0 False
1 False
2 True
3 True
4 True
5 False
...
885 True
886 True
887 True
888 True
889 True
890 True
drop_duplicate_df = train_df.drop_duplicates(['Embarked'])
print(drop_duplicate_df)
PassengerId Survived Pclass Sex Age SibSp Parch Ticket \
0 1 0 3 male 22.000000 1 0 A/5 21171
1 2 1 1 female 38.000000 1 0 PC 17599
5 6 0 3 male 29.699118 0 0 330877
Fare Cabin Embarked Tamanho_Familia
0 7.2500 NaN S 1
1 71.2833 C85 C 1
5 8.4583 NaN Q 0
Some Statistics on DataFrames
Pandas also have a serie of functions to calculate descriptive statistics on DataFrames. The statistics can be calculated along rows or columns. There are functions for mean, mode, median, among a lot of others. By default, the calculation is made along columns, but it can also be made along rows if you pass axis=1
as a parameter. Let’s see some examples:
train_df.sum()
PassengerId 397386.0000
Survived 342.0000
Pclass 2057.0000
Age 21205.1700
SibSp 466.0000
Parch 340.0000
Fare 28693.9493
Tamanho_Familia 806.0000
train_df.mean()
PassengerId 446.000000
Survived 0.383838
Pclass 2.308642
Age 29.699118
SibSp 0.523008
Parch 0.381594
Fare 32.204208
Tamanho_Familia 0.904602
train_df.max(axis=1)
0 22.0000
1 71.2833
2 26.0000
3 53.1000
4 35.0000
...
886 887.0000
887 888.0000
888 889.0000
889 890.0000
890 891.0000
The complete list of functions can be found on the following link, from the official documentation: http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics
Sorting DataFrames
For data reading, sometimes it can be important that the DataFrame is sorted. To sort by index, we will first reindex the DataFrame so that the indexes are reverted, and then we will use the function sort_index() to sort it back. You can also use the ascending=False
parameter to sort in descending:
# Here we reverse the DataFrame indexes
new_train_df = train_df.reindex(list(reversed(range(891))))
new_train_df.head()
PassengerId Survived Pclass Sex Age SibSp Parch Ticket \
890 891 0 3 male 32 0 0 370376
889 890 1 1 male 26 0 0 111369
888 889 0 3 female NaN 1 2 W./C. 6607
887 888 1 1 female 19 0 0 112053
886 887 0 2 male 27 0 0 211536
Fare Cabin Embarked Tamanho_Familia
890 7.75 NaN Q 0
889 30.00 C148 C 0
888 23.45 NaN S 3
887 30.00 B42 S 0
886 13.00 NaN S 0
# And here we sort it back through sort_index()
new_train_df = new_train_df.sort_index()
new_train_df.head()
PassengerId Survived Pclass Sex Age SibSp Parch Ticket \
0 1 0 3 male 22 1 0 A/5 21171
1 2 1 1 female 38 1 0 PC 17599
2 3 1 3 female 26 0 0 STON/O2. 3101282
3 4 1 1 female 35 1 0 113803
4 5 0 3 male 35 0 0 373450
Fare Cabin Embarked Tamanho_Familia
0 7.2500 NaN S 1
1 71.2833 C85 C 1
2 7.9250 NaN S 0
3 53.1000 C123 S 1
4 8.0500 NaN S 0
# You can also use sort_index to sort in descending
new_train_df.sort_index(ascending=False).head()
PassengerId Survived Pclass Sex Age SibSp Parch Ticket \
890 891 0 3 male 32 0 0 370376
889 890 1 1 male 26 0 0 111369
888 889 0 3 female NaN 1 2 W./C. 6607
887 888 1 1 female 19 0 0 112053
886 887 0 2 male 27 0 0 211536
Fare Cabin Embarked Tamanho_Familia
890 7.75 NaN Q 0
889 30.00 C148 C 0
888 23.45 NaN S 3
887 30.00 B42 S 0
886 13.00 NaN S 0
It is also possible to sort based on column values, through the function sort_values()
. You will pass the column that you want to sort by through the by
parameter:
new_train_df.sort_values(by='Age').head()
PassengerId Survived Pclass Sex Age SibSp Parch Ticket \
803 804 1 3 male 0.42 0 1 2625
755 756 1 2 male 0.67 1 1 250649
644 645 1 3 female 0.75 2 1 2666
469 470 1 3 female 0.75 2 1 2666
78 79 1 2 male 0.83 0 2 248738
Fare Cabin Embarked Tamanho_Familia
803 8.5167 NaN C 1
755 14.5000 NaN S 2
644 19.2583 NaN C 3
469 19.2583 NaN C 3
78 29.0000 NaN S 2
And just like that we finish one more post talking a little about the main functionalities of Pandas. Still thinking about the next post, but it will probably be about Data Visualization, matplotlib and how to plot and make some visualizations.
Stay tuned! :)