Descriptive Statistics with Python

March 31, 2016

Introduction

According to Google, Statistics is “the Mathematics branch that deals with collecting, analyzing, interpreting and presenting masses of numerical data”. In this way, Statistics is also very important for Data Science, used to analyze a series of situations and problems through its most varied concepts.

This post is not intended to be a complete Statistics course, but an Introduction that will teach some concepts and how to apply them in Python and Pandas.

Here, we will focus on Descriptive Statistics, the part of Statistics with the objective to describe and summarize sets of data. For that, measures are used, like the famous mean, or average.

In the examples for this post, we will use the Titanic’s Dataset again, but we will also look to a simpler Pandas example, so you can confirm that everything is being calculated correctly.

Descriptive Statistics’ measures are divided between measures of central tendency and measures of dispersion.

Measures of Central Tendency

Measures of Central Tendency define significant, representative and adequate values for a set of data, depending on what you want to analyze. They are the mean, median, quantiles and mode.

Mean

The mean is a measure of central tendency that indicates the value where the set of values is concentrated, representing a significative value for it.

With Pandas’ Dataframes, the mean is calculated through the mean() function. The mean() function is present on Dataframes and Series. Let’s create a Series so we can calculate its mean, and we will also calculate the mean of the ages on the Titanic Dataset:

import pandas as pd
import numpy as np

train_df = pd.read_csv('train.csv')
example_series =  pd.Series([1,5,10,30,50,30,15,40,45])

print(train_df['Age'].mean())
29.69911764705882
print(example_series.mean())
25.11111111111111

Median and Quantile

Median is the value that separates the inferior half of a data set, or the value in the center of the distribution. What this means is, if the number of observations on a data set is odd, the median is the value in the center, and if the number of observations is even, it will be the average of the two most central observations. Let’s see below:

print(train_df['Age'].median())
28.0
print(example_series.median())
30.0

The median is a concept that is less susceptible to great outliers than the mean. If the number of observations is not very large and you have one observation that is way larger than the other ones, your mean can start to get less representative of most of your group. For example, if you are analysing the income of a college class and one of them is a millionaire, while the rest is the average company worker, the median will probably be a better representation of the income of the group as a whole, since the mean will be “contaminated” by the outlier.

The quantile is a generalization of the median. The quantile is the value below which a certain percentage of the data is. In the median case, this percentage is 50%. Let’s see the code for the quantile, calculated through the quantile() function. This function, by default, adopts the 50% percentage for the quantile, represented through the q parameter. You can configure the percentage that you want through it:

print(train_df['Age'].quantile())
28.0
print(example_series.quantile())
30.0

print(train_df['Age'].quantile(q=0.25))
20.125
print(example_series.quantile(q=0.25))
10.0

Mode

Mode is simple. The mode is the value that has the most occurrences in a data set. In Pandas, it is calculated through the mode() function. Let’s see:

print(train_df['Age'].mode())
0    24
dtype: float64
print(example_series.mode())
0    30
dtype: int64

The function returns the data in this format because, in the case that two values are tied for most occurrences, the function will return both of them. Let’s see:

example_series_2 =  pd.Series([1,5,10,30,50,30,15,40,45,45])
print(example_series_2.mode())
0    30
1    45
dtype: int64

Measures of Dispersion

Measures of Dispersion are measures that indicate how spread the data is, or how they vary. The measures of dispersion are range, variance, standard deviation and the absolute deviation, or mean absolute deviation.

Range

Range is the difference between the biggest and the smallest value in a data set. To make this calculation, we will use two Pandas’ functions, max() and min(), that returns the maximum and minimum value in a data set, and then we will subtract one from the other:

print(train_df['Age'].max() - train_df['Age'].min())
79.58
print(example_series.max() - example_series.min())
49

Variance

Variance is a measure that express how much the data is away from its expected value. We calculate variance in Pandas with the var() function:

print(train_df['Age'].var())
211.0191247463081
print(example_series.var())
325.1111111111111

Standard Deviation

Standard deviation is also a measure of dispersion that indicates how away the data is from its mean. A high standard deviation means that the data is more spread, and a smaller standard deviation means that the values are more near the mean. In Pandas we calculate it through the std() function:

print(train_df['Age'].std())
14.526497332334044
print(example_series.std())
18.03083778173136

Absolute Deviation or Mean Absolute Deviation

The Absolute Deviation is calculated like this: first, we calculate the mean of the values; then, we calculate the distance of each point from this average; after that, we sum these distances and finally we divide the result by the mean of the distances.

In Pandas, the mad() function calculates this. Let’s see the examples:

print(train_df['Age'].mad())
11.322944471906405
print(example_series.mad())
15.432098765432098

Covariance and Correlation

In some moments, we may want to know if two variables are related in a data set. For these cases, we calculate the covariance and correlation.

Covariance is a numerical measure that indicates the inter-dependency between two variables. Covariance indicates how two variables behave together in relation to their averages. A covariance of 0 indicates that the variables are totally independant, while a high and positive covariance value means that a variable is big when the other is big. Analogously, a negative covariance with a high absolute value means that one variable is big when the other is small. Covariance can be calculated through the cov() function. This function will return a matrix indicating the covariance between each column and the other columns on the data set:

print(train_df.cov())
              PassengerId  Survived     Pclass         Age      SibSp  \
PassengerId  66231.000000 -0.626966  -7.561798  138.696504 -16.325843   
Survived        -0.626966  0.236772  -0.137703   -0.551296  -0.018954   
Pclass          -7.561798 -0.137703   0.699015   -4.496004   0.076599   
Age            138.696504 -0.551296  -4.496004  211.019125  -4.163334   
SibSp          -16.325843 -0.018954   0.076599   -4.163334   1.216043   
Parch           -0.342697  0.032017   0.012429   -2.344191   0.368739   
Fare           161.883369  6.221787 -22.830196   73.849030   8.748734   

                Parch         Fare  
PassengerId -0.342697   161.883369  
Survived     0.032017     6.221787  
Pclass       0.012429   -22.830196  
Age         -2.344191    73.849030  
SibSp        0.368739     8.748734  
Parch        0.649728     8.661052  
Fare         8.661052  2469.436846

Covariance, however, is hard to be understood and compared, because it gives us values in different scales as the variables change. For a better comparison, we normalize covariance to a value that will always be between -1 and 1, so it can be better understood. This value is the correlation. Correlation is always between -1, a perfect anti-correlation, and 1, perfect correlation. We calculate it with the corr() function. Just like the cov() function, it will return a matrix with the correlation among each column in the data set.

print(train_df.corr())
             PassengerId  Survived    Pclass       Age     SibSp     Parch  \
PassengerId     1.000000 -0.005007 -0.035144  0.036847 -0.057527 -0.001652   
Survived       -0.005007  1.000000 -0.338481 -0.077221 -0.035322  0.081629   
Pclass         -0.035144 -0.338481  1.000000 -0.369226  0.083081  0.018443   
Age             0.036847 -0.077221 -0.369226  1.000000 -0.308247 -0.189119   
SibSp          -0.057527 -0.035322  0.083081 -0.308247  1.000000  0.414838   
Parch          -0.001652  0.081629  0.018443 -0.189119  0.414838  1.000000   
Fare            0.012658  0.257307 -0.549500  0.096067  0.159651  0.216225   

                 Fare  
PassengerId  0.012658  
Survived     0.257307  
Pclass      -0.549500  
Age          0.096067  
SibSp        0.159651  
Parch        0.216225  
Fare         1.000000

However, we have to take care when evaluating correlations. Some data may indicate a correlation value that does not exist in reality, being a mere case of luck and fate. This is a possible scenario when the variables present some sort of pattern that can lead to this correlation.

Finally, it is also important to note the famous concept that says that “correlation does not imply causation”. For our studies, what this means is that, when two variables have a high correlation, it may be the first value causing the behavior on the second, the second causing the behavior on the first, both causing the behavior on each other or it might mean nothing. So, we need to know what the data means so we can evaluate each case.

For an introduction to Descriptive Statistics, I think that this is a good start :)

Regards!

Felipe Galvao