Descriptive Statistics with Python
March 31, 2016
Introduction
According to Google, Statistics is “the Mathematics branch that deals with collecting, analyzing, interpreting and presenting masses of numerical data”. In this way, Statistics is also very important for Data Science, used to analyze a series of situations and problems through its most varied concepts.
This post is not intended to be a complete Statistics course, but an Introduction that will teach some concepts and how to apply them in Python and Pandas.
Here, we will focus on Descriptive Statistics, the part of Statistics with the objective to describe and summarize sets of data. For that, measures are used, like the famous mean, or average.
In the examples for this post, we will use the Titanic’s Dataset again, but we will also look to a simpler Pandas example, so you can confirm that everything is being calculated correctly.
Descriptive Statistics’ measures are divided between measures of central tendency and measures of dispersion.
Measures of Central Tendency
Measures of Central Tendency define significant, representative and adequate values for a set of data, depending on what you want to analyze. They are the mean, median, quantiles and mode.
Mean
The mean is a measure of central tendency that indicates the value where the set of values is concentrated, representing a significative value for it.
With Pandas’ Dataframes, the mean is calculated through the mean()
function. The mean()
function is present on Dataframes and Series. Let’s create a Series so we can calculate its mean, and we will also calculate the mean of the ages on the Titanic Dataset:
import pandas as pd
import numpy as np
train_df = pd.read_csv('train.csv')
example_series = pd.Series([1,5,10,30,50,30,15,40,45])
print(train_df['Age'].mean())
29.69911764705882
print(example_series.mean())
25.11111111111111
Median and Quantile
Median is the value that separates the inferior half of a data set, or the value in the center of the distribution. What this means is, if the number of observations on a data set is odd, the median is the value in the center, and if the number of observations is even, it will be the average of the two most central observations. Let’s see below:
print(train_df['Age'].median())
28.0
print(example_series.median())
30.0
The median is a concept that is less susceptible to great outliers than the mean. If the number of observations is not very large and you have one observation that is way larger than the other ones, your mean can start to get less representative of most of your group. For example, if you are analysing the income of a college class and one of them is a millionaire, while the rest is the average company worker, the median will probably be a better representation of the income of the group as a whole, since the mean will be “contaminated” by the outlier.
The quantile is a generalization of the median. The quantile is the value below which a certain percentage of the data is. In the median case, this percentage is 50%. Let’s see the code for the quantile, calculated through the quantile()
function. This function, by default, adopts the 50% percentage for the quantile, represented through the q
parameter. You can configure the percentage that you want through it:
print(train_df['Age'].quantile())
28.0
print(example_series.quantile())
30.0
print(train_df['Age'].quantile(q=0.25))
20.125
print(example_series.quantile(q=0.25))
10.0
Mode
Mode is simple. The mode is the value that has the most occurrences in a data set. In Pandas, it is calculated through the mode()
function. Let’s see:
print(train_df['Age'].mode())
0 24
dtype: float64
print(example_series.mode())
0 30
dtype: int64
The function returns the data in this format because, in the case that two values are tied for most occurrences, the function will return both of them. Let’s see:
example_series_2 = pd.Series([1,5,10,30,50,30,15,40,45,45])
print(example_series_2.mode())
0 30
1 45
dtype: int64
Measures of Dispersion
Measures of Dispersion are measures that indicate how spread the data is, or how they vary. The measures of dispersion are range, variance, standard deviation and the absolute deviation, or mean absolute deviation.
Range
Range is the difference between the biggest and the smallest value in a data set. To make this calculation, we will use two Pandas’ functions, max()
and min()
, that returns the maximum and minimum value in a data set, and then we will subtract one from the other:
print(train_df['Age'].max() - train_df['Age'].min())
79.58
print(example_series.max() - example_series.min())
49
Variance
Variance is a measure that express how much the data is away from its expected value. We calculate variance in Pandas with the var()
function:
print(train_df['Age'].var())
211.0191247463081
print(example_series.var())
325.1111111111111
Standard Deviation
Standard deviation is also a measure of dispersion that indicates how away the data is from its mean. A high standard deviation means that the data is more spread, and a smaller standard deviation means that the values are more near the mean. In Pandas we calculate it through the std()
function:
print(train_df['Age'].std())
14.526497332334044
print(example_series.std())
18.03083778173136
Absolute Deviation or Mean Absolute Deviation
The Absolute Deviation is calculated like this: first, we calculate the mean of the values; then, we calculate the distance of each point from this average; after that, we sum these distances and finally we divide the result by the mean of the distances.
In Pandas, the mad()
function calculates this. Let’s see the examples:
print(train_df['Age'].mad())
11.322944471906405
print(example_series.mad())
15.432098765432098
Covariance and Correlation
In some moments, we may want to know if two variables are related in a data set. For these cases, we calculate the covariance and correlation.
Covariance is a numerical measure that indicates the inter-dependency between two variables. Covariance indicates how two variables behave together in relation to their averages. A covariance of 0 indicates that the variables are totally independant, while a high and positive covariance value means that a variable is big when the other is big. Analogously, a negative covariance with a high absolute value means that one variable is big when the other is small. Covariance can be calculated through the cov()
function. This function will return a matrix indicating the covariance between each column and the other columns on the data set:
print(train_df.cov())
PassengerId Survived Pclass Age SibSp \
PassengerId 66231.000000 -0.626966 -7.561798 138.696504 -16.325843
Survived -0.626966 0.236772 -0.137703 -0.551296 -0.018954
Pclass -7.561798 -0.137703 0.699015 -4.496004 0.076599
Age 138.696504 -0.551296 -4.496004 211.019125 -4.163334
SibSp -16.325843 -0.018954 0.076599 -4.163334 1.216043
Parch -0.342697 0.032017 0.012429 -2.344191 0.368739
Fare 161.883369 6.221787 -22.830196 73.849030 8.748734
Parch Fare
PassengerId -0.342697 161.883369
Survived 0.032017 6.221787
Pclass 0.012429 -22.830196
Age -2.344191 73.849030
SibSp 0.368739 8.748734
Parch 0.649728 8.661052
Fare 8.661052 2469.436846
Covariance, however, is hard to be understood and compared, because it gives us values in different scales as the variables change. For a better comparison, we normalize covariance to a value that will always be between -1 and 1, so it can be better understood. This value is the correlation. Correlation is always between -1, a perfect anti-correlation, and 1, perfect correlation. We calculate it with the corr()
function. Just like the cov()
function, it will return a matrix with the correlation among each column in the data set.
print(train_df.corr())
PassengerId Survived Pclass Age SibSp Parch \
PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652
Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629
Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443
Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119
SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838
Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000
Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225
Fare
PassengerId 0.012658
Survived 0.257307
Pclass -0.549500
Age 0.096067
SibSp 0.159651
Parch 0.216225
Fare 1.000000
However, we have to take care when evaluating correlations. Some data may indicate a correlation value that does not exist in reality, being a mere case of luck and fate. This is a possible scenario when the variables present some sort of pattern that can lead to this correlation.
Finally, it is also important to note the famous concept that says that “correlation does not imply causation”. For our studies, what this means is that, when two variables have a high correlation, it may be the first value causing the behavior on the second, the second causing the behavior on the first, both causing the behavior on each other or it might mean nothing. So, we need to know what the data means so we can evaluate each case.
For an introduction to Descriptive Statistics, I think that this is a good start :)
Regards!