# Descriptive Statistics with Python

March 31, 2016

## Introduction

According to Google, Statistics is “the Mathematics branch that deals with collecting, analyzing, interpreting and presenting masses of numerical data”. In this way, Statistics is also very important for Data Science, used to analyze a series of situations and problems through its most varied concepts.

This post is not intended to be a complete Statistics course, but an Introduction that will teach some concepts and how to apply them in Python and Pandas.

Here, we will focus on Descriptive Statistics, the part of Statistics with the objective to describe and summarize sets of data. For that, measures are used, like the famous mean, or average.

In the examples for this post, we will use the Titanic’s Dataset again, but we will also look to a simpler Pandas example, so you can confirm that everything is being calculated correctly.

Descriptive Statistics’ measures are divided between measures of central tendency and measures of dispersion.

## Measures of Central Tendency

Measures of Central Tendency define significant, representative and adequate values for a set of data, depending on what you want to analyze. They are the mean, median, quantiles and mode.

### Mean

The mean is a measure of central tendency that indicates the value where the set of values is concentrated, representing a significative value for it.

With Pandas’ Dataframes, the mean is calculated through the `mean()` function. The `mean()` function is present on Dataframes and Series. Let’s create a Series so we can calculate its mean, and we will also calculate the mean of the ages on the Titanic Dataset:

``````import pandas as pd
import numpy as np

example_series =  pd.Series([1,5,10,30,50,30,15,40,45])

print(train_df['Age'].mean())
29.69911764705882
print(example_series.mean())
25.11111111111111``````

### Median and Quantile

Median is the value that separates the inferior half of a data set, or the value in the center of the distribution. What this means is, if the number of observations on a data set is odd, the median is the value in the center, and if the number of observations is even, it will be the average of the two most central observations. Let’s see below:

``````print(train_df['Age'].median())
28.0
print(example_series.median())
30.0``````

The median is a concept that is less susceptible to great outliers than the mean. If the number of observations is not very large and you have one observation that is way larger than the other ones, your mean can start to get less representative of most of your group. For example, if you are analysing the income of a college class and one of them is a millionaire, while the rest is the average company worker, the median will probably be a better representation of the income of the group as a whole, since the mean will be “contaminated” by the outlier.

The quantile is a generalization of the median. The quantile is the value below which a certain percentage of the data is. In the median case, this percentage is 50%. Let’s see the code for the quantile, calculated through the `quantile()` function. This function, by default, adopts the 50% percentage for the quantile, represented through the `q` parameter. You can configure the percentage that you want through it:

``````print(train_df['Age'].quantile())
28.0
print(example_series.quantile())
30.0

print(train_df['Age'].quantile(q=0.25))
20.125
print(example_series.quantile(q=0.25))
10.0``````

### Mode

Mode is simple. The mode is the value that has the most occurrences in a data set. In Pandas, it is calculated through the `mode()` function. Let’s see:

``````print(train_df['Age'].mode())
0    24
dtype: float64
print(example_series.mode())
0    30
dtype: int64``````

The function returns the data in this format because, in the case that two values are tied for most occurrences, the function will return both of them. Let’s see:

``````example_series_2 =  pd.Series([1,5,10,30,50,30,15,40,45,45])
print(example_series_2.mode())
0    30
1    45
dtype: int64``````

## Measures of Dispersion

Measures of Dispersion are measures that indicate how spread the data is, or how they vary. The measures of dispersion are range, variance, standard deviation and the absolute deviation, or mean absolute deviation.

### Range

Range is the difference between the biggest and the smallest value in a data set. To make this calculation, we will use two Pandas’ functions, `max()` and `min()`, that returns the maximum and minimum value in a data set, and then we will subtract one from the other:

``````print(train_df['Age'].max() - train_df['Age'].min())
79.58
print(example_series.max() - example_series.min())
49``````

### Variance

Variance is a measure that express how much the data is away from its expected value. We calculate variance in Pandas with the `var()` function:

``````print(train_df['Age'].var())
211.0191247463081
print(example_series.var())
325.1111111111111``````

### Standard Deviation

Standard deviation is also a measure of dispersion that indicates how away the data is from its mean. A high standard deviation means that the data is more spread, and a smaller standard deviation means that the values are more near the mean. In Pandas we calculate it through the `std()` function:

``````print(train_df['Age'].std())
14.526497332334044
print(example_series.std())
18.03083778173136``````

### Absolute Deviation or Mean Absolute Deviation

The Absolute Deviation is calculated like this: first, we calculate the mean of the values; then, we calculate the distance of each point from this average; after that, we sum these distances and finally we divide the result by the mean of the distances.

In Pandas, the `mad()` function calculates this. Let’s see the examples:

``````print(train_df['Age'].mad())
11.322944471906405
15.432098765432098``````

### Covariance and Correlation

In some moments, we may want to know if two variables are related in a data set. For these cases, we calculate the covariance and correlation.

Covariance is a numerical measure that indicates the inter-dependency between two variables. Covariance indicates how two variables behave together in relation to their averages. A covariance of 0 indicates that the variables are totally independant, while a high and positive covariance value means that a variable is big when the other is big. Analogously, a negative covariance with a high absolute value means that one variable is big when the other is small. Covariance can be calculated through the `cov()` function. This function will return a matrix indicating the covariance between each column and the other columns on the data set:

``````print(train_df.cov())
PassengerId  Survived     Pclass         Age      SibSp  \
PassengerId  66231.000000 -0.626966  -7.561798  138.696504 -16.325843
Survived        -0.626966  0.236772  -0.137703   -0.551296  -0.018954
Pclass          -7.561798 -0.137703   0.699015   -4.496004   0.076599
Age            138.696504 -0.551296  -4.496004  211.019125  -4.163334
SibSp          -16.325843 -0.018954   0.076599   -4.163334   1.216043
Parch           -0.342697  0.032017   0.012429   -2.344191   0.368739
Fare           161.883369  6.221787 -22.830196   73.849030   8.748734

Parch         Fare
PassengerId -0.342697   161.883369
Survived     0.032017     6.221787
Pclass       0.012429   -22.830196
Age         -2.344191    73.849030
SibSp        0.368739     8.748734
Parch        0.649728     8.661052
Fare         8.661052  2469.436846  ``````

Covariance, however, is hard to be understood and compared, because it gives us values in different scales as the variables change. For a better comparison, we normalize covariance to a value that will always be between -1 and 1, so it can be better understood. This value is the correlation. Correlation is always between -1, a perfect anti-correlation, and 1, perfect correlation. We calculate it with the `corr()` function. Just like the `cov()` function, it will return a matrix with the correlation among each column in the data set.

``````print(train_df.corr())
PassengerId  Survived    Pclass       Age     SibSp     Parch  \
PassengerId     1.000000 -0.005007 -0.035144  0.036847 -0.057527 -0.001652
Survived       -0.005007  1.000000 -0.338481 -0.077221 -0.035322  0.081629
Pclass         -0.035144 -0.338481  1.000000 -0.369226  0.083081  0.018443
Age             0.036847 -0.077221 -0.369226  1.000000 -0.308247 -0.189119
SibSp          -0.057527 -0.035322  0.083081 -0.308247  1.000000  0.414838
Parch          -0.001652  0.081629  0.018443 -0.189119  0.414838  1.000000
Fare            0.012658  0.257307 -0.549500  0.096067  0.159651  0.216225

Fare
PassengerId  0.012658
Survived     0.257307
Pclass      -0.549500
Age          0.096067
SibSp        0.159651
Parch        0.216225
Fare         1.000000  ``````

However, we have to take care when evaluating correlations. Some data may indicate a correlation value that does not exist in reality, being a mere case of luck and fate. This is a possible scenario when the variables present some sort of pattern that can lead to this correlation.

Finally, it is also important to note the famous concept that says that “correlation does not imply causation”. For our studies, what this means is that, when two variables have a high correlation, it may be the first value causing the behavior on the second, the second causing the behavior on the first, both causing the behavior on each other or it might mean nothing. So, we need to know what the data means so we can evaluate each case.

For an introduction to Descriptive Statistics, I think that this is a good start :)

Regards! 