*Versão em português deste post / Portuguese version of this post*

### Introduction

According to Google, Statistics is “the Mathematics branch that deals with collecting, analyzing, interpreting and presenting masses of numerical data”. In this way, Statistics is also very important for Data Science, used to analyze a series of situations and problems through its most varied concepts.

This post is not intended to be a complete Statistics course, but an Introduction that will teach some concepts and how to apply them in Python and Pandas.

Here, we will focus on Descriptive Statistics, the part of Statistics with the objective to describe and summarize sets of data. For that, measures are used, like the famous mean, or average.

In the examples for this post, we will use the Titanic’s Dataset again, but we will also look to a simpler Pandas example, so you can confirm that everything is being calculated correctly.

Descriptive Statistics’ measures are divided between measures of central tendency and measures of dispersion.

### Measures of Central Tendency

Measures of Central Tendency define significant, representative and adequate values for a set of data, depending on what you want to analyze. They are the mean, median, quantiles and mode.

### Mean

The mean is a measure of central tendency that indicates the value where the set of values is concentrated, representing a significative value for it.

With Pandas’ Dataframes, the mean is calculated through the mean() function. The mean() function is present on Dataframes and Series. Let’s create a Series so we can calculate its mean, and we will also calculate the mean of the ages on the Titanic Dataset:

```
import pandas as pd
import numpy as np
train_df = pd.read_csv('train.csv')
example_series = pd.Series([1,5,10,30,50,30,15,40,45])
print(train_df['Age'].mean())
```*29.69911764705882*
print(example_series.mean())
*25.11111111111111*

### Median and Quantile

Median is the value that separates the inferior half of a data set, or the value in the center of the distribution. What this means is, if the number of observations on a data set is odd, the median is the value in the center, and if the number of observations is even, it will be the average of the two most central observations. Let’s see below:

```
print(train_df['Age'].median())
```*28.0*
print(example_series.median())
*30.0*

The median is a concept that is less susceptible to great outliers than the mean. If the number of observations is not very large and you have one observation that is way larger than the other ones, your mean can start to get less representative of most of your group. For example, if you are analysing the income of a college class and one of them is a millionaire, while the rest is the average company worker, the median will probably be a better representation of the income of the group as a whole, since the mean will be “contaminated” by the outlier.

The quantile is a generalization of the median. The quantile is the value below which a certain percentage of the data is. In the median case, this percentage is 50%. Let’s see the code for the quantile, calculated through the quantile() function. This function, by default, adopts the 50% percentage for the quantile, represented through the “q” parameter. You can configure the percentage that you want through the “q” parameter:

```
print(train_df['Age'].quantile())
```*28.0*
print(example_series.quantile())
*30.0*
print(train_df['Age'].quantile(q=0.25))
*20.125*
print(example_series.quantile(q=0.25))
*10.0*

### Mode

Mode is simple. The mode is the value that has the most occurrences in a data set. In Pandas, it is calculated through the mode() function. Let’s see:

```
print(train_df['Age'].mode())
```*0 24*
*dtype: float64*
print(example_series.mode())
*0 30*
*dtype: int64*

The function returns the data in this format because, in the case that two values are tied for most occurrences, the function will return both of them. Let’s see:

```
example_series_2 = pd.Series([1,5,10,30,50,30,15,40,45,45])
print(example_series_2.mode())
```*0 30*
*1 45*
*dtype: int64*

### Measures of Dispersion

Measures of Dispersion are measures that indicate how spread the data is, or how they vary. The measures of dispersion are range, variance, standard deviation and the absolute deviation, or mean absolute deviation.

### Range

Range is the difference between the biggest and the smallest value in a data set. To make this calculation, we will use two Pandas’ functions, max() and min(), that returns the maximum and minimum value in a data set, and then we will subtract one from the other:

```
print(train_df['Age'].max() - train_df['Age'].min())
```*79.58**
print(example_series.max() - example_series.min())
**49*

### Variance

Variance is a measure that express how much the data is away from its expected value. We calculate variance in Pandas with the var() function:

```
print(train_df['Age'].var())
211.0191247463081
print(example_series.var())
325.1111111111111
```

### Standard Deviation

Standard deviation is also a measure of dispersion, that indicates how away the data is away from its mean. A high standard deviation means that the data is more spread, and a smaller standard deviation means that the values are more near the mean. In Pandas we calculate it through the std() function:

```
print(train_df['Age'].std())
```*14.526497332334044*
print(example_series.std())
*18.03083778173136*

### Absolute Deviation or Mean Absolute Deviation

The Absolute Deviation is calculated like this: first, we calculate the mean of the values; then, we calculate the distance of each point from this average; after that, we sum these distances and finally we divide the result by the mean of the distances.

In Pandas, the mad() function calculates this. Let’s see the examples:

```
print(train_df['Age'].mad())
```*11.322944471906405*
print(example_series.mad())
*15.432098765432098*

### Covariance and Correlation

In some moments, we may want to know if two variables are related in a data set. For these cases, we calculate the covariance and correlation.

Covariance is a numerical measure that indicates the inter-dependency between two variables. Covariance indicates how two variables behave together in relation to their averages. A covariance of 0 indicates that the variables are totally independant, while a high and positive covariance value means that a variable is big when the other is big. Analogously, a negative covariance with a high absolute value means that one variable is big when the other is small. Covariance can be calculated through the cov() function. This function will return a matrix indicating the covariance between each column and the other columns on the data set:

```
print(train_df.cov())
```* PassengerId Survived Pclass Age SibSp \*
*PassengerId 66231.000000 -0.626966 -7.561798 138.696504 -16.325843 *
*Survived -0.626966 0.236772 -0.137703 -0.551296 -0.018954 *
*Pclass -7.561798 -0.137703 0.699015 -4.496004 0.076599 *
*Age 138.696504 -0.551296 -4.496004 211.019125 -4.163334 *
*SibSp -16.325843 -0.018954 0.076599 -4.163334 1.216043 *
*Parch -0.342697 0.032017 0.012429 -2.344191 0.368739 *
*Fare 161.883369 6.221787 -22.830196 73.849030 8.748734 *
* Parch Fare *
*PassengerId -0.342697 161.883369 *
*Survived 0.032017 6.221787 *
*Pclass 0.012429 -22.830196 *
*Age -2.344191 73.849030 *
*SibSp 0.368739 8.748734 *
*Parch 0.649728 8.661052 *
*Fare 8.661052 2469.436846 *

Covariance, however, is hard to be understood and compared, because it gives us values in different scales as the variables change. For a better comparison, we normalize covariance to a value that will always be between -1 and 1, so it can be better understood. This value is the correlation. Correlation is always between -1, a perfect anti-correlation, and 1, perfect correlation. We calculate it with the corr() function. Just like the cov() function, it will return a matrix with the correlation among each column in the data set.

```
print(train_df.corr())
```* PassengerId Survived Pclass Age SibSp Parch \*
*PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652 *
*Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629 *
*Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443 *
*Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119 *
*SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838 *
*Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000 *
*Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225 *
* Fare *
*PassengerId 0.012658 *
*Survived 0.257307 *
*Pclass -0.549500 *
*Age 0.096067 *
*SibSp 0.159651 *
*Parch 0.216225 *
*Fare 1.000000 *

However, we have to take care when evaluating correlations. Some data may indicate a correlation value that does not exist in reality, being a mere case of luck and fate. This is a possible scenario when the variables present some sort of pattern that can lead to this correlation.

Finally, it is also important to note the famous concept that says that “correlation does not imply causation”. For our studies, what this means is that, when two variables have a high correlation, it may be the first value causing the behavior on the second, the second causing the behavior on the first, both causing the behavior on each other or it might mean nothing. So, we need to know what the data means so we can evaluate each case.

For an introduction to Descriptive Statistics, i think that this is a good start 🙂

Regards!

## One comment on “Descriptive Statistics with Python”