According to Google, Statistics is “the Mathematics branch that deals with collecting, analyzing, interpreting and presenting masses of numerical data”. In this way, Statistics is also very important for Data Science, used to analyze a series of situations and problems through its most varied concepts.
This post is not intended to be a complete Statistics course, but an Introduction that will teach some concepts and how to apply them in Python and Pandas.
Here, we will focus on Descriptive Statistics, the part of Statistics with the objective to describe and summarize sets of data. For that, measures are used, like the famous mean, or average.
In the examples for this post, we will use the Titanic’s Dataset again, but we will also look to a simpler Pandas example, so you can confirm that everything is being calculated correctly.
Descriptive Statistics’ measures are divided between measures of central tendency and measures of dispersion.
Measures of Central Tendency
Measures of Central Tendency define significant, representative and adequate values for a set of data, depending on what you want to analyze. They are the mean, median, quantiles and mode.
The mean is a measure of central tendency that indicates the value where the set of values is concentrated, representing a significative value for it.
With Pandas’ Dataframes, the mean is calculated through the
mean() function. The
mean() function is present on Dataframes and Series. Let’s create a Series so we can calculate its mean, and we will also calculate the mean of the ages on the Titanic Dataset:
import pandas as pd import numpy as np train_df = pd.read_csv('train.csv') example_series = pd.Series([1,5,10,30,50,30,15,40,45]) print(train_df['Age'].mean()) 29.69911764705882 print(example_series.mean()) 25.11111111111111
Median and Quantile
Median is the value that separates the inferior half of a data set, or the value in the center of the distribution. What this means is, if the number of observations on a data set is odd, the median is the value in the center, and if the number of observations is even, it will be the average of the two most central observations. Let’s see below:
print(train_df['Age'].median()) 28.0 print(example_series.median()) 30.0
The median is a concept that is less susceptible to great outliers than the mean. If the number of observations is not very large and you have one observation that is way larger than the other ones, your mean can start to get less representative of most of your group. For example, if you are analysing the income of a college class and one of them is a millionaire, while the rest is the average company worker, the median will probably be a better representation of the income of the group as a whole, since the mean will be “contaminated” by the outlier.
The quantile is a generalization of the median. The quantile is the value below which a certain percentage of the data is. In the median case, this percentage is 50%. Let’s see the code for the quantile, calculated through the
quantile() function. This function, by default, adopts the 50% percentage for the quantile, represented through the
q parameter. You can configure the percentage that you want through it:
print(train_df['Age'].quantile()) 28.0 print(example_series.quantile()) 30.0 print(train_df['Age'].quantile(q=0.25)) 20.125 print(example_series.quantile(q=0.25)) 10.0
Mode is simple. The mode is the value that has the most occurrences in a data set. In Pandas, it is calculated through the
mode() function. Let’s see:
print(train_df['Age'].mode()) 0 24 dtype: float64 print(example_series.mode()) 0 30 dtype: int64
The function returns the data in this format because, in the case that two values are tied for most occurrences, the function will return both of them. Let’s see:
example_series_2 = pd.Series([1,5,10,30,50,30,15,40,45,45]) print(example_series_2.mode()) 0 30 1 45 dtype: int64
Measures of Dispersion
Measures of Dispersion are measures that indicate how spread the data is, or how they vary. The measures of dispersion are range, variance, standard deviation and the absolute deviation, or mean absolute deviation.
Range is the difference between the biggest and the smallest value in a data set. To make this calculation, we will use two Pandas’ functions,
min(), that returns the maximum and minimum value in a data set, and then we will subtract one from the other:
print(train_df['Age'].max() - train_df['Age'].min()) 79.58 print(example_series.max() - example_series.min()) 49
Variance is a measure that express how much the data is away from its expected value. We calculate variance in Pandas with the
print(train_df['Age'].var()) 211.0191247463081 print(example_series.var()) 325.1111111111111
Standard deviation is also a measure of dispersion that indicates how away the data is from its mean. A high standard deviation means that the data is more spread, and a smaller standard deviation means that the values are more near the mean. In Pandas we calculate it through the
print(train_df['Age'].std()) 14.526497332334044 print(example_series.std()) 18.03083778173136
Absolute Deviation or Mean Absolute Deviation
The Absolute Deviation is calculated like this: first, we calculate the mean of the values; then, we calculate the distance of each point from this average; after that, we sum these distances and finally we divide the result by the mean of the distances.
In Pandas, the
mad() function calculates this. Let’s see the examples:
print(train_df['Age'].mad()) 11.322944471906405 print(example_series.mad()) 15.432098765432098
Covariance and Correlation
In some moments, we may want to know if two variables are related in a data set. For these cases, we calculate the covariance and correlation.
Covariance is a numerical measure that indicates the inter-dependency between two variables. Covariance indicates how two variables behave together in relation to their averages. A covariance of 0 indicates that the variables are totally independant, while a high and positive covariance value means that a variable is big when the other is big. Analogously, a negative covariance with a high absolute value means that one variable is big when the other is small. Covariance can be calculated through the
cov() function. This function will return a matrix indicating the covariance between each column and the other columns on the data set:
print(train_df.cov()) PassengerId Survived Pclass Age SibSp \ PassengerId 66231.000000 -0.626966 -7.561798 138.696504 -16.325843 Survived -0.626966 0.236772 -0.137703 -0.551296 -0.018954 Pclass -7.561798 -0.137703 0.699015 -4.496004 0.076599 Age 138.696504 -0.551296 -4.496004 211.019125 -4.163334 SibSp -16.325843 -0.018954 0.076599 -4.163334 1.216043 Parch -0.342697 0.032017 0.012429 -2.344191 0.368739 Fare 161.883369 6.221787 -22.830196 73.849030 8.748734 Parch Fare PassengerId -0.342697 161.883369 Survived 0.032017 6.221787 Pclass 0.012429 -22.830196 Age -2.344191 73.849030 SibSp 0.368739 8.748734 Parch 0.649728 8.661052 Fare 8.661052 2469.436846
Covariance, however, is hard to be understood and compared, because it gives us values in different scales as the variables change. For a better comparison, we normalize covariance to a value that will always be between -1 and 1, so it can be better understood. This value is the correlation. Correlation is always between -1, a perfect anti-correlation, and 1, perfect correlation. We calculate it with the
corr() function. Just like the
cov() function, it will return a matrix with the correlation among each column in the data set.
print(train_df.corr()) PassengerId Survived Pclass Age SibSp Parch \ PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652 Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629 Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443 Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119 SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838 Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000 Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225 Fare PassengerId 0.012658 Survived 0.257307 Pclass -0.549500 Age 0.096067 SibSp 0.159651 Parch 0.216225 Fare 1.000000
However, we have to take care when evaluating correlations. Some data may indicate a correlation value that does not exist in reality, being a mere case of luck and fate. This is a possible scenario when the variables present some sort of pattern that can lead to this correlation.
Finally, it is also important to note the famous concept that says that “correlation does not imply causation”. For our studies, what this means is that, when two variables have a high correlation, it may be the first value causing the behavior on the second, the second causing the behavior on the first, both causing the behavior on each other or it might mean nothing. So, we need to know what the data means so we can evaluate each case.
For an introduction to Descriptive Statistics, I think that this is a good start :)