Descriptive Statistics in Python

FREE Online Courses: Click, Learn, Succeed, Start Now!

You would have come across statistics concepts at school where you found minimum, maximum, average, etc. of the given values. Doing these calculations manually is fine when we have a small amount of data. What if we have big data and you want to analyze it?

Don’t worry, you have Python to help you. In this article, we will discuss the statistics module. Then we will also see getting the insights statistically from the pandas data frames. Let us begin with a formal introduction to Descriptive Statistics.

What is Descriptive Statistics?

When we have data, we need to get some information about the data for moving further to do some operations on it and to get the desired outputs. This process is called data analysis. When we talk about analyzing the data to get some statistical information, we have two main methods:

1. Descriptive statistics: In this tools like mean, standard deviation, etc are applied to given data sample to summarize the data.

2. Inferential statistics: In this method, we deal with data that can randomly vary, due to observational error, sampling difference, etc., and get details about it.

In this article, we will be covering Descriptive Statistics using Python. In this, basic features of data are described to draw conclusions. This can be further classified into:

1. Central tendency: We will be finding the central value of the entire data, as the name represents. Some examples include mean, mode, and median.

2. Dispersion: Again here also the name talks for itself. We try to figure out how the data is spread about the center value and from each other. Variance, standard deviation come under this category.

We will be learning these two in the next sections. But before that, let us see the module we will be using. Python provides a separate module for these statistical methods, named ‘statistics’ and it is a part of the Python Standard Library. We can import the statistics module by using the below statement.

import statistics as st

Finding Central Tendencies using Python

In this section, we will be discussing the central tendencies using Python. Let us see each of them with an example.

1. Mean:

Mean is the average value of the given data. Let us see an example of computing the mean of the list of values.
Example of finding the mean of a list of values:

import statistics as st

data=[2,3,5,-2,8,-4,6,7]
st.mean(data)

Output:

3.125

But we cannot apply this function on empty data, we get an error as shown below.

Example of finding mean of empty data:

st.mean([])

Output:

StatisticsError Traceback (most recent call last)
<ipython-input-7-d8abf62323e6> in <module>
2 from fractions import Fraction as fr
3
—-> 4 st.mean([])~\anaconda3\lib\statistics.py in mean(data)
313 n = len(data)
314 if n < 1:
–> 315 raise StatisticsError(‘mean requires at least one data point’)
316 T, total, count = _sum(data)
317 assert count == n

We can compute the mean of float as well as fractional values.

Example of finding mean of float values:

data=[3.4,8,9,5.3,-1.4]
st.mean(data)

Output:

4.86

Example of finding mean of fractional values:

from fractions import Fraction as fr

st.mean((fr(5,4),fr(2,3),fr(6,3)))

Output:

Fraction(47, 36)

We can apply this method to dictionaries too. Where we get the output as the average of the numerical key values.

Example of finding mean of dictionaries:

st.mean({1:'a',2:'b',3:'c',4:'d',5:'e'})

Output:

3

2. Mode:

This function gives the output as the most occurred/ repeated value.

Example of finding mode:

st.mode([1,3,4,1,2,1,3,4,3,3,5,3])

Output:

3

Since this function does not involve any arithmetic operation to give the result, this method can be applied even to characters.

Example of finding mode of characters:

st.mode(['a','c','b','c','a','b','a','c','a'])

Output:

‘a’

3. Median:

This gives the value of the center value when arranged in ascending or descending order. If the length is even, then the average of the two values present at the center will be given as the output.

Example of finding median:

st.median([2,3,5,-2,8,-4,6,7,5,3])

Output:

3

4. Harmonic Mean in Python:

This returns the harmonic mean of the given data. It is the inverse of the ordinary mean. For example, to find the harmonic mean of two values a and b we need to find 2/(1/a +1/b).

Example of finding harmonic mean:

st.harmonic_mean([2,3,5,4])

Output:

3.116883116883117

If we find the mean of the same data, we get the output as 3.5.

5. Median Low using Python:

We saw above that when the length of the data is even, the median value we get is the average of the two center values. What if we want the lower value as the output? We have the function for this too.

Example of finding median low:

st.median_low([2,3,1])

Output:

2

Example of finding median low:

st.median_low([2,3,1,4])

Output:

2

6. Median High in Python:

Similarly, we have a function to get the larger value of the two center values and that is median_high().

Example of finding median high:

st.median_high([2,3,1])

Output:

2

Example of finding median high:

st.median_high([2,3,1,4])

Output:

3

7. Median Grouped:

This function is used to calculate the median of the grouped continuous data, calculated as 50th percentile. In this, the data is assumed to be grouped into intervals of width intervals. And the median is calculated by interpolation within the median interval (the interval containing the median value). Here an assumption is made that the true values, the ones within that media interval, are distributed uniformly. The formula is

median = L + interval * (N / 2 – CF) / F

Where, L = the lower limit of the median interval,
N = total number of data points
CF = number of data points below the median interval
F = number of data points that lie in the median interval

Example of finding grouped median:

st.median_grouped([1,2,3,5,6,7],interval=1)

Output:

4.5

The median for the above data is 4.0. Let us see the result with a different interval.

Example of finding grouped median:

st.median_grouped([1,2,3,5,6,7],interval=3)

Output:

3.5

Finding Dispersions using Python

We have discussed different functions that gives us information about the center value. Let us see some of the dispersion methods.

1. Variance:

This returns the value that represents how the values are spread from the mean. This is used when the data is a sample of the population.

Example of finding variance:

st.variance([1,2,3,5,6,7])

Output:

5.6

2. Standard Deviation:

This function returns the standard deviation, square root of sample variance.

Example of finding standard deviation:

st.stdev([1,2,3,5,6,7])

Output:

2.3664319132398464

3. Population Variance:

This is used when the data for which we want find the population variance.

Example of finding pvariance:

st.pvariance([1,2,3,5,6,7])

Output:

4.666666666666667

4. Population Standard Deviation:

Like the population variance, this returns the standard deviation of the whole population data.
Example of finding population standard deviation:

st.pstdev([1,2,3,5,6,7])

Output:

2.160246899469287

Statistical Description of Data Frames in Python

Now, let us discuss the statistical analysis data frame. Do you know we can just need to write one line to get statistical information about the data frames? Interesting right? Let us apply these to the below data frame.
Example of creating data frame:

# import required modules
import pandas as pd
import matplotlib.pyplot as plt

# create 2D array of student details
stdData = [[1, 'M', 13, 123, 46],
        [2, 'M', 12, 134, 82],
        [3, 'F', 14, 114, 77],
        [4, 'M', 13, 136, 73],
        [5, 'F', 13, 107, 56],
        [6, 'F', 12, 121, 80],
        [7, 'M', 14, 113, 76],
        [8, 'F', 15, 123, 95],
        [9, 'F', 14, 112, 78],
        [10, 'M', 15, 100,60] ]

# creating the dataframe from the above data 
df = pd.DataFrame(stdData, columns = ['ID', 'Gender','Age', 'Height(cm)','Marks'] )
print(df)

Output:

ID Gender Age Height(cm) Marks
0 1 M 13 123 46
1 2 M 12 134 82
2 3 F 14 114 77
3 4 M 13 136 73
4 5 F 13 107 56
5 6 F 12 121 80
6 7 M 14 113 76
7 8 F 15 123 95
8 9 F 14 112 78
9 10 M 15 100 60

1. Mean

Example of finding the mean of the data frame:

df.mean()

Output:

ID 5.5
Age 13.5
Height(cm) 118.3
Marks 72.3
dtype: float64

We can also find the statistical information for one particular column.
Example of finding the mean of a column of the data frame:

df['Marks'].mean()

Output:

72.3

2. Standard Deviation

Example of finding the standard deviation of the data frame:

df.std()

Output:

ID 3.027650
Age 1.080123
Height(cm) 11.353414
Marks 14.322089
dtype: float64

3. Skew

Example of finding the skew of the data frame:

df.skew()

Output:

ID 0.000000
Age 0.000000
Height(cm) 0.151149
Marks -0.518947
dtype: float64

Describe Function

This function gives the summary of all the statistical information of the data frame.
Example of describing the data frame:

print(df.describe())

Output:

ID Age Height(cm) Marks
count 10.00000 10.000000 10.000000 10.000000
mean 5.50000 13.500000 118.300000 72.300000
std 3.02765 1.080123 11.353414 14.322089
min 1.00000 12.000000 100.000000 46.000000
25% 3.25000 13.000000 112.250000 63.250000
50% 5.50000 13.500000 117.500000 76.500000
75% 7.75000 14.000000 123.000000 79.500000
max 10.00000 15.000000 136.000000 95.000000

We can see the information related to only the numerical columns. To find the statistical details for all of the columns, the information mode, etc. then we can add another parameter inside the describe() function as shown in the below example.

Example of describing the data frame:

print(df. describe(include='all'))

Output:

ID Gender Age Height(cm) Marks
count 10.00000 10 10.000000 10.000000 10.000000
unique NaN 2 NaN NaN NaN
top NaN M NaN NaN NaN
freq NaN 5 NaN NaN NaN
mean 5.50000 NaN 13.500000 118.300000 72.300000
std 3.02765 NaN 1.080123 11.353414 14.322089
min 1.00000 NaN 12.000000 100.000000 46.000000
25% 3.25000 NaN 13.000000 112.250000 63.250000
50% 5.50000 NaN 13.500000 117.500000 76.500000
75% 7.75000 NaN 14.000000 123.000000 79.500000
max 10.00000 NaN 15.000000 136.000000 95.000000

Conclusion

We are at the end of the article where we first got introduced to the statistical description. Then we learned different methods in the statistics module. After this, we also saw statistical data analysis on data frames. Hope you enjoyed reading this article and learned something new. Happy learning!

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook


Leave a Reply

Your email address will not be published. Required fields are marked *