Statistics with Python

Boost Your Career with Our Placement-ready Courses – ENroll Now

The statistics module is built in python, and we can use it to calculate statistics of any numeric data. In this tutorial, we will learn more about this module.

What is Statistics?

Statistics is a branch of mathematics that deals with numerical data representation. It manipulates, tabulates and interprets the data to draw conclusions from the data. Based on these conclusions, we can decide the impact of any business decision on the company.

Understanding descriptive statistics

Descriptive statistics sum up the complete dataset. The data set can represent the entire dataset or just a part of the population. Descriptive statistics are divided into mean, median and mode, known as the measures of central tendency. Measures of variable tendency include deviation and variance.

Measures of central tendency explain the central values in the data set, and the measures of variable tendency describe how the data is spread in the data set.

Descriptive statistics are broadly classified into two types:

Measures of central tendency
Measures of variability

Measures of central tendency

These measures primarily focus on the middle or central values in your data. However, the measures also use graphs, visuals and pictorial representations to understand and give knowledge about the data to the users.

We start by calculating the frequency of each point in the distribution and describe it with the help of mean, mode and median.

Calculating mean and median using Python Pandas

We calculate the mean and median with the help of the pandas library:

We can write the following piece of code:

import pandas as pd

[{"metadata":{"trusted":true},"cell_type":"code","source":"d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',\n   'Lee','Chanchal','Gasper','Naviya','Andres']),\n   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),\n   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}","execution_count":3,"outputs":[]}]

df = pd.DataFrame(d)

print("Mean Values in the Distribution")
print(df.mean())
print("*******************************")
print("Median Values in the Distribution")
print(df.median())

Output

Mean Values in the Distribution

Age 31.4333

Rating 3.74

dtype float64

*******************************

Median Values in the Distribution

Age 29.50

Rating 3.79

dtype float64

Calculating mode

The value that appears the most in your given data is defined as mode. It (mode()) is an in-built function in python that prints the mode or the most commonly occurring value within the dataset. Consider the following example:

Import statistics 
set1 =[6, 6, 6, 3, 6, 4, 6, 5, 5, 6]
print(statistics.mode(set1))

Output

Consider another example:

Consider another example:

import pandas as pd

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','Chanchal','Gasper','Naviya','Andres']),
   'Age':pd.Series([25,26,25,23,30,25,23,34,40,30,25,46])}
#Create a DataFrame
df = pd.DataFrame(d)

print(df.mode())

Output

Name Age

0 Andres 25.0

1 Chanchal Null

2 Gasper Null

3 Jack Null

4 James Null

5 Lee Null

6 Naviya Null

7 Ricky Null

8 Smith Null

9 Steve Null

10 Tom Null

11 Vin Null

Measures of variability

These measures help us understand the distribution and dispersion of the given data.

The most used measures of variability are:

Range
Variance
Standard deviation

For example, if the average of the given data lies between 55 and 60, the data can be between 1 and 100. Hence the measures of variability help us understand how the data is spread.

1. variance()

The variance is calculated by subtracting each data point in the dataset from the given average and squaring the answer. Finally, dividing this squared value by the number of data points provides us with the variance.

We use this when our sample dataset is a population measure.

Example:

import statistics as st
nums=[1,2,3,5,7,9]
st.variance(nums)

Output:

9.5

2. Standard deviation

The square root of the standard deviation is variance. We saw how to calculate the variance in the above code.

In the statistics library in python, the stdev() method calculates the standard deviation of the given dataset.

Example:

import statistics as st
nums=[1,2,3,5,7,9]
st.stdev(nums)

Output:

3.082207001484488

3. Range

The range indicates the difference between the highest and smallest value in the data. It is directly proportional to the spread of data, which means the larger the range, the bigger the data is spread.

range= highest value in the dataset – smallest value in the dataset

In addition, you can find the max and min values using the max() and min() functions in python.

Example:

arr = [1, 2, 3, 4, 5]
 
Maximum = max(arr)

Minimum = min(arr)

Range = Maximum-Minimum    
print("Maximum = {}, Minimum = {} and Range = {}".format(
    Maximum, Minimum, Range))

Output:

Maximum = 5, Minimum = 1 and Range = 4

Summary

This was all about Statistics with Python. Hope you liked it.

Statistics with Python

What is Statistics?

Understanding descriptive statistics

Measures of central tendency

Calculating mean and median using Python Pandas

Calculating mode

Measures of variability

1. variance()

2. Standard deviation

3. Range

Summary

Leave a Reply Cancel reply