Statistics with Python
We offer you a brighter future with placement-ready courses - Start Now!!
The statistics module is built in python, and we can use it to calculate statistics of any numeric data. In this tutorial, we will learn more about this module.
What is Statistics?
Statistics is a branch of mathematics that deals with numerical data representation. It manipulates, tabulates and interprets the data to draw conclusions from the data. Based on these conclusions, we can decide the impact of any business decision on the company.
Understanding descriptive statistics
Descriptive statistics sum up the complete dataset. The data set can represent the entire dataset or just a part of the population. Descriptive statistics are divided into mean, median and mode, known as the measures of central tendency. Measures of variable tendency include deviation and variance.
Measures of central tendency explain the central values in the data set, and the measures of variable tendency describe how the data is spread in the data set.
Descriptive statistics are broadly classified into two types:
- Measures of central tendency
- Measures of variability
Measures of central tendency
These measures primarily focus on the middle or central values in your data. However, the measures also use graphs, visuals and pictorial representations to understand and give knowledge about the data to the users.
We start by calculating the frequency of each point in the distribution and describe it with the help of mean, mode and median.
Calculating mean and median using Python Pandas
We calculate the mean and median with the help of the pandas library:
We can write the following piece of code:
import pandas as pd
[{"metadata":{"trusted":true},"cell_type":"code","source":"d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',\n 'Lee','Chanchal','Gasper','Naviya','Andres']),\n 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),\n 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}","execution_count":3,"outputs":[]}]
df = pd.DataFrame(d)
print("Mean Values in the Distribution")
print(df.mean())
print("*******************************")
print("Median Values in the Distribution")
print(df.median())
Output
Mean Values in the Distribution
Age 31.4333
Rating 3.74
dtype float64
*******************************
Median Values in the Distribution
Age 29.50
Rating 3.79
dtype float64
Calculating mode
The value that appears the most in your given data is defined as mode. It (mode()) is an in-built function in python that prints the mode or the most commonly occurring value within the dataset. Consider the following example:
Import statistics set1 =[6, 6, 6, 3, 6, 4, 6, 5, 5, 6] print(statistics.mode(set1))
Output
6
Consider another example:
Consider another example:
import pandas as pd
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','Chanchal','Gasper','Naviya','Andres']),
'Age':pd.Series([25,26,25,23,30,25,23,34,40,30,25,46])}
#Create a DataFrame
df = pd.DataFrame(d)
print(df.mode())
Output
Name Age
0 Andres 25.0
1 Chanchal Null
2 Gasper Null
3 Jack Null
4 James Null
5 Lee Null
6 Naviya Null
7 Ricky Null
8 Smith Null
9 Steve Null
10 Tom Null
11 Vin Null
Measures of variability
These measures help us understand the distribution and dispersion of the given data.
The most used measures of variability are:
- Range
- Variance
- Standard deviation
For example, if the average of the given data lies between 55 and 60, the data can be between 1 and 100. Hence the measures of variability help us understand how the data is spread.
1. variance()
The variance is calculated by subtracting each data point in the dataset from the given average and squaring the answer. Finally, dividing this squared value by the number of data points provides us with the variance.
We use this when our sample dataset is a population measure.
Example:
import statistics as st nums=[1,2,3,5,7,9] st.variance(nums)
Output:
9.5
2. Standard deviation
The square root of the standard deviation is variance. We saw how to calculate the variance in the above code.
In the statistics library in python, the stdev() method calculates the standard deviation of the given dataset.
Example: import statistics as st nums=[1,2,3,5,7,9] st.stdev(nums)
Output:
3.082207001484488
3. Range
The range indicates the difference between the highest and smallest value in the data. It is directly proportional to the spread of data, which means the larger the range, the bigger the data is spread.
range= highest value in the dataset – smallest value in the dataset
In addition, you can find the max and min values using the max() and min() functions in python.
Example:
arr = [1, 2, 3, 4, 5]
Maximum = max(arr)
Minimum = min(arr)
Range = Maximum-Minimum
print("Maximum = {}, Minimum = {} and Range = {}".format(
Maximum, Minimum, Range))
Output:
Maximum = 5, Minimum = 1 and Range = 4
Summary
This was all about Statistics with Python. Hope you liked it.
