Data Cleansing using Python
FREE Online Courses: Dive into Knowledge for Free. Learn More!
Here we are again with an article related to handling data, which plays an important role in all the domains. We all know that the raw data we get needs to be cleansed to remove repeated values, missing values, etc.
In this article, we will be learning to clean the data by using the Python modules NumPy and Pandas. First, lets us see more on data cleaning.
What is Data Cleansing?
Data Cleansing is the process of detecting and changing raw data by identifying incomplete, wrong, repeated, or irrelevant parts of the data. For example, when one takes a data set one needs to remove null values, remove that part of data we need based on application, etc. Besides this, there are a lot of applications where we need to handle the obtained information.
Installing required Modules
As said above we will be learning data cleansing using NumPy and Pandas modules. We can use the below statements to install the modules.
pip install numpy
pip install pandas
Data Cleansing using NumPy
Before learning about the operations we can perform using NumPy, let us look at the ways of creating NumPy arrays.
Creating NumPy array
There are many ways of creating numpy arrays using np.array() method. by specifying different properties. Let us see each of them.
1. Creating a one dimensional numpy array
Example of creating a one dimensional numpy array:
import numpy as np np.array([1,2,3,4,5])
Output:
2. Creating a multi dimensional numpy array
Example of creating a multi dimensional numpy array:
import numpy as np np.array([['a','b','c','d','e'],[1,2,3,4,5]])
Output:
[‘1’, ‘2’, ‘3’, ‘4’, ‘5’]], dtype='<U11′)
3. Specifying the dimension
Example of creating a numpy array specifying dimension:
import numpy as np np.array([1,2,3,4,5],ndmin=2)
Output:
We can see that we got a 2 dimensional array rather than a 1D one.
4. Specifying the data type
Example of creating a numpy array specifying data type:
import numpy as np np.array([1,2,3,4,5],dtype=float)
Output:
We can see that because we gave the ‘dtype’ parameter as float, the values of the array are float values ( have decimal points).
Getting Properties of the array
We can get different properties of the numpy array by using different methods. Let us see them in the below example.
Example of numpy arrays:
import numpy as np arr=np.array([1,2,3,4,5]) print("array:",arr) print("Type of arr:",type(arr)) print("Data type:",arr.dtype) print("Dimension:",arr.ndim) print("Shape:",arr.shape) print("Size:",arr.size)
Output:
Type of arr: <class ‘numpy.ndarray’>
Data type: int32
Dimension: 1
Shape: (5,)
Size: 5
Applying operations on NumPy array
Now we will see different operations we can perform on these arrays to modify the arrays.
1. Arithmetic Operations on array
We can do different operations like addition, subtraction, multiplication, and division. Let us see some examples.
Example of addition operation:
import numpy as np a1=np.array([1,2,3,4,5]) a2=np.array([3,4,5,1,2]) print('a1+a2:') print(a1+a2) print('a1+2:') print(a1+2)
Output
[4 6 8 5 7]
a1+2:
[3 4 5 6 7]
Example of addition operation:
import numpy as np a1=np.array([1,2,3,4,5]) a2=np.array([3,4,5,1,2]) print('a1-a2:') print(a1-a2) print('a1-3:') print(a1-3)
Output:
[-2 -2 -2 3 3]
a1-3:
[-2 -1 0 1 2]
Example of multiplication operation:
import numpy as np a1=np.array([[1,2,3],[4,5,6]]) print('a1*-1:') print(a1*-1)
Output:
[[-1 -2 -3]
[-4 -5 -6]]
Example of division operation:
import numpy as np a1=np.array([[1,2,3],[4,5,6]]) print('a1/2:') print(a1/2)
Output:
[[0.5 1. 1.5]
[2. 2.5 3. ]]
2. Reshaping the array
We can reshare using the reshape() method. We need to make sure that the dimension we want to reshape it to should have the same number of values as that of the original array.
Example of reshaping the array:
import numpy as np a1=np.array([[1,2,3],[4,5,6]]) a2=a1.reshape(3,2) print(a2)
Output:
[3 4]
[5 6]]
3. Flattening the array
Flattening is the process of converting them into one dimensional array.
Example of flattening the array:
import numpy as np a1=np.array([[1,2,3],['a','b','c']]) a2=a1.flatten() print(a2)
Output:
4. Sorting the array
Example of sorting the array:
import numpy as np a1=np.array([4,21,2,5,10,6,2]) a1.sort() print(a1)
Output:
Data Cleansing using Pandas
When we are using pandas, we use the data frames. Let us first see the way to load the data frame.
Example of loading CSV file as data frame:
import pandas as pd data =pd.read_csv('data.csv') print(data)
Output:
Now let us get the information about the data using the describe() and rank() functions.
Example of describe() function:
data.describe()
Output:
Example of rank() function:
data.rank()
Output:
Now let us see different operations we can use on the data frame.
1. Finding and Removing Missing Values
We can find the missing values using isnull() function.
Example of finding missing values:
data.isnull()
Output:
Example of removing missing values:
data.dropna()
Output:
2. Replacing Missing Values
We have different options for replacing the missing values. We can use the replace() function or fillna() function to replace it with a constant value.
Example of replacing missing values using replace():
from numpy import NaN data.replace({NaN:0.00})
Output:
Example of replacing missing values using fillna():
data.fillna(3)
Output:
Using fillna() function, we can fill forward and fill backward as well.
Example of replacing missing values by filling forward :
data.fillna(method='pad')
Output:
Example of replacing missing values by filling backward:
data.fillna(method='backfill')
Output:
3. Removing Repeated Values
We can remove the repeated values by using the drop_duplicates() method.
Example of removing repeated values:
data.drop_duplicates()
Output:
4. Removing Irrelevant Data
We can remove the irrelevant data by using the del method.
Example of removing irrelevant data:
del data['YOB'] print(data)
Output:
5. Renaming Columns
We have a function rename() to rename the columns.
Example of renaming columns:
print(data.rename(columns={'Name':'FirstName','Surname':'LastName'}))
Output:
Interview Question on Data Cleansing using Python
1. Write a program to do the operation A*3+2 on matrix A.
Example of arithmetic operation on matrix:
import numpy as np ar=np.array([[1,2,3],[4,5,6]]) print(ar*3+2)
Output:
[14 17 20]]
2. Write a program to reshape a 2×4 having 4 rows.
Example of reshaping the matrix:
import numpy as np ar=np.array([[1,3,6,2],[4,8,3,9]]) ar.reshape(4,2)
Output:
[6, 2],
[4, 8],
[3, 9]])
3. Write a program to remove the rows with null values.
Example of removing the null data:
data.dropna()
4. Write a program to fill the null values with 0 and make the changes reflect on the original data frame.
Example of replacing null values and affecting the original data frame:
data.fillna(0,inplace=True)
5. Write a program to replace the locality ‘Loc3’ of the above data frame with ‘Loc1’.
Example of replacing the data:
data['Locality']=data['Locality'].str.replace('Loc3','Loc1')
Conclusion
Here, we are done at the end of the article. In this article, we learned different operations using modules NumPy and Pandas for cleansing the data. Hope you enjoyed reading this article. Happy learning!
It was informative!!
I thank for this tutorial was teach me how to writes Python code