Data Cleansing using Python

FREE Online Courses: Dive into Knowledge for Free. Learn More!

Here we are again with an article related to handling data, which plays an important role in all the domains. We all know that the raw data we get needs to be cleansed to remove repeated values, missing values, etc.

In this article, we will be learning to clean the data by using the Python modules NumPy and Pandas. First, lets us see more on data cleaning.

What is Data Cleansing?

Data Cleansing is the process of detecting and changing raw data by identifying incomplete, wrong, repeated, or irrelevant parts of the data. For example, when one takes a data set one needs to remove null values, remove that part of data we need based on application, etc. Besides this, there are a lot of applications where we need to handle the obtained information.

Installing required Modules

As said above we will be learning data cleansing using NumPy and Pandas modules. We can use the below statements to install the modules.

pip install numpy
pip install pandas

Data Cleansing using NumPy

Before learning about the operations we can perform using NumPy, let us look at the ways of creating NumPy arrays.

Creating NumPy array

There are many ways of creating numpy arrays using np.array() method. by specifying different properties. Let us see each of them.

1. Creating a one dimensional numpy array

Example of creating a one dimensional numpy array:

import numpy as np
np.array([1,2,3,4,5])

Output:

array([1, 2, 3, 4, 5])
2. Creating a multi dimensional numpy array

Example of creating a multi dimensional numpy array:

import numpy as np
np.array([['a','b','c','d','e'],[1,2,3,4,5]])

Output:

array([[‘a’, ‘b’, ‘c’, ‘d’, ‘e’],
[‘1’, ‘2’, ‘3’, ‘4’, ‘5’]], dtype='<U11′)
3. Specifying the dimension

Example of creating a numpy array specifying dimension:

import numpy as np
np.array([1,2,3,4,5],ndmin=2)

Output:

array([[1, 2, 3, 4, 5]])

We can see that we got a 2 dimensional array rather than a 1D one.

4. Specifying the data type

Example of creating a numpy array specifying data type:

import numpy as np
np.array([1,2,3,4,5],dtype=float)

Output:

array([1., 2., 3., 4., 5.])

We can see that because we gave the ‘dtype’ parameter as float, the values of the array are float values ( have decimal points).

Getting Properties of the array

We can get different properties of the numpy array by using different methods. Let us see them in the below example.
Example of numpy arrays:

import numpy as np

arr=np.array([1,2,3,4,5])

print("array:",arr)
print("Type of arr:",type(arr))
print("Data type:",arr.dtype)
print("Dimension:",arr.ndim)
print("Shape:",arr.shape)
print("Size:",arr.size)

Output:

array: [1 2 3 4 5]
Type of arr: <class ‘numpy.ndarray’>
Data type: int32
Dimension: 1
Shape: (5,)
Size: 5

Applying operations on NumPy array

Now we will see different operations we can perform on these arrays to modify the arrays.

1. Arithmetic Operations on array

We can do different operations like addition, subtraction, multiplication, and division. Let us see some examples.
Example of addition operation:

import numpy as np

a1=np.array([1,2,3,4,5])
a2=np.array([3,4,5,1,2])
print('a1+a2:')
print(a1+a2)

print('a1+2:')
print(a1+2)

Output

a1+a2:
[4 6 8 5 7]
a1+2:
[3 4 5 6 7]

Example of addition operation:

import numpy as np

a1=np.array([1,2,3,4,5])
a2=np.array([3,4,5,1,2])
print('a1-a2:')
print(a1-a2)

print('a1-3:')
print(a1-3)

Output:

a1-a2:
[-2 -2 -2 3 3]
a1-3:
[-2 -1 0 1 2]

Example of multiplication operation:

import numpy as np

a1=np.array([[1,2,3],[4,5,6]])

print('a1*-1:')
print(a1*-1)

Output:

a1*-1:
[[-1 -2 -3]
[-4 -5 -6]]

Example of division operation:

import numpy as np

a1=np.array([[1,2,3],[4,5,6]])

print('a1/2:')
print(a1/2)

Output:

a1/2:
[[0.5 1. 1.5]
[2. 2.5 3. ]]
2. Reshaping the array

We can reshare using the reshape() method. We need to make sure that the dimension we want to reshape it to should have the same number of values as that of the original array.
Example of reshaping the array:

import numpy as np

a1=np.array([[1,2,3],[4,5,6]])

a2=a1.reshape(3,2)
print(a2)

Output:

[[1 2]
[3 4]
[5 6]]
3. Flattening the array

Flattening is the process of converting them into one dimensional array.
Example of flattening the array:

import numpy as np

a1=np.array([[1,2,3],['a','b','c']])

a2=a1.flatten()
print(a2)

Output:

[‘1’ ‘2’ ‘3’ ‘a’ ‘b’ ‘c’]
4. Sorting the array

Example of sorting the array:

import numpy as np

a1=np.array([4,21,2,5,10,6,2])

a1.sort()
print(a1)

Output:

[ 2 2 4 5 6 10 21]

Data Cleansing using Pandas

When we are using pandas, we use the data frames. Let us first see the way to load the data frame.
Example of loading CSV file as data frame:

import pandas as pd

data =pd.read_csv('data.csv')
print(data)

Output:

load csv

Now let us get the information about the data using the describe() and rank() functions.
Example of describe() function:

data.describe()

Output:

describe

Example of rank() function:

data.rank()

Output:

rank()

Now let us see different operations we can use on the data frame.

1. Finding and Removing Missing Values

We can find the missing values using isnull() function.
Example of finding missing values:

data.isnull()

Output:

find missing values

Example of removing missing values:

data.dropna()

Output:

remove missing values

2. Replacing Missing Values

We have different options for replacing the missing values. We can use the replace() function or fillna() function to replace it with a constant value.

Example of replacing missing values using replace():

from numpy import NaN
data.replace({NaN:0.00})

Output:

replace missing values

Example of replacing missing values using fillna():

data.fillna(3)

Output:

Python fillna

Using fillna() function, we can fill forward and fill backward as well.
Example of replacing missing values by filling forward :

data.fillna(method='pad')

Output:

filling forward

Example of replacing missing values by filling backward:

data.fillna(method='backfill')

Output:

filling backward

3. Removing Repeated Values

We can remove the repeated values by using the drop_duplicates() method.
Example of removing repeated values:

data.drop_duplicates()

Output:

remove repeated values

4. Removing Irrelevant Data

We can remove the irrelevant data by using the del method.

Example of removing irrelevant data:

del data['YOB']
print(data)

Output:

 

5. Renaming Columns

We have a function rename() to rename the columns.

Example of renaming columns:

print(data.rename(columns={'Name':'FirstName','Surname':'LastName'}))

Output:

 

Interview Question on Data Cleansing using Python

1. Write a program to do the operation A*3+2 on matrix A.

Example of arithmetic operation on matrix:

import numpy as np

ar=np.array([[1,2,3],[4,5,6]])

print(ar*3+2)

Output:

[[ 5 8 11]
[14 17 20]]

2. Write a program to reshape a 2×4 having 4 rows.

Example of reshaping the matrix:

import numpy as np

ar=np.array([[1,3,6,2],[4,8,3,9]])

ar.reshape(4,2)

Output:

array([[1, 3],
[6, 2],
[4, 8],
[3, 9]])

3. Write a program to remove the rows with null values.

Example of removing the null data:

data.dropna()

4. Write a program to fill the null values with 0 and make the changes reflect on the original data frame.

Example of replacing null values and affecting the original data frame:

data.fillna(0,inplace=True)

5. Write a program to replace the locality ‘Loc3’ of the above data frame with ‘Loc1’.

Example of replacing the data:

data['Locality']=data['Locality'].str.replace('Loc3','Loc1')

Conclusion

Here, we are done at the end of the article. In this article, we learned different operations using modules NumPy and Pandas for cleansing the data. Hope you enjoyed reading this article. Happy learning!

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google | Facebook


2 Responses

  1. Madhu B. says:

    It was informative!!

  2. Kristofel says:

    I thank for this tutorial was teach me how to writes Python code

Leave a Reply

Your email address will not be published. Required fields are marked *