Regular Expressions in Python

FREE Online Courses: Click, Learn, Succeed, Start Now!

Did you ever try to find patterns or search for a word in a text? Do you know Python provides a module to do such tasks? Interesting right! In this article, we will learn about this module called ‘re’ and different methods in this module. So, let us start with the introduction to the re module and Regular Expressions in Python.

Introduction to re module in Python

A regular expression is a sequence of characters containing a pattern. It helps in finding and also replacing strings. Python provides a module called re (stands for the regular expression) for this purpose. We can import this module by writing the below code.

import re

This module contains different methods in it to do different operations of regular expressions in Python. Before we discuss these methods, let us first discuss the metacharacters we need to give as inputs to these methods.

Metacharacters in Python Regex

In Python Regex, a character is either a metacharacter or a regular character. Metacharacters are the ones that are interpreted a special way or have a special meaning. While regular characters are the ones that match themselves.

There are 11 metacharacters and these are: [], (), ., ^, $, *, +, ?, {}, \, and |. Let us discuss each of these in detail.

1. Square brackets []

This matches a single character from the set of characters provided in the brackets.
For example,

a. [abc] checks a match for ‘a’,’b’,’c’.

String Match
trw No match
b 1 match
abc 3 matches
abc def cb 5 matches

b. [e-j] is the same as [efghij] and it checks a match for ‘e’,’f’,’g’,’h’,’i’,’j’.
c. [1-5] is the same as [12345] and it checks for ‘1’,’2’,’3’,’4’,’5’
d. [a-ct] is the same as [abct’ and checks for ‘a’,’b’,’c’,’t’
e. [c-fx-z] is the same as [cdefxyz] and checks for ‘c’,’d’,’e’,’f’,’x’,’y’,’z’

2. Group ()

This function is used to group subpatterns. It defines the blocks and checks a match for them. For example,

a. (s|t)ion checks for either s or t followed by ion

String Match
ion No match
operation 1 match
expansion 1 matches
action under supervision 2 matches (action, supervision)

3. Period .

The period (.) is used to match a single character (except the new line). When it is used inside the square brackets, a match is searched for a dot.

4. Caret ^

This is used to check if a string starts with the specified character or a set of characters.
For example,
a. ^a checks if the string starts with ‘a’. And when we check with

String Match
abc 1match
bca No match

b. ^xy checks if the string starts with ’xy’. On checking with

String Match
xyz 1 match
xxy No match

But when we use ^ in the square brackets ([^]), then it checks for all other characters except the ones specified in the brackets. For example,

a. [^a-c] checks for all characters other than ‘a’,’b’,’c’
b. [^1234] checks for all the characters other than ‘1’,’2’,’3’,’4’

5. Dollar $

It is used to check if a string ends with the specified character(s). For example,

a. $s checks of the string ends with ‘s’. When checked with

String Match
Geeks  1 match
abc No match

6. Star *

This is used to match zero or more occurrences of the character preceding it. For example,

a. in*g marches for ‘ig’,’ing’,’inng’,’innng’ and so on. And on checking with

String Match
dig 1 match
doing 1 match
pinng 1 match
ingredients  1 match
inaugurate Zero matches (as ‘au’ lies between in and g)

b. [xy]* matches for ‘x’,’y’,’xy’,’yx’,’xyx’ and so on
c. (xy) checks for ‘xy’,’xyxy’,’xyxyxy’ and so on.

7. Plus +

This matches one or more occurrences of the character preceding it.
For example,
a. in+g marches for ’ing’,’inng’,’innng’ and so on. And on checking with

String Match
dig 1 match
doing 1 match
pinng 1 match
ingredients  1 match
inaugurate Zero matches (as ‘au’ lies between in and g)

8. Question Mark ?

This matches zero or one occurrence of the character preceding it.
For example,
a. in?g marches for ‘ig’ and ’ing’. And on checking with

String Match
swig 1 match
saying 1 match
pinng No match
ingine 1 match
instagram Zero matches (as ‘sta’ lies between in and g)

9. Braces {}

If we give two numbers {m,n}, then it means it checks for at least m and at most n repetitions of the character preceding.
For example,
a. l{2,4} checks for ‘ll’,’lll’,’llll’.

String Match
living No match
illegal 1 match
lilly llli 2 matches (lilly llli)
llllls 1 match (llllls)

We can also give the characters in the square brackets left to the curly brackets. For example, when we give [1-7]{2,3} then it checks for matching with at least 2 digits and at most 3.

String Match
abc1xyz2 No match
12pqr 1 match (12pqr)
12 345671 3 matches (12, 345, 671)

10. Alternation |

This is used for OR operation, that is, it checks for the expression either before or after the metacharacter. For example,
a. ab|cd checks for either ab or cd.

String Match
adc No match
abc 1 match
abcdefab 3 matches (ab, cd, ab)

11. Backslash \

The backslash is used to escape characters including the metacharacter. If a metacharacter follows the backslash, it is not given a special meaning.
For example, \*b searches if the string contains * followed by b.

Special Sequences

There are various sequences present in Regex by using the backslash. The following table shows some of the important ones.

Sequence  Description
\t,\n,\r,\f Tab, newline, return, form feed
\A Checks if the specified characters are at the start of the string
\b Checks if the specified characters are in the beginning or at the end of the string
\B Checks if the specified characters are not in the beginning or at the end of the string
\d Checks for a single decimal digit
\D Checks for non-decimal digits
\s Matches for any whitespace character
\S Matches for any non-whitespace character
\w Checks for any alphanumeric character
\W Checks for any non-alphanumeric character
\Z Checks if any characters specified left to it are at the end of the string

Now let us see the functions in this module.

Python Regex Function compile()

This function converts Python regular expressions into patterns that can be used for different operations. For example,
Example of compile() function:

# class [1234] will match the string with ‘1’,’2’,’3’,’4’
patrn=re.compile('[1234]')
print(patrn.findall("abcd123"))

Output:

[‘1’, ‘2’, ‘3’]

In this example, we created a pattern that matches for ‘1’,’2’,’3’,’4’. Then we used the findall() function, which we will discuss later in this article. This will find all the matches and returns the matched elements.

Let us see some more examples.

Example of compile() function:

# matches all alphanumeric characters with the string 
patrn=re.compile('\w')
print(patrn.findall("abc@123"))

Output:

[‘a’, ‘b’, ‘c’, ‘1’, ‘2’, ‘3’]

Example of compile() function:

# matches 'x','xy','xyy', and so on with the string 
p = re.compile('xy*')
print(p.findall("xyxxyyxyyy"))

Output:

[‘xy’, ‘x’, ‘xyy’, ‘xyyy’]

Python Regex Function match()

This function takes two arguments, namely, the pattern and the whole string. It matches the patterns to the whole string and returns a match object if successful, else it returns None. Its syntax is

re.match(pattern, string, flags=0)

In this flags are optional and give some constraints like re.IGNORECASE,etc.
For example,
Example of match() function:

print(re.match('on',"one"))

Output:

<re.Match object; span=(0, 2), match=’on’>

Example of match() function:

print(re.match('[123]',"centre"))

Output:

None

Python Regex Function search()

This function also takes two arguments, namely, the pattern and the string. It matches the pattern in the string and if the first match is found, it returns the match object. If no match is found, it returns None. Its syntax is

match = re.search(pattern, str)

Let us see some examples.

Example of search() function:

print(re.search('aa?abc','aaabc'))

Output:

<re.Match object; span=(0, 5), match=’aaabc’>

Example of search() function:

print(re.search('aa?bca?','aabca aabca').group())

Output:

aabca

In this example, we are using group() function to get the part of the string where we find the match. We can observer that though we have two possible matches here, we got only one match.

Let us search the occurrence of ‘oo’ in the word.

Example of search() function:

print(re.search('\w+o{2}\w*','snoopy').group())

Output:

snoopy

In this example, we are searching for the occurrence of ‘o’ two times in the second and third positions. The character \w means we have a single alphanumeric character before ‘oo’ and \w* means we can have any alphanumeric.

If there is no match and if we try to group, then we get an error as we cannot prove a None object to group. For example,

Example of search() function giving error on grouping None:

print(re.search('123','aabca aabca').group())

Output:

AttributeError Traceback (most recent call last)
<ipython-input-35-da154ee63ccb> in <module>
—-> 1 print(re.search(‘123′,’aabca aabca’).group())
AttributeError: ‘NoneType’ object has no attribute ‘group’

Python Regex Function split()

This function is used to split the string on the occurrence of the pattern. These split strings are given as a part of a list. Its syntax is

re.split(pattern, string, maxsplit=0, flags=0)

In this, if maxsplit is not mentioned then it is considered to be zero. If a positive number is provided, then we will hive maximum those numbers of splits.
For example,
Example of split() function:

print(re.split('\s','Hello. How are you?'))

Output:

[‘Hello.’, ‘How’, ‘are’, ‘you?’]

In this example, we are splitting based on space. Let us see the output when we give the maxsplit as 1.
Example of split() function:

print(re.split('\s','Hello. How are you?',maxsplit=1))

Output:

[‘Hello.’, ‘How are you?’]

We can see that the splitting happened only once.

Let us see one more example.
Example of split() function:

print(re.split('[aeiou]','Hello. How are you?',flags=re.IGNORECASE))

Output:

[‘H’, ‘ll’, ‘. H’, ‘w ‘, ‘r’, ‘ y’, ”, ‘?’]

In this example, we are splitting based on the vowels. And we are usign the flag asking it to ignore if the character is in uppercase or lowercase.

Python Regex Function sub()

This function stands for the substring. It finds the matching pattern and replaces it with the mentioned replacement expression. Its syntax is

re.sub(pattern, repl, string, count=0, flags=0)

This function searches for the pattern in the string and replaces it with the argument ‘repl’ and count keeps track of the number of matchings and replacing is done.
For example,
Example of sub() function:

print(re.sub('a','e','Walcoma!'))

Output:

Welcome!

Example of sub() function:

print(re.sub('&','and','Bread & Butter'))

Output:

Bread and Butter

Python Regex Function subn()

This function is similar to the sub() function. The only difference is that this function will return a tuple with the number of replacements and the new string. Its syntax is

re.subn(pattern, repl, string, count=0, flags=0)

Let us see an example.
Example of subn() function:

tup=re.subn(' ','_','This is a string')
print(tup)
print("New string:",tup[0])
print('Number of matchings is:',tup[1])

Output:

(‘This_is_a_string’, 3)
New string: This_is_a_string
Number of matchings is: 3

Python Group Expression in Regex

We have seen the group() function previously. As said before, this function is used to take the direct, combined results of the operation. For example,
Example of group() function:

matcher=re.search(r'([\w.-]+)+((\d+))','His Phone No is +91 12345678')
matcher.group()

Output:

’91’

We have two subgroups here. We get these by indexing as shown below.
Example of group() function:

matcher.group(1)

Output:

‘9’

Example of group() function:

matcher.group(2)

Output:

‘1’

Python Regex findall() Function

This function is similar to the search() function. Except that it does not stop at the first match, it finds all the matches and returns the list of matches found.
For example,

Example of findall() function:

matcher=re.findall(r'item[1234]','item1 item4 item 2 Item3 item6')
print(type(matcher))
for i in matcher:
    print(i)

Output:

<class ‘list’>
item1
item4

We can divide the match into groups by giving the pattern using parentheses. For example,

Example of findall() function:

matcher=re.findall(r'(\d+)\s(\d+)','91 12345,91 345,91 23456)')

for i in matcher:
    print(i)

Output:

(’91’, ‘12345’)
(’91’, ‘345’)
(’91’, ‘23456’)

We can also give optins like re.IGNORECASE, re.MULTILINE, re.DOTALL, etc. For example,
Example of findall() function with options:

matcher=re.findall(r'!','Hello! \nWelcome to PythonGeels!\n Happy learning!\t',re.MULTILINE)

for i in matcher:
    print(i)

Output:

!
!
!

Example of findall() function with options:

matcher=re.findall(r'!','Hello! \nWelcome to PythonGeels!\n Happy learning!\t',re.MULTILINE)

for i in matcher:
    print(i)

Output:

!
!
!

The metacharacters *,+,? are calle greedy characters. This is because they keep checking. For example,
Example of greedy metacharacter:

matcher=re.findall(r'(<.*?>)','<Python> and <c> and <Java>')

matcher

Output:

[‘<Python>’, ‘<c>’, ‘<Java>’]

Example of greedy metacharacter:

matcher=re.findall(r'</?\w+>','<Python> and <c> and <Java>')

matcher

Output:

[‘<Python>’, ‘<c>’, ‘<Java>’]

However, if we give the ? and * in combination with defined characters, then * becomes non-greedy as shown below.
Example of non-greedy metacharacter:

matcher=re.findall(r'(x*?)y','xxxyx')

matcher

Output:

[‘xxx’]

Applications of Regular Expressions in Python

We learned a lot about Python regular expressions and different commands we can apply to them to find different patterns. Let us see some applications where these are used.

1. These are used in Search engines

2. These are used in Text processing applications like sed and AWK

3. This is used to find and replace words in word processors and text editors

4. It is also used in lexical analysis

Interview Questions on Python Regular Expressions

Q1. Write a program to find a pattern of all the vowels and find them in the text and print them.
Ans. Below is the example of finding all the vowels:

patrn=re.compile('[aeiou]')
print(re.findall(patrn,'all the vowels'))

Output:

[‘a’, ‘e’, ‘o’, ‘e’]

Q2. Write a function to split the text into a maximum of 3 parts of digits based on the occurrence of 0.
Ans. Below is the example of splitting string:

print(re.split('0','12043234065043260',maxsplit=2))

Output:

[’12’, ‘43234’, ‘65043260’]

Q3. Write a program to substitute the first 3 occurrences of ‘or’ by a comma.
Ans. Below is the example of substituting characters in a string:

print(re.sub('or',',','a or b or c or d or e or t ',count=3))

Output:

a , b , c , d or e or t

Q4. Write a function to count all the special characters in the string.
Ans. Below is the example of counting the special characters in a string:

matcher=re.findall(r'[^\w]','abc@123#tre$')
len(matcher)

Output:

3

Q5. Write a program to find the mail id from the given text.
Ans. In an email id, important characters are ‘@’ and ‘.’. So, we will give a variable number of characters at the starting followed by ‘@’ and again variable characters and ‘.’ followed by some characters.

Example of mail id from the string:

matcher=re.search(r'[\w.-]+@[\w-]+\.[\w]+','The mail id is [email protected]')
matcher.group()

Output:

Quiz on Regular Expressions in Python

Conclusion

In this article, we learned about regular expressions, metacharacters, and different methods in Python re module.

Hope you enjoyed reading this article. Happy learning!

Your opinion matters
Please write your valuable feedback about PythonGeeks on Google | Facebook


1 Response

Leave a Reply

Your email address will not be published. Required fields are marked *