Regular Expression in Python - Complete Tutorial

Introduction

Have you ever received an error message while trying to log in to any mail server for giving the wrong mail id? or advised you to keep strong passwords when creating an account. Every internet user, as well as me, faced this problem at least once. A question may arise in your mind; how the computer could understand a password is strong or not, or is a mail id is correct or not? 

Here comes the use of Regular Expressions. But how, right? You'll get the answer shortly. In this tutorial, You'll learn about Regular Expression in Python.

What is Regular Expression in Python?

A Regular Expression or RE is a specific sequence of characters or pattern that specifies a set of strings that matches it. It helps to search for a specific pattern of a text. 

For example, a Gmail id should be like this, username@gmail.com. As per Gmail Help, a username can contain letters (a-z), numbers (0-9), and periods (.). Usernames can not contain an equal sign (=), apostrophe('), ampersand (&), underscore (_), dash (-), comma (,), plus sign (+), brackets (<,>), or more than one period (.) in a row. Usernames can begin or end with non-alphanumeric characters except periods (.). 

You have to follow these rules when creating a USERNAME on Gmail. In such cases, to match a correct email ID, Regular Expressions are helpful, and time savers. It can solve a problem more easily. We'll discuss more deeply one by one in the upcoming sections.

Finding Patterns Without Regular Expressions

Imagine, you want to find all the phone numbers in a string with the help of programming. Here is an example of a US phone number, (535) 555-1348 or 535-555-1348. Let's see the simple steps you should follow to find a phone number.

👉Steps:

  1. Check the phone number length is 12 or not as per the second format I've mentioned above
  2. Then check the first-three digit of the area code
  3. Now check the hyphen sign after the area code
  4. Again check three numeric digits
  5. Next, the hyphen sign again
  6. At last check the four more numeric digits

If all the steps are satisfied, the program will return the matched phone number else return the not found message.

Code


def isPhoneNumber(text):
if len(text) != 12:
return False
for i in range(0, 3):
if not text[i].isdecimal():
return False
if text[3] != '-':
return False
for i in range(4, 7):
if not text[i].isdecimal():
return False
if text[7] != '-':
return False
for i in range(8, 12):
if not text[i].isdecimal():
return False
return True

if __name__ == '__main__':
msg = 'Contact 1: 415-444-1049. Contact 2: 415-444-2341'
for i in range(len(msg)):
block = msg[i:i+12]
if isPhoneNumber(block):
print('Phone number found: ' + block)

Output🖥

Phone number found: 415-444-1049
Phone number found: 415-444-2341

Finding Patterns with Regular Expressions

In the previous section, we extracted phone numbers from a string. There was no problem with that. But suppose, if we wanted to find out both formats of the phone number then the code would get more complicated. Here comes the tricks of Regular Expressions. Let's see how we do this.

Follow these simple steps👇:

  1. Import regular expression(in short regex) module with import re,
  2. Create a regex object with re.compile() function (Use raw string),
  3. Pass the string you want to search into the regex object's search() method. It will return the matching object,
  4. Call the match object's group() method to return an actual matched text.

Code


import re

# By putting an r before the first quote of the string value,
# you can mark the string as a raw string, which does not
# escape characters.
phoneNum = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
msg = 'Contact 1: 415-444-1049. Contact 2: 415-444-2341'
mo = phoneNum.search(msg)
print('Phone number found: ' + mo.group())

Output🖥

Phone number found: 415-444-1049


You can write the above code more simply. Like \d{3}-\d{3}-\d{4} is the same as \d\d\d-\d\d\d-\d\d\d\d. Now it's your choice which one You would like to use🙆.

I think by now you probably have understood that by using Regular Expressions we can solve complex problems by writing less code😊.

In the last example, You may notice that I'd put two phone numbers in the string, but, in the Output, we got only one result; Because the search() function returns only the first match. We'll see every pattern match technique with the help of the findall() function in the coming section. Let's try more examples.

Parentheses

We can create different groups by adding parentheses in the regex. Example: (\d\d\d)-(\d\d\d-\d\d\d\d). You can use the group() match object method to grab the matching text from just one group. Let's have a look at the code for a better understanding.

Code


import re

phoneNum1 = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
msg = 'Contact me at 415-444-2341'
mo1 = phoneNum1.search(msg)
# mo1.groups() returns a tuple of multiple values
areaCode, mainNum = mo1.groups()
print('Tuple: ', mo1.groups())
print('AreaCode: ', areaCode)
print('Actual Number: ', mainNum)

# Another Method
print('~Another Method~')
print('AreaCode: ', mo1.group(1))
print('Actulal Number: ', mo1.group(1))
print('Compact Form: ', mo1.group())

Output🖥

Tuple:  ('415', '444-2341')
AreaCode:  415
Actual Number:  444-2341
~Another Method~
AreaCode:  415
Actulal Number:  415
Compact Form:  415-444-2341
 

The escape(\) characters

Suppose someone enters a phone number in a text like this, (415) 444-2341. In this case, we've to use the \(and\) escape characters in the raw string passed to re.compile() function for matching exact numbers.

Code


import re

phoneNum2 = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
msg = 'Contact me at (415) 444-1049.'
mo2 = phoneNum2.search(msg)
print('Phone number found: ', mo2.group())

Output🖥

Phone number found:  (415) 444-1049

The Pipe(|) Character

Pipe means | character of Your keyboard. It is used for matching more than one expression. For example, the regular expression 'Tom Cruise |Kevin Bacon' will match either 'Tom Cruise' or 'Kevin Bacon'.

If both names occur in the search string then, the first occurrence will be matched as the matching object.

Code


import re

movieCharacter = re.compile(r'Tom Cruise |Kevin Bacon')
mo1 = movieCharacter.search('Tom Cruise is my favourite and Kevin Bacon too')
msg = 'Kevin Bacon and Tom Cruise both starred in the A Few Good Man Movie'
mo2 = movieCharacter.search(msg)
print(mo1.group())
print(mo2.group())

Output🖥

Tom Cruise 

Kevin Bacon


Note
You can match all the matching occurrences with the help of the
findall() method

Suppose, You want to match Combat, Common, Commando, Computer, Commodity any of these in a search string. See, all the strings start with Com; You can specify this only one as a prefix. Let's see how it is done.

Code


import re

character = re.compile(r'Com(bat|mon|mando|puter|modity)')
mo1 = character.search('Four things are Common in a Computer')
print(mo1.group())

Output🖥

Common
 

Note
If you want to match an actual pipe character then use a backslash before it. 
e.g. \|.

The question(?) mark

Sometimes You may want to match a pattern optionally. To make this You've to write the pattern like this, (pattern)? inside the re.compile() function. The question mark makes the pattern inside the braces optional.

For example, there are two words, "Impossible" and "possible". Now You want the make the "Im" optional. By doing this we'll get the correct answer whether the optional pattern, "Im" is in the search string or not.

Let's see an example for a better understanding.

Code


import re

missionRegex = re.compile(r'(Im)?possible')
mo1 = missionRegex.search('Mission Impossible')
print(mo1.group())

mo2 = missionRegex.search('Mission possible')
print(mo2.group())

Output🖥

Impossible
possible


Let's see another example.

Code


import re

contactRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo1 = contactRegex.search('Contact me at 415-444-1049')
print(mo1.group())

mo2 = contactRegex.search('Contact me at 444-1049')
print(mo2.group())

Output🖥

415-444-1049
444-1049


In the above example, I've made the area code optional.

The Star(*)

Star(*) matches zero or more repetitions of the preceding RE. The name zero-or-more suggests that the group that comes before the Star(*) can occur at least zero times and at most any number of times.

Code


import re

batRegex = re.compile(r'com(bat)*')
mo1 = batRegex.search('A com zone')
print(mo1.group())

mo2 = batRegex.search('A combat zone')
print(mo2.group())

mo3 = batRegex.search('A combatbatbatbat zone')
print(mo3.group())

Output🖥

com
combat
combatbatbatbat


For 'com', the (bat)* part of the regex matches zero instances of 'bat' in the string; for 'combat', the (bat)* matches one instance of bat in the string; and for 'combatbatbatbat', the (bat)* matches four instances of 'bat'.

The Plus(+) Sign

In the Regular Expression in python the Plus(+) sign means one more. In this case, the group that comes before the Plus(+) can occur at least one time and at most any number of times.

At first, we'll try the previous example.

Code


import re

batRegex = re.compile(r'com(bat)+')
mo1 = batRegex.search('A com zone')
print(mo1.group())

mo2 = batRegex.search('A combat zone')
print(mo2.group())

mo3 = batRegex.search('A combatbatbatbat zone')
print(mo3.group())

Output🖥

Traceback (most recent call last):
  File "c:\Users\SUKHENDU\Desktop\Python\hello.py", line 5, in <module>
    print(mo1.group())
AttributeError: 'NoneType' object has no attribute 'group'

 

Let's try the right one.

Code


import re

batRegex = re.compile(r'com(bat)+')

mo1 = batRegex.search('A combat zone')
print(mo1.group())

mo2 = batRegex.search('A combatbatbatbat zone')
print(mo2.group())

Output🖥

combat
combatbatbatbat

Greedy and Non-Greedy Matching

By default Python's regular expressions are greedy, that means in ambiguous situation they will match the longest string possible. For example, bat{3,5} will match three, four, or five instances of bat. Now suppose You've entered a phrase, 'batbatbatbatbat'. In this case, the matching result will be 'batbatbatbatbat' instead of 'batbatbat', or 'batbatbatbat'; But one thing to note is that the next instances are also but the correct match.

Here comes non-greedy matching. It matches the shortest string possible.

Code


import re

greedyRegex = re.compile(r'(Bat){3,5}')
mo1 = greedyRegex.search('BatBatBatBatBat')
print(mo1.group())

# Putting ? sign at the end
nongreedyRegex = re.compile(r'(Bat){3,5}?')
mo2 = nongreedyRegex.search('BatBatBatBatBat')
print(mo2.group())

Output🖥

BatBatBatBatBat
BatBatBat

The findall() Method

As we discussed earlier, the search() method returns the first matched text in the search string where; the findall() method returns all the matching instances in the search string.

Code


import re

findallRegex = re.compile(r'\d{3}-\d{3}-\d{4}')
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')

msg1 = 'Contact me at 415-444-1049. 415-444-2341 is another one'

mo1 = findallRegex.findall(msg1)
mo2 = phoneNumRegex.findall(msg1)

print(mo1)
print(mo2)

Output🖥

['415-444-1049', '415-444-2341']
[('415', '444', '1049'), ('415', '444', '2341')]

Make Your Character Classes

In Python Regular Expression, You can make your own character classes. Write down the pattern between the square brackets, like this, ['aeiouAEIOU']. It will match all the vowels in the search string.

Code


import re

vowelRegex = re.compile('[aeiouAEIOU]')
# Making negative charcter class by putting
# (^) symbol. It'll return non-vowel characters.
consonantRegex = re.compile('[^aeiouAEIOU]')

msg = 'Have you seen the Money Heist?'

mo1 = vowelRegex.findall(msg)
mo2 = consonantRegex.findall(msg)
print(mo1)
print(mo2)

Output🖥

['a', 'e', 'o', 'u', 'e', 'e', 'e', 'o', 'e', 'e', 'i']
['H', 'v', ' ', 'y', ' ', 's', 'n', ' ', 't', 'h', ' ', 'M', 'n', 'y', ' ', 'H', 's', 't', '?']

The Caret(^) and Dollar($)

The caret(^) symbol at the start of regex indicates the matched text must occur at the beginning of the search text. On the other hand, the ($) symbol at the end of the regex indicates the search string must end with this regex pattern. The use of ^ and $ together indicates that the entire string must match the regex.

Let's have a look at this example for a better understanding.

Code

 
import re

beginsWithMoney = re.compile(r'^Money')
endsWithNumber = re.compile(r'\d$')
wholeStringIsNum = re.compile(r'^\d+$')

msg = 'Money Heist Season 5'

mo1 = beginsWithMoney.search(msg)
print(mo1.group())

mo2 = endsWithNumber.search(msg)
print(mo2.group())

mo3 = wholeStringIsNum.search('1234567890')
mo4 = wholeStringIsNum.search('123 4567890')
print(mo3.group())
# It'll give an error message
# print(mo4.group())

Output🖥

Money
5
1234567890

The Wildcard(.) Character

In the Regular Expression in Python, the . (or dot) is known as wildcard character. It is used to match any character except a newline. If the flag DOTALL has been specified then, it will match any character including a new line.

Code


import re

atRegex = re.compile(r'.at')
msg = 'a cat was sat on a flat mat and watching a bat.'
mo1 = atRegex.findall(msg)
print(mo1)

Output🖥

['cat', 'sat', 'lat', 'mat', 'wat', 'bat']


Keep in mind one thing that the dot character will match just one character, that's why we've got 'lat' in the case of 'flat'. To match the actual dot, escape the dot with a backslash: \..

Matching Everything with Dot-Star

The name "Matching Everything" suggests that You can match everything & anything. Suppose, You want to match the string 'First Name: ', followed by anything, followed by 'Last Name: ', and then followed by anything again. To perform this operation we'll use dot-star(.*). Remember, the dot character stands for any single character except the newline, and the star character means zero or more of the preceding character.

We'll try both the greedy and non-greedy modes here.

Code


import re

nameRegex = re.compile(r'Name: (.*) Surname: (.*)')
mo1 = nameRegex.search('Name: Kane Surname: Williamson')
print(mo1.group())

msg = '(Williamson is the captain) of the NZ team)'

# Non greedy Mode
nongreedyRegex = re.compile(r'(.*?)')
mo2 = nongreedyRegex.search(msg)
print(mo2.group())

# Greedy Mode
greedyRegex = re.compile(r'(.*)')
mo3 = greedyRegex.search(msg)
print(mo3.group())

Output🖥

Name: Kane Surname: Williamson

(Williamson is the captain)
(Williamson is the captain) of the NZ team)

Matching New Lines with the Dot Character

The dot-star will match everything except a newline. You can match all characters including a new line bypassing re.DOTALL as a second argument to re.compile() function.

Code


import re

noNewlineRegex = re.compile('.*')

msg = 'Create a life plan. \nMaster a difficult skill'

mo1 = noNewlineRegex.search(msg)
print(mo1.group())

print('-'*20)
newlineRegex = re.compile('.*', re.DOTALL)
mo2 = newlineRegex.search(msg)
print(mo2.group())

Output🖥

Create a life plan.
--------------------
Create a life plan.
Master a difficult skill

Character Classes

Shorthand
Represents
\d
Any numeric digits from 0 to 9.
\D
Any character that's not numeric 0 to 9.
\w
Any letter, numeric, or underscore. In short a word.
\W
Opposite of the previous.
\s
Any space, tab, or newline character.
\S
Any character that's not a space, tab, or newline.


Review of Regex Symbol

Let's have a glimpse👁 on what you've learned till now.

  • ? Matches zero or one of the preceding groups.
  • * For zero or more of the preceding group.
  • + Matches One or more of the preceding group.
  • {n} Exactly n of the preceding group. Less than n causes the entire regex not to match. For example, m{5} will match exactly five 'm' characters not four.
  • {n,} n or more of the preceding group.
  • {,m} 0 to m preceding group.
  • {n,m} Matches at least n and at most m of the preceding RE. For example, a{4,8} will match 4 to 8 'a' character.
  • {n,m}? or *? or +? For non-greedy match of the preceding group.
  • ^caret means the search string must begin with "caret". "caret" is a word.
  • caret$ means the search must end with "caret".
  • . (Dot.) matches any character, except the newline character.
  • \d, \w, or \s match a digit, word, or space character, respectively.
  • \D, \W, or \S match anything except a digit, word, or space character, respectively.
  • [abc] matches any character between the square bracket. In this case, a, b, or c.
  • [a-m] matches any character between a and m.
  • [0-8] matches any numbers between 0 and 8.
  • [^abc] matches any character that's not between the square bracket.

Case-Insensitive Match

The examples You've been seen so far show all the exact pattern matches. But if we've match case-insensitive then we've to resort to other ways.

For example, these examples are totally different form each other.

👉reg1 = re.compile('Master')

👉reg2 = re.compile('master')

👉reg3 = re.compile('mAster')

👉reg4 = re.compile('MasteR')

In this case, instead of creating so many regexes, You can solve the issue in just one line. Simply pass re.IGNORECASE or re.I as a second argument to the re.compile() function.

Code


import re

robocop = re.compile(r'money', re.I)
mo1 = robocop.search('Have you seen the Money Heist?')
print("Match1: ", mo1.group())

mo2 = robocop.search('Have you seen the moneY Heist?')
print("Match2: ", mo2.group())

Output🖥

Match1:  Money
Match2:  moneY

Substitute String with sub() Method

Remember, at the beginning of this tutorial I'd discussed the find-and-replacement feature of MS-Word. Regular Expression in Python doesn't only find the matching text pattern, but can also replace or substitute new text with those patterns.

We use the sub() method to perform the Substitute operation. We've to pass two arguments to this method. The first one is a string to replace any matches and the second one is the search string.

Code


import re

nameRegex = re.compile(r'Professor \w+', re.I)
msg = 'Professor Alberto told Lisbon the way to get out the gold.'
mo1 = nameRegex.sub('Boogy', msg)
print(mo1)

'''In that string the \1 will be replaced by whatever text was
matched by group 1—that is, the (\w\w) group of the regular
expression.'''

agentNamesRegex = re.compile(r'Agent (\w\w)\w+', re.I)
msg2 = 'Agent Berlin told agent Denver to kill Agent Monica.'
mo1 = agentNamesRegex.sub(r'\1****', msg2)
print(mo1)

Output🖥

Boogy told Lisbon the way to get out the gold.
Be**** told De**** to kill Mo****.


You can type \1, \2, \3, and so on in the first argument to the sub() method. It means in the substitution, enter the text of groups 1, 2, 3, and so on.

Managing Complex Regular Expression

Regular Expressions are excellent if the text pattern You want to match is straightforward. But matching complex text patterns might require long, tangled regular expressions. But don't worry. There is a way to mitigate this. You can ignore white space and comments inside the regular expression string by passing re.VERBOSE as the second argument to re.compile() function.

Let's have a look at a complex expression.

phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}
(\s*(ext|x|ext.)\s*\d{2,5})?)')

I think it stayed over Your head. OK, there is another one for You.

Code


import re

phoneRegex = re.compile(r'''(
(\d{3}|\(\d\))? # First three digit
(\s|-|\.)? # separator
\d{3} # first 3 digits
(\s|-|\.)? # separator
\d{4} # last 4 digits
(\s*(ext|x|ext.)\s*\d{2,5})? # extension
)''', re.VERBOSE)

msg = 'Contact 1: 415-444-1049. Contact 2: 415-444-2341.'

# Returning a list
mo1 = phoneRegex.findall(msg)
print(mo1)

Output🖥

[('415-444-1049', '415', '-', '-', '', ''), ('415-444-2341', '415', '-', '-', '', '')]

Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE together

Generally, re.compile() allows passing only a single value as its second argument. But suppose, You want to use re.IGNORECASE, and re.DOTALL together. There is a solution to this situation. Use (|) or pipe character or bitwise OR operator  as a separator between two arguments. You can use more than two also.

Example


regex = re.compile('python', re.IGNORECASE | re.DOTALL | re.VERBOSE)

References

👉 Automate the Boring Stuff with Python by AI Sweigart.

👉 Gmail Help

👉 Regular Expression - Wikipedia

 

That's all for this tutorial. Feel free to drop your comment below. You'll get a reply soon.

Thanks for reading!💚

Subhankar Rakshit

Meet Subhankar Rakshit, a Computer Science postgraduate (M.Sc.) and the creator of PySeek. Subhankar is a programmer, specializes in Python language. With a several years of experience under his belt, he has developed a deep understanding of software development. He enjoys writing blogs on various topics related to Computer Science, Python Programming, and Software Development.

Post a Comment (0)
Previous Post Next Post