Introduction
Have you ever received an error message while trying to log in to any mail server for giving the wrong mail id? or advised you to keep strong passwords when creating an account. Every internet user, as well as me, faced this problem at least once. A question may arise in your mind; how the computer could understand a password is strong or not, or is a mail id is correct or not?
Here comes the use of Regular Expressions. But how, right? You'll get the answer shortly. In this tutorial, You'll learn about Regular Expression in Python.
What is Regular Expression in Python?
A Regular Expression or RE is a specific sequence of characters or pattern that specifies a set of strings that matches it. It helps to search for a specific pattern of a text.
For example, a Gmail id should be like this, username@gmail.com. As per Gmail Help, a username can contain letters (a-z), numbers (0-9), and periods (.). Usernames can not contain an equal sign (=), apostrophe('), ampersand (&), underscore (_), dash (-), comma (,), plus sign (+), brackets (<,>), or more than one period (.) in a row. Usernames can begin or end with non-alphanumeric characters except periods (.).
You have to follow these rules when creating a USERNAME on Gmail. In such cases, to match a correct email ID, Regular Expressions are helpful, and time savers. It can solve a problem more easily. We'll discuss more deeply one by one in the upcoming sections.
Finding Patterns Without Regular Expressions
Imagine, you want to find all the phone numbers in a string with the help of programming. Here is an example of a US phone number, (535) 555-1348 or 535-555-1348. Let's see the simple steps you should follow to find a phone number.
Steps:
- Check the phone number length is 12 or not (as per the second format I've mentioned above).
- Then check the first-three digit of the area code.
- Now check the hyphen sign after the area code.
- Again check three numeric digits.
- Next, the hyphen sign again.
- At last check the four more numeric digits.
If all the steps are satisfied, the program will return the matched phone number else return the "not found" message.
Code
def isPhoneNumber(text):
if len(text) != 12:
return False
for i in range(0, 3):
if not text[i].isdecimal():
return False
if text[3] != '-':
return False
for i in range(4, 7):
if not text[i].isdecimal():
return False
if text[7] != '-':
return False
for i in range(8, 12):
if not text[i].isdecimal():
return False
return True
if __name__ == '__main__':
msg = 'Contact 1: 415-444-1049. Contact 2: 415-444-2341'
for i in range(len(msg)):
block = msg[i:i+12]
if isPhoneNumber(block):
print('Phone number found: ' + block)
Output🖥
Finding Patterns with Regular Expressions
In the previous section, we extracted phone numbers from a string. There was no problem with that. But suppose, if we wanted to find out both formats of the phone number then the code would get more complicated. Here comes the tricks of Regular Expressions. Let's see how we do this.
Follow these simple steps
- Import regular expression (in short regex) module with "import re",
- Create a regex object with re.compile() function (Use raw string),
- Pass the string you want to search into the regex object's search() method. It will return the matching object,
- Call the match object's group() method to return an actual matched text.
Code
import re
# By putting an r before the first quote of the string value,
# you can mark the string as a raw string, which does not
# escape characters.
phoneNum = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
msg = 'Contact 1: 415-444-1049. Contact 2: 415-444-2341'
mo = phoneNum.search(msg)
print('Phone number found: ' + mo.group())
Output🖥
Phone number found: 415-444-1049
You can write the above code more simply. Like "\d{3}-\d{3}-\d{4}" is the same as "\d\d\d-\d\d\d-\d\d\d\d". Now it's your choice which one You would like to use🙆.
I think by now you probably have understood that by using Regular Expressions we can solve complex problems by writing less code.
In the last example, You may notice that I'd put two phone numbers in the string, but, in the Output, we got only one result; Because the search() function returns only the first match. We'll see every pattern match technique with the help of the findall() function in the coming section. Let's try more examples.
Parentheses
We can create different groups by adding parentheses in the regex. Example: "(\d\d\d)-(\d\d\d-\d\d\d\d)". You can use the group() match object method to grab the matching text from just one group. Let's have a look at the code for a better understanding.
Code
import re
phoneNum1 = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
msg = 'Contact me at 415-444-2341'
mo1 = phoneNum1.search(msg)
# mo1.groups() returns a tuple of multiple values
areaCode, mainNum = mo1.groups()
print('Tuple: ', mo1.groups())
print('AreaCode: ', areaCode)
print('Actual Number: ', mainNum)
# Another Method
print('~Another Method~')
print('AreaCode: ', mo1.group(1))
print('Actulal Number: ', mo1.group(1))
print('Compact Form: ', mo1.group())
Output🖥
The escape(\) characters
Suppose someone enters a phone number in a text like this, (415) 444-2341. In this case, we've to use the "\(and\)" escape characters in the raw string passed to re.compile() function for matching exact numbers.
Code
import re
phoneNum2 = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
msg = 'Contact me at (415) 444-1049.'
mo2 = phoneNum2.search(msg)
print('Phone number found: ', mo2.group())
Output🖥
The Pipe(|) Character
Pipe means | character of Your keyboard. It is used for matching more than one expression. For example, the regular expression "Tom Cruise |Kevin Bacon" will match either 'Tom Cruise' or 'Kevin Bacon'.
If both names occur in the search string then, the first occurrence will be matched as the matching object.
Code
import re
movieCharacter = re.compile(r'Tom Cruise |Kevin Bacon')
mo1 = movieCharacter.search('Tom Cruise is my favourite and Kevin Bacon too')
msg = 'Kevin Bacon and Tom Cruise both starred in the A Few Good Man Movie'
mo2 = movieCharacter.search(msg)
print(mo1.group())
print(mo2.group())
Output🖥
Tom Cruise
Kevin Bacon
Suppose, You want to match Combat, Common, Commando, Computer, Commodity any of these in a search string. See, all the strings start with Com; You can specify this only one as a prefix. Let's see how it is done.
Code
import re
character = re.compile(r'Com(bat|mon|mando|puter|modity)')
mo1 = character.search('Four things are Common in a Computer')
print(mo1.group())
Output🖥
The question(?) mark
Suppose, you may want to match a pattern optionally. To make this You've to write the pattern like this, "(pattern)?" inside the re.compile() function. The question mark makes the pattern inside the braces optional.
For example, there are two words, "Impossible" and "possible". Now You want the make the "Im" optional. By doing this we'll get the correct answer whether the optional pattern, "Im" is in the search string or not.
Let's see an example for a better understanding.
Code
import re
missionRegex = re.compile(r'(Im)?possible')
mo1 = missionRegex.search('Mission Impossible')
print(mo1.group())
mo2 = missionRegex.search('Mission possible')
print(mo2.group())
Output🖥
Impossible
possible
Another example.
Code
import re
contactRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo1 = contactRegex.search('Contact me at 415-444-1049')
print(mo1.group())
mo2 = contactRegex.search('Contact me at 444-1049')
print(mo2.group())
Output🖥
415-444-1049
444-1049
In the above example, I've made the area code optional.
The Star(*)
Star (*) matches zero or more repetitions of the preceding RE. The name zero-or-more suggests that the group that comes before the Star(*) can occur at least zero times and at most any number of times.
Code
import re
batRegex = re.compile(r'com(bat)*')
mo1 = batRegex.search('A com zone')
print(mo1.group())
mo2 = batRegex.search('A combat zone')
print(mo2.group())
mo3 = batRegex.search('A combatbatbatbat zone')
print(mo3.group())
Output🖥
com
combat
combatbatbatbat
For 'com', the (bat)* part of the regex matches zero instances of 'bat' in the string; for 'combat', the (bat)* matches one instance of bat in the string; and for 'combatbatbatbat', the (bat)* matches four instances of 'bat'.
The Plus(+) Sign
In the Python Regular Expressions the Plus('+') sign means one more. In this case, the group that comes before the Plus('+') can occur at least one time and at most any number of times.
At first, we'll try the previous example.
Code
import re
batRegex = re.compile(r'com(bat)+')
mo1 = batRegex.search('A com zone')
print(mo1.group())
mo2 = batRegex.search('A combat zone')
print(mo2.group())
mo3 = batRegex.search('A combatbatbatbat zone')
print(mo3.group())
Output🖥
Traceback (most recent call last):
File "c:\Users\SUKHENDU\Desktop\Python\hello.py", line 5, in <module>
print(mo1.group())
AttributeError: 'NoneType' object has no attribute 'group'
Let's try the right one.
Code
import re
batRegex = re.compile(r'com(bat)+')
mo1 = batRegex.search('A combat zone')
print(mo1.group())
mo2 = batRegex.search('A combatbatbatbat zone')
print(mo2.group())
Output🖥
combat
combatbatbatbat
Greedy and Non-Greedy Matching
By default Python's regular expressions are greedy, that means in ambiguous situation they will match the longest string possible. For example, bat{3,5} will match three, four, or five instances of bat. Now suppose You've entered a phrase, 'batbatbatbatbat'. In this case, the matching result will be 'batbatbatbatbat' instead of 'batbatbat', or 'batbatbatbat'; But one thing to note is that the next instances are also but the correct match.
Here comes non-greedy matching. It matches the shortest string possible.
Code
import re
greedyRegex = re.compile(r'(Bat){3,5}')
mo1 = greedyRegex.search('BatBatBatBatBat')
print(mo1.group())
# Putting ? sign at the end
nongreedyRegex = re.compile(r'(Bat){3,5}?')
mo2 = nongreedyRegex.search('BatBatBatBatBat')
print(mo2.group())
Output🖥
BatBatBatBatBat
BatBatBat
The findall() Method
As we discussed earlier, the search() method returns the first matched text in the search string where; the findall() method returns all the matching instances in the search string.
Code
import re
findallRegex = re.compile(r'\d{3}-\d{3}-\d{4}')
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
msg1 = 'Contact me at 415-444-1049. 415-444-2341 is another one'
mo1 = findallRegex.findall(msg1)
mo2 = phoneNumRegex.findall(msg1)
print(mo1)
print(mo2)
Output🖥
['415-444-1049', '415-444-2341']
[('415', '444', '1049'), ('415', '444', '2341')]
Make Your Character Classes
In Python Regular Expression, You can make your own character classes. Write down the pattern between the square brackets, like this, ['aeiouAEIOU']. It will match all the vowels in the search string.
Code
import re
vowelRegex = re.compile('[aeiouAEIOU]')
# Making negative charcter class by putting
# (^) symbol. It'll return non-vowel characters.
consonantRegex = re.compile('[^aeiouAEIOU]')
msg = 'Have you seen the Money Heist?'
mo1 = vowelRegex.findall(msg)
mo2 = consonantRegex.findall(msg)
print(mo1)
print(mo2)
Output🖥
['a', 'e', 'o', 'u', 'e', 'e', 'e', 'o', 'e', 'e', 'i']
['H', 'v', ' ', 'y', ' ', 's', 'n', ' ', 't', 'h', ' ', 'M', 'n', 'y', ' ', 'H', 's', 't', '?']
The Caret(^) and Dollar($)
The caret(^) symbol at the start of regex indicates the matched text must occur at the beginning of the search text. On the other hand, the ($) symbol at the end of the regex indicates the search string must end with this regex pattern. The use of ^ and $ together indicates that the entire string must match the regex.
Let's have a look at this example for a better understanding.
Code
import re
beginsWithMoney = re.compile(r'^Money')
endsWithNumber = re.compile(r'\d$')
wholeStringIsNum = re.compile(r'^\d+$')
msg = 'Money Heist Season 5'
mo1 = beginsWithMoney.search(msg)
print(mo1.group())
mo2 = endsWithNumber.search(msg)
print(mo2.group())
mo3 = wholeStringIsNum.search('1234567890')
mo4 = wholeStringIsNum.search('123 4567890')
print(mo3.group())
# It'll give an error message
# print(mo4.group())
Output🖥
Money
5
1234567890
The Wildcard(.) Character
In the Regular Expression in Python, the . (or dot) is known as wildcard character. It is used to match any character except a newline. If the flag DOTALL has been specified then, it will match any character including a new line.
Code
import re
atRegex = re.compile(r'.at')
msg = 'a cat was sat on a flat mat and watching a bat.'
mo1 = atRegex.findall(msg)
print(mo1)
Output🖥
['cat', 'sat', 'lat', 'mat', 'wat', 'bat']
Keep in mind one thing that the dot character will match just one character, that's why we've got 'lat' in the case of 'flat'. To match the actual dot, escape the dot with a backslash: '\.'.
Matching Everything with Dot-Star
The name "Matching Everything" suggests that You can match everything & anything. Suppose, You want to match the string 'First Name: ', followed by anything, followed by 'Last Name: ', and then followed by anything again. To perform this operation we'll use dot-star(.*). Remember, the dot character stands for any single character except the newline, and the star character means zero or more of the preceding character.
We'll try both the greedy and non-greedy modes here.
Code
import re
nameRegex = re.compile(r'Name: (.*) Surname: (.*)')
mo1 = nameRegex.search('Name: Kane Surname: Williamson')
print(mo1.group())
msg = '(Williamson is the captain) of the NZ team)'
# Non greedy Mode
nongreedyRegex = re.compile(r'(.*?)')
mo2 = nongreedyRegex.search(msg)
print(mo2.group())
# Greedy Mode
greedyRegex = re.compile(r'(.*)')
mo3 = greedyRegex.search(msg)
print(mo3.group())
Output🖥
Name: Kane Surname: Williamson
(Williamson is the captain)
(Williamson is the captain) of the NZ team)
Matching New Lines with the Dot Character
The dot-star will match everything except a newline. You can match all characters including a new line bypassing re.DOTALL as a second argument to re.compile() function.
Code
import re
noNewlineRegex = re.compile('.*')
msg = 'Create a life plan. \nMaster a difficult skill'
mo1 = noNewlineRegex.search(msg)
print(mo1.group())
print('-'*20)
newlineRegex = re.compile('.*', re.DOTALL)
mo2 = newlineRegex.search(msg)
print(mo2.group())
Output🖥
Create a life plan.
--------------------
Create a life plan.
Master a difficult skill
Character Classes
Shorthand | Represents |
---|---|
\d | Any numeric digits from 0 to 9. |
\D | Any character that's not numeric 0 to 9. |
\w | Any letter, numeric, or underscore. In short a word. |
\W | Opposite of the previous. |
\s | Any space, tab, or newline character. |
\S | Any character that's not a space, tab, or newline. |
Review of Regex Symbol
Let's have a glimpse on what you've learned till now.
- ? Matches zero or one of the preceding groups.
- * For zero or more of the preceding group.
- + Matches One or more of the preceding group.
- {n} Exactly n of the preceding group. Less than n causes the entire regex not to match. For example, m{5} will match exactly five 'm' characters not four.
- {n,} n or more of the preceding group.
- {,m} 0 to m preceding group.
- {n,m} Matches at least n and at most m of the preceding RE. For example, a{4,8} will match 4 to 8 'a' character.
- {n,m}? or *? or +? For non-greedy match of the preceding group.
- ^caret means the search string must begin with "caret". "caret" is a word.
- caret$ means the search must end with "caret".
- . (Dot.) matches any character, except the newline character.
- \d, \w, or \s match a digit, word, or space character, respectively.
- \D, \W, or \S match anything except a digit, word, or space character, respectively.
- [abc] matches any character between the square bracket. In this case, a, b, or c.
- [a-m] matches any character between a and m.
- [0-8] matches any numbers between 0 and 8.
- [^abc] matches any character that's not between the square bracket.
Case-Insensitive Match
The examples You've been seen so far show all the exact pattern matches. But if we've match case-insensitive then we've to resort to other ways.
For example, these examples are totally different form each other.
👉reg1 = re.compile('Master')
👉reg2 = re.compile('master')
👉reg3 = re.compile('mAster')
👉reg4 = re.compile('MasteR')
In this case, instead of creating so many regexes, You can solve the issue in just one line. Simply pass re.IGNORECASE or re.I as a second argument to the re.compile() function.
Code
import re
robocop = re.compile(r'money', re.I)
mo1 = robocop.search('Have you seen the Money Heist?')
print("Match1: ", mo1.group())
mo2 = robocop.search('Have you seen the moneY Heist?')
print("Match2: ", mo2.group())
Output🖥
Match1: Money
Match2: moneY
Substitute String with sub() Method
Remember, at the beginning of this tutorial I'd discussed the find-and-replacement feature of MS-Word. Regular Expression in Python doesn't only find the matching text pattern, but can also replace or substitute new text with those patterns.
We use the sub() method to perform the Substitute operation. We've to pass two arguments to this method. The first one is a string to replace any matches and the second one is the search string.
Code
import re
nameRegex = re.compile(r'Professor \w+', re.I)
msg = 'Professor Alberto told Lisbon the way to get out the gold.'
mo1 = nameRegex.sub('Boogy', msg)
print(mo1)
'''In that string the \1 will be replaced by whatever text was
matched by group 1—that is, the (\w\w) group of the regular
expression.'''
agentNamesRegex = re.compile(r'Agent (\w\w)\w+', re.I)
msg2 = 'Agent Berlin told agent Denver to kill Agent Monica.'
mo1 = agentNamesRegex.sub(r'\1****', msg2)
print(mo1)
Output🖥
Boogy told Lisbon the way to get out the gold.
Be**** told De**** to kill Mo****.
You can type \1, \2, \3, and so on in the first argument to the sub() method. It means in the substitution, enter the text of groups 1, 2, 3, and so on.
Managing Complex Regular Expression
Regular Expressions are excellent if the text pattern You want to match is straightforward. But matching complex text patterns might require long, tangled regular expressions. But don't worry. There is a way to mitigate this. You can ignore white space and comments inside the regular expression string by passing re.VERBOSE as the second argument to re.compile() function.
Let's have a look at a complex expression.
phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}
(\s*(ext|x|ext.)\s*\d{2,5})?)')
I think it stayed over Your head. OK, there is another one for You.
Code
import re
phoneRegex = re.compile(r'''(
(\d{3}|\(\d\))? # First three digit
(\s|-|\.)? # separator
\d{3} # first 3 digits
(\s|-|\.)? # separator
\d{4} # last 4 digits
(\s*(ext|x|ext.)\s*\d{2,5})? # extension
)''', re.VERBOSE)
msg = 'Contact 1: 415-444-1049. Contact 2: 415-444-2341.'
# Returning a list
mo1 = phoneRegex.findall(msg)
print(mo1)
Output🖥
[('415-444-1049', '415', '-', '-', '', ''), ('415-444-2341', '415', '-', '-', '', '')]
Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE together
Generally, re.compile() allows passing only a single value as its second argument. But suppose, You want to use re.IGNORECASE, and re.DOTALL together. There is a solution to this situation. Use (|) or pipe character or bitwise OR operator as a separator between two arguments. You can use more than two also.
Example
regex = re.compile('python', re.IGNORECASE | re.DOTALL | re.VERBOSE)
References
👉 Automate the Boring Stuff with Python by AI Sweigart.
👉 Gmail Help
👉 Regular Expression - Wikipedia
That's all for this tutorial. Please leave your comments below in case of any doubts or problems related to this topic. You'll get a reply soon.
Thanks for reading!💙
PySeek