Extract emails and phone numbers from a webpage using python

extract phone numbers and emails from a text or webpage using a python program

Introduction

Have You ever encountered a situation where You had to find the email and phone number one by one from a large text document or a webpage? To do manually, it's a boring task and it may consume so much Your valuable time. But if you had a program that can find email addresses and phone numbers from a text document automatically then how it would be? Sounds great, right?

In this tutorial, we will create a Python Program which will extract emails and phone numbers from a text or webpage using Regular Expressions.

You don't have to do much. Simply select all the text (from a webpage or document) by pressing ctrl+A, next, ctrl+C to copy that; then, run this python program to get the extracted emails and phone numbers.

Visit Also: Automate your email service - Sending emails using python

Requirement

We have to install this module in our system. It will allow us to working with the clipboard function(Copy and Paste) using Python.

👉Install pyperclip: pip install pyperclip

Important Note
This python program will find only US and Indian phone numbers from a text.

Code


import pyperclip, re

IndianNumber = re.compile(r'''(
([+]\d{1,2})
(\d{3,10})
)''',re.VERBOSE)

phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? # Area Code(Optional)
(\s|-|\.) # Separator
(\d{3}) # First Three Digits
(\s|-|\.) # Separator
(\d{4}) # Last Four Digits
(\s*(ext|x|ext.)\s*(\d{2,5}))? # Extension
)''', re.VERBOSE)

emailRegex = re.compile(r'''(
[a-zA-Z0-9._%+-]+ # Username
@ # @Symbol
[a-zA-Z0-9.-]+ # Domain Name
(\.[a-zA-Z]{2,4}) # dot-something
)''', re.VERBOSE)

# Find Matches in Clipboard Text
text = str(pyperclip.paste())
phone_groups = phoneRegex.findall(text)
email_groups = emailRegex.findall(text)
Indian_Contacts = IndianNumber.findall(text)

matched = []

for group in phone_groups:
matched.append(group[0])
# phoneNum = '-'.join(group[1], group[3], group[5])
# if group[8] != '':
# phoneNum += ' x' + group[8]
# matched.append(phoneNum)

for group in Indian_Contacts:
if group[1] == '+91':
phoneNum = group[1] + group[2]
matched.append(phoneNum)

for group in email_groups:
matched.append(group[0])

if len(matched) > 0:
pyperclip.copy('\n'.join(matched))
print('Copied to clipboard!\n')
print('\n'.join(matched))
else:
print('No Phone Numbers or Emails found')

Output

Copied to clipboard!

535-555-1348
(535) 555-1348
+919876543210
pyseek89@gmail.com
example123@outlook.com
my_email@yahoo.com
subhankar.rakshit@pyseek.com

Important Note
Please know about Regular Expression in Python before, to understand the code better.

Output Video

Illustration of the above Code

In this section, I'll illustrate every single part of the above code.

Regex for Indian🇮🇳 Phone Numbers


import pyperclip, re

IndianNumber = re.compile(r'''(
([+]\d{1,2}) # Country Code
(\d{3,10}) # Actual Number
)''',re.VERBOSE)

Example: +919876543210.

This is an example of telephone numbers in India. The Plus Sign(+) followed by the first two digit(91) is the country code

Next four digits indicate the operator code and the remaining are the subscriber's unique number.

In the code, you'll see a regex: 'IndianNumber' which is for matching the first two numerical digits followed by the '+' sign and the remaining 10 numbers.

Regex for US🇺🇸 Phone Numbers


phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? # Area Code(Optional)
(\s|-|\.) # Separator
(\d{3}) # First Three Digits
(\s|-|\.) # Separator
(\d{4}) # Last Four Digits
(\s*(ext|x|ext.)\s*(\d{2,5}))? # Extension(Optional)
)''', re.VERBOSE)

Let's see the format of a US telephone number: 535-555-1348 or (535) 555-1348 or 535 555 1348, etc. 

Look, I've mentioned the purpose of every line of the above code on the right side(see the comment lines). Suppose, a phone number has an extension like this, 535-555-1348 x99. In this case, the last line in the 'phoneRegex' will perform the exact match.

Regex for Email addresses


emailRegex = re.compile(r'''(
[a-zA-Z0-9._%+-]+ # Username
@ # @Symbol
[a-zA-Z0-9.-]+ # Domain Name
(\.[a-zA-Z]{2,4}) # dot-something
)''', re.VERBOSE)

The Username of an email address contains one or more characters that can be any of the following: lowercase & uppercase letters, numeric numbers(0-9), an underscore(_), a dot(.), a plus sign(+), a percent sign(%), or a hyphen(-). 

The username and domain name are separated by a @ symbol.

After the username, the domain name takes place which contains one or more of the following: lowercase & uppercase letters, numeric numbers, a dot, or a hyphen. At last, comes the dot-anything part. It may "dot-com", "dot-net", etc.

Find Matches in the Clipboard Text


text = str(pyperclip.paste())
phone_groups = phoneRegex.findall(text)
email_groups = emailRegex.findall(text)
Indian_Contacts = IndianNumber.findall(text)

matched = []

The variable 'text' stores the text you copied from a webpage or a text document. The .findall(text) function returns a tuple containing all the matched objects found from the 'text' variable.

Later we add those matched phone numbers and email addresses to the list 'matched'.

The For Loops


for group in phone_groups:
matched.append(group[0])
# phoneNum = '-'.join(group[1], group[3], group[5])
# if group[8] != '':
# phoneNum += ' x' + group[8]
# matched.append(phoneNum)

phoneRegex.findall(text): It returns a list containing all the matched objects with the regular expression, declared for the phoneRegex, which looks like the following.

[('535-555-1348', '535', '-', '555', '-', '1348', '', '', ''), ('(535) 555-1348', '(535)', ' ', '555', '-', '1348', '', '', '')].

Now, look at the yellow line of the above code. We've taken only the 0th item from every tuple in the returned list.


for group in Indian_Contacts:
if group[1] == '+91':
phoneNum = group[1] + group[2]
matched.append(phoneNum)

for group in email_groups:
matched.append(group[0])

In this case, the logic is the same as before.

An Example of a returned list: [('+919876543210', '+91', '9876543210')] and [('pyseek89@gmail.com', '.com'), ('example123@outlook.com', '.com'), ('my_email@yahoo.com', '.com'), ('subhankar.rakshit@pyseek.com', '.com')]

Printing the Final Result


if len(matched) > 0:
pyperclip.copy('\n'.join(matched))
print('Copied to clipboard!\n')
print('\n'.join(matched))
else:
print('No Phone Numbers or Emails found')

Here, the code checks whether the list 'matched' is empty or not. Not empty means the program has found some matching content from the text and prints the result. Otherwise, it will print this message: "No Phone Numbers and Emails found".

Summary

In this python tutorial, we talked about how to extract email ids and phone numbers from a webpage or text using a Python Program.

We performed this task with the help of Python Regular Expressions. In the case of Phone Numbers, we have taken US and Indian examples only.

If you are from different country, you need to create a different regex for it. Do try it. If there is any difficulty, let me know. I will guide you.

To get more lovely python topics, visit the separate page created only for Unique Examples in Python. Some examples are given below.

👉Wish Your Friends with Stylish TExt in Python

👉Communicate with Your Friends Secretly using Python

👉Draw the Sketch of Lionel Messi using a Python Program

👉Extract Metadata from an Image using Python

👉Check the strength of your password using python

Thanks for reading!💙

PySeek

Subhankar Rakshit

Meet Subhankar Rakshit, a Computer Science postgraduate (M.Sc.) and the creator of PySeek. Subhankar is a programmer, specializes in Python language. With a several years of experience under his belt, he has developed a deep understanding of software development. He enjoys writing blogs on various topics related to Computer Science, Python Programming, and Software Development.

Post a Comment (0)
Previous Post Next Post