Working with PDF Files in Python - Complete Tutorial

working with pdf files in python

Introduction

We all probably know about the versatility of the python programming language. Python made programming easier by providing a large number of libraries, modules, or packages to work with different fields. There is a module called PDF2 that provides many functions to work with PDF files using python programming.

PyPDF2 is the updated version of pyPdf that was released in 2005. The module named PyPDF2  was released in 2016 by adding some extra features with pyPdf.

In this tutorial, you will learn:

  1. Extract data from a PDF file.
  2. Rotate only one page of a PDF file.
  3. Rotate only specific pages of a PDF file.
  4. Split a PDF file.
  5. Merge more than one PDF file.
  6. Add a watermark to the PDFs.
  7. Encrypt a PDF file.
  8. Decrypting a PDF file.

Read AlsoWorking with CSV files - Complete Tutorial

Requirements

Install PyPDF2: pip install PyPDF2

How to read data from a PDF file

From this PDF page we will read all the data using a python program.

We will extract the data from this PDF page using the Python program below.

Code


'''Extract the data'''

from PyPDF2 import PdfFileReader

# Opening the PDF file in read and binary mode
# 'rb': read in binary
pdfFile = open('LinuxForDevelopers.pdf', 'rb')

pdfReader = PdfFileReader(pdfFile)

# Page no: 20
thePage = pdfReader.getPage(20)

print(thePage.extractText())
print("Total Pages: ", pdfReader.numPages)

pdfFile.close()

Output

Extracted data from a PDF file using python program.

Look at the yellow line in the above code. There I mentioned the page number(Page: 20) from where the data have been extracted.

How to Rotate one page of a PDF file

Nowadays, we can easily convert multiple photos to PDF files by scanning through our mobile camera. There are many application software available to perform this task.

For instance, you're converting some pre-scanned images to a PDF file by merging them into one. But suppose you captured one photo in landscape mode instead of portrait. Then, only that diagonally perverted page can ruin the beauty of the whole PDF file.

In such a situation, it is difficult to fix just one or a few pages. But don't worry, python can do it in the blink of an eye. We'll create a python program to fix this issue.

First, we will learn how to rotate only a single page of a PDF file. To rotate more than one page we need to use a loop that will iterate through each page number. Please keep reading.

Code


'''Rotate only one page'''

from PyPDF2 import PdfFileReader
from PyPDF2.pdf import PdfFileWriter

pdfFile = 'LinuxForDevelopers.pdf'

pdfReader = PdfFileReader(pdfFile)
pdfWriter = PdfFileWriter()
resultPdf = open("result.pdf", 'wb')

# Page no: 0
thePage = pdfReader.getPage(0)

# thePage.rotateClockwise(90)
thePage.rotateCounterClockwise(90)

pdfWriter.addPage(thePage)
pdfWriter.write(resultPdf)
resultPdf.close()

Output

Rotated a single page of a PDF file using a Python program

I've rotated the page 90 degree, anti-clockwise (See the image above). There is another option, clockwise(see the yellow marked line).

Rotate more than one page of a PDF file

As I mentioned in the earlier section, suppose a situation arises where we need to rotate more than one page of a PDF file.

Rotate multiple pages of a PDF file using a python program

Look precisely at the image above. There, three pages (0, 3, and 4) are needed to be fixed. Let's do it with the help of a python program.

Code


'''Rotate only a few pages'''

from PyPDF2 import PdfFileReader
from PyPDF2.pdf import PdfFileWriter

need_to_fix = [0, 3, 4]

pdfReader = PdfFileReader('merged_file.pdf')
pdfWriter = PdfFileWriter()

fixed_file = open('fixed_file.pdf', 'wb')

for page in range(pdfReader.getNumPages()):
thePage = pdfReader.getPage(page)
if page in need_to_fix:
thePage.rotateClockwise(90)

pdfWriter.addPage(thePage)

pdfWriter.write(fixed_file)
print("Done!")
fixed_file.close()

Output

rotation operation has performed on multiple PDF pages using a python program

Now, all pages are formatted well.

How to split the pages of a PDF file

Here, we are going to split a PDF file(with many pages) into several single-page PDFs using a python program. Below is the code for you.

Code


'''Split a PDF file'''

import PyPDF2
from PyPDF2.pdf import PdfFileWriter

pdfFile = 'LinuxForDevelopers.pdf'

pdfReader = PyPDF2.PdfFileReader(pdfFile)

# Split the pages from 0 to 9.
for page in range(0, 10):
pdfWriter = PdfFileWriter()
pdfWriter.addPage(pdfReader.getPage(page))

splitPage = f'{page}.pdf'
resultPdf = open(splitPage, 'wb')
pdfWriter.write(resultPdf)

resultPdf.close()

Output

A PDF file has been splitted into several single page PDF files using a python program

Merge PDF files

In the previous step, we split a multi-page PDF file into several single-page PDFs. In this section, we will merge multiple PDFs (in our case, each single-page PDF file) into a single PDF file using a python program.

Code


'''Merge PDF files'''

import PyPDF2

# List of pdf files that are going to be merged.
files = ['0.pdf', '1.pdf', '2.pdf', '3.pdf']

pdfWriter = PyPDF2.PdfFileWriter()

for file in files:
pdfReader = PyPDF2.PdfFileReader(file)
pdfWriter.addPage(pdfReader.getPage(0))

mergePdf = open('merged_file.pdf', 'wb')
pdfWriter.write(mergePdf)

mergePdf.close()

Output

multiple PDF pages are merged into one single PDF file using a python program

We successfully merged multiple PDFs into a single one.

Add Watermark in PDF using Python

To accomplish this task, we need to create a watermark (as per our choice) on another single-page PDF and then, merge that PDF with each page of the original PDF file. Here, we will perform this task using a python program.

Sample PDF file

Code


'''Add watermark to a PDF'''

from PyPDF2 import PdfFileReader, PdfFileWriter

pdfFile = 'sample_file.pdf'
watermarkFile = 'watermark.pdf'
result = open('watermarked_file.pdf', 'wb')

pdfReader = PdfFileReader(pdfFile)
pdfWriter = PdfFileWriter()
wmarkReader = PdfFileReader(watermarkFile)

for page in range(pdfReader.getNumPages()):
page = pdfReader.getPage(page)
page.mergePage(wmarkReader.getPage(0))
pdfWriter.addPage(page)

pdfWriter.write(result)
result.close()

Output

a watermark has been added to all the pages of a PDF file using a python program

Look, every page is now watermarked.

How to encrypt a PDF file using python

It's wise to keep a personal file encrypted at all times to prevent unauthorized access. In this section, you'll learn to encrypt a PDF file with a password using a few lines of python code.

Code


'''Encrypt a PDF file'''

from PyPDF2 import PdfFileReader, PdfFileWriter

# Read the PDF file
pdfFile = PdfFileReader('sample_file.pdf')
# Create a PdfFileWriter object
pdfWriter = PdfFileWriter()
# The Result file: "encrypted.pdf"
result = open('encrypted.pdf', 'wb')

password = '00001111'

for page in range(pdfFile.getNumPages()):
pdfWriter.addPage(pdfFile.getPage(page))

# Call the encrypt function
pdfWriter.encrypt(user_pwd=password)
pdfWriter.write(result)

Output

a PDF file has encrypted by password using a python program

Decrypt a password-protected PDF file using python

We learned how to encrypt a PDF file with a password using a python program in the previous section. Now it's necessary to learn the decryption process. In this section, we will do so.

Code


'''Decrypt a PDF file'''

from PyPDF2 import PdfFileReader, PdfFileWriter

# Read the PDF file
pdfFile = PdfFileReader('encrypted.pdf')
pdfWriter = PdfFileWriter()
# The Result file: "decrypted.pdf"
result = open('decrypted.pdf', 'wb')

password = '00001111'

# Call the decrypt function
pdfFile.decrypt(password=password)

for page in range(pdfFile.getNumPages()):
pdfWriter.addPage(pdfFile.getPage(page))

pdfWriter.write(result)

Output

a PDF file has decrypted using a python program

Summary

In this tutorial, you've learned several methods for working with PDF files in Python. For example, Extracting data from a PDF, Rotating, Splitting, Merging, Adding a watermark to a PDF, encrypting, and decrypting a PDF file, etc.

I hope you loved this tutorial. If you have doubt anywhere, just leave your comment below without hesitating. You will get a reply soon.

Thanks for reading!💙

PySeek

Subhankar Rakshit

Meet Subhankar Rakshit, a Computer Science postgraduate (M.Sc.) and the creator of PySeek. Subhankar is a programmer, specializes in Python language. With a several years of experience under his belt, he has developed a deep understanding of software development. He enjoys writing blogs on various topics related to Computer Science, Python Programming, and Software Development.

Post a Comment (0)
Previous Post Next Post