Working with PDF Files in Python - Complete Tutorial

Working with PDF files in python. For instance, extracting data from a PDF, rotating, splitting, merging PDF files, Add watermark to PDFs, Encrypt, and Decrypting PDF files, etc.

Introduction

We all probably know about the versatility of the python programming language. Python made programming easier by providing a large number of libraries, modules, or packages to work with different fields. There is a module called PDF2 that provides many functions to work with PDF files using python programming.

PyPDF2 is the updated version of pyPdf that was released in 2005. The module named PyPDF2  was released in 2016 by adding some extra features with pyPdf.

In this tutorial, you will learn:

  1. Extract data from a PDF file.
  2. Rotate only one page of a PDF file.
  3. Rotate only specific pages of a PDF file.
  4. Split a PDF file.
  5. Merge more than one PDF file.
  6. Add watermark to the PDFs.
  7. Encrypt a PDF file.
  8. Decrypting a PDF file.

Requirements

Install PyPDF2: pip install PyPDF2

How to read data from a PDF file

Read data from a PDF file using python

We'll extract the data from this page here

Code


'''Extract the data'''

from PyPDF2 import PdfFileReader

# Opening the PDF file in read and binary mode
# 'rb': read in binary
pdfFile = open('LinuxForDevelopers.pdf', 'rb')

pdfReader = PdfFileReader(pdfFile)

# Page no: 20
thePage = pdfReader.getPage(20)

print(thePage.extractText())
print("Total Pages: ", pdfReader.numPages)

pdfFile.close()

Output

Extract data from a PDF file using python.

See the yellow line in the above code. There I mentioned the page number(Page: 20) from where the data have been extracted.

How to Rotate one page of a PDF file

Nowadays, we can easily convert multiple photos to PDF files by scanning through our mobile camera. There are many application software available to perform this task.

For instance, you're converting some pre-scanned images to a PDF file by merging them into one. But suppose you captured one photo in the landscape mode instead of portrait. Then, only that diagonally perverted page can ruin the beauty of the whole PDF file.

In such a situation, it is difficult to fix just one or a few pages. But don't worry, python can do it in the blink of an eye. We'll create a python program to fix this issue.

First, we will learn how to rotate only a single page of a PDF file. Rotate more than one page need to use a loop that will iterate through the page numbers. Please keep reading.

Code


'''Rotate only one page'''

from PyPDF2 import PdfFileReader
from PyPDF2.pdf import PdfFileWriter

pdfFile = 'LinuxForDevelopers.pdf'

pdfReader = PdfFileReader(pdfFile)
pdfWriter = PdfFileWriter()
resultPdf = open("result.pdf", 'wb')

# Page no: 0
thePage = pdfReader.getPage(0)

# thePage.rotateClockwise(90)
thePage.rotateCounterClockwise(90)

pdfWriter.addPage(thePage)
pdfWriter.write(resultPdf)
resultPdf.close()

Output

Rotate a single page of a PDF file using Python

I've rotated the page 90 degree, anti-clockwise. There is another option, clockwise(see the yellow marked line).

Rotate more than one page of a PDF file

As I mentioned in the earlier section, suppose a situation arises where we need to rotate more than one page of a PDF file.

Rotate multiple pages of a PDF file using python

Let's solve this issue.

Code


'''Rotate only a few pages'''

from PyPDF2 import PdfFileReader
from PyPDF2.pdf import PdfFileWriter

need_to_fix = [0, 3, 4]

pdfReader = PdfFileReader('merged_file.pdf')
pdfWriter = PdfFileWriter()

fixed_file = open('fixed_file.pdf', 'wb')

for page in range(pdfReader.getNumPages()):
thePage = pdfReader.getPage(page)
if page in need_to_fix:
thePage.rotateClockwise(90)

pdfWriter.addPage(thePage)

pdfWriter.write(fixed_file)
print("Done!")
fixed_file.close()

Output

After performing rotation operation on PDF file in python

How to split the pages of a PDF file

Now split a PDF file(with many pages) into several single-page PDFs.

Code


'''Split a PDF file'''

import PyPDF2
from PyPDF2.pdf import PdfFileWriter

pdfFile = 'LinuxForDevelopers.pdf'

pdfReader = PyPDF2.PdfFileReader(pdfFile)

# Split the pages from 0 to 9.
for page in range(0, 10):
pdfWriter = PdfFileWriter()
pdfWriter.addPage(pdfReader.getPage(page))

splitPage = f'{page}.pdf'
resultPdf = open(splitPage, 'wb')
pdfWriter.write(resultPdf)

resultPdf.close()

Output

Split a PDF file page by page using python

Merge PDF files

Code


'''Merge PDF files'''

import PyPDF2

# List of pdf files that are going to merged.
files = ['0.pdf', '1.pdf', '2.pdf', '3.pdf']

pdfWriter = PyPDF2.PdfFileWriter()

for file in files:
pdfReader = PyPDF2.PdfFileReader(file)
pdfWriter.addPage(pdfReader.getPage(0))

mergePdf = open('merged_file.pdf', 'wb')
pdfWriter.write(mergePdf)

mergePdf.close()

Output

Merge multiple PDF pages to one PDF file using python

Add Watermark in PDF using Python

To perform this task, create a watermark(as your choice) in another single-page PDF. Then, merge that with every page of the PDF file to which you want to add it.

Sample PDF file

Code


'''Add watermark to a PDF'''

from PyPDF2 import PdfFileReader, PdfFileWriter

pdfFile = 'sample_file.pdf'
watermarkFile = 'watermark.pdf'
result = open('watermarked_file.pdf', 'wb')

pdfReader = PdfFileReader(pdfFile)
pdfWriter = PdfFileWriter()
wmarkReader = PdfFileReader(watermarkFile)

for page in range(pdfReader.getNumPages()):
page = pdfReader.getPage(page)
page.mergePage(wmarkReader.getPage(0))
pdfWriter.addPage(page)

pdfWriter.write(result)
result.close()

Output

Add watermark to a PDF file using python

How to encrypt a PDF file using python

It's wise to keep a personal file encrypted at all times to prevent unauthorized access. In this section, you'll learn to encrypt a PDF file using a few lines of python code.

Code


'''Encrypt a PDF file'''

from PyPDF2 import PdfFileReader, PdfFileWriter

# Read the PDF file
pdfFile = PdfFileReader('sample_file.pdf')
# Create a PdfFileWriter object
pdfWriter = PdfFileWriter()
# The Result file: "encrypted.pdf"
result = open('encrypted.pdf', 'wb')

password = '00001111'

for page in range(pdfFile.getNumPages()):
pdfWriter.addPage(pdfFile.getPage(page))

# Call the encrypt function
pdfWriter.encrypt(user_pwd=password)
pdfWriter.write(result)

Output

Encrypt a PDF file using python

Decrypt a password protected PDF file using python

Now we will decrypt the PDF file that we've encrypted in the previous section.

Code


'''Decrypt a PDF file'''

from PyPDF2 import PdfFileReader, PdfFileWriter

# Read the PDF file
pdfFile = PdfFileReader('encrypted.pdf')
pdfWriter = PdfFileWriter()
# The Result file: "decrypted.pdf"
result = open('decrypted.pdf', 'wb')

password = '00001111'

# Call the decrypt function
pdfFile.decrypt(password=password)

for page in range(pdfFile.getNumPages()):
pdfWriter.addPage(pdfFile.getPage(page))

pdfWriter.write(result)

Output

Decrypt a PDF file using python

Conclusion

In this tutorial, you've learned several methods for working with PDF files in Python. For example, Extracting data from a PDF, Rotating, Splitting, Merging, Adding a watermark to a PDF, Encrypt, and Decrypt a PDF file, etc.

I hope you loved this tutorial. Please share your love❤️ and do comment below.

Thanks for reading!💙

Subhankar Rakshit

Meet Subhankar Rakshit, a Computer Science postgraduate (M.Sc.) and the creator of PySeek. Subhankar is a programmer, specializes in Python language. With a several years of experience under his belt, he has developed a deep understanding of software development. He enjoys writing blogs on various topics related to Computer Science, Python Programming, and Software Development.

Post a Comment (0)
Previous Post Next Post