How to remove a watermark from a PDF using Python?
July 11, 2023, 5:39 a.m.
Digital documents come in various formats, and PDF (Portable Document Format) is one of the most common types of such files. Sometimes, these PDF files could feature unwanted watermarks that interfere with the readability of the document. Watermarks are usually light imprints on documents designed to protect and establish ownership of digital materials. However, one may need to remove these watermarks for various reasons and Python, a high-level general-purpose programming language, has several libraries that enable removing watermarks from a PDF.
## Install Required Python Libraries
Before we proceed, you need to ensure the required Python libraries are installed in your environment. The core libraries we'll be utilizing in this tutorial are PyPDF2 and PDFMiner.
You can install these libraries using pip, a package installer for Python. Simply execute the following commands in your terminal:
pip install pypdf2
pip install pdfminer
## Removing Watermarks Using PyPDF2
PyPDF2 is one of the most popular Python libraries for working with PDF files. It's used to extract document information, split, merge, crop, and decompose PDF pages. But importantly for us, it can also be used to remove watermarks from our PDF files.
The first step is to import the necessary library.
from PyPDF2 import PdfFileWriter, PdfFileReader
After importing the PyPDF2 library into the Python environment, our next course of action is to construct a PdfFileReader object representing the PDF from which we intend to remove the watermark. To facilitate this, we use the `PdfFileReader()` function and input the name of our file.
pdf = PdfFileReader('watermarked_file.pdf)
Next, we initialize `PdfFileWriter()`, which can be thought of as a "blank page" to which we'll import the contents of our initial watermarked document.
pdf_writer = PdfFileWriter()
It's important to note that PyPDF2 views each page of the PDF as an individual entity. As such, we need to loop through all pages in the PDF, extract each one, and add it to the new PDF file. This action will take all the content from the original PDF and place it into the unwatermarked file—separate from the watermark layer.
for page_number in range(pdf.getNumPages()):
page = pdf.getPage(page_number)
Finally, we can write the content to a new PDF file, resulting in a similar PDF without a watermark:
with open('unwatermarked_file.pdf', 'wb') as out:
While PyPDF2 is quite versatile, this method can only be employed to create unwatermarked versions of PDFs that have been watermarked with Adobe Acrobat, as it directly adds a watermark layer on top of the original content.
## Removing Watermarks Using PDFMiner
For more complex watermark types, such as those integrated into the content layer of the PDF, we need to delve deeper, to the point of manipulating the PDF's underlying structure. In situations like these, a library like PDFMiner might be more suitable.
PDFMiner has the ability to explore the actual structure of a PDF and its content. Its pdf2txt.py module can convert PDF files to pure text as per the layout of the original PDF. By processing the text and leaving out graphics, PDFMiner allows us to extract the content we want without the watermarking artefacts.
Here is how we can remove watermarks from a PDF with the help of PDFMiner:
from pdfminer.high_level import extract_text
text = extract_text('watermarked_file.pdf')
with open('unwatermarked_file.txt', 'w') as out:
This produces a text file with the document's text data but lacks the original formatting. While this might not be ideal for every scenario, it can be a practical solution when all you need is the raw text of the document and can afford to lose the original layout and images.
PDF watermark removal using Python tends to rely on tools that ignore or sidestep the artwork the watermark is injected into. As such, none of the current accessible libraries can ensure perfect, seamless watermark removal, especially when you wish to retain elaborate formatting and graphics in your output file. However, the techniques covered provide a good starting point.
Remember, when you're removing watermarks using Python, always respect copyright laws and individual ownership rights. Utilize these techniques responsibly and ethically. Happy coding!
Check out our services
Check out our product HelpRange. It is designed to securely store (GDPR compliant), share, protect, sell, e-sign and analyze usage of your documents.