How to remove a watermark from a PDF using Python?

July 11, 2023, 5:39 a.m.
Digital documents come in various formats, and PDF (Portable Document Format) is one of the most common types of such files. Sometimes, these PDF files could feature unwanted watermarks that interfere with the readability of the document. Watermarks are usually light imprints on documents designed to protect and establish ownership of digital materials. However, one may need to remove these watermarks for various reasons and Python, a high-level general-purpose programming language, has several libraries that enable removing watermarks from a PDF.
How to remove a watermark from a PDF using Python?
This article will focus on discussing the various methods of erasing watermarks from a PDF file using Python programming. We'll delve into the world of Python packages like PyPDF2, PDFMiner, and more, which can be extremely useful in achieving this task. Please bear in mind that this guide is developed for educational purposes, and any use of these techniques should comply with privacy laws and copyrights rules in your region or as stipulated in the terms of use of the document in question.

## Install Required Python Libraries

Before we proceed, you need to ensure the required Python libraries are installed in your environment. The core libraries we'll be utilizing in this tutorial are PyPDF2 and PDFMiner.
You can install these libraries using pip, a package installer for Python. Simply execute the following commands in your terminal:

```bash

pip install pypdf2

pip install pdfminer

```

## Removing Watermarks Using PyPDF2

PyPDF2 is one of the most popular Python libraries for working with PDF files. It's used to extract document information, split, merge, crop, and decompose PDF pages. But importantly for us, it can also be used to remove watermarks from our PDF files.

The first step is to import the necessary library.

```python

from PyPDF2 import PdfFileWriter, PdfFileReader

```

After importing the PyPDF2 library into the Python environment, our next course of action is to construct a PdfFileReader object representing the PDF from which we intend to remove the watermark. To facilitate this, we use the `PdfFileReader()` function and input the name of our file.

```python

pdf = PdfFileReader('watermarked_file.pdf)

```

Next, we initialize `PdfFileWriter()`, which can be thought of as a "blank page" to which we'll import the contents of our initial watermarked document.

```python

pdf_writer = PdfFileWriter()

```

It's important to note that PyPDF2 views each page of the PDF as an individual entity. As such, we need to loop through all pages in the PDF, extract each one, and add it to the new PDF file. This action will take all the content from the original PDF and place it into the unwatermarked file—separate from the watermark layer.

```python

for page_number in range(pdf.getNumPages()):

page = pdf.getPage(page_number)

pdf_writer.addPage(page)

```

Finally, we can write the content to a new PDF file, resulting in a similar PDF without a watermark:

```python

with open('unwatermarked_file.pdf', 'wb') as out:

pdf_writer.write(out)

```

While PyPDF2 is quite versatile, this method can only be employed to create unwatermarked versions of PDFs that have been watermarked with Adobe Acrobat, as it directly adds a watermark layer on top of the original content.

## Removing Watermarks Using PDFMiner

For more complex watermark types, such as those integrated into the content layer of the PDF, we need to delve deeper, to the point of manipulating the PDF's underlying structure. In situations like these, a library like PDFMiner might be more suitable.
PDFMiner has the ability to explore the actual structure of a PDF and its content. Its pdf2txt.py module can convert PDF files to pure text as per the layout of the original PDF. By processing the text and leaving out graphics, PDFMiner allows us to extract the content we want without the watermarking artefacts.

Here is how we can remove watermarks from a PDF with the help of PDFMiner:

```python

from pdfminer.high_level import extract_text

text = extract_text('watermarked_file.pdf')

with open('unwatermarked_file.txt', 'w') as out:

out.write(text)

```

This produces a text file with the document's text data but lacks the original formatting. While this might not be ideal for every scenario, it can be a practical solution when all you need is the raw text of the document and can afford to lose the original layout and images.

## Conclusion

PDF watermark removal using Python tends to rely on tools that ignore or sidestep the artwork the watermark is injected into. As such, none of the current accessible libraries can ensure perfect, seamless watermark removal, especially when you wish to retain elaborate formatting and graphics in your output file. However, the techniques covered provide a good starting point.
Remember, when you're removing watermarks using Python, always respect copyright laws and individual ownership rights. Utilize these techniques responsibly and ethically. Happy coding!

Check out our services

Check out our product HelpRange. It is designed to securely store (GDPR compliant), share, protect, sell, e-sign and analyze usage of your documents.

Other Posts: