Skip to content Skip to sidebar Skip to footer

Reading A Pdf Portfolio In Python?

I have a pdf portfolio which is comprised of an email thread, each email containing attachments. I would like to read the text from each email and extract the attachments. However,

Solution 1:

The program pdfdetach from the poppler utilities can extract attachments.

Most UNIX-like operating system distributions have a poppler-utils package available. You can find a ms-windows version on SourceForge.

You can use the subprocess module to call this program from Python.

Solution 2:

You could use python-poppler.

from poppler import load_from_file

pdf_document = load_from_file("portfolio.pdf")

if pdf_document.has_embedded_files():
    for attachment in pdf_document.embedded_files():
        print(attachment.data)

Solution 3:

It took me quite a bit of work to extract the embedded files from a portfolio using @Roland Smith and @ikreb answers. python-poppler has a fairly cryptic API and the instructions above just get the data, not the pdf. The following steps detail how to get the documents out of a portfolio using poppler and python subprocess:

  1. You will need poppler installed. It can be installed with homebrew (or condo) on a Mac. You might also need cmake (also installed with homebrew). Here are various ways to install on Windows: How to install Poppler on Windows?

  2. Poppler is a command line program so you don't necessarily have to use python to solve your problem. From command line:

# this will pull the files from the portfolio and save them to the same directory
pdfdetach -saveall <file_name, no quotes>

# example:
pdfdetach -saveall my_portfolio.pdf
  1. Within python, use subprocess as follows:
import subprocess

# pdfdetach will save all files from the portfolio to the same directory
subprocess.run(['pdfdetach', '-saveall', file_name.pdf])

# if you want to get a list of the files, use -list (see note below)
subprocess.run(['pdfdetach', '-list', file_name])

# it is also useful, within a script, to save to another folder using -o:
subprocess.run(['pdfdetach', '-saveall', os.path.join(os.path.join(os.getcwd(), my_portfolio.pdf), '-o', os.path.join(os.getcwd(), './out')])


note on list output: the output will be a subprocess object that must be parsed to make a python list of file names. This post has several ways to do that: python subprocess output to list or file

Post a Comment for "Reading A Pdf Portfolio In Python?"