Reading A Pdf Portfolio In Python?
Solution 1:
The program pdfdetach
from the poppler utilities can extract attachments.
Most UNIX-like operating system distributions have a poppler-utils
package available. You can find a ms-windows version on SourceForge.
You can use the subprocess
module to call this program from Python.
Solution 2:
You could use python-poppler
.
from poppler import load_from_file
pdf_document = load_from_file("portfolio.pdf")
if pdf_document.has_embedded_files():
for attachment in pdf_document.embedded_files():
print(attachment.data)
Solution 3:
It took me quite a bit of work to extract the embedded files from a portfolio using @Roland Smith and @ikreb answers. python-poppler has a fairly cryptic API and the instructions above just get the data, not the pdf. The following steps detail how to get the documents out of a portfolio using poppler and python subprocess:
You will need poppler installed. It can be installed with homebrew (or condo) on a Mac. You might also need cmake (also installed with homebrew). Here are various ways to install on Windows: How to install Poppler on Windows?
Poppler is a command line program so you don't necessarily have to use python to solve your problem. From command line:
# this will pull the files from the portfolio and save them to the same directory
pdfdetach -saveall <file_name, no quotes>
# example:
pdfdetach -saveall my_portfolio.pdf
- Within python, use subprocess as follows:
import subprocess
# pdfdetach will save all files from the portfolio to the same directory
subprocess.run(['pdfdetach', '-saveall', file_name.pdf])
# if you want to get a list of the files, use -list (see note below)
subprocess.run(['pdfdetach', '-list', file_name])
# it is also useful, within a script, to save to another folder using -o:
subprocess.run(['pdfdetach', '-saveall', os.path.join(os.path.join(os.getcwd(), my_portfolio.pdf), '-o', os.path.join(os.getcwd(), './out')])
note on list output: the output will be a subprocess object that must be parsed to make a python list of file names. This post has several ways to do that: python subprocess output to list or file
Post a Comment for "Reading A Pdf Portfolio In Python?"