Code Challenge 60 - Working With PDF Files in Python

PyBites, Tue 08 January 2019, Challenge

challenges, code challenge, data cleaning, data mining, PDF, PDFMiner, pdftables, PyPDF2, PyPI, text parsing

There is an immense amount to be learned simply by tinkering with things. - Henry Ford

Hey Pythonistas, in this challenge you will learn how to work with PDF documents. Enjoy!

The Challenge

For the NLTK challenge (PCC58/59) we stumbled upon a hurdle: episode 1-150 of Tim Ferriss' transcripts are PDF files. And we're not alone, in the comments somebody stated:

These are much appreciated. I do wonder, however, why they are all not a downloadable PDF and only the first 150. Perhaps just a marketing thing, but it would be nice to be able to grab them all to have an easily searchable database. Ah well, you have to work for what you want!

Challenge accepted! You can try this too or use another data set, it's up to you!

Googling for this challenge we stumbled upon a Pycon proposal: Liberating tabular data from the clutches of PDFs:

Budget Documents are moral documents that represent the priorities and values of the states and its governing bodies. Unfortunately these documents are published in unstructured PDF formats which makes it difficult for researchers, economists and general public to analyse and use this crucial data. In this session will delve into how we can create a data pipeline and leverage computer vision techniques to parse these documents into clean machine-readable formats by leveraging libraries like OpenCV, numpy, pandas, PyPDF2, tabula and poppler-pdf-to-text

Which goes to show that:

  1. There are a lot of interesting resources that are still in PDF format that are waiting to be converted ...
  2. In this Data Science age, there is a lot of focus on the data algorithms and visualization, the fun stuff, but it is data cleaning that actually allows for this, so this is a relevant skill to have.

If you can't find a use case for data extraction, feel free to do the inverse: generate a nice looking PDF file from a bunch of data sources.

You probably want to use a 3rd party package for this: PyPDF2, pdftables (if you need to extract tables), and/or PDFMiner. Or search the cheese shop ...

Have fun and use Python!

Ideas and feedback

If you have ideas for a future challenge or find any issues, open a GH Issue or reach out via Twitter, Slack or Email.

Last but not least: there is no best solution, only learning more and better Python. Good luck!

Become a Python Ninja

At PyBites you get to master Python through Code Challenges:

>>> from pybites import Bob, Julian

Keep Calm and Code in Python!