Next Time I Will Use Pandas to Parse Html Tables

Bob, Thu 08 June 2017, Concepts

BeautifulSoup, csv, data, data cleaning, energy, html, json, Pandas, parsing, regex

Last week I did some html table parsing for our Electricity Cost Calculation App challenge. I used BeautifulSoup and regex. It turns out there is an easier way to do this: Pandas.

Parsing html tables

Take 1: BeautifulSoup and regex

For our challenge I wanted to include a table of wattage uses of standard devices. I did not find any API so ended up with Wholesale Solar's power table.

However even having a great library like BeautifulSoup available it was still a pain parsing the html (see get_appliance_wattages() here)

Take 2: Pandas

Luckily I stumbled upon this article which shows you how to use Pandas' read_html() to grab tabular data from html pages. Very useful! Here is a Jupyter notebook applying it to the power table problem.

Although easy to use, I still had to do some data conversion in Pandas, because the table came with duplicated column names: 3 columns of Appliances and 3 columns of Watts.

So I did end up spending time on both methods, but the Pandas way is more extensible, because once you have the data in a DataFrame you have a rich API to your disposal to do all kinds of data manipulations, like grouping, filtering and format conversion (to csv/json).

The manual part: Data cleaning

The take away is to use specialized libraries as much as possible. They have most of the common use cases figured out.

However be it BeautifulSoup, regex or Pandas, there is always some data (manual) manipulation and cleaning involved.

As you can see in the notebook, although Pandas took care of stripping the thousand separators, I still needed to manually manipulate/clean values like: 80-150 (average), 400-1000+ (strip), or 1080 watt-hours /day* (normalize).

If you have a magic method for that let me know or if you want to share your data parsing story do so in the comments below, specially if it involved a lot of nasty manipulation and cleaning :)

I realize this would be an ideal code challenge too, if you agree, feel free to suggest one here.

Keep Calm and Code in Python!

-- Bob

PyBites Python Tips

Do you want to get 250+ concise and applicable Python tips in an ebook that will cost you less than 10 bucks (future updates included), check it out here.

Get our Python Tips Book

"The discussions are succinct yet thorough enough to give you a solid grasp of the particular problem. I just wish I would have had this book when I started learning Python." - Daniel H

"Bob and Julian are the masters at aggregating these small snippets of code that can really make certain aspects of coding easier." - Jesse B

"This is now my favourite first Python go-to reference." - Anthony L

"Do you ever go on one of those cooking websites for a recipe and have to scroll for what feels like an eternity to get to the ingredients and the 4 steps the recipe actually takes? This is the opposite of that." - Sergio S

Get the book