earn the White PyBites Ninja earn the Yellow PyBites Ninja earn the Orange PyBites Ninja right arrow earn more PyBites Ninja belts and certificates
The best way to learn to code in Python is to actually use the language.

Our platform offers effective Test Driven Learning which will be key to your progress.

Join thousands of Pythonistas and start coding!

Join us on our PyBites Platform
Click here to code!

Next Time I Will Use Pandas to Parse Html Tables

Posted by Bob on Thu 08 June 2017 in Concepts • 2 min read

Last week I did some html table parsing for our Electricity Cost Calculation App challenge. I used BeautifulSoup and regex. It turns out there is an easier way to do this: Pandas.

Parsing html tables

Take 1: BeautifulSoup and regex

For our challenge I wanted to include a table of wattage uses of standard devices. I did not find any API so ended up with Wholesale Solar's power table.

However even having a great library like BeautifulSoup available it was still a pain parsing the html (see get_appliance_wattages() here)

Take 2: Pandas

Luckily I stumbled upon this article which shows you how to use Pandas' read_html() to grab tabular data from html pages. Very useful! Here is a Jupyter notebook applying it to the power table problem.

Although easy to use, I still had to do some data conversion in Pandas, because the table came with duplicated column names: 3 columns of Appliances and 3 columns of Watts.

So I did end up spending time on both methods, but the Pandas way is more extensible, because once you have the data in a DataFrame you have a rich API to your disposal to do all kinds of data manipulations, like grouping, filtering and format conversion (to csv/json).

The manual part: Data cleaning

The take away is to use specialized libraries as much as possible. They have most of the common use cases figured out.

However be it BeautifulSoup, regex or Pandas, there is always some data (manual) manipulation and cleaning involved.

As you can see in the notebook, although Pandas took care of stripping the thousand separators, I still needed to manually manipulate/clean values like: 80-150 (average), 400-1000+ (strip), or 1080 watt-hours /day* (normalize).

If you have a magic method for that let me know or if you want to share your data parsing story do so in the comments below, specially if it involved a lot of nasty manipulation and cleaning :)

I realize this would be an ideal code challenge too, if you agree, feel free to suggest one here.

Keep Calm and Code in Python!

-- Bob

See an error in this post? Please submit a pull request on Github.