Last week I did some html table parsing for our Electricity Cost Calculation App challenge. I used BeautifulSoup and regex. It turns out there is an easier way to do this: Pandas.
Parsing html tables
Take 1: BeautifulSoup and regex
For our challenge I wanted to include a table of wattage uses of standard devices. I did not find any API so ended up with Wholesale Solar's power table.
Take 2: Pandas
Luckily I stumbled upon this article which shows you how to use Pandas'
read_html() to grab tabular data from html pages. Very useful! Here is a Jupyter notebook applying it to the power table problem.
Although easy to use, I still had to do some data conversion in Pandas, because the table came with duplicated column names: 3 columns of Appliances and 3 columns of Watts.
So I did end up spending time on both methods, but the Pandas way is more extensible, because once you have the data in a DataFrame you have a rich API to your disposal to do all kinds of data manipulations, like grouping, filtering and format conversion (to csv/json).
The manual part: Data cleaning
The take away is to use specialized libraries as much as possible. They have most of the common use cases figured out.
However be it BeautifulSoup, regex or Pandas, there is always some data (manual) manipulation and cleaning involved.
As you can see in the notebook, although Pandas took care of stripping the thousand separators, I still needed to manually manipulate/clean values like:
400-1000+ (strip), or
1080 watt-hours /day* (normalize).
If you have a magic method for that let me know or if you want to share your data parsing story do so in the comments below, specially if it involved a lot of nasty manipulation and cleaning :)
I realize this would be an ideal code challenge too, if you agree, feel free to suggest one here.
Keep Calm and Code in Python!
See an error in this post? Please submit a pull request on Github.