Create a Simple Web Scraper with BeautifulSoup4

Julian, Wed 11 January 2017, Tools

beautifulsoup, bs4, code, namedtuples, pybites, python, tips, tricks, webscraping

I absolutely loved the idea of web scraping when Bob explained what it was (it sounded so spy-like and hackery!). It did however sound like something that, coding-wise, was completely out of my grasp. Once I dove in and tried to create one though I realised it was actually quite simple!


Create a web scraper that probes a site for the latest headlines.

For my example, I'm going to scrape, a World of Warcraft database site, for their latest news headlines.

Head to the Wowhead page and you'll see their home page is just a series of news/blog posts. What we want to do is pull the title of each blog post and output it to text.

(You can follow along with this or, of course, you can use your own site).

The Setup

# pwd
# ls

The Code

The final code for this simple scraper can be found in the PyBites Code Repo, subdirectory BeautifulSoup.

Disclaimer: I've lumped everything under the main() function. This is a really simple program and I wanted to keep it as readable as possible, thus it's not all split into different functions.

URL = ""
header_list = []

def main():
    raw_site_page = requests.get(URL)
    raw_site_page.raise_for_status()  #Confirm site was pulled. Error if not

The get function of the requests module allows us to pull the HTML data from the site. We assign this data to the variable raw_site_page. (This is known as the response object).

As the comment implies, the .raise_for_status() function checks to see if the data was pulled successfully. If, for example, your URL is incorrect, this will error your program out and tell you about it.

soup = bs4.BeautifulSoup(raw_site_page.text, 'html.parser')

This code takes the Response object and reads it as plain text. BS4 parses it with the html parser and creates a Soup Object which we're assigning to the variable soup.

html_header_list ='.heading-size-1')
    for headers in html_header_list:

We need to use the .select() function within BS4 to find what we want in the site HTML code. This is where you'll need to view the page source of the site (or use Inspect!) to find something unique about the data you want to pull.

You can see that I've specified the CSS Element ".heading-size-1". On the Wowhead page I found that each post heading contained this element and that it was unique to them as well.

We then take this selected data and create html_header_list with it.

$ python3 
<h1 class="heading-size-1"><a href="/patch-7-1-5-survival-guide">Patch 7.1.5 Survival Guide: Class Guides, New Legendaries, Brawler's Guild, Artifact Knowledge Catch Up and More!</a></h1>

What's happening here is that I'm not only just getting the header of the post but also the URL assigned to by the "a href" HTML tag. We don't need this data for this exercise.

html_header_list ='.heading-size-1')
    for headers in html_header_list:

Using .getText() we can then pull the plain text and append it to the header_list list.

for headers in header_list:
$ python3 
Patch 7.1.5 Survival Guide: Class Guides, New Legendaries, Brawler's Guild, Artifact Knowledge Catch Up and More!
Official Patch Notes for World of Warcraft 7.1.5
Kirin Tor Quest Fix, World Quest Reset in 7.1.5, Live Developer Q&A Thursday
The Story of Aviana - Lore Collaboration with Nobbel87
All The Demon Hunter Class and Legendary Changes in Patch 7.1.5
Wowhead Weekly #106 and Blizzard Gear Shop Diablo Sale

More examples (Bob)

Here is another example how to scrape to parse the html table that has the Scrabble tile distribution and load it into a data structure (list of named tuples).

Titans books kata also used BeautifulSoup to scrape the page, see code here.

Areas for Expansion

Again, this is web scraping at its simplest. There are heaps of improvements and additions that can be made with these coming to mind right away:


This was a pretty satisfying project for me. Web scraping has endless possibilities - you just need to figure out what you want and from where!

This example is as simple as they come but hopefully now you can see just how easy it really is.

Oh and if anyone tries to say, "Isn't that what the RSS feed or Subscribe button is for?", ignore them. This is way more satisfying!

Keep Calm and Code in Python!

-- Julian

PyBites Python Tips

Do you want to get 250+ concise and applicable Python tips in an ebook that will cost you less than 10 bucks (future updates included), check it out here.

Get our Python Tips Book

"The discussions are succinct yet thorough enough to give you a solid grasp of the particular problem. I just wish I would have had this book when I started learning Python." - Daniel H

"Bob and Julian are the masters at aggregating these small snippets of code that can really make certain aspects of coding easier." - Jesse B

"This is now my favourite first Python go-to reference." - Anthony L

"Do you ever go on one of those cooking websites for a recipe and have to scroll for what feels like an eternity to get to the ingredients and the 4 steps the recipe actually takes? This is the opposite of that." - Sergio S

Get the book