I absolutely loved the idea of web scraping when Bob explained what it was (it sounded so spy-like and hackery!). It did however sound like something that, coding-wise, was completely out of my grasp. Once I dove in and tried to create one though I realised it was actually quite simple!
Create a web scraper that probes a site for the latest headlines.
For my example, I'm going to scrape wowhead.com, a World of Warcraft database site, for their latest news headlines.
Head to the Wowhead page and you'll see their home page is just a series of news/blog posts. What we want to do is pull the title of each blog post and output it to text.
(You can follow along with this or, of course, you can use your own site).
# pwd wowhead # ls venv wowhead_scraper.py
Disclaimer: I've lumped everything under the main() function. This is a really simple program and I wanted to keep it as readable as possible, thus it's not all split into different functions.
URL = "http://www.wowhead.com" header_list =  def main(): raw_site_page = requests.get(URL) raw_site_page.raise_for_status() #Confirm site was pulled. Error if not
The get function of the requests module allows us to pull the HTML data from the site. We assign this data to the variable raw_site_page. (This is known as the response object).
As the comment implies, the .raise_for_status() function checks to see if the data was pulled successfully. If, for example, your URL is incorrect, this will error your program out and tell you about it.
soup = bs4.BeautifulSoup(raw_site_page.text, 'html.parser')
This code takes the Response object and reads it as plain text. BS4 parses it with the html parser and creates a Soup Object which we're assigning to the variable soup.
html_header_list = soup.select('.heading-size-1') for headers in html_header_list: print(headers)
We need to use the .select() function within BS4 to find what we want in the site HTML code. This is where you'll need to view the page source of the site (or use Inspect!) to find something unique about the data you want to pull.
You can see that I've specified the CSS Element ".heading-size-1". On the Wowhead page I found that each post heading contained this element and that it was unique to them as well.
We then take this selected data and create html_header_list with it.
$ python3 wowhead_scraper.py <h1 class="heading-size-1"><a href="/patch-7-1-5-survival-guide">Patch 7.1.5 Survival Guide: Class Guides, New Legendaries, Brawler's Guild, Artifact Knowledge Catch Up and More!</a></h1>
What's happening here is that I'm not only just getting the header of the post but also the URL assigned to by the "a href" HTML tag. We don't need this data for this exercise.
html_header_list = soup.select('.heading-size-1') for headers in html_header_list: header_list.append(headers.getText())
Using .getText() we can then pull the plain text and append it to the header_list list.
for headers in header_list: print(headers)
$ python3 wowhead_scraper.py Patch 7.1.5 Survival Guide: Class Guides, New Legendaries, Brawler's Guild, Artifact Knowledge Catch Up and More! Official Patch Notes for World of Warcraft 7.1.5 Kirin Tor Quest Fix, World Quest Reset in 7.1.5, Live Developer Q&A Thursday The Story of Aviana - Lore Collaboration with Nobbel87 All The Demon Hunter Class and Legendary Changes in Patch 7.1.5 Wowhead Weekly #106 and Blizzard Gear Shop Diablo Sale $
Again, this is web scraping at its simplest. There are heaps of improvements and additions that can be made with these coming to mind right away:
This was a pretty satisfying project for me. Web scraping has endless possibilities - you just need to figure out what you want and from where!
This example is as simple as they come but hopefully now you can see just how easy it really is.
Oh and if anyone tries to say, "Isn't that what the RSS feed or Subscribe button is for?", ignore them. This is way more satisfying!
Keep Calm and Code in Python!
Do you want to get 250+ concise and applicable Python tips in an ebook that will cost you less than 10 bucks (future updates included), check it out here.
"The discussions are succinct yet thorough enough to give you a solid grasp of the particular problem. I just wish I would have had this book when I started learning Python." - Daniel H
"Bob and Julian are the masters at aggregating these small snippets of code that can really make certain aspects of coding easier." - Jesse B
"This is now my favourite first Python go-to reference." - Anthony L
"Do you ever go on one of those cooking websites for a recipe and have to scroll for what feels like an eternity to get to the ingredients and the 4 steps the recipe actually takes? This is the opposite of that." - Sergio S