Decorators, Scrapers and Generators

Whoever said scraping web pages can’t be fun never tried it using Python decorators and generators! We’ll use this mini-framework to fetch all the upcoming comic book releases from one of my favorite online comic book stores.

A couple of interesting things going on here programmatically: 1) using decorators that are static methods of a class that maintains some state for the decorated operations and 2) decorating class methods that are generators – the typical decorator can’t handle a yielded result…

Here’s a class that defines some static methods for us: scrape.route and scrape.tag. We’ll use these as decorators in another class that’s specific to our scraping task. These decorators will declare URL patterns that our decorated methods will handle when called using scrape.nav().

Now we’ll create a class that’s specific to our scraping task. We’ll use the scrape.route decorator to capture a web page for a specific URL pattern with: @scrape.route('/newreleases'). When we navigate to a URL patching this pattern, the class method weekly will be called.

Notice that when weekly is first called, it captures the total number of pages. We’ll use this information later on subsequent round-trips.

Now for the main event! We instantiate NewComics and pass the instance to the scrape constructor. This instance is our context object when we call the scrape.nav(url) method.

When nav(url) is called and matches a route we declared with the scrape.route decorator, the handler weekly(url, d) is called. ‘d‘ is our PyQuery object which can be used to traverse the document.

Route handlers like weekly, can be generators or return anything that suitable for the list() constructor. In this example, we yield the titles of upcoming comic book releases compliments of MyComicShop.com. Go grab the full source at GitHub!

TL;DR; Go right to the Gist on GitHub.

Leave a Reply