Whoever said scraping web pages can’t be fun never tried it using Python decorators and generators! We’ll use this mini-framework to fetch all the upcoming comic book releases from one of my favorite online comic book stores.
A couple of interesting things going on here programmatically: 1) using decorators that are static methods of a class that maintains some state for the decorated operations and 2) decorating class methods that are generators – the typical decorator can’t handle a yielded result…
Here’s a class that defines some static methods for us:
scrape.tag. We’ll use these as decorators in another class that’s specific to our scraping task. These decorators will declare URL patterns that our decorated methods will handle when called using
Now we’ll create a class that’s specific to our scraping task. We’ll use the
scrape.route decorator to capture a web page for a specific URL pattern with:
@scrape.route('/newreleases'). When we navigate to a URL patching this pattern, the class method
weekly will be called.
Notice that when
weekly is first called, it captures the total number of pages. We’ll use this information later on subsequent round-trips.
Now for the main event! We instantiate
NewComics and pass the instance to the
scrape constructor. This instance is our context object when we call the
nav(url) is called and matches a route we declared with the
scrape.route decorator, the handler
weekly(url, d) is called. ‘
d‘ is our
PyQuery object which can be used to traverse the document.
Route handlers like
weekly, can be generators or return anything that suitable for the
list() constructor. In this example, we yield the titles of upcoming comic book releases compliments of MyComicShop.com. Go grab the full source at GitHub!
TL;DR; Go right to the Gist on GitHub.