Writing a Web Scraper

Okay, so I know everyone’s written some kind of web scraper in their time, but I’m still proud of myself for my own take on the subject.

Recently, I saw an interesting post on Reddit that provided some source code for scraping images based on the similarity of their names. This might be useful for downloading all of the images in a gallery (for example, if you want to keep something posted on Imgur) without having to follow all of the links by hand. However, there was quite a lot of it missing, it relied on some random site, and it was bad at handling all kinds of common use cases: for example, thumbnails.

So I rewrote it. And expanded it. Massively. The current usage output is

Usage:
  galleryscraper.py URL DIR [--threads N --log-level N -q -s]
  galleryscraper.py -h | --help | --version

Options:
      --threads N        the number of threads to use [default: 4]
  -V, --log-level N      the level of info logged to the console, which can be
                         one of INFO, DEBUG, or WARNING [default: INFO]
  -s, --skip-duplicates  ignore files that have been downloaded already
  -q, --quiet            suppress output to console
  -v, --version          show program's version number and exit
  -h, --help             show this help message and exit

In particular, I worked out a clever way of dealing with thumbnails. Often, a gallery of thumbnails each link to a page containing ads, etc, with one large image (the one you actually want) in the centre. So I figured if one found the image on the page with the largest content-length, this would be the image in mind. This technique seems to work extremely well.

Mostly, this was an opportunity to learn about some gaps in my programming experience. First and foremost is how HTML requests actually work (success), but I also learned a lot of parallelization and Python decorators. I made use of the excellent (and pretty standard at this point) Python libraries Requests and BeautifulSoup.

The script itself has a couple of interesting features:

I write really, really well-documented code. It helps me keep track of what goes on, explain (and the improve) my approach, and it often comes in handy when I got back and try to figure out what’s going on. This should make it really easy to read the code itself.
The script contains no classes. Everything is a function, and functions are designed to have manageable side effects. I certainly haven’t been converted to functional programming, but I have become pretty irritated by the everything-must-be-a-class paradigm. This script is essentially a big function that turns a web page into a directory of images. If anyone’s interested in the things that go into accomplishing that, they might just benefit from the functions in the script itself. I’ve also seen the light that is namedtuples, which makes the return value of more complex functions much more manageable.
I learned to love decorators. No, seriously. I had never really thought of a use for them before, but here such uses turned up in abundance. I’m under the impression that this will have a big impact on how I think about writing code in Python going forward.

My favourite decorator at the moment:

def sessional(func):
    """
    Decorator that maintains the same session for URL requests within the given
    function. It relies on the wrapped function taking a ``session`` keyword
    argument. At the moment this is only used for the safe_request function,
    but if timeouts are not a problem it can be used to wrap all of the other
    functions that make requests instead.
    """
    session = requests.Session()
    session.mount('http://', HTTPAdapter(max_retries=5))
    session.headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
    
    @wraps(func)
    def newfunc(*args, **kwargs):
        try:
            # Replace sessions in the function's (kw) arguments
            kwargs['session'] = session
        except KeyError:
            pass
        return func(*args, **kwargs)
    
    return newfunc

Of course, there are still some edge cases. The script works very well most of the time, but some things are still tripping it up. Usually this is due to the images you want having different names, for example when some kind of random hash is appended to them.

In the future, I may make the script accept a file of URLs and directories, similar to how crontab works. I’m also looking in to working out when the page linked to by the thumbnail has multiple sizes of image – i.e. the Small - Medium - Large - Original header – and getting the image of the desired size.