Okay, so I know everyone’s written some kind of web scraper in their time, but I’m still proud of myself for my own take on the subject.
Recently, I saw an interesting post on Reddit that provided some source code for scraping images based on the similarity of their names. This might be useful for downloading all of the images in a gallery (for example, if you want to keep something posted on Imgur) without having to follow all of the links by hand. However, there was quite a lot of it missing, it relied on some random site, and it was bad at handling all kinds of common use cases: for example, thumbnails.
So I rewrote it. And expanded it. Massively. The current usage output is
Usage:
galleryscraper.py URL DIR [--threads N --log-level N -q -s]
galleryscraper.py -h | --help | --version
Options:
--threads N the number of threads to use [default: 4]
-V, --log-level N the level of info logged to the console, which can be
one of INFO, DEBUG, or WARNING [default: INFO]
-s, --skip-duplicates ignore files that have been downloaded already
-q, --quiet suppress output to console
-v, --version show program's version number and exit
-h, --help show this help message and exit
In particular, I worked out a clever way of dealing with thumbnails. Often, a
gallery of thumbnails each link to a page containing ads, etc, with one large
image (the one you actually want) in the centre. So I figured if one found the
image on the page with the largest content-length
, this would be the image in
mind. This technique seems to work extremely well.
Mostly, this was an opportunity to learn about some gaps in my programming experience. First and foremost is how HTML requests actually work (success), but I also learned a lot of parallelization and Python decorators. I made use of the excellent (and pretty standard at this point) Python libraries Requests and BeautifulSoup.
The script itself has a couple of interesting features:
-
I write really, really well-documented code. It helps me keep track of what goes on, explain (and the improve) my approach, and it often comes in handy when I got back and try to figure out what’s going on. This should make it really easy to read the code itself.
-
The script contains no classes. Everything is a function, and functions are designed to have manageable side effects. I certainly haven’t been converted to functional programming, but I have become pretty irritated by the everything-must-be-a-class paradigm. This script is essentially a big function that turns a web page into a directory of images. If anyone’s interested in the things that go into accomplishing that, they might just benefit from the functions in the script itself. I’ve also seen the light that is
namedtuples
, which makes the return value of more complex functions much more manageable. -
I learned to love decorators. No, seriously. I had never really thought of a use for them before, but here such uses turned up in abundance. I’m under the impression that this will have a big impact on how I think about writing code in Python going forward.
My favourite decorator at the moment:
def sessional(func):
"""
Decorator that maintains the same session for URL requests within the given
function. It relies on the wrapped function taking a ``session`` keyword
argument. At the moment this is only used for the safe_request function,
but if timeouts are not a problem it can be used to wrap all of the other
functions that make requests instead.
"""
session = requests.Session()
session.mount('http://', HTTPAdapter(max_retries=5))
session.headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
@wraps(func)
def newfunc(*args, **kwargs):
try:
# Replace sessions in the function's (kw) arguments
kwargs['session'] = session
except KeyError:
pass
return func(*args, **kwargs)
return newfunc
Of course, there are still some edge cases. The script works very well most of the time, but some things are still tripping it up. Usually this is due to the images you want having different names, for example when some kind of random hash is appended to them.
In the future, I may make the script accept a file of URLs and directories,
similar to how crontab
works. I’m also looking in to working out when the
page linked to by the thumbnail has multiple sizes of image – i.e. the Small -
Medium - Large - Original header – and getting the image of the desired size.