Get Scraping With Scrapy

This is one job you’ll be happy to give to the robots

Quick Overview

In this article, I’m going to take you on a quick journey through Scrapy. As the name implies, Scrapy is a Python web-scraping package that’s open source. Many folks in the data world are familiar with web-scraping and other packages like beautiful soup. If this is your first time encountering web-scraping then sit tight.

Web scraping is a powerful tool for data collection. The base of the idea is just as it sounds, go through the web and copy the data you want from web pages. This is generally easier said than done. The content we as users explore through the graphical interface of our computers is a narrow window of the underlying structure that makes this functionality possible. It all starts with the document object model, or DOM for short. There are technical definitions out there, but in essence, the DOM is a structure that allows web browsers to use and display HTML (or other) content on your monitors. When scraping the web for information, we will use the DOM to pull more exact forms of the information we want, rather than getting the whole page.

Let’s check out the DOM

As a quick exercise, follow along with the following instructions to get an understanding of what all this DOM and HTML stuff means. If you’re practiced at this kind of thing go ahead and skip to the next section.

  1. Click anywhere on this page
  2. On your keyboard, press F12. You should see a sidebar window pop up with a bunch of code-looking stuff: This isn’t the DOM, rather, what you are looking at are the web development tools for the particular browser you’re currently using. Don’t worry you can’t break anything by making changes. This is a copy of my article that’s been retrieved from the web.
  3. At the top of the window, click the “Elements” tab or button: This is what my article looks like in HTML code form. Feel free to expand the sections of code by clicking the triangles at the beginning of the lines. If you continue this for a while you’ll see how complicated the structure of modern web content can be. For the most part, what you see are things added in from Medium to make their content searchable, interactable and to collect user data.
  4. On the top bar of the dev tools window, click the image of a mouse pointer over a square: Remember this step for all your future web-scraping endeavors. I personally find the selector tool to be the most useful aspect of the dev tools. When you select this icon (it should change color to let you know it’s ready to use), you can click the content on the web page itself and it will take you directly to said element in the Elements tab of the dev tools.

Ultimately, the goal of web scraping is to pull down only what you need. Doing this requires, specifying exactly where the content you want is in the HTML version of the webpage.

Why Scrapy?

Web scrapers are more or less a “feel” kind of thing. Which one you use will depend on the style you like and the power of the tool. In my opinion, Scrapy has everything you could ever want in terms of power. It also can be simplified if a project doesn't require extensive customization. As Michael Scott would say, “It’s a win-win-win.”

Here are some of the specific features I think are worth mentioning.

  • Asynchronous requests: This is a big deal if you are trying to parse large-sized content quickly as the bot (called a spider) will move on to pulling more information down, even if the previous section isn’t complete.
  • Custom Scrapy classes are well defined, easy to use, and make parsing simple(ish).
  • The docs are AWESOME.
  • The tutorials come with many examples.
  • The Scrapy shell allows you to test scraping operations without running the full spider. You can also insert the shell into the spider to inspect results mid-crawl!
  • Has some serious customization chops: Middleware, serialization, throttling, and integrated data processing at runtime.
  • Has native support for regular expressions and XML.

All in all, Scrapy can be used by beginners and experts. It has a stable community and is regularly supported. If you begin with Scrapy today, you can grow into it as you become more adept at scraping.

Scrapy Example:

I’m a dungeon and dragons (D&D) person. I’m sick and tired of spending hours reading docs to try and create cool items for my players. Ultimately, all this stuff exists on D&D websites, but I would like it in a local database so I can pull this stuff up randomly with the click of a button. Looks like a perfect opportunity for scraping.

Let’s start the way every python project does.

pip install scrapy

I personally recommend using conda or creating a virtual environment for any specialized python project. If you’re unfamiliar go ahead and check out these links. The second one only works if you have Anaconda installed.

Woohoo! we’ve installed Scrapy. Go ahead and open up your favorite terminal or code editor and navigate to somewhere on your computer where you won’t mind creating some tutorial folders and files. Creating a Scrapy project is as easy as

scrapy startproject dditems

This will create the following items in the working directory.

dditems - folder
__pycache__ - folder
spiders - folder
__pycache__ - folder
__init__.py - file
[This is where we will create a spider file]
items.py - file
middlewares.py - file
pipelines.py - file
settings.py - file
items.json - file
scrapy.cfg - file

There’s a fair amount of stuff here but don’t be overwhelmed. The only thing we are going to focus on for this example is the line I inserted under the spider folder. The spider file doesn’t exist just yet but we’ll create one in a minute. As for everything else, these are files related to customizing how the spider works. I highly recommend checking out the docs on these features as they are incredibly powerful tools for processing the data into a desired form.

The first thing we are going to do is create the spider. Create a file called items_spider.py in the spider folder. FYI, all spiders you create in the future will need “spider” in the file name to run properly. Insert the following code into the new file.

import scrapy
import time
class ItemSpider(scrapy.Spider):
name = "items"
def start_requests(self):
urls = [
'https://www.d20pfsrd.com/magic-items/rings',
'https://www.d20pfsrd.com/magic-items/staves',
'https://www.d20pfsrd.com/magic-items/wands',
'https://www.d20pfsrd.com/magic-items/wondrous-items',
'https://www.d20pfsrd.com/magic-items/artifacts',
'https://www.d20pfsrd.com/magic-items/intelligent-items',
'https://www.d20pfsrd.com/magic-items/cursed-items',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for href in response.css(
'div.ogn-childpages \
ul.ogn-childpages \
li a::attr(href)').getall():
yield response.follow(href, callback = self.chaseLinks)
time.sleep(.1)
def chaseLinks(self, res):
itemName = res.css('main h1::text').get()
itemInfo = res.css('div.page-center p::text').getall()
yield {
'name' : itemName,
'info' : itemInfo
}

Now let’s go through this line by line. The first two will pull in the Scrapy and time modules into the file. Next, we are going to create a spider class. Scrapy has several base classes we can choose to extend (borrow the underlying class structure from) from, we’ll use the base scrapy.Spider class for this example.

As part of this base spider class, we must implement a class method called “start_requests()”. This method is called when the spider is instantiated and will serve as the origin for our crawling operation. The URLs variable under this method is a simple python list that will house the list of pages we want to start with when scraping our D&D site. I suggest going to the site and checking out the structure of things. I am going to build the structure for the spider for you, but in general, it helps to browse around your target site. Why you might ask? Well, it’s generally understood that developers are lazy, and for good reason! Making content without some sort of easy-to-use template is a recipe for a lot of extra work as a developer. For this reason, the structure of websites is often predictable. That’s good news when scraping for information, as it makes tailoring our spider for the info we want much simpler.

The last section of this method is a for loop that iterates through our URLs list and sends an HTTP request. This request object is a custom class from scarpy that implements easily with their spider and processing classes. In reality it’s not dissimilar from a standard HTTP GET request but it has parameters that allow for chaining callacks to the response object. In this case, the callback we will implement is the main method of the spider, the “parse()” method.

Parse is another required method for any spider class and will be the primary work horse for the scraping operation. For this reason, we will spend some more time on it. If you look at the parameters of the parse method, you should see the non-self parameter is called response. When the start_request() method sends requests to the various URLs we defined earlier, it will return a response class that we then feed into the parse method. So what does the parse method do with the response?

def parse(self, response):
for href in response.css(
'div.ogn-childpages \
ul.ogn-childpages \
li a::attr(href)').getall():
yield response.follow(href, callback = self.chaseLinks)
time.sleep(.1)

The first line loops through this thing.

response.css('div.ogn-childpages ul.ogn-childpages li a::attr(href)').getall():

This is where a healthy working knowledge of css will come in handy. Scrapy repsonse classes have a build in .css() method which utilizes the structure of the DOM to find HTML elements that match certain patterns. For those of you who’ve spent some time creating css files, this should look pretty familiar. It’s too much for this blog to go through the basics of css rules, so I’ll give the short and sweet explanation.

div.ogn-childpages

we start by looking for all div elements that have a classname of ogn-childpages.

ul.ogn-childpages

under this div element we look for all ul (unordered list) elements with a classname of ogn-childpages.

li 

Then we find all li (list elements) under the ul elements

a::attr(href)

And finally, we will pull all the a (anchor/link) elements under the li elements. And more specifically, the value of the “href” key for each of the a elements. You might have noticed another method chained onto the back.

.getall()

This is a very handy method will grab all the anchor elements that fit these criteria. It’s not shown here but there are other methods that can pull different subsets of the described elements if you don’t want all of them.

Ok, how on earth did I know to look for these? Go ahead and go to one of the URLs in the urls variable and go to the site. Now, pulling out your dev tools skills, use the selector tool to click on one of the links to an item we’re interested in scraping. For example, at https://www.d20pfsrd.com/magic-items/rings, if we go down to the bottom of the page where all the various rings are listed, we can use the selector to click on a link. Doing so should lead to something like this.

<a href=”https://www.d20pfsrd.com/magic-items/rings/aquatic-ring-of-resistance/">Aquatic Ring of Resistance</a>

Now we have the type of link that we want. Time to work backwards to make sure we get ONLY what we want. These anchor elements don’t have any specific tags that make them easy to find and the anchor tag by itself isn’t nearly specific enough to only pull these links. To solve this problem we need to move farther up the DOM structure to find something we can use to filter down to what we want. I’m not going to go through every bit, but if you look at what these anchor elements fall under, you find div, ul, and li elements as the nearest containers to these links. In the example, I use their class tags to pull the specific container elements to widdle it down.

This example is slightly more advanced than a basic one because the landing URLs don’t actually have all the item information we want. To get all the details we will need to pass through the links we’ve found up to this point and get to the individual item pages.

Fortunately, the structure of gathering information from subsequent pages uses the exact same process! Now if only we knew how to make the spider go to these new links….

Enter stage right the .follow() method. Describing this method in scientific terms it’s TOTALLY AWESOME. Let me explain. I want to get my spider to send another request to the URLs we found and return the same type of response object we already have. Here’s how we do that.

yield response.follow(href, callback = self.chaseLinks)

That’s it. It’s exactly the same as our original start_requests() method down to the callback keyword. All that’s left is to define the new callback which will go gather the info from the actual item pages. Here’s the chaseLinks method that will do that.

def chaseLinks(self, res):
itemName = res.css('main h1::text').get()
itemInfo = res.css('div.page-center p::text').getall()
yield {
'name' : itemName,
'info' : itemInfo
}

This method will yield a dictionary with “name” and “info” keywords which will house the info we are pulling down from the site. You may notice that I’m taking all the text from the paragraph elements under div elements with a class name of “page-center”. If I were using this info the text would need substantially more preprocessing to make it directly useful. It’s beyond the scope of this example, but as an FYI, this is where some of the more powerful features from Scrapy would come in handy.

At this point, we would start wondering if what we’ve build will actually accomplish what we want. We could go ahead and run the full spider, and in this case, it’s not the worst idea considering the scope of our scrape is pretty limited. However, if you’re trying to pull down a couple hundred thousand entries then a limited engagement makes more sense. This is a great use case for the Scrapy shell.

In the environment that has Scrapy installed, run the following command.

scrapy shell https://www.d20pfsrd.com/magic-items/

This command will create a session in the shell and return a response from the URL we provide. This response will live in the shell such that we can run the code from our spider against it and see what comes back. Go ahead and throw our first bit into the shell.

response.css('div.ogn-childpages ul.ogn-childpages li a::attr(href)').getall()

You should see a whole bunch of hyperlinks run through the terminal. Sweet, that means the first bit works!

Go ahead and close the shell and reopen it to one of the links we are chasing. Then run the code in the chaseLinks method.

res.css('main h1::text').get()
res.css('div.page-center p::text').getall()

Everything looks like it’s working! Now let’s run the full spider. Don’t worry, this is one of the easiest parts. Go to the root of the project folder and run this command. It will write all our scraped information to a file named dditems.json on the same level as the project folder.

scrapy crawl dditems -O dditems.json

An Important Side Note: The Scrape Didn’t Work

This is an important time to bring up a critical component of webscrapping. Ethics. Webscrapping is a very powerful technology that allows a single person to harvest tons of data with relatively little effort. For obvious reasons this can be problematic. Intellectual property, theft, privacy invasion, and potential denial of service to name a few. This isn’t a hypothetical issue either. In the example I chose, the domain d20pfsrd.com has a rule built into their website that tells webscrapping bots to not scrape the site. If you go to https://www.d20pfsrd.com/robots.txt you will see this information.

# This virtual robots.txt file was created by the Virtual Robots.txt WordPress plugin: https://www.wordpress.org/plugins/pc-robotstxt/
User-agent: *
Disallow: /wp-json/
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Allow: /wp-includes/js/
Allow: /wp-includes/images/
Disallow: /trackback/
Disallow: /wp-login.php
Disallow: /wp-register.php
Disallow: /staging/
Disallow: /staging__trashed/
Allow: /sitemap*.xml
Allow: /sitemap*.xml.gz

crawl-delay: 10

Sitemap: http://www.d20pfsrd.com/sitemap.xml

The second rule in the file explicitly disallows bots from writing scrapped info to a JSON file.

Where does this leave our project?

Ultimately, that’s up to you. In addition to all the other cool things, scrapy allows for spiders to be configured to ignore these types of rules. In the root of the project directory you will see another file named settings.py that comes preconfigured to look like this.

# Scrapy settings for d20pfsrd_scape project
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
https://docs.scrapy.org/en/latest/topics/settings.html
#
https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#
https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'd20pfsrd_scape'SPIDER_MODULES = ['d20pfsrd_scape.spiders']NEWSPIDER_MODULE = 'd20pfsrd_scape.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'd20pfsrd_scape (+http://www.yourdomain.com)'
# Obey robots.txt rulesROBOTSTXT_OBEY = False
.
.
.
.

This file is home to the ROBOTSTXT_OBEY environment variable. By default, this variable is set to False. Under this setting, your scrapy spider will honor the rules of the websites it visits. However, if you change this variable to True, scrapy will ignore the rules in robots.txt and scrape the site anyways.

I’m including this information to give you a sense of the power of this technology. Yes, websites aren’t totally indefensible against bots that choose to ignore the rules, but with some more cleaver tweaks, the vast majority of sites on the web will be completely incapable to stopping your spiders.

When you go about scraping sites for information think about why these rules exist. Think about how you would feel if someone scrapped a bunch of your hard work or if they complied your personal information from social media sites without regard to generally understood principles of good faith.

There’s no one size fits all solution to these questions. Sometimes, the use of the data may justify bending the rules. For my pet project, I’m not going to distribute my database for anything other than the games that I play. I’m also a doner to the org. At the end of the day, I personally don’t feel as though I’ve abused the technology. I’ll leave it to you to decide.

References

  1. Zyte (2021) Scrapy V2.5 [docs] https://docs.scrapy.org/en/latest/
  2. Python Software Foundation (2021) Python V3.9.5 [docs] https://www.python.org/doc/
  3. Henry, Matthew (2021) Peeling Paint Wall Texture Photo [photo]. Retrieved from https://burst.shopify.com/photos/peeling-paint-wall-texture
  4. Magic Items, https://www.d20pfsrd.com/magic-items

I love life, family, math and the internet. I’ve done everything from academic research to digging holes. I can be stubborn but always try to keep and open mind