The Python Scrapy library is a very popular software package for web scraping. Web scraping is the process of programmatically extracting key data from online web pages using the software. Using this technique, it’s possible to scrape data from a single page or crawl across multiple pages, scraping data from each one as you go. This second approach is referred to as web crawling when the software bot follows links to find new data to scrape. Scrapy makes it possible to set up these web bot crawlers in an automated way, and we’ll learn how to get started with Scrapy now.
Install Scrapy
Installing Scrapy is very easy and can be done right at the terminal.
pip install Scrapy
Once that is complete, you can check on the installation by viewing the help menu using this command.
scrapy $scrapy --help Scrapy 2.4.1 - no active project Usage: scrapy[options] [args] Available commands: bench Run quick benchmark test commands fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy -h" to see more info about a command
Notice the shell command. We’ll be looking at the Scrapy shell in the next tutorial.
Start Scrapy Project
Scrapy is a feature-rich framework, and as such, you begin projects similar to how you do in Django. The command below instructs Scrapy to build a project named scrapy_testing.
scrapy $scrapy startproject scrapy_testing
New Scrapy project 'scrapy_testing' created in: C:\python\scrapy\scrapy_testing cd scrapy_testing scrapy genspider example example.com
Scrapy In PyCharm
After Scrapy generates the folders and files to hold the Scrapy project, we can open it up in a great IDE like PyCharm or Visual Studio Code.
Scrapy Project Files
A new Scrapy project creates a scaffold of all the needed files for you. Those files are listed here with relevant links to helpful documentation.
- spiders holds the Spider class you create that define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages
- items.py Define here the models for your scraped items. Defines the objects or the entities that we’re scraping. Scrapy Items are declared similar to Django Models, except that Scrapy Items are much simpler as there is no concept of different field types.
- middlewares.py Define here the models for your spider middleware, or Scapy hooks. When sending a request to a website, the request can be updated or modified on the fly and the same holds true for responses. For example, if you wanted to add a proxy to all requests, you can do so in middleware.
- pipelines.py Define your item pipelines here, defines functions that create and filter items. Pipelines are for cleansing HTML data, validating scraped data, checking for duplicates (and dropping them), and storing the scraped item in a database if desired.
- settings.py Project settings, For simplicity, this file contains only settings considered important or commonly used. In the settings file, you can configure the bot name. The BOT_NAME variable will be automatically set to the name of your Scrapy project when you create it. A custom USER_AGENT can also be set here if you like.
- scrapy.cfg holds configuration information
Spiders
A Scrapy project can be thought of as primarily a collection of spiders. Here we can create a new web spider in the Scrapy project. The command below instructs Scrapy to create a new testing spider that is crawling data from scrapethissite.com.
cd scrapy_testing/spiders
spiders $scrapy genspider testing scrapethissite.com
Created spider 'testing' using template 'basic' in module: scrapy_testing.spiders.testing
spiders/testing.py
The default boilerplate code gets created for you when you run the genspider command. We can see that the generated class uses Python Inheritance to inherit all of the power of the Spider class. You can create a Scrapy spider class manually, but it is much faster and less prone to errors if you make use of that genspider command.
import scrapy
class TestingSpider(scrapy.Spider):
name = 'testing'
allowed_domains = ['scrapethissite.com']
start_urls = ['http://scrapethissite.com/']
def parse(self, response):
pass
The parse() function is passed a response object via Scrapy, and we want to fill this in with something that will return an object containing the data scraped from our site. In other words, the response variable holds the entire source markup and content of the URL that the request was made to. It is inside this parse() method that we need to define code that narrows down the response contents to the data we are actually interested in. Here are some additional details about the Spider.
XPath or CSS
Before we start filling in the parse() method, we need to look at some details about XPath and CSS selectors. In Scrapy you can extract data from the source webpage using either XPath or CSS selectors. CSS selectors tend to be very popular with front-end developers, while XPath is often used by those that enjoy Regular Expressions. Both are perfectly valid approaches to selecting the needed data, though XPath is known to be a little more robust, so that is what we will look at now.
Scraping A Page
The page we are going to scrape lives at https://scrapethissite.com/pages/ and looks like this.
In order to scrape this page successfully, we need to update the start_urls variable along with the parse() function in our spider class. Note that the XPath expression we are using below simply says, “find the first h3 tag that has a class of ‘page-title’, then look at the text content of the child anchor tag”. At the end of the XPath expression, we append the Scrapy .get() method to fetch the first result.
import scrapy
class TestingSpider(scrapy.Spider):
name = 'testing'
allowed_domains = ['scrapethissite.com']
start_urls = ['https://scrapethissite.com/pages/']
def parse(self, response):
title = response.xpath('//h3[@class="page-title"]/a/text()').get()
return {'title': title}
Running Your Spider
To run your spider, Scrapy provides the runspider command that you can use like so.
spiders $scrapy runspider testing.py
The output is quite verbose, but if you inspect it you will find the data you wanted to scrape. It worked!
{'title': 'Countries of the World: A Simple Example'}
The crawl command
Another way you can run your spider that might be a little cleaner is to use the crawl command.
scrapy crawl testing
Python Scrapy Introduction Summary
There you have it, a nice introduction to the powerful Python Scrapy library. We learned how to use Scrapy to define a new project, create a new web spider, and fetch some data from a web page.