Python Scrapy Introduction

Python Scrapy Introduction

The Python Scrapy library is a very popular software package for web scraping. Web scraping is the process of programmatically extracting key data from online web pages using the software. Using this technique, it’s possible to scrape data from a single page or crawl across multiple pages, scraping data from each one as you go. This second approach is referred to as web crawling when the software bot follows links to find new data to scrape. Scrapy makes it possible to set up these web bot crawlers in an automated way, and we’ll learn how to get started with Scrapy now.


Install Scrapy

Installing Scrapy is very easy and can be done right at the terminal.

pip install Scrapy

Once that is complete, you can check on the installation by viewing the help menu using this command.

scrapy $scrapy --help
Scrapy 2.4.1 - no active project

Usage:
  scrapy  [options] [args]

Available commands:
  bench         Run quick benchmark test
  commands
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy  -h" to see more info about a command

Notice the shell command. We’ll be looking at the Scrapy shell in the next tutorial.


Start Scrapy Project

Scrapy is a feature-rich framework, and as such, you begin projects similar to how you do in Django. The command below instructs Scrapy to build a project named scrapy_testing.

scrapy $scrapy startproject scrapy_testing
New Scrapy project 'scrapy_testing' created in:
    C:\python\scrapy\scrapy_testing

    cd scrapy_testing
    scrapy genspider example example.com

Scrapy In PyCharm

After Scrapy generates the folders and files to hold the Scrapy project, we can open it up in a great IDE like PyCharm or Visual Studio Code.
scrapy startproject


Scrapy Project Files

A new Scrapy project creates a scaffold of all the needed files for you. Those files are listed here with relevant links to helpful documentation.

  • spiders holds the Spider class you create that define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages
  • items.py Define here the models for your scraped items. Defines the objects or the entities that we’re scraping. Scrapy Items are declared similar to Django Models, except that Scrapy Items are much simpler as there is no concept of different field types.
  • middlewares.py Define here the models for your spider middleware, or Scapy hooks. When sending a request to a website, the request can be updated or modified on the fly and the same holds true for responses. For example, if you wanted to add a proxy to all requests, you can do so in middleware.
  • pipelines.py Define your item pipelines here, defines functions that create and filter items. Pipelines are for cleansing HTML data, validating scraped data, checking for duplicates (and dropping them), and storing the scraped item in a database if desired.
  • settings.py Project settings, For simplicity, this file contains only settings considered important or commonly used. In the settings file, you can configure the bot name. The BOT_NAME variable will be automatically set to the name of your Scrapy project when you create it. A custom USER_AGENT can also be set here if you like.
  • scrapy.cfg holds configuration information

Spiders

A Scrapy project can be thought of as primarily a collection of spiders. Here we can create a new web spider in the Scrapy project. The command below instructs Scrapy to create a new testing spider that is crawling data from scrapethissite.com.

cd scrapy_testing/spiders
spiders $scrapy genspider testing scrapethissite.com
Created spider 'testing' using template 'basic' in module:
  scrapy_testing.spiders.testing

spiders/testing.py
The default boilerplate code gets created for you when you run the genspider command. We can see that the generated class uses Python Inheritance to inherit all of the power of the Spider class. You can create a Scrapy spider class manually, but it is much faster and less prone to errors if you make use of that genspider command.

import scrapy


class TestingSpider(scrapy.Spider):
    name = 'testing'
    allowed_domains = ['scrapethissite.com']
    start_urls = ['http://scrapethissite.com/']

    def parse(self, response):
        pass

The parse() function is passed a response object via Scrapy, and we want to fill this in with something that will return an object containing the data scraped from our site. In other words, the response variable holds the entire source markup and content of the URL that the request was made to. It is inside this parse() method that we need to define code that narrows down the response contents to the data we are actually interested in. Here are some additional details about the Spider.

  • scrapy.Spider The Base class for all Scrapy spiders. Spiders you create must inherit from this class.
  • name This string sets the name for the spider. Scrapy uses this to instantiate a new spider so it needs to be unique.
  • allowed_domains This is an optional list of domains that the spider is allowed to crawl.
  • start_urls This is where the spider begins crawling from.

  • XPath or CSS

    Before we start filling in the parse() method, we need to look at some details about XPath and CSS selectors. In Scrapy you can extract data from the source webpage using either XPath or CSS selectors. CSS selectors tend to be very popular with front-end developers, while XPath is often used by those that enjoy Regular Expressions. Both are perfectly valid approaches to selecting the needed data, though XPath is known to be a little more robust, so that is what we will look at now.


    Scraping A Page

    The page we are going to scrape lives at https://scrapethissite.com/pages/ and looks like this.
    web scraping with python scrapy

    In order to scrape this page successfully, we need to update the start_urls variable along with the parse() function in our spider class. Note that the XPath expression we are using below simply says, “find the first h3 tag that has a class of ‘page-title’, then look at the text content of the child anchor tag”. At the end of the XPath expression, we append the Scrapy .get() method to fetch the first result.

    import scrapy
    
    
    class TestingSpider(scrapy.Spider):
        name = 'testing'
        allowed_domains = ['scrapethissite.com']
        start_urls = ['https://scrapethissite.com/pages/']
    
        def parse(self, response):
            title = response.xpath('//h3[@class="page-title"]/a/text()').get()
            return {'title': title}

    Running Your Spider

    To run your spider, Scrapy provides the runspider command that you can use like so.

    spiders $scrapy runspider testing.py

    The output is quite verbose, but if you inspect it you will find the data you wanted to scrape. It worked!

    {'title': 'Countries of the World: A Simple Example'}

    The crawl command

    Another way you can run your spider that might be a little cleaner is to use the crawl command.

    scrapy crawl testing

    Python Scrapy Introduction Summary

    There you have it, a nice introduction to the powerful Python Scrapy library. We learned how to use Scrapy to define a new project, create a new web spider, and fetch some data from a web page.