How To Create A Python Scrapy Project

How To Create A Python Scrapy Project

To create a project in Scrapy, you first want to make sure you have a good introduction to the framework. This will ensure that Scrapy is installed and ready to go. Once you are ready to go, we’ll look at how to create a new Python Scrapy project and what to do once it is created. The process is similar for all Scrapy projects, and this is a good exercise to practice web scraping using Scrapy.

startproject

To begin the project, we can run the scrapy startproject command along with the name we will call the project. The target website is located at https://books.toscrape.com.

scrapy $scrapy startproject bookstoscrape
New Scrapy project 'bookstoscrape', using template directory 
'\python\python39\lib\site-packages\scrapy\templates\project', created in:
    C:\python\scrapy\bookstoscrape

You can start your first spider with:
    cd bookstoscrape
    scrapy genspider example example.com

We can open the project in PyCharm and the project folder structure should look familiar to you at this point.
scrapy bookstoscrape pycharm

genspider

Once a project has been created, you want to generate one or more Spiders for the project. This is done with the scrapy genspider command.

bookstoscrape $scrapy genspider books books.toscrape.com
Created spider 'books' using template 'basic' in module:
  bookstoscrape.spiders.books

scrapy genspider books bookstoscrape


books.py

Here is the default boilerplate code for a freshly generated Spider in Scrapy. It’s nice to get the structure of the code setup for us.

import scrapy


class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        pass

Test XPath and CSS Selectors

To get yourself ready to add code to the Spider that has been created for us, you first need to figure out which selectors are going to get you the data you want. This is done with the Scrapy shell, by inspecting the target page source markup and testing selectors in the browser console.

bookstoscrape $scrapy shell 'https://books.toscrape.com/'
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x000001F2C93E31F0>
[s]   item       {}
[s]   request    <GET https://books.toscrape.com/>
[s]   response   <200 https://books.toscrape.com/>
[s]   settings   <scrapy.settings.Settings object at 0x000001F2C93E3430>
[s]   spider     <BooksSpider 'books' at 0x1f2c98485b0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

Inspect HTML Source

A right click on the page will allow you to Inspect any element you like.
browser inspect source

We are interested in each book and its associated data, all of which are contained in an article element.
how to determine xpath or css selectors

Test XPath and CSS Selectors In Browser Console

Both Firefox and Chrome provide XPath and CSS Selector tools you can use in the console.

$x(‘the xpath’)

Based on what we found by inspecting the source above, we know that each book item on the page lives inside of an <article> tag that has a class of product_pod. If we are using XPath, then the expression $x(‘//article’) gets us all 20 book items on this first page.
test xpath selector browser console

$$(‘the css selector’)

If you would rather use the CSS selector version which provides the same results, then the $$(‘.product_pod’) does the trick.
test css selector browser console

Test Selectors In Scrapy Shell

Once we have an idea of the XPath or CSS Selectors that seem to work in the browser console, we can try them in the Scrapy Shell which is a great tool. By typing response.xpath(‘//article’) or response.css(‘.product_pod’) at the Scrapy shell, you will see that 20 Selector objects are returned in both cases, and that makes sense because there are 20 book items on the page being scraped.

From Shell To Spider

It makes sense to try out those XPath and CSS Selectors both in the console of the browser and in Scrapy shell. This gives a good idea of what will work once it is time to start adding your own custom code to the Spider boilerplate code that the Scrapy framework provided.

Building The parse() Method

The purpose of the parse() method is to look at the response that is returned and well, parse the output. There are many ways to construct this part of the Spider ranging from very basic, to more advanced when you start adding Items and Item Loaders. Initially, the only goal is to return or yield a Python dictionary from that function. We’ll look at an example of using yield here with the custom code we are adding to the boilerplate highlighted.

import scrapy


class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        for book in response.xpath('//article'):
            yield {
                'booktitle': book.xpath('.//a/text()').get()
            }

Scrapy Crawl {Your Spider}

We can now run the Spider using the scrapy crawl command.

bookstoscrape $scrapy crawl books

There will be a lot of output in the console, but you should be able to find all book titles.

{'booktitle': 'A Light in the ...'}
{'booktitle': 'Tipping the Velvet'}
{'booktitle': 'Soumission'}
{'booktitle': 'Sharp Objects'}
{'booktitle': 'Sapiens: A Brief History ...'}
{'booktitle': 'The Requiem Red'}
{'booktitle': 'The Dirty Little Secrets ...'}
{'booktitle': 'The Coming Woman: A ...'}
{'booktitle': 'The Boys in the ...'}
{'booktitle': 'The Black Maria'}
{'booktitle': 'Starving Hearts (Triangular Trade ...'}
{'booktitle': "Shakespeare's Sonnets"}
{'booktitle': 'Set Me Free'}
{'booktitle': "Scott Pilgrim's Precious Little ..."}
{'booktitle': 'Rip it Up and ...'}
{'booktitle': 'Our Band Could Be ...'}
{'booktitle': 'Olio'}
{'booktitle': 'Mesaerion: The Best Science ...'}
{'booktitle': 'Libertarianism for Beginners'}
{'booktitle': "It's Only the Himalayas"}

My yield statement is not iterating!

Important! The example above uses a yield statement instead of a return statement. Also note that we are working with sub queries of XPath inside of that yield. When you are inside of a loop and using XPath to complete sub queries, you must inlude a leading period on the XPath selector. If you omit the leading period, you will get the first result back for as many times as the loop runs.

leading period xpath sub query


Start Big Then Narrow Down

As you play with the XPath and CSS Selectors, it is tempting to look at the target page, and then just get a new query for every different piece of information you want to scrape. For example, our initial query selected 20 article elements, and then we can narrow it down individually from there. You don’t want to look at the page and say I want the title, rating, price, and availability for every book on the page. You’re not going to use 80 different selectors for that. You’re going to grab 20 books at the top level, and then get 4 pieces of data from each book. The code below shows how to build these subqueries on the original XPath query.

import scrapy


class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        for book in response.xpath('//article'):
            yield {
                'booktitle': book.xpath('.//a/text()').get(),
                'bookrating': book.xpath('.//p').attrib['class'],
                'bookprice': book.xpath('.//div[2]/p/text()').get(),
                'bookavailability': book.xpath('.//div[2]/p[2]/i/following-sibling::text()').get().strip()
            }

The bookavailability selector was a little tricky. We are trying to get the text that comes after the <i> tag, however that text is kind of in no man’s land. For this, we can use the following-sibling::text() selector. We also add the strip() function to get rid of some whitespace, but we’ll learn about how to use Item Loaders to better handle this soon.

<p class="instock availability">
    <i class="icon-ok"></i>
    
        In stock
    
</p>

Scrapy Output

To actually output the data we capture, we can add the -o flag when using the scrapy crawl command to output to a CSV or json file.

bookstoscrape $scrapy crawl books -o books.json

Once you run the command, you’ll see a new file appear in the Scrapy project which holds all the data you just collected.

how to output python scrapy data to json

books.json result
The final result is a JSON file that has 20 objects, each having 4 attributes for the title, rating, price, and availability. You can now practice your data science skills on various sets of data that you collect.

[
  {
    "booktitle": "A Light in the ...",
    "bookrating": "star-rating Three",
    "bookprice": "£51.77",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Tipping the Velvet",
    "bookrating": "star-rating One",
    "bookprice": "£53.74",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Soumission",
    "bookrating": "star-rating One",
    "bookprice": "£50.10",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Sharp Objects",
    "bookrating": "star-rating Four",
    "bookprice": "£47.82",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Sapiens: A Brief History ...",
    "bookrating": "star-rating Five",
    "bookprice": "£54.23",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "The Requiem Red",
    "bookrating": "star-rating One",
    "bookprice": "£22.65",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "The Dirty Little Secrets ...",
    "bookrating": "star-rating Four",
    "bookprice": "£33.34",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "The Coming Woman: A ...",
    "bookrating": "star-rating Three",
    "bookprice": "£17.93",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "The Boys in the ...",
    "bookrating": "star-rating Four",
    "bookprice": "£22.60",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "The Black Maria",
    "bookrating": "star-rating One",
    "bookprice": "£52.15",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Starving Hearts (Triangular Trade ...",
    "bookrating": "star-rating Two",
    "bookprice": "£13.99",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Shakespeare's Sonnets",
    "bookrating": "star-rating Four",
    "bookprice": "£20.66",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Set Me Free",
    "bookrating": "star-rating Five",
    "bookprice": "£17.46",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Scott Pilgrim's Precious Little ...",
    "bookrating": "star-rating Five",
    "bookprice": "£52.29",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Rip it Up and ...",
    "bookrating": "star-rating Five",
    "bookprice": "£35.02",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Our Band Could Be ...",
    "bookrating": "star-rating Three",
    "bookprice": "£57.25",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Olio",
    "bookrating": "star-rating One",
    "bookprice": "£23.88",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Mesaerion: The Best Science ...",
    "bookrating": "star-rating One",
    "bookprice": "£37.59",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Libertarianism for Beginners",
    "bookrating": "star-rating Two",
    "bookprice": "£51.33",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "It's Only the Himalayas",
    "bookrating": "star-rating Two",
    "bookprice": "£45.17",
    "bookavailability": "In stock"
  }
]