How To Use Scrapy Items

How To Use Scrapy Items

An Item in Scrapy is a logical grouping of extracted data points from a website that represents a real-world thing. You do not have to make use of Scrapy Items right away, as we saw in earlier Scrapy tutorials. You can simply yield page elements as they are extracted and do with the data as you wish. Items provide the ability to better structure the data you scrape, as well as massaging the data with Item Loaders rather than directly in the default Spider parse() method. Let’s learn a little bit more about Scrapy Items here.

Define Scrapy Items

If we open items.py from the Scrapy Project, there will be some starter code to help us out. It might look something like the below, though it is based on whatever you named your project.

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BookstoscrapeItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    pass

To understand how to define the Items, let’s look at the original code of our Scrapy Project.

import scrapy


class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        for book in response.xpath('//article'):
            yield {
                'booktitle': book.xpath('.//a/text()').get(),
                'bookrating': book.xpath('.//p').attrib['class'],
                'bookprice': book.xpath('.//div[2]/p/text()').get(),
                'bookavailability': book.xpath('.//div[2]/p[2]/i/following-sibling::text()').get().strip()
            }

        next_page = response.css('.next a').attrib['href']
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

We want to add the highlighted lines as Items in items.py. We can update items.py to reflect this by writing the following code.

items.py

import scrapy


class BookstoscrapeItem(scrapy.Item):
    booktitle = scrapy.Field()
    bookrating = scrapy.Field()
    bookprice = scrapy.Field()
    bookavailability = scrapy.Field()

One point about declaring Items is that if we declare a field that doesn’t mean we must fill it in on every spider, or even use it altogether. We can add whatever fields seem appropriate and can always correct them later if needed.

Update Spider Code

By defining those Fields for our Item class, means we can use an object of that class to load data in the Spider. The syntax is a little different than when not using Items, and the newly defined class needs to be imported into the Spider class. The highlighted lines of code show how to import the new Item class, instantiate an item object from it, populate each field, then yield the populated item object.

import scrapy
from bookstoscrape.items import BookstoscrapeItem


class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        item = BookstoscrapeItem()
        for book in response.xpath('//article'):
            item['booktitle'] = book.xpath('.//a/text()').get()
            item['bookrating'] = book.xpath('.//p').attrib['class']
            item['bookprice'] = book.xpath('.//div[2]/p/text()').get()
            item['bookavailability'] = book.xpath('.//div[2]/p[2]/i/following-sibling::text()').get().strip()

            yield item

        next_page = response.css('.next a').attrib['href']
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

Running The Spider

Running the spider now with scrapy crawl books will produce the same exact result as when we first began this project, even when there were no Items defined. So why are we even doing all this work to define these Items if the program runs the same? Well, the answer is that by putting in place the Items class and making use of it, we can unlock many new functionalities in Scrapy such as using Item Loaders to clean and sanitize the data before output or storage.