How To Use Scrapy Item Loaders

How To Use Scrapy Item Loaders

The Python Scrapy framework has a concept known as Item Loaders. These Item Loaders are used to load data into Scrapy Items once they have been defined. During this process, we can apply input processors and output processors which clean up the extracted data in various ways. With an ItemLoader class and a few small but useful functions, you can strip unwanted characters, clean up whitespace characters, or otherwise modify the data being collected however you see fit. Let’s look at a Scrapy Item Loader example now.

Add Input/Output Processors

To use an Item Loader, you first want to navigate to the items.py file in your Scrapy Project. It is in the items.py file where you can import the item loader processors to use. This is how the item loader applies changes to the data as it passes through the scrapy process. Here is our updated items.py file.

import scrapy
from itemloaders.processors import TakeFirst, MapCompose
from w3lib.html import remove_tags


def remove_whitespace(value):
    return value.strip()


def to_dollars(value):
    return value.replace('£', '$')


class BookstoscrapeItem(scrapy.Item):
    booktitle = scrapy.Field(
        input_processor=MapCompose(remove_tags),
        output_processor=TakeFirst()
    )
    bookrating = scrapy.Field(
        input_processor=MapCompose(remove_tags),
        output_processor=TakeFirst()
    )
    bookprice = scrapy.Field(
        input_processor=MapCompose(remove_tags, to_dollars),
        output_processor=TakeFirst()
    )
    bookavailability = scrapy.Field(
        input_processor=MapCompose(remove_tags, remove_whitespace),
        output_processor=TakeFirst()
    )

Item Loader In Spider

In order for the scraped data to actually be processed through the input and output processors we just set up, the ItemLoader needs to be imported into the main Spider. Once in the Spider, we use a special syntax to load the data through the pipeline. The important changes are highlighted here in books.py.

import scrapy
from bookstoscrape.items import BookstoscrapeItem
from scrapy.loader import ItemLoader


class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        for book in response.xpath('//article'):
            l = ItemLoader(item=BookstoscrapeItem(), selector=book)
            l.add_xpath('booktitle', './/a/text()')
            l.add_xpath('bookrating', './/p/@class')
            l.add_xpath('bookprice', './/div[2]/p/text()')
            l.add_xpath('bookavailability', './/div[2]/p[2]/i/following-sibling::text()')

            yield l.load_item()

        next_page = response.css('.next a').attrib['href']
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

Running The Spider

When the spider is run, the data is processed how we wanted it to be. Any HTML tags have been removed, whitespace and newlines stripped, and the currency has been changed to use dollars.

[
  {
    "booktitle": "A Light in the ...",
    "bookrating": "star-rating Three",
    "bookprice": "$51.77",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Tipping the Velvet",
    "bookrating": "star-rating One",
    "bookprice": "$53.74",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Soumission",
    "bookrating": "star-rating One",
    "bookprice": "$50.10",
    "bookavailability": "In stock"
  },
  {
    "booktitle": "Sharp Objects",
    "bookrating": "star-rating Four",
    "bookprice": "$47.82",
    "bookavailability": "In stock"
  }
]

How To Use Scrapy Item Loaders Summary

The syntax of the Item Loader approach makes the parse() method in the main Spider a bit more visually appealing and clean. This gives the spider a clear intention of what we are trying to do and that makes the code more maintainable and self-documenting. ItemLoaders offer many interesting ways of combining data, formatting them, and cleaning them up. ItemLoaders pass values from XPath/CSS expressions through different processor classes. Processors are fast yet simple functions. There are many to choose from, but we looked at MapCompose and TakeFirst here. We also saw how to add our own custom methods to the class to clean up data how we see fit. The key concept to take away is that processors are just simple and small functions that post-process our XPath/CSS results.