How To Follow Links With Python Scrapy

How To Follow Links With Python Scrapy

Following links during data extraction using Python Scrapy is pretty straightforward. The first thing we need to do is find the navigation links on the page. Many times this is a link containing the text ‘Next’, but it may not always be. Then we need to construct either an XPath or CSS selector query to get the value contained in the href attribute of the anchor element we need. Once that is in place, we can use Scrapy’s response.follow() method to automatically navigate to other pages on the website.


Find The Next Button

This example is using books.toscrape.com and we can see that on the main page there is a ‘Next’ button that links to the next page. This continues until all 50 pages are displayed.

next button for responsefollow scrapy

Testing in the Scrapy Shell shows us that the response.css(‘.next a’).attrib[‘href’] gives us the needed URL value.


Implement response.follow()

Now to give our Spider the ability to navigate to the next page, we can construct the code shown below. The first step is to extract the URL to visit from the page using the response.css(‘.next a’).attrib[‘href’] selector and storing that result in the next_page variable.

Once that is complete we use an if statement to make sure that next_page is holding a valid URL. If it is, then we yield a call to response.follow() like so:

response.follow(next_page, callback=self.parse)

Notice that there is a callback function that refers to the parse() method in this very Spider class. What that tells Scrapy is to go ahead and scrape the current page, when you are finished – click the link to visit the next page, then run the parse() method again to scrape that new page. This process continues until there is no longer a valid URL extracted from the current page. In other words, the last page will not have an anchor tag with the text of ‘Next’ pointing to a new page. At that point, the response.css(‘.next a’).attrib[‘href’] will in fact be empty, or None, so therefore the response.follow() method will not get called and the Spider will stop.

import scrapy


class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        for book in response.xpath('//article'):
            yield {
                'booktitle': book.xpath('.//a/text()').get(),
                'bookrating': book.xpath('.//p').attrib['class'],
                'bookprice': book.xpath('.//div[2]/p/text()').get(),
                'bookavailability': book.xpath('.//div[2]/p[2]/i/following-sibling::text()').get().strip()
            }

        next_page = response.css('.next a').attrib['href']
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

Running The Spider

This small change to our Scrapy Project has now put in place a method to recursively follow links until all pages are scraped. We can run the spider and output it to a JSON file.

bookstoscrape $scrapy crawl books -o books.json 

In the output of the Spider, we can see some impressive stats now. The spider shows that 1000 items have now been scraped in about 12 seconds. That’s the entire site, and we only added a few lines of code!

{'downloader/request_bytes': 15059,
 'downloader/request_count': 51,
 'downloader/request_method_count/GET': 51,
 'downloader/response_bytes': 291875,
 'downloader/response_count': 51,
 'downloader/response_status_count/200': 50,
 'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 12.535962,
 'finish_reason': 'finished',
'item_scraped_count': 1000,
 'log_count/DEBUG': 1051,
 'log_count/ERROR': 1,
 'log_count/INFO': 11,
 'request_depth_max': 49,
 'response_received_count': 51,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 50,
 'scheduler/dequeued/memory': 50,
 'scheduler/enqueued': 50,
 'scheduler/enqueued/memory': 50,
 'spider_exceptions/KeyError': 1,
}

We can inspect the generated books.json file in the Scrapy project and sure enough, it now has 1000 objects each having a title, rating, price, and availability attribute. Impressive!

Link Exractors

Scrapy also provides what are known as Link Extractors. This is an object that can automatically extract links from responses. They are typically used in Crawl Spiders, though they can be also used in regular Spiders like the one featured in this article. The syntax is different, but the same result can be achieved. The link following code just above is rewritten here using a Link Extractor, and the result is the same.

import scrapy
from scrapy.linkextractors import LinkExtractor


class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        for book in response.xpath('//article'):
            yield {
                'booktitle': book.xpath('.//a/text()').get(),
                'bookrating': book.xpath('.//p').attrib['class'],
                'bookprice': book.xpath('.//div[2]/p/text()').get(),
                'bookavailability': book.xpath('.//div[2]/p[2]/i/following-sibling::text()').get().strip()
            }

        next_page = LinkExtractor(restrict_css='.next a').extract_links(response)[0]
        if next_page.url is not None:
            yield response.follow(next_page, callback=self.parse)

How To Limit Number Of Followed Links

When this type of recursive program runs, it will keep going and going until a stop condition is met. You might not want that scenario on a very large site. You need a way to stop the spider from crawling new links in this situation and there are a couple of ways to do it.

CLOSESPIDER_PAGECOUNT
One option is to add a configuration value to settings.py setting CLOSESPIDER_PAGECOUNT to the value of 25.

# Scrapy settings for bookstoscrape project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'bookstoscrape'

SPIDER_MODULES = ['bookstoscrape.spiders']
NEWSPIDER_MODULE = 'bookstoscrape.spiders'

CLOSESPIDER_PAGECOUNT = 25

Now when we run the spider, it stops itself after 25 pages have been scraped. You can do the same thing by setting a number of items to be scraped. For example, if you set CLOSESPIDER_ITEMCOUNT = 100, then the crawling automatically stops after 100 items have been retrieved. Keep these two configuration values in mind for the settings.py file when working with large data sets.

How To Follow Links With Python Scrapy Summary

There are several other ways to follow links in Python Scrapy, but the response.follow() method is likely the easiest to use, especially when first starting with Scrapy. Other options for following links are the urljoin() method and LinkExtractor object.