Python Scrapy Shell Tutorial

Python Scrapy Shell Tutorial

Fetching and selecting data from websites when you’re scraping with Python Scrapy can be tedious. There is a lot of updating the code, running it, and checking to see if you’re getting the results you expect. Scrapy provides a way to make this process easier, and it is called the Scrapy Shell. The Scrapy shell can be launched from the terminal so that you can test all of the various XPath or CSS selectors that you want to use in your Scrapy project. It’s really neat, so let’s take a look at it now.


Launch Scrapy Shell

python $scrapy shell

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x00000187145AEA30>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x00000187145AE9A0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)        Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]:

Open Scrapy Shell With A URL

The goal is to work with the contents of a page in the Scrapy shell for testing. Scrapy gives you a shortcut to launch the shell while fetching a URL at the same time.

scrapy_testing $scrapy shell https://scrapethissite.com/pages/

Now you can see the request and response right away in the Scrapy shell. Scrapy made a GET request to https://scrapethissite.com/pages/ and the request was successful as we can see the 200 OK response.

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0000015474761190>
[s]   item       {}
[s]   request    <GET https://scrapethissite.com/pages/>
[s]   response   <200 https://scrapethissite.com/pages/>
[s]   settings   <scrapy.settings.Settings object at 0x0000015474761880>
[s]   spider     <TestingSpider 'testing' at 0x15474ba8f40>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)

Practicing XPath

Now comes the fun part. We have a page source in memory, and we can easily query the document for various elements and content using XPath. Let’s fist look at the Navigation of the page in question. The source for the navigation is here.

<nav id="site-nav">
    <div class="container">
        <div class="col-md-12">
            <ul class="nav nav-tabs">
                <li id="nav-homepage">
                    <a href="/" class="nav-link hidden-sm hidden-xs">
                        <img src="/static/images/scraper-icon.png" id="nav-logo">
                        Scrape This Site
                    </a>
                </li>
                <li id="nav-sandbox">
                    <a href="/pages/" class="nav-link">
                        <i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
                        Sandbox
                    </a>
                </li>
                <li id="nav-lessons">
                    <a href="/lessons/" class="nav-link">
                        <i class="glyphicon glyphicon-education hidden-sm hidden-xs"></i>
                        Lessons
                    </a>
                </li>
                <li id="nav-faq">
                    <a href="/faq/" class="nav-link">
                        <i class="glyphicon glyphicon-flag hidden-sm hidden-xs"></i>
                        FAQ
                    </a>
                </li>
                
                <li id="nav-login" class="pull-right">
                    <a href="/login/" class="nav-link">
                        Login
                    </a>
                </li>
                
            </ul>
        </div>
    </div>
</nav>

The code snippet above is just a small part of the entire HTML markup on the source page. Selecting data and content on the page can be as broad or as focused as you like.


Query The Response

We can fetch the entire snippet of code above using XPath like so.

In [12]: response.xpath('//*[@id="site-nav"]')
Out[12]: [<Selector xpath='//*[@id="site-nav"]' data='<nav id="site-nav">\n            <div ...'>]

The above code is what is known as Querying a response. When calling the .xpath() or .css() methods on the response, what you get is a Selector object when a single element is matched or a list of Selector objects when multiple elements are matched.

Single Selector Object

In [9]: response.xpath('//li[1]/a')
Out[9]: [<Selector xpath='//li[1]/a' data='<a href="/" class="nav-link hidden-sm...'>]

List of Selector Objects

In [7]: response.xpath('//li/a')
Out[7]: 
[<Selector xpath='//li/a' data='<a href="/" class="nav-link hidden-sm...'>,
 <Selector xpath='//li/a' data='<a href="/pages/" class="nav-link">\n ...'>,
 <Selector xpath='//li/a' data='<a href="/lessons/" class="nav-link">...'>,
 <Selector xpath='//li/a' data='<a href="/faq/" class="nav-link">\n   ...'>,
 <Selector xpath='//li/a' data='<a href="/login/" class="nav-link">\n ...'>]

Selector Methods

Once you have a Selector object, you can use various methods to extract the data out of the selector. You’ll be using methods like .get(), .getall(), .re_first(), and .re(). You can also use the .attrib property to read the values of attributes contained in the source.


.get() vs .getall()

These are the two most commonly used methods on the selector object. The .get() method extracts the contents of the first selector object, even if there is more than one returned from either the .xpath() or the .css() query. As an example, we know that the xpath() query of ‘//li/a’ actually returns several selector objects. Watch the difference between .get() and.getall() in this scenario.

.get()

In [14]: response.xpath('//li/a').get()
Out[14]: '<a href="/" class="nav-link hidden-sm hidden-xs">\n
<img src="/static/images/scraper-icon.png" id="nav-logo">\n                                Scrape This Site\n                            </a>'

.getall()

In [15]: response.xpath('//li/a').getall()
Out[15]: 
['<a href="/" class="nav-link hidden-sm hidden-xs">\n                                <img src="/static/images/scraper-icon.png" id="nav-logo">\n                                Scrape This Site\n                            </a>',
 '<a href="/pages/" class="nav-link">\n                                <i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>\n                                Sandbox\n
                   </a>',
 '<a href="/lessons/" class="nav-link">\n                                <i class="glyphicon
glyphicon-education hidden-sm hidden-xs"></i>\n                                Lessons\n
                       </a>',
 '<a href="/faq/" class="nav-link">\n                                <i class="glyphicon glyphicon-flag hidden-sm hidden-xs"></i>\n                                FAQ\n
          </a>',
 '<a href="/login/" class="nav-link">\n                                Login\n
             </a>']

.re()

The .re() method can be used for extracting data using regular expressions.

In [18]: response.xpath('//li/a').re(r'[A-Z][a-z]*')
Out[18]: ['Scrape', 'This', 'Site', 'Sandbox', 'Lessons', 'F', 'A', 'Q', 'Login']

.re_first()

The .re_first() method does the same as .re() except it returns only the first regular expression match.

In [19]: response.xpath('//li/a').re_first(r'[A-Z][a-z]*')
Out[19]: 'Scrape'

Selecting Specific Elements

Between the combination of .css() queries, .xpath() queries, and various combinations of .get() and .getall(), you can get any piece of the page you like at any time. Here is an example of getting each individual link using XPath.

In [2]: response.xpath(‘//li[1]/a’)
Out[2]: [<Selector xpath=’//li[1]/a’ data='<a href=”/” class=”nav-link hidden-sm…’>]

In [3]: response.xpath(‘//li[2]/a’)
Out[3]: [<Selector xpath=’//li[2]/a’ data='<a href=”/pages/” class=”nav-link”>\n …’>]

In [4]: response.xpath(‘//li[3]/a’)
Out[4]: [<Selector xpath=’//li[3]/a’ data='<a href=”/lessons/” class=”nav-link”>…’>]

In [5]: response.xpath(‘//li[4]/a’)
Out[5]: [<Selector xpath=’//li[4]/a’ data='<a href=”/faq/” class=”nav-link”>\n …’>]

This is the same thing but using list indexing to get the desired element, rather than XPath itself.

In [11]: response.xpath('//li/a')[0]
Out[11]: <Selector xpath='//li/a' data='<a href="/" class="nav-link hidden-sm...'>

In [12]: response.xpath('//li/a')[1]
Out[12]: <Selector xpath='//li/a' data='<a href="/pages/" class="nav-link">\n ...'>

In [13]: response.xpath('//li/a')[2]
Out[13]: <Selector xpath='//li/a' data='<a href="/lessons/" class="nav-link">...'>

In [14]: response.xpath('//li/a')[3]
Out[14]: <Selector xpath='//li/a' data='<a href="/faq/" class="nav-link">\n   ...'>

Removing HTML Markup With text()

During web scraping, it’s not really the markup that you’re interested in, it is the content inside of the markup tags. When constructing the XPath queries, you can use the text() node specifier in XPath. All items in the DOM are a node, even text. To specify a text node you use text(). Let’s see some examples.

In [11]: response.xpath('//h3/a/text()')
Out[11]: 
[<Selector xpath='//h3/a/text()' data='Countries of the World: A Simple Example'>,
 <Selector xpath='//h3/a/text()' data='Hockey Teams: Forms, Searching and Pa...'>,
 <Selector xpath='//h3/a/text()' data='Oscar Winning Films: AJAX and Javascript'>,
 <Selector xpath='//h3/a/text()' data='Turtles All the Way Down: Frames & iF...'>,
 <Selector xpath='//h3/a/text()' data='Advanced Topics: Real World Challenge...'>]
In [12]: response.xpath('//h3/a/text()').get()
Out[12]: 'Countries of the World: A Simple Example'
In [13]: response.xpath('//h3/a/text()').getall()
Out[13]: 
['Countries of the World: A Simple Example',
 'Hockey Teams: Forms, Searching and Pagination',
 'Oscar Winning Films: AJAX and Javascript',
 'Turtles All the Way Down: Frames & iFrames',
 "Advanced Topics: Real World Challenges You'll Encounter"]

Dealing With Whitespace and Newlines

Many times the markup on a webpage is not pretty. It renders nicely because the Browser abstracts away any whitespace or newline characters, but when you are web scraping those irregularities in the markup come right through. For example, look at this markup.

html with whitespace and newlines

During the XPath query, all of those whitespaces and newlines come right through.

In [18]: response.xpath('//div/p/text()').get()
Out[18]: '\n                                A single page that lists information about all the countries in the world. Good for those just get started with web scraping.\n

You can add the Python strip() method to overcome this if you like.

In [19]: response.xpath('//div/p/text()').get().strip()
Out[19]: 'A single page that lists information about all the countries in the world. Good for those just get started with web scraping.'

Looping In The Shell

Even in the Scrapy shell, you can loop over response data.

In [25]: for i in response.xpath('//div/p/text()'):
    ...:     print(i.get().lstrip())
    ...: 
A single page that lists information about all the countries in the world. Good for those just get started with web scraping.

Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components.

Click through a bunch of great films. Learn how content is added to the page asynchronously with Javascript and how you can scrape it.

Some older sites might still use frames to break up thier pages. Modern ones might be using iFrames to expose data. Learn about turtles as you scrape content inside frames.

Scraping real websites, you're likely run into a number of common gotchas. Get practice with
spoofing headers, handling logins & session cookies, finding CSRF tokens, and other common network errors.

Changing The Working Response

You can change the page you are testing in the Scrapy shell by simply fetching a new page using the fetch() method. Let’s change the response we want to query to something else.

In [3]: fetch('https://yahoo.com')
2021-03-30 17:05:12 [scrapy.core.engine] INFO: Spider opened
2021-03-30 17:05:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET
https://www.yahoo.com/> from <GET https://yahoo.com>
2021-03-30 17:05:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yahoo.com/> (referer: None)

In [4]: response
Out[4]: <200 https://www.yahoo.com/>

Now we can query all of the paragraph page elements using XPath.

In [8]: response.xpath('//p/text()')
Out[8]: 
[<Selector xpath='//p/text()' data="New York state's highest court cleare...">,
 <Selector xpath='//p/text()' data='Trump may have to be questioned\xa0»'>,
 <Selector xpath='//p/text()' data='“What do you mean you just killed you...'>,
 <Selector xpath='//p/text()' data='Thanks to Connelly, the "Career Oppor...'>,
 <Selector xpath='//p/text()' data="Two former Texas sheriff's deputies w...">,
 <Selector xpath='//p/text()' data='When the cat first walked into the cl...'>,
 <Selector xpath='//p/text()' data='When former President Donald Trump wa...'>,
 <Selector xpath='//p/text()' data="Nobody was buying this father's side ...">,
 <Selector xpath='//p/text()' data='Something major happens late in the d...'>]

Python Scrapy Shell Tutorial Summary

The Scrapy shell is a fun test environment where you can try and debug your scraping code very quickly, without having to run the spider. Its purpose is for testing data extraction code, but you can also use it for testing any kind of Python code as it doubles as a standard Python shell.

The Scrapy shell is perfect for testing your XPath or CSS expressions to see how they work and what data they extract from the web pages you’re trying to scrape. It is a great way to interactively test your expressions while you’re writing your spider, without having to run the spider to test every change.

After some practice, you’ll find the Scrapy shell to be a wonderful tool for developing and debugging your spiders.