Python Web Scraping With Beautiful Soup

Python Web Scraping With Beautiful Soup

Web scraping is a common technique used to fetch data from the internet for different types of applications. With the almost limitless data available online, software developers have created many tools to make it possible to compile information efficiently. During the process of web scraping, a computer program sends a request to a website on the internet. An Html document is sent back as a response to the program’s request. Inside of that document is information you may be interested in for one purpose or another. In order to access this data quickly, the step of parsing comes into play. By parsing the document, we can isolate and focus on the specific data points we are interested in. Common Python libraries for helping with this technique are Beautiful Soup, lxml, and Requests. In this tutorial, we’ll put these tools to work to learn how to implement Web Scraping using Python.


Install Web Scraping Code

To follow along run these three commands from the terminal. It’s also recommended to make use of a virtual environment to kepp things clean on your system.

  • pip install lxml
  • pip install requests
  • pip install beautifulsoup4

Find A Website To Scrape

To learn about how to do web scraping, we can test out a website called http://quotes.toscrape.com/ which looks like it was made for just this purpose.

website for web scraping

From this website, maybe we would like to create a data store of all the authors, tags, and quotes from the page. How could that be done? Well, first we can look at the source of the page. This is the data that is actually returned when a request is sent to the website. So in the Firefox web browser, we can right-click on the page and choose “view page source”.

firefox browser view page source

This will display the raw Html markup on the page. It is shown here for reference.

As you can see from the above markup, there is a lot of data that kind of just looks all mashed together. The purpose of web scraping is to be able to access just the parts of the web page that we are interested in. Many software developers will employ regular expressions for this task, and that is definitely a viable option. The Python Beautiful Soup library is a much more user-friendly way to extract the information we want.


Building The Scraping Script

In PyCharm, we can add a new file that will hold the Python code to scrape our page.

pycharm web scraping

scraper.py

The code above is the beginning of our Python scraping script. At the top of the file, the first thing to do is import the requests and BeautifulSoup libraries. Then, we set the URL we want to scrape right into that url variable. This is then passed to the requests.get() function and we assign the result into the response variable. We use the BeautifulSoup() constructor to put the response text into the soup variable setting lxml as the format. Last, we print out the soup variable and you should see something similar to the screen shot below. Essentially, the software is visiting the website, reading the data and viewing the source of the website much as we did manually above. The only difference is this time around, all we had to do was click a button to see the output. Pretty neat!

python beautiful soup in pycharm


Traversing HTML Structures

HTML stands for hypertext markup language and works by distributing elements of the HTML document with specific tags. HTML has many different tags but a general layout involves three basic ones. An HTML tag, a head tag, and a body tag. These tags organize the HTML document. In our case, we’ll be mostly focused on the information within the body tag. At this point, our script is able to fetch the Html markup from our designated Url. The next step is to focus on the specific data we are interested in. Notice that if you use the inspector tool in your browser, it is fairly easy to see exactly what Html markup is responsible for rendering a given piece of information on the page. As we hover our mouse pointer over a particular span tag, we can see the associated text is automatically highlighted in the browser window. It turns out that every quote is inside of a span tag which also has a class of text. This is how you decipher how to scrape data. You look for patterns on the page and then create code that works on that pattern. Have a play around and notice that this works no matter where you place the mouse pointer. We can see the mapping of a specific quote to specific Html markup. Web scraping makes it possible to easily fetch all similar sections of an Html document. That’s pretty much all the HTML we need to know to scrape simple websites.

inspect web page HTML using browser inspector

Parsing Html Markup

There is a lot of information in the Html document, but Beautiful Soup makes it really easy to find the data we want, sometimes with just one line of code. So let’s go ahead and search all span tags that have a class of text. This should find all the quotes for us. When you want to find multiple of the same tags on the page you can use the find_all() function.

scraper.py

When the code above runs, the quotes variable gets assigned a list of all the elements from the Html document that is a span tag with a class of text. Printing out that quotes variable gives us the output we see below. The entire Html tag is captured along with its inner contents.

beautiful soup find_all example

Beautiful Soup text property

The extra Html markup that is returned in the script is not really what we are interested in. To get only the data we want, in this case, the actual quotes, we can use the .text property made available to us via Beautiful Soup. Note the new highlighted code here where we use a for loop to iterate over all of the captured data and print out only the contents we want.

scraper.py

This gives us a nice output with just the quotes we are interested in.

C:\python\vrequests\Scripts\python.exe C:/python/vrequests/scraper.py
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”

Process finished with exit code 0

Neat! To now find all the authors and also print them out as they are associated with each quote, we can use the code below. By following the same steps as before, we first manually inspect the page we want to scrape. We can see that each author is contained inside of a <small> tag with an author class. So we follow the same format as before with the find_all() function and store the result in that new authors variable. We also need to change up the for loop to make use of the range() function so we can iterate over both the quotes and authors at the same time.

scraper.py

Now we get the quotes and each associated author when the script is run.

C:\python\vrequests\Scripts\python.exe C:/python/vrequests/scraper.py
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
--Albert Einstein

“It is our choices, Harry, that show what we truly are, far more than our abilities.”
--J.K. Rowling

“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
--Albert Einstein

“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
--Jane Austen

“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
--Marilyn Monroe

“Try not to become a man of success. Rather become a man of value.”
--Albert Einstein

“It is better to be hated for what you are than to be loved for what you are not.”
--André Gide

“I have not failed. I've just found 10,000 ways that won't work.”
--Thomas A. Edison

“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
--Eleanor Roosevelt

“A day without sunshine is like, you know, night.”
--Steve Martin


Process finished with exit code 0

Finally, we’ll just add some code to fetch all the tags for each quote as well. This one is a little trickier because we first need to fetch each outer wrapping div of each collection of tags. If we didn’t do this first step, then we could fetch all the tags but we wouldn’t know how to associate them to a quote and author pair. Once the outer div is captured, we can drill down further by using the find_all() function again on *that* subset. From there we have to add an inner loop to the first loop to complete the process.

This code now gives us the following result. Pretty cool, right?!

C:\python\vrequests\Scripts\python.exe C:/python/vrequests/scraper.py
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
--Albert Einstein
change
deep-thoughts
thinking
world


“It is our choices, Harry, that show what we truly are, far more than our abilities.”
--J.K. Rowling
abilities
choices


“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
--Albert Einstein
inspirational
life
live
miracle
miracles


“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
--Jane Austen
aliteracy
books
classic
humor


“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
--Marilyn Monroe
be-yourself
inspirational


“Try not to become a man of success. Rather become a man of value.”
--Albert Einstein
adulthood
success
value


“It is better to be hated for what you are than to be loved for what you are not.”
--André Gide
life
love


“I have not failed. I've just found 10,000 ways that won't work.”
--Thomas A. Edison
edison
failure
inspirational
paraphrased


“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
--Eleanor Roosevelt
misattributed-eleanor-roosevelt


“A day without sunshine is like, you know, night.”
--Steve Martin
humor
obvious
simile



Process finished with exit code 0

Practice Web Scraping

Another great resource for learning how to Web scrape can be found at https://scrapingclub.com. There are many tutorials there that cover how to use another Python web scraping software package called Scrapy. In addition to that are several practice web pages for scraping that we can utilize. We can start with this url here https://scrapingclub.com/exercise/list_basic/?page=1

scraping club web scrape practice

We want to simply extract the item name and price from each entry and display it as a list. So step one is to examine the source of the page to determine how we can search on the Html. It looks like we have some Bootstrap classes we can search on among other things.

inspect html source for web scrape

With this knowledge, here is our Python script for this scrape.

C:\python\vrequests\Scripts\python.exe C:/python/vrequests/scraper.py
1:  $24.99 for the Short Dress
2:  $29.99 for the Patterned Slacks
3:  $49.99 for the Short Chiffon Dress
4:  $59.99 for the Off-the-shoulder Dress
5:  $24.99 for the V-neck Top
6:  $49.99 for the Short Chiffon Dress
7:  $24.99 for the V-neck Top
8:  $24.99 for the V-neck Top
9:  $59.99 for the Short Lace Dress

Process finished with exit code 0

Web Scraping More Than One Page

The URL above is a single page of a paginated collection. We can see that by the page=1 in the URL. We can also set up a Beautiful Soup script to scrape more than one page at a time. Here is a script that scrapes all of the linked pages from the original page. Once all those URLs are captured, the script can issue a request to each individual page and parse out the results.

scraper.py

Running that script then scrapes all the pages in one go and outputs a large list like so.

C:\python\vrequests\Scripts\python.exe C:/python/vrequests/scraper.py
1:  $24.99 for the Short Dress
2:  $29.99 for the Patterned Slacks
3:  $49.99 for the Short Chiffon Dress
4:  $59.99 for the Off-the-shoulder Dress
5:  $24.99 for the V-neck Top
6:  $49.99 for the Short Chiffon Dress
7:  $24.99 for the V-neck Top
8:  $24.99 for the V-neck Top
9:  $59.99 for the Short Lace Dress
1:  $24.99 for the Short Dress
2:  $29.99 for the Patterned Slacks
3:  $49.99 for the Short Chiffon Dress
4:  $59.99 for the Off-the-shoulder Dress
5:  $24.99 for the V-neck Top
6:  $49.99 for the Short Chiffon Dress
7:  $24.99 for the V-neck Top
8:  $24.99 for the V-neck Top
9:  $59.99 for the Short Lace Dress
10:  $24.99 for the Short Dress
11:  $29.99 for the Patterned Slacks
12:  $49.99 for the Short Chiffon Dress
13:  $59.99 for the Off-the-shoulder Dress
14:  $24.99 for the V-neck Top
15:  $49.99 for the Short Chiffon Dress
16:  $24.99 for the V-neck Top
17:  $24.99 for the V-neck Top
18:  $59.99 for the Short Lace Dress
19:  $24.99 for the Short Dress
20:  $29.99 for the Patterned Slacks
21:  $49.99 for the Short Chiffon Dress
22:  $59.99 for the Off-the-shoulder Dress
23:  $24.99 for the V-neck Top
24:  $49.99 for the Short Chiffon Dress
25:  $24.99 for the V-neck Top
26:  $24.99 for the V-neck Top
27:  $59.99 for the Short Lace Dress
28:  $24.99 for the Short Dress
29:  $29.99 for the Patterned Slacks
30:  $49.99 for the Short Chiffon Dress
31:  $59.99 for the Off-the-shoulder Dress
32:  $24.99 for the V-neck Top
33:  $49.99 for the Short Chiffon Dress
34:  $24.99 for the V-neck Top
35:  $24.99 for the V-neck Top
36:  $59.99 for the Short Lace Dress
37:  $24.99 for the Short Dress
38:  $29.99 for the Patterned Slacks
39:  $49.99 for the Short Chiffon Dress
40:  $59.99 for the Off-the-shoulder Dress
41:  $24.99 for the V-neck Top
42:  $49.99 for the Short Chiffon Dress
43:  $24.99 for the V-neck Top
44:  $24.99 for the V-neck Top
45:  $59.99 for the Short Lace Dress
46:  $24.99 for the Short Dress
47:  $29.99 for the Patterned Slacks
48:  $49.99 for the Short Chiffon Dress
49:  $59.99 for the Off-the-shoulder Dress
50:  $24.99 for the V-neck Top
51:  $49.99 for the Short Chiffon Dress
52:  $24.99 for the V-neck Top
53:  $24.99 for the V-neck Top
54:  $59.99 for the Short Lace Dress

Process finished with exit code 0

Learn More About Beautiful Soup


Python Web Scraping With Beautiful Soup Summary

Beautiful Soup is one of a few available libraries built for Web Scraping using Python. It is very easy to get started with Beautiful Soup as we saw in this tutorial. Web scraping scripts can be used to gather and compile data from the internet for various types of data analysis projects, or whatever else your imagination comes up with.