Python Urllib

python urllib

Python has modules included in the standard library that make working with internet data easy. The urllib package is one such module. This package can be used to fetch data from the internet, and perform common processing tasks. Inside of urllib is the request module. This module is for reading online URLs. An error module is available for dealing with errors that may come up. The parse module facilitates the parsing of URL structures. There is also a robotparser for working with robots.txt files that you might find on a web server. In this tutorial, we’ll take a look at some of these modules in the urllib package.


How To Fetch Data

To begin, we can first set up a virtual environment in Python with the virtualenv . command in the directory of our choice. Don’t forget to activate the virtual environment with source ./Scripts/activate. Our virtual environment is named vurllib (meaning virtualized urllib), and our prompt is now (vurllib) vurllib $ indicating our environment is ready.

Now let’s open the project in Pycharm and add a new file to try out some urllib examples.

python urllib examples

Importing urllib

Before we can use the software inside of the urllib package, we need to import it. Let’s use the following line of code to import the request class of the urllib package.

urllib_examples.py

This gives us access to the class methods we’ll be testing in a bit. But first, we need some external URLs to work with.

httpbin to the rescue

Httpbin is an amazing web service for testing HTTP libraries. It has several great endpoints that can test pretty much everything you need in an HTTP library. Check it out at https://httpbin.org

httpbinorg request response service

Set Url and Fetch Data

Now we can specify a URL to work with while storing it in the url variable. To make the request to the url we can use the urlopen() function while passing in the variable which holds the Url. The response is now stored in the result variable.

Checking Http Response Code

HTTP response codes tell us whether a specific HTTP request has been successfully completed or not. These responses are grouped into five different classes.

  • Informational responses (100–199)
  • Successful responses (200–299)
  • Redirects (300–399)
  • Client errors (400–499)
  • Server errors (500–599)

When we run the code above, we are seeing a 200 OK status code which means everything went well!

http response 200 ok

Http Response Headers

The response from a server also includes Http headers. This is information in text form that a Web server sends back in response to receiving an HTTP request. The response header contains various types of information and we can inspect that information using the getheaders() function.

Result

[('Date', 'Mon, 09 Mar 2020 16:05:38 GMT'), ('Content-Type', 'application/xml'),
 ('Content-Length', '522'), ('Connection', 'close'), ('Server', 'gunicorn/19.9.0'),
 ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true')]

We can see the header information that the server sends back above as a result of calling the getheaders() function. If you want just a single header value, you can use the getheader() function instead. In the header response is a list of tuple values. So we can see we have values for Date, Content-Type, Content-Length, Connection, Server, Access-Control-Allow-Origin, and Access-Control-Allow-Credentials. Interesting!

Reading Response Data

Now we need to read the actual returned data, or payload, contained within the Http response. To do so, we can use the read() and decode() functions like so.

Result

Returned data: ---------------------
<?xml version='1.0' encoding='us-ascii'?>

<!--  A SAMPLE set of slides  -->

<slideshow 
    title="Sample Slide Show"
    date="Date of publication"
    author="Yours Truly"
    >

    <!-- TITLE SLIDE -->
    <slide type="all">
      <title>Wake up to WonderWidgets!</title>
    </slide>

    <!-- OVERVIEW -->
    <slide type="all">
        <title>Overview</title>
        <item>Why <em>WonderWidgets</em> are great</item>
        <item/>
        <item>Who <em>buys</em> WonderWidgets</item>
    </slide>

</slideshow>

We can visit the same Url right in the web browser to see how it renders this data as well.

httpbin xml testing service


GET and POST with urllib

In the section above, we saw how to use urllib to fetch data from a web service. Now we want to see how to send information to web servers. Most commonly, this will be done with either a GET or POST Http request. A GET request uses parameters encoded directly into the URL which is a pretty common way of issuing a query to a web service like a Bing search. If you are trying to create or update something on the webserver, then you will usually be leveraging a POST Http request. There are other Http methods to learn like PUT, PATCH, and DELETE, but GET and POST will be sufficient most of the time and those two will be what we test here.

http operation and purpose

Request to GET endpoint

In the code below, we can start by again setting up a simple url of http://httpbin.org/get. Then we again read the Http status code and read the returned data using read() and decode().

Result

C:\python\vurllib\Scripts\python.exe C:/python/vurllib/urllib_examples.py
Result code: 200
Returned data: ----------------------
{
  "args": {}, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.8", 
    "X-Amzn-Trace-Id": "Root=1-5e667d77-8282fd705e85709035d2c830"
  }, 
  "origin": "127.0.0.1", 
  "url": "http://httpbin.org/get"
}

Notice that the args key is empty in the response. That means we didn’t send any data along with the request. We can do that, however, and this is what we will do next.

Creating an args payload

To pass data in the payload we can use a simple python dictionary with some random data just for example. Then, the data needs to be url encoded first with the urlencode() function. The result of that operation is stored in the data variable. Finally, we make the request with the urlopen() function passing in both the url and the data separated by a question mark character.

Result

Be looking at the result above, we notice two new things. The args key is not populated with the payload data we are interested in. Additionally, notice the url has all of the data encoded right in the Url itself. This is how a GET request works.

Making POST Request

POST works in a different way than GET does. The same args dictionary can still be used as a payload, but it needs to be encoded into bytes before making the POST request. This is done using the encode() function. This is one of the built-in string functions that’s available in Python and it defaults to using UTF-8. For the POST request, we do not add the parameters to the URL. Instead, you can use the data parameter of the urlopen() function. By passing the data directly to the urlopen() function, urllib will automatically switch over to using the POST method behind the scenes. No need to tell urllib to use POST rather than GET.

Result

C:\python\vurllib\Scripts\python.exe C:/python/vurllib/urllib_examples.py
Result code: 200
Returned data: ----------------------
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "color": "Blue", 
    "is_active": "True", 
    "shape": "Circle"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "38", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.8", 
    "X-Amzn-Trace-Id": "Root=1-5e6683a5-777d0378401b31982e213810"
  }, 
  "json": null, 
  "origin": "127.0.0.1", 
  "url": "http://httpbin.org/post"
}

Can you spot the differences in the response we get from httpbin? That’s right, the payload data is now inside of the form key rather than args. Also, note that the Url key does not have any data embedded in the Url itself. So we can see the distinction here between GET and POST and how they differ with regards to carrying payload data.


Errors With urllib

Handling errors is not always the most fun thing to do, but it is needed. The web is inherently error-prone, so programs that make Http requests should be prepared for those situations. You might run into a problem where an Http error code is the response from a server. Or perhaps the URL you try to fetch data from no longer exists. Then again, there could be a network problem that causes the request to time out. Any number of things may lead to problems for the program. To mitigate these scenarios, you can wrap Http requests inside of a try-catch block in Python. Here are a few examples of how to do that.

This first example actually has no errors, and it works great. We are using urllib to fetch the url of https://httpbin.org/html which holds some text from the Moby Dick novel by Herman Melville. We can see this result right inside of Pycharm.

Herman Melville - Moby-Dick

What if we make this change to the code? Note line 5 which now has an invalid Url.

This time, the result is quite different. Our except block handles the error gracefully and shows a user-friendly error.

pycharm handle urllib error

Httpbin also provides a way to check for 404 status codes. We can test that error condition like so and note that we get a different error now.

try except python urllib

Some urllib shortcomings

The urllib module is fairly easy to use, however it does have some drawbacks when compared to other libraries. One shortcoming of urllib is that it only supports a subset of the full set of HTTP verbs, such as GET and POST. PUT, PATCH, and DELETE are not as commonly used but it would be good if the Http library you are using is able to implement them. A second shortcoming is that urllib does not automatically decode the returned data for you. If you’re writing an application that has to deal with unknown data sources or several encodings then that becomes cumbersome to work with. There are no built-in features to urllib for working with cookies, authentication or sessions. Working with JSON responses is a bit tough, and timeouts are tricky to deal with. An alternative to urllib we can try is Python Requests.

Learn More About urllib

Python Urllib Summary

In this tutorial, we learned a little bit about fetching internet data in Python using urllib which is part of the Python standard library. To access a URL with urllib, you can use the urlopen() function which is a part of urllib.request. Data that is returned from the request to the server needs to be transformed using the decode() function. To specify a POST request when you use the urlopen() function, all you need to do is include the data parameter, and urllib changes the Http verb under the hood. We also saw a few examples of HTTPError and URLError and how to process them. Next up, we will learn about the Python Requests Library.