Python Statistics Tutorial

Python Statistics Tutorial

Python has a statistics module that makes it simple to calculate common statistics like the mean, mode, standard deviation, and others. Statistics like these can give you interesting information about the data you might be working with. This could be a collection of various grades, a sampling of prices for an item across many retailers, stock prices of various public companies, and many more. Statistics allow individuals and organizations to make decisions based on what the data is providing. In this tutorial, we’ll have a look at some of the basic statistical functions we can use in Python.

To begin working with statistics in Python, the first thing you want to do is to import the statistics module like so.

Now we need some data to work with. A familiar and easy dataset to understand would be that of grades in school. We’ll set up a list of grades so we can test out all of these statistical methods on the data. All of the grades are now stored in a grades variable.

Statistic Definitions

Before we get too far ahead, let’s examine the definitions of these statistics that we want to work with.

  • Mean: The average of a set of numbers. Add up all of the numbers in the set, and then divide that total by the number of numbers in the set to find the mean.
  • Median: The middle number, or midpoint of the data, when the numbers are listed in ascending order. To find the Median, place the numbers in value order and find the middle number.
  • Mode: The mode is the value that occurs most often. If no number in the list is repeated, then there is no mode for the list.

In Python, we don’t have to manually calculate any of these! We simply use the functions provided by the statistics module and we are good to go.

Mean

Here is how we calculate the mean(average) of all the grades in our list.

The mean of all the grades is 87.22222222222223

Median

To calculate the Median, or midpoint of the grades, we’ll use this code here.

The median of all the grades is 88

We see that the median of our grades is 88. By looking at the original list, it is not easy to decide how that result came to be. Remember the median looks at the middle of the data when the list is sorted. Let’s sort our grades and have a look at the output.

[75, 77, 80, 85, 88, 90, 93, 97, 100]

The output above does show us that when grades is sorted, 88 is in fact right in the middle of the data. So the median function is working perfectly!

Mode

To demonstrate the mode function, first, we are going to update the list of grades. Recall, the mode is found by looking for the value that occurs most often in a set of data. Our original grades list had all unique values. We’ll change that here so we can test out the mode.

Now we can do the calculation of the mode like so.

The mode of all the grades is 75

We see that the mode of all the grades is 75. If you look at the updated list of grades, you can easily see that 75 occurs three times, while all of the others appear only once or twice. So this is accurate, 75 is the mode of our grades.

Variance

The variance of data is another statistical method we can take a look at. Variance in statistics refers to the average of the squared differences from the mean. In other words, how varied is the data? Does it vary a lot, in that we have one grade of say 20, another that’s 99, and another that’s like 50? Are the grades very varied, or are they all fairly close together? Before even running the code for this, we can conclude that our grades are fairly similar. So let’s try the variance function on our current list of grades, and then we will change the grades to get a different result. Also, to better understand what the value we calculate is, a variance value of zero means that all of the data values are identical. All non-zero variances are positive.

The grades have a variance of 83.15151515151516

Ok, that is an interesting result. Let’s change the grades to all the same value to see what happens then.

The grades have a variance of 0

Sure enough, that gives us a variance of zero, since all of the grades are the same. They do not vary much at all. Now we’ll add just one additional grade with a different value. Let’s see what happens.

The grades have a variance of 14.285714285714285

With just that one change to the data, we can see the variance jump fairly quickly. We’ll do one more example of variance.

The grades have a variance of 257.35714285714283

So that gives us a pretty good idea of how variance works in Python.

Standard Deviation

Standard deviation is used to show how much variation from the mean exists. You can think of it as a typical deviation from the mean. A low standard deviation means the values tend to be close to the mean. A high standard deviation means the values are spread out over a larger range.

grades with a low standard deviation

The grades have a standard deviation of 2.9154759474226504

grades with a high standard deviation

The grades have a standard deviation of 31.716377022424414

Fun fact for the math geeks. The standard deviation is actually the square root of the variance. We didn’t have to do that manually since the stdev() function took care of that for us. We can prove this however with the following code. We’ll use the same grades us just above, but change the function to get the standard deviation.

The grades have a standard deviation of 31.716377022424414

Ah-ha! The result is exactly the same. We can calculate the standard deviation by looking at the square root of the variance, or we could take the easier route and make use of the stdev() function in Python.

Additional Statistics Resources

Python Statistics Tutorial Summary

So that is a good beginner-level overview of statistics in Python. Python has many modules, libraries, and packages, to do some very intensive scientific and statistical computing. The concepts covered here will be a good stepping stone to further study of statistics in Python.