How To Use Python With Excel

How To Use Python With Excel

Microsoft Excel is arguably the most popular business software on the planet, with almost every single company making use of it in one way or another. Proficiency in Excel is a must-have skill in many jobs. Here is the rub. Managing data in Excel manually is often quite repetitive and tedious. Wouldn’t it be great if you could still leverage the power of Microsoft Excel while reducing the repetition? Enter the Python Programming Language. Just as Excel is one of the most popular computer applications, Python is one of the most popular programming languages. Once you have learned basic Python, you can integrate Python and Excel to streamline your workflow, lighten your workload, accomplish more in less time, and generally just be more effective. In this tutorial, we are going to examine how to get started using Python and Exel together.


How To Load And Save Data Using Pandas

Pandas is the popular open-source library written in Python explicitly designed for data analysis and manipulation. Pandas can work directly with Excel files by way of something called a dataframe. A dataframe is like a spreadsheet in the memory of your Python application. A dataframe has indexed rows and header columns that both store what’s called a series. The series stores all the values in the rows and columns, almost like the list. Because both these objects are iterable, Pandas makes it easy to traverse and pick up the data. Using the dataframe, we can do any kind of analysis or grouping we’d want and export it to Excel. Cool!

Install Pandas

To get started with Pandas, make sure you have it installed. The pip install pandas command should do the trick.

pip install pandas

While we’re at it, we should also run pip install openpyxl and pip install xlrdsince we will be working with these modules as well.


Loading And Saving Data Using Pandas

To start working with data in Pandas, we need to import some data from files. The first two Pandas methods to help us with this are .read_excel() and .read_csv(). You can also create data frames in Pandas from lists or objects in code.

Excel Data Frame

In this first example, we have a file named regiondata.xlsx. It just has some simple sample data to view.

read xlsx file in pandas

Now we can use Python to read that .xlsx file into a data frame, then output the results. We do this by calling import pandas as pd. We can also import the openpyxl module that will allow saving to an Excel document. Once we have these, we just need to initialize our data frame using .read_excel().

You can see that Pandas uses indices for rows and headers for columns. In this case, the indices are integers, and our headers are strings, such as region, unit, sales, and exports.

print data frame as output in pandas

CSV Data Frame

Now let’s go ahead and read a CSV file into a data frame using Pandas. For this we can use the pandas .read_csv() method. Here is the CSV file we want to open.

pandas CSV file to read

We can see that the indices are present, but the first row of data is acting as a header for the rest of the file. We don’t really want this so we need to specify that there is no header present in the CSV file.

python pandas read_csv output

To fix this all we have to do is add a header argument and set that to None. When we run this again, we can see that now the headers are integer based.

pandas read_csv header argument

Setting Column Names

Those integers could be more helpful if they were descriptive names of what each column actually is. We can take care of this by setting the columns. We simply assign a list with each element corresponding to the column as we want them named.

pandas column name via list

Pandas CSV to Excel

Now let’s go ahead and save the CSV file to an Excel file of type .xlsx. This is done using the .to_excel() method and passing in the path and name of the file.

This creates a new Excel file in the directory we specify, and when we open that file in Excel we can see that it has the indices and headers as we expect.

pandas csv to excel method

Text File Data Frame

Oddly enough, creating a data frame from a text file is done using the same .read_csv() function. This is because there is not too much of a difference between a CSV file and a text file, except for some minor formatting differences. We have a text file that has over 11,000 rows of data. Pandas can read this large amount of data super quickly.

The file itself is tab-delimited, which is why we need to specify delimiter=’\t’ as the second argument to the .read_csv() function. Pandas provides the output as we see here, with the middle chunk of rows omitted since there are so many rows in the data set.

python pandas read_csv text file


How To View And Inspect Data With Pandas

When working with pandas, getting what you want out of the data is a matter of manipulating the data frame. Let’s look at a few selection and viewing functions, as well as saving our desired values to an Excel sheet.

How To Work With Columns In Pandas

The code below is operating on the same some_names.csv file from earlier where we had assigned column names. Imagine if you had to deal with a spreadsheet with so many columns that it was hard to read the data in the terminal. You need to know which columns contain what so that you can access the data you need to proceed. In order to do this, we can use the same function we used to assign the columns. Printing out the dataframe_csv.columns variable shows us the data we are working with.

Index(['First', 'Last', 'Address', 'City', 'State', 'Zip', 'Population'], dtype='object')

This shows us how the data is split up by column, allowing us to now be able to view just one column if we like. Pandas data frames are a multi-row and multi-column data structure, like a matrix. This makes it easy to work with all data in just one column, or one row if we want. No loops are needed, and all we have to do is index by the column name. So let’s say we want First Name. Then we just take the First Name index and this prints out all the values in the First Name column and their indices.

0      Liam
1      Noah
2      Emma
3    Olivia
4       Ava
5     James
Name: First, dtype: object

If you wanted to access multiple column’s data, you can pass in a list. So let’s say we want the Address and the State columns. We just construct a list with those two values, then pass them in using bracket notation. Now we have both the Address and the State and their indices.

            Address State
0   3 Jackson Drive    MI
1    81 Emerson Way    CO
2    17 Upside Lane    TX
3  712 Front Street    CA
4   24 Seven Street    ID
5   11 Breaker Lane    OH

When a spreadsheet has hundreds of rows of data, you may want to just focus in on a subset of the data in a particular column for example. This can be done using slicing, and the syntax is the standard Python slicing approach you are used to. In this example below, we want to look at the Zip column, but only the first two results. The highlighted line shows the slice syntax to achieve that.

0    49508
1    80922
Name: Zip, dtype: int64

How To Work With Rows In Pandas

Let’s look at how to work with some rows in Pandas now that we have learned a bit about columns. Rows are indexed with integers automatically by Pandas, and we can use these indices to locate them specifically with the .iloc function. In the example below, we pass in 2 as an index and print that to the screen. The data printed out is the row that we expect.

First                   Emma
Last                Williams
Address       17 Upside Lane
City                 Watauga
State                     TX
Zip                    76148
Population            120000
Name: 2, dtype: object

If you are looking at the actual spreadsheet in Excel, the above code is selecting the row seen here.

pandas iloc function row

You can also drill down and select individual cells in the spreadsheet. Again, the in-memory data frame is like a matrix just like we see visually. Let’s consider we want to access the cell data circled here.

access specific cell in pandas

It would be possible to find this data using iteration, but it is more efficient to pass two numbers or variables to .iloc[] in the way they correspond to a row-column value or coordinate system. So in our example, we want the text circled above. In Pandas, we know that indices are numbered starting at zero and our headers are strings. What is also true is that when using the .iloc[] function, headers also correspond to integer values starting at zero. The first argument is the row, and the second is the column. Since these are zero-based, we want the integer 3 row and the integer 2 column.

712 Front Street

Saving Extracted Data

Now that we have the knowledge of how to access data by column, row, or coordinator, we can see how it is easy to drill down on only the data we want. With this in mind, let’s say we want to access a subset of the data, then store that data in its own spreadsheet. We can use code as we see here to accomplish that goal.

Opening the resulting file extracted_data.xlsx in Excel shows that it has the exact data we were drilling down on. Very cool!

pandas exract and save to new spreadsheet


How To Filter And Sort Data Using Pandas

You can filter and sort data in Pandas in almost countless ways, we’ll look at a few of the more simple examples here. This first example simply looks for all rows that have the value of Worthington in the City column.

   First    Last          Address         City State    Zip  Population
4   Zach   Price    99 Eleven Way  Worthington    OH  43085       68000
8  James  Miller  11 Breaker Lane  Worthington    OH  43085       68000

To combine two conditions when filtering data in Pandas, you use the ampersand & operator as we see below. This code checks for all rows where the City is Kentwood *and* the First column has a value of Sam.

  First    Last          Address      City State    Zip  Population
0   Sam   Smith  3 Jackson Drive  Kentwood    MI  49508       45000
6   Sam  Miller   75 High Street  Kentwood    MI  49508       45000

If you only want to work on a subset of the data, you can drop columns using the .drop() function.

    First      Last         City State    Zip
0     Sam     Smith     Kentwood    MI  49508
1   Sally    Holmes        Boise    ID  83704
2    Noah   Johnson    C Springs    CO  80922
3    Emma  Williams      Watauga    TX  76148
4    Zach     Price  Worthington    OH  43085
5  Olivia     Jones     C Valley    CA  94546
6     Sam    Miller     Kentwood    MI  49508
7     Ava     Brown        Boise    ID  83704
8   James    Miller  Worthington    OH  43085

Here we can create a new column with a default value of false. Then, we can look at the data in other columns and change that value to True based on a condition. In this case, we check to see if the State column has a value of OH, and if it does, go ahead and set the new column we defined to True.

    First      Last         City State    Zip  T or F
0     Sam     Smith     Kentwood    MI  49508   False
1   Sally    Holmes        Boise    ID  83704   False
2    Noah   Johnson    C Springs    CO  80922   False
3    Emma  Williams      Watauga    TX  76148   False
4    Zach     Price  Worthington    OH  43085    True
5  Olivia     Jones     C Valley    CA  94546   False
6     Sam    Miller     Kentwood    MI  49508   False
7     Ava     Brown        Boise    ID  83704   False
8   James    Miller  Worthington    OH  43085    True

In the below two examples, we use the .sort_values() method to sort the data on a particular column.

    First      Last           Address         City State    Zip  Population
7     Ava     Brown   24 Seven Street        Boise    ID  83704       30000
3    Emma  Williams    17 Upside Lane      Watauga    TX  76148      120000
8   James    Miller   11 Breaker Lane  Worthington    OH  43085       68000
2    Noah   Johnson    81 Emerson Way    C Springs    CO  80922       18000
5  Olivia     Jones  712 Front Street     C Valley    CA  94546       90000
1   Sally    Holmes    12 Front Drive        Boise    ID  83704       30000
0     Sam     Smith   3 Jackson Drive     Kentwood    MI  49508       45000
6     Sam    Miller    75 High Street     Kentwood    MI  49508       45000
4    Zach     Price     99 Eleven Way  Worthington    OH  43085       68000

To sort the data in the other direction, just add ascending=False as the second argument.

    First      Last           Address         City State    Zip  Population
4    Zach     Price     99 Eleven Way  Worthington    OH  43085       68000
0     Sam     Smith   3 Jackson Drive     Kentwood    MI  49508       45000
6     Sam    Miller    75 High Street     Kentwood    MI  49508       45000
1   Sally    Holmes    12 Front Drive        Boise    ID  83704       30000
5  Olivia     Jones  712 Front Street     C Valley    CA  94546       90000
2    Noah   Johnson    81 Emerson Way    C Springs    CO  80922       18000
8   James    Miller   11 Breaker Lane  Worthington    OH  43085       68000
3    Emma  Williams    17 Upside Lane      Watauga    TX  76148      120000
7     Ava     Brown   24 Seven Street        Boise    ID  83704       30000

Controlling Excel Directly With Openpyxl

Openpyxl is another popular library that is good for searching on certain data, copying and pasting to new Excel documents, or just trying to make your data look good. It is a package in Python that allows your code to interact directly with Excel documents. With Openpyxl, you can manipulate not just the data in the spreadsheet, but also the properties of the spreadsheet itself. Openpyxl does this by storing an Excel workbook as well as the worksheets as objects that can be altered with Python code. Openpyxl is the most used module for working specifically with Excel, as it allows developers to do anything from styling sheets to parsing data or creating graphs. To get started with Openpyxl, install it with the pip install openpyxl command.

.load_workbook()

We have a file named stock_options.xlsx that we can use for some examples. A quick recap of Excel terms is in order before we get started.

  • Workbook – The Excel document itself is called a Workbook.
  • .xlsx extension – A single Workbook is saved to a file with the .xlsx extension.
  • Worksheets(Sheets) – Each Workbook can have multiple Worksheets.
  • Columns – Columns are aligned vertically and use Letters for identification.
  • Rows – Rows are aligned horizontally and use Numbers for identification.
  • Cell – At the intersection of a Column and Row is what is known as a cell.

The Excel file we will work with looks like this.

Excel Data For Openpyxl

To start working with a Workbook, we can use code as we see here.

Printing out the workbook variable shows it is an object of type Workbook.

<class 'openpyxl.workbook.workbook.Workbook'>

How To Access Worksheets

A Workbook may contain multiple Worksheets. In openpyxl, we can access a worksheet object a couple of ways. If you know the name of the Worksheet you want to work with, this code will suit you.

<class 'openpyxl.worksheet.worksheet.Worksheet'>

If you don’t know the name of the worksheet you want to access, you can check what names exist with a simple call to .sheetnames.

['Sheet1', 'Sheet2', 'Sheet3']

How To Access Cells In Sheets

Once you have a sheet object to work with, you can access it’s cells and values pretty easily.

SFIX

You can also access a cell using the .cell() method and passing both the row and column as integers like so.

0.3745

The approach of using the .cell() method is good when you want to iterate over values in the sheet. For example:

CCJ
SFIX
FE
WLL
ACC

We can also use slicing to select a range of cells. Here is an example of that.

((<Cell 'Sheet1'.A1>,), (<Cell 'Sheet1'.A2>,), (<Cell 'Sheet1'.A3>,))

To select an entire column, we can simply specify the column we want. We’ll then print out the number of items in that column since actually printing out the entire data would be too many to view in such a large document. We can see there are 923 values in column A.

923

This code shows us all of the cells that have values in row 1.

(<Cell 'Sheet1'.A1>, <Cell 'Sheet1'.B1>, <Cell 'Sheet1'.C1>, <Cell 'Sheet1'.D1>, <Cell 'Sheet1'.E1>, <Cell 'Sheet1'.F1>, <Cell 'Sheet1'.G1>, <Cell 'Sheet1'.H1>, <Cell 'Sheet1'.I1>, <Cell 'Sheet1'.J1>,
 <Cell 'Sheet1'.K1>, <Cell 'Sheet1'.L1>, <Cell 'Sheet1'.M1>, <Cell 'Sheet1'.N1>, <Cell 'Sheet1'.O1>)

Cells are accessed by the combination of string headers and integers and columns are only the headers. For rows, we only use the integer and we still access them through our worksheet. Just like with other functions if we wanted to grab one row, we index by a specific number. But if we wanted to grab multiple rows we use the slicing method.

The final access function we can look at is iterating through columns or rows. When we iterate through these objects we use a specific function in our for loop. So let’s go over the rows function. The column function is exactly the same but the words are switched. In the iter_rows or iter_columns function, we have to specify the minimum and maximums for our rows and columns. We’ll set our minimum row to one, our maximum column to three, and our maximum row to two. Inside the of the loop, we iterate through the objects again. Now you’ll notice we should be printing the cells from a minimum row of one, a minimum column of one, to a maximum row of two and a maximum column of three. We can see that our minimum row one corresponds to the first row and our max row corresponds just to the second. So you see we only have A1 to C2. The maximum column equals three puts us all the way to C.

<Cell 'Sheet1'.A1>
<Cell 'Sheet1'.B1>
<Cell 'Sheet1'.C1>
<Cell 'Sheet1'.A2>
<Cell 'Sheet1'.B2>
<Cell 'Sheet1'.C2>

Creating New Workbooks and Worksheets

We don’t have to import an already existing excel file when using openpyxl. The following code creates a Workbook with three Worksheets.

['My Awesome Sheet', 'First Sheet', 'Second Sheet']

To add some data to one of the Worksheets in our Workbook, we can assign a value to the cell we are interested in. Then we will save the workbook and view it in Excel.

openpyxl create workbook worksheet save

How To Format Workbooks

When presenting large sets of data to others, it helps to format it in such a way that it is readable and understandable. If you want to create an Excel that’s stylized and professional then openpyxl provides the perfect formatting tools. In the example below, we use a few of the available formatting tools available in openpyxl to add a nice header to the spreadsheet. By inserting rows and then using the merge function, we can create a header that is above all of the actual data. In that header area, we use the Font module to add color to the text in addition to italicizing it. We know how to set a value to a cell, so we do that next. The Alignment module makes it easy to put content in a merged cell area with a great amount of accuracy. We set the text to be vertically centered and horizontally shifted to the right. The GradientFill module allows us to add a cool gradient effect to the header as well.

This code saves a new Workbook to the excel directory named stylish.xlsx. If we open up that workbook using Excel, we can see two new rows at the top of the workbook. All of the cells in that range have been merged, and there is now some text in that area. We also see the neat color, font, and italics that we applied. Cool!

How To Format Workbooks

Named Styles In Openpyxl

Openpyxl has a cool module named NamedStyles. Named styles are objects we can create that store a style so that we can use it multiple times instead of having to type all that out as we did just above. To complete this example, we import a few more modules at the top of the file. These are NamedStyle, Side, Border, and PatternFill. We want to add another effect to the top of each column for the data. So we create a new variable named highlight and assign a NamedStyle to it. Then we can simply build up that object with all of the properties we want the style to have. In this case, we add a border to the top and bottom of the cell, make it yellow in color, and set the font to bold. One the named style is in place, we set simply iterate over each cell in row 3, applying the highlight on each iteration.

openpyxl named styles example


Learn More About How To Use Python With Excel

How To Use Python With Excel Summary

Python works great with Excel, both very popular in their respective domains. Python has a growing community of developers that produce great tools to use like Pandas and Openpyxl. We had a look at the basics of using both tools to work with Excel. There is always more to learn, so the list of resources just above is a great place to keep learning.