A dataframe in Pandas is a two-dimensional array that has rows and columns. The dataframe is the primary component of the popular Pandas Python library. Pandas is an open-source Python library that provides high performance, easy-to-use data structures, and analysis tools. Pandas runs on top of Python NumPy, and we’ll take a look at how to get started with dataframes in Pandas for this tutorial.
Pandas Vs Numpy
Before we look at dataframes in Pandas, let’s do a quick comparison of NumPy and Pandas.
NumPy
|
Pandas
|
|
|
List To Dataframe
We know what a python list is and how to use it. Here is a simple list.
simple_list = ['Sam', 'Bob', 'Joe', 'Mary', 'Sue', 'Sally']
print(simple_list)
['Sam', 'Bob', 'Joe', 'Mary', 'Sue', 'Sally']
We can load this list into a Pandas Dataframe like so.
import pandas as pd
simple_list = ['Sam', 'Bob', 'Joe', 'Mary', 'Sue', 'Sally']
data = pd.DataFrame(simple_list)
print(data)
We can see the resulting data now looks a little different. You can see the list is now organized into rows and columns.
0 0 Sam 1 Bob 2 Joe 3 Mary 4 Sue 5 Sally
Naming The Column
The number 0 is not very descriptive for the column name so let’s change that using this code.
import pandas as pd
simple_list = ['Sam', 'Bob', 'Joe', 'Mary', 'Sue', 'Sally']
named_column = {'Name': simple_list}
data = pd.DataFrame(named_column)
print(data)
The string in the key of the dictionary above becomes the name of the column, in this case “Name”.
Name 0 Sam 1 Bob 2 Joe 3 Mary 4 Sue 5 Sally
Adding A Column
To add a column to a Pandas Dataframe we can do something like this.
import pandas as pd
simple_list = ['Sam', 'Bob', 'Joe', 'Mary', 'Sue', 'Sally']
named_column = {'Name': simple_list,
'Favorite Color': ['Blue', 'Red', 'Green', 'Blue', 'Red', 'Green']}
data = pd.DataFrame(named_column)
print(data)
Just like that, we now have a new “Favorite Color” column.
Name Favorite Color 0 Sam Blue 1 Bob Red 2 Joe Green 3 Mary Blue 4 Sue Red 5 Sally Green
Let’s add another column like so.
import pandas as pd
simple_list = ['Sam', 'Bob', 'Joe', 'Mary', 'Sue', 'Sally']
named_column = {'Name': simple_list,
'Favorite Color': ['Blue', 'Red', 'Green', 'Blue', 'Red', 'Green'],
'Favorite Food': ['Italian', 'Mediterranean', 'Thai', 'Chinese', 'Mexican', 'Spanish']}
data = pd.DataFrame(named_column)
print(data)
Name Favorite Color Favorite Food 0 Sam Blue Italian 1 Bob Red Mediterranean 2 Joe Green Thai 3 Mary Blue Chinese 4 Sue Red Mexican 5 Sally Green Spanish
Ok this Dataframe is looking pretty good. We have some rows and some columns, and some useful information stored in these rows and columns. Is the format of this data starting to look familiar to you yet? Yes that’s right, this kind of looks like an excel spreadsheet of sorts! This is a good concept to understand. A DataFrame in pandas is analogous to an Excel worksheet. While an Excel workbook can contain multiple worksheets, pandas DataFrames exist independently.
Selecting Column Data
Once you have a pandas Dataframe to work with, you can start selecting data from it as you choose. The following code will select all the values from the “Favorite Color” column.
import pandas as pd
simple_list = ['Sam', 'Bob', 'Joe', 'Mary', 'Sue', 'Sally']
named_column = {'Name': simple_list,
'Favorite Color': ['Blue', 'Red', 'Green', 'Blue', 'Red', 'Green'],
'Favorite Food': ['Italian', 'Mediterranean', 'Thai', 'Chinese', 'Mexican', 'Spanish']}
data = pd.DataFrame(named_column)
selected_column = data['Favorite Color']
print(selected_column)
0 Blue 1 Red 2 Green 3 Blue 4 Red 5 Green Name: Favorite Color, dtype: object
Select a value in a Dataframe
Now we want to get the favorite color of just one person. Imagine we want the favorite color of Joe. How do we do that? Well, we can see that Joe is in the index row of 2, so we can provide that index when selecting a value. This way we are specifying that we want the value where the column ‘Favorite Color’ and the row of index value 2 intersect.
import pandas as pd
simple_list = ['Sam', 'Bob', 'Joe', 'Mary', 'Sue', 'Sally']
named_column = {'Name': simple_list,
'Favorite Color': ['Blue', 'Red', 'Green', 'Blue', 'Red', 'Green'],
'Favorite Food': ['Italian', 'Mediterranean', 'Thai', 'Chinese', 'Mexican', 'Spanish']}
data = pd.DataFrame(named_column)
selected_column = data['Favorite Color'][2]
print(selected_column)
Green
Selecting rows with iloc
import pandas as pd
simple_list = ['Sam', 'Bob', 'Joe', 'Mary', 'Sue', 'Sally']
named_column = {'Name': simple_list,
'Favorite Color': ['Blue', 'Red', 'Green', 'Blue', 'Red', 'Green'],
'Favorite Food': ['Italian', 'Mediterranean', 'Thai', 'Chinese', 'Mexican', 'Spanish']}
data = pd.DataFrame(named_column)
selected_column = data['Favorite Color'][2]
selected_row = data.iloc[2]
print(selected_row)
This provides us with all the data found in that row. We have the Name, Favorite Color, and Favorite food for Joe.
Name Joe Favorite Color Green Favorite Food Thai Name: 2, dtype: object
To get Sue’s information, we could do so easily by simply changing the index value passed to iloc.
selected_row = data.iloc[4]
Name Sue Favorite Color Red Favorite Food Mexican Name: 4, dtype: object
Selecting a row value
Just like we could provide an index to select a specific value when selecting a column, we can do the same when selecting rows. Let’s just get the Favorite Food of Sue.
import pandas as pd
simple_list = ['Sam', 'Bob', 'Joe', 'Mary', 'Sue', 'Sally']
named_column = {'Name': simple_list,
'Favorite Color': ['Blue', 'Red', 'Green', 'Blue', 'Red', 'Green'],
'Favorite Food': ['Italian', 'Mediterranean', 'Thai', 'Chinese', 'Mexican', 'Spanish']}
data = pd.DataFrame(named_column)
selected_column = data['Favorite Color'][2]
selected_row = data.iloc[4]['Favorite Food']
print(selected_row)
Mexican
Manipulating Dataframe Data
Just like in a spreadsheet, you can apply formulas to the data to create new columns of data based on existing data. Let’s create a formula that adds a new “About Me” column to the dataframe.
import pandas as pd
simple_list = ['Sam', 'Bob', 'Joe', 'Mary', 'Sue', 'Sally']
named_column = {'Name': simple_list,
'Favorite Color': ['Blue', 'Red', 'Green', 'Blue', 'Red', 'Green'],
'Favorite Food': ['Italian', 'Mediterranean', 'Thai', 'Chinese', 'Mexican', 'Spanish']}
data = pd.DataFrame(named_column)
formula_result = []
for i in range(len(data)):
formula_result.append(f'{data.iloc[i]["Name"]} likes {data.iloc[i]["Favorite Food"]}'
f' food and the color {data.iloc[i]["Favorite Color"]}')
data['About Me'] = formula_result
print(data)
Name ... About Me 0 Sam ... Sam likes Italian food and the color Blue 1 Bob ... Bob likes Mediterranean food and the color Red 2 Joe ... Joe likes Thai food and the color Green 3 Mary ... Mary likes Chinese food and the color Blue 4 Sue ... Sue likes Mexican food and the color Red 5 Sally ... Sally likes Spanish food and the color Green [6 rows x 4 columns]
That looks pretty good! Did you notice that the dataframe looks a little different now? You see those three dots … in the rows of data? This happens because Pandas will truncate the output if there is a lot of data to be displayed. You can override this behavior using pd.set_option(‘display.max_columns’, None) like so.
import pandas as pd
simple_list = ['Sam', 'Bob', 'Joe', 'Mary', 'Sue', 'Sally']
named_column = {'Name': simple_list,
'Favorite Color': ['Blue', 'Red', 'Green', 'Blue', 'Red', 'Green'],
'Favorite Food': ['Italian', 'Mediterranean', 'Thai', 'Chinese', 'Mexican', 'Spanish']}
pd.set_option('display.max_columns', None)
data = pd.DataFrame(named_column)
formula_result = []
for i in range(len(data)):
formula_result.append(f'{data.iloc[i]["Name"]} likes {data.iloc[i]["Favorite Food"]}'
f' food and the color {data.iloc[i]["Favorite Color"]}')
data['About Me'] = formula_result
print(data)
Name Favorite Color Favorite Food \ 0 Sam Blue Italian 1 Bob Red Mediterranean 2 Joe Green Thai 3 Mary Blue Chinese 4 Sue Red Mexican 5 Sally Green Spanish About Me 0 Sam likes Italian food and the color Blue 1 Bob likes Mediterranean food and the color Red 2 Joe likes Thai food and the color Green 3 Mary likes Chinese food and the color Blue 4 Sue likes Mexican food and the color Red 5 Sally likes Spanish food and the color Green
Hmm, that is kind of what we want, but notice that it prints out some of the values, then creates a line break and prints out the rest of our new values. What if you want to print out the entire Dataframe with no truncated columns and no newlines in the output. I give you:
pd.set_option(‘display.max_columns’, None)
pd.set_option(‘display.expand_frame_repr’, False)
import pandas as pd
simple_list = ['Sam', 'Bob', 'Joe', 'Mary', 'Sue', 'Sally']
named_column = {'Name': simple_list,
'Favorite Color': ['Blue', 'Red', 'Green', 'Blue', 'Red', 'Green'],
'Favorite Food': ['Italian', 'Mediterranean', 'Thai', 'Chinese', 'Mexican', 'Spanish']}
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
data = pd.DataFrame(named_column)
formula_result = []
for i in range(len(data)):
formula_result.append(f'{data.iloc[i]["Name"]} likes {data.iloc[i]["Favorite Food"]}'
f' food and the color {data.iloc[i]["Favorite Color"]}')
data['About Me'] = formula_result
print(data)
This gives us the entire output we are looking for!
Name Favorite Color Favorite Food About Me 0 Sam Blue Italian Sam likes Italian food and the color Blue 1 Bob Red Mediterranean Bob likes Mediterranean food and the color Red 2 Joe Green Thai Joe likes Thai food and the color Green 3 Mary Blue Chinese Mary likes Chinese food and the color Blue 4 Sue Red Mexican Sue likes Mexican food and the color Red 5 Sally Green Spanish Sally likes Spanish food and the color Green
Save a dataframe to a file
If you would like to store the contents of your dataframe into a file now, this is easy to do with the .to_csv() method.
data.to_csv('dataframe_to_file.csv')
A new file has appeared in our project!
Our favorite Microsoft application excel is also able to open the newly created file.
When saving a dataframe to a file using .to_csv(), the default delimiter is of course a comma. This can be changed if you like using the sep= parameter. Let’s create a tab-delimited version of our file now.
data.to_csv('dataframe_to_file_tabs.csv', sep='\t')
Saving pandas dataframe to text file
Even though the method we use to write a dataframe to a file is named .to_csv(), you are not limited to just .csv files. In this next snippet, we will save the dataframe to a text file with a .txt extension using a custom separator. Note that “delimiter” must be a 1-character string. Here we will use the ‘+’ character and then view the results with the delimiter highlighted so that we can clearly see it.
data.to_csv('dataframe_to_text_file.txt', sep='+')
Load Dataframe From File
To load a file into a dataframe, you can use the .read_csv() function as we see below.
import pandas as pd
data = pd.read_csv('dataframe_to_file.csv')
print(data)
Unnamed: 0 ... About Me 0 0 ... Sam likes Italian food and the color Blue 1 1 ... Bob likes Mediterranean food and the color Red 2 2 ... Joe likes Thai food and the color Green 3 3 ... Mary likes Chinese food and the color Blue 4 4 ... Sue likes Mexican food and the color Red 5 5 ... Sally likes Spanish food and the color Green [6 rows x 5 columns]
To see the non truncated data when reading a file into a dataframe we can use the handy pd.set_option(‘display.max_columns’, None) and pd.set_option(‘display.expand_frame_repr’, False) options.
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
data = pd.read_csv('dataframe_to_file.csv')
print(data)
Unnamed: 0 Name Favorite Color Favorite Food About Me 0 0 Sam Blue Italian Sam likes Italian food and the color Blue 1 1 Bob Red Mediterranean Bob likes Mediterranean food and the color Red 2 2 Joe Green Thai Joe likes Thai food and the color Green 3 3 Mary Blue Chinese Mary likes Chinese food and the color Blue 4 4 Sue Red Mexican Sue likes Mexican food and the color Red 5 5 Sally Green Spanish Sally likes Spanish food and the color Green
How to use sqlite with pandas
It’s possible to read data into pandas from an SQLite database. We can borrow a sample database from a different application to use for this purpose. To make use of this technique, we can import sqlite3, set up a connection variable, and then use the pd.read_sql() function like so.
import pandas as pd
import sqlite3
connection = sqlite3.connect('db.sqlite3')
data = pd.read_sql('select * from stockapp_call', connection)
print(data)
id ... calls 0 416 ... AMC,MRNA,TSLA,BYND,SNAP,CHPT,NCTY,GOOGL,VXRT,N... 1 418 ... AMC,SNAP,FSR,PFE,AMD,MRNA,ZEV,AMZN,BAC,SBUX,NV... 2 419 ... FUBO,AMC,COIN,AMD,BA,AMZN,CAT,SPCE,CHPT,RBLX,N... 3 424 ... MRNA,IP,AMC,AMZN,MU,SONO,HYRE,ROKU,AMD,HOOD,PC... 4 425 ... WISH,AMZN,AMD,SPCE,BABA,LAZR,EBAY,AMC,ZNGA,MRN... .. ... ... ... 117 738 ... INTC,TSLA,LCID,NIO,AMZN,BA,AMD,UAA,CLX,HOOD,SK... 118 740 ... AMZN,TSLA,BA,HOOD,NIO,AMD,TWTR,AFRM,AMC,BHC,FL... 119 743 ... AMD,AFRM,PLUG,NVDA,HOOD,TTWO,BA,UPS,TLRY,XOM,F... 120 746 ... UPST,XOM,AMD,Z,FCX,GO,NFLX,RBLX,DWAC,AMRN,FDX,... 121 748 ... PYPL,AMD,FB,GOOGL,RBLX,SQ,WFC,PENN,QCOM,AMGN,T... [122 rows x 4 columns]
Using head() and tail()
You may want to look at the first or last set of records in the dataframe. This can be accomplished using either the head() or tail() functions. By default head() will display the first 5 results and tail() will display the last 5 results. An integer can be passed to either function if you want to see say the first 7 records, or the last 10 records. Here are a few examples of head() and tail().
import pandas as pd
import sqlite3
connection = sqlite3.connect('db.sqlite3')
data = pd.read_sql('select * from stockapp_call', connection)
print(data.head())
id ... calls 0 416 ... AMC,MRNA,TSLA,BYND,SNAP,CHPT,NCTY,GOOGL,VXRT,N... 1 418 ... AMC,SNAP,FSR,PFE,AMD,MRNA,ZEV,AMZN,BAC,SBUX,NV... 2 419 ... FUBO,AMC,COIN,AMD,BA,AMZN,CAT,SPCE,CHPT,RBLX,N... 3 424 ... MRNA,IP,AMC,AMZN,MU,SONO,HYRE,ROKU,AMD,HOOD,PC... 4 425 ... WISH,AMZN,AMD,SPCE,BABA,LAZR,EBAY,AMC,ZNGA,MRN... [5 rows x 4 columns]
import pandas as pd
import sqlite3
connection = sqlite3.connect('db.sqlite3')
data = pd.read_sql('select * from stockapp_call', connection)
print(data.head(7))
id ... calls 0 416 ... AMC,MRNA,TSLA,BYND,SNAP,CHPT,NCTY,GOOGL,VXRT,N... 1 418 ... AMC,SNAP,FSR,PFE,AMD,MRNA,ZEV,AMZN,BAC,SBUX,NV... 2 419 ... FUBO,AMC,COIN,AMD,BA,AMZN,CAT,SPCE,CHPT,RBLX,N... 3 424 ... MRNA,IP,AMC,AMZN,MU,SONO,HYRE,ROKU,AMD,HOOD,PC... 4 425 ... WISH,AMZN,AMD,SPCE,BABA,LAZR,EBAY,AMC,ZNGA,MRN... 5 427 ... TWTR,AMD,AMC,WISH,HOOD,FANG,SONO,SNAP,SPCE,BYN... 6 430 ... PFE,MSFT,BABA,AMZN,TSLA,AAPL,MRNA,NIO,WISH,BBW... [7 rows x 4 columns]
import pandas as pd
import sqlite3
connection = sqlite3.connect('db.sqlite3')
data = pd.read_sql('select * from stockapp_call', connection)
print(data.tail(10))
id ... calls 112 724 ... AMD,NVDA,LAZR,AFRM,BHC,MRNA,GM,AA,PTON,HZO,MAR... 113 727 ... AMD,TSLA,NVDA,AMC,PTON,NFLX,AMZN,DISH,NRG,FB,L... 114 731 ... TSLA,NVDA,AMD,AMC,AAPL,FB,MSFT,AAL,RBLX,AMZN,B... 115 734 ... NVDA,TSLA,AMC,MSFT,AMD,AMZN,FB,BABA,BAC,EW,ZM,... 116 736 ... AMC,T,MSFT,FB,CVX,NVDA,BABA,AMD,RUN,PLTR,INTC,... 117 738 ... INTC,TSLA,LCID,NIO,AMZN,BA,AMD,UAA,CLX,HOOD,SK... 118 740 ... AMZN,TSLA,BA,HOOD,NIO,AMD,TWTR,AFRM,AMC,BHC,FL... 119 743 ... AMD,AFRM,PLUG,NVDA,HOOD,TTWO,BA,UPS,TLRY,XOM,F... 120 746 ... UPST,XOM,AMD,Z,FCX,GO,NFLX,RBLX,DWAC,AMRN,FDX,... 121 748 ... PYPL,AMD,FB,GOOGL,RBLX,SQ,WFC,PENN,QCOM,AMGN,T... [10 rows x 4 columns]
Filter in a dataframe
The dataframe we are pulling from the sqlite database is over 100 rows long. We may want to filter that to limit how much data is viewed, how can we do that? There is a special syntax for that highlighted below.
import pandas as pd
import sqlite3
connection = sqlite3.connect('db.sqlite3')
data = pd.read_sql('select * from stockapp_call', connection)
filtered_row = data[data['created_at'].str.contains('2022-01-24')]
print(filtered_row)
id ... calls 114 731 ... TSLA,NVDA,AMD,AMC,AAPL,FB,MSFT,AAL,RBLX,AMZN,B... [1 rows x 4 columns]
Replacing values in a dataframe
To replace one or more values in a dataframe we can use the .replace() function. Here is an example of that technique.
import pandas as pd
import sqlite3
connection = sqlite3.connect('db.sqlite3')
data = pd.read_sql('select * from stockapp_call', connection)
replaced_ticker = data.replace(to_replace='AMC', value='replaced!', regex=True)
print(replaced_ticker)
id ... calls 0 416 ... replaced!,MRNA,TSLA,BYND,SNAP,CHPT,NCTY,GOOGL,... 1 418 ... replaced!,SNAP,FSR,PFE,AMD,MRNA,ZEV,AMZN,BAC,S... 2 419 ... FUBO,replaced!,COIN,AMD,BA,AMZN,CAT,SPCE,CHPT,... 3 424 ... MRNA,IP,replaced!,AMZN,MU,SONO,HYRE,ROKU,AMD,H... 4 425 ... WISH,AMZN,AMD,SPCE,BABA,LAZR,EBAY,replaced!,ZN...
Removing columns
To remove a column from the dataframe simply use the .drop() function like so.
import pandas as pd
import sqlite3
connection = sqlite3.connect('db.sqlite3')
data = pd.read_sql('select * from stockapp_call', connection)
removed_column = data.drop('calls', axis=1)
print(removed_column)
id created_at updated_at 0 416 2021-08-09 20:29:27.252553 2021-08-09 20:29:27.252553 1 418 2021-08-10 18:36:36.024030 2021-08-10 18:36:36.024030 2 419 2021-08-11 14:41:28.597140 2021-08-11 14:41:28.597140 3 424 2021-08-12 20:18:08.020679 2021-08-12 20:18:08.020679 4 425 2021-08-13 18:27:07.071109 2021-08-13 18:27:07.071109 .. ... ... ... 117 738 2022-01-27 21:18:50.158205 2022-01-27 21:18:50.159205 118 740 2022-01-28 22:12:43.995624 2022-01-28 22:12:43.995624 119 743 2022-01-31 20:52:06.498233 2022-01-31 20:52:06.498233 120 746 2022-02-01 21:01:50.009382 2022-02-01 21:01:50.009382 121 748 2022-02-02 21:17:53.769019 2022-02-02 21:17:53.769019 [122 rows x 3 columns]
Removing rows from dataframe
In this example, we will remove rows of data from the dataframe while specifying more than one label at a time using a list.
import pandas as pd
import sqlite3
connection = sqlite3.connect('db.sqlite3')
data = pd.read_sql('select * from stockapp_call', connection)
removed_row = data.iloc[0:3].drop(['id', 'created_at', 'updated_at'], axis=1)
print(removed_row)
calls 0 AMC,MRNA,TSLA,BYND,SNAP,CHPT,NCTY,GOOGL,VXRT,N... 1 AMC,SNAP,FSR,PFE,AMD,MRNA,ZEV,AMZN,BAC,SBUX,NV... 2 FUBO,AMC,COIN,AMD,BA,AMZN,CAT,SPCE,CHPT,RBLX,N...
What Is A Pandas Dataframe Summary
The pandas.DataFrame data structure makes working with two-dimensional data very efficient. We saw several ways to create and work with a Pandas DataFrame as well as how to do some of the common functions like access, modify, add, sort, filter, and delete data when working with DataFrames.