
Working with dates and time series data is a common task in data analysis and is crucial for a wide range of applications such as financial analysis, weather forecasting, and social media analytics. Pandas is a popular Python library that provides powerful tools for handling and analyzing time series data. In this tutorial, we will explore some of the essential features of pandas for working with dates and time series data.
- How to create a datetime column in a pandas DataFrame
- How to extract date components from a datetime column
- How to filter rows based on a date range
- How to group data by date or time intervals
- How to resample time series data at different frequencies
- How to fill missing values in time series data
- How to shift and lag time series data
- How to calculate rolling windows and moving averages
- How to plot time series data with pandas and matplotlib
- How to handle time zones in pandas
We will cover topics such as creating datetime objects, extracting date components, filtering, grouping, and resampling data, filling missing values, shifting and lagging data, and visualizing time series data using pandas and matplotlib. By the end of this tutorial, you should have a good understanding of how to use pandas to manipulate and analyze time series data in Python.
How to create a datetime column in a pandas DataFrame
To work with time series data in pandas, we need to create a datetime column in our DataFrame. We can do this by using the pd.to_datetime()
function to convert a column of strings or integers into datetime objects. Here’s an example:
import pandas as pd
# create a DataFrame with a date column
df = pd.DataFrame({'date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04']})
# convert the date column to datetime format
df['date'] = pd.to_datetime(df['date'])
# display the DataFrame
print(df)
Output:
date
0 2022-01-01
1 2022-01-02
2 2022-01-03
3 2022-01-04
In this example, we first create a DataFrame with a column of strings representing dates. We then use the pd.to_datetime()
function to convert the date
column to datetime format. Finally, we display the resulting DataFrame, which now has a date
column of datetime objects.
Note that the pd.to_datetime()
function can handle a variety of input formats, including ISO 8601 format (yyyy-mm-dd HH:MM:SS
), UNIX timestamps, and more. Additionally, the function can handle missing values and time zones.
How to extract date components from a datetime column
After creating a datetime column in a pandas DataFrame, we can extract various date components such as year, month, day, hour, minute, second, and day of the week. This can be useful for filtering, grouping, and analyzing time series data. Here’s an example:
import pandas as pd
# create a DataFrame with a datetime column
df = pd.DataFrame({'date': ['2022-01-01 12:34:56', '2022-01-02 13:45:57', '2022-01-03 14:56:58']})
df['date'] = pd.to_datetime(df['date'])
# extract year, month, and day columns
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
# extract hour, minute, and second columns
df['hour'] = df['date'].dt.hour
df['minute'] = df['date'].dt.minute
df['second'] = df['date'].dt.second
# extract day of the week column
df['day_of_week'] = df['date'].dt.dayofweek
# display the DataFrame
print(df)
Output:
date year month day hour minute second day_of_week
0 2022-01-01 12:34:56 2022 1 1 12 34 56 5
1 2022-01-02 13:45:57 2022 1 2 13 45 57 6
2 2022-01-03 14:56:58 2022 1 3 14 56 58 0
In this example, we first create a DataFrame with a datetime column. We then use the dt
accessor to extract various date components and create new columns in the DataFrame. Finally, we display the resulting DataFrame, which now has columns for year, month, day, hour, minute, second, and day of the week.
Note that the dt
accessor can also be used to extract other date properties, such as the week of the year, the quarter, and whether the date is a holiday (if a suitable calendar is available).
How to filter rows based on a date range
Filtering rows based on a date range is a common operation in time series analysis. Pandas provides various ways to filter rows based on a date range. Here’s an example:
import pandas as pd
# create a DataFrame with a datetime column
df = pd.DataFrame({
'date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05'],
'value': [1, 2, 3, 4, 5]
})
df['date'] = pd.to_datetime(df['date'])
# filter rows between two dates
start_date = '2022-01-02'
end_date = '2022-01-04'
mask = (df['date'] >= start_date) & (df['date'] <= end_date)
filtered_df = df.loc[mask]
# display the filtered DataFrame
print(filtered_df)
Output:
date value
1 2022-01-02 2
2 2022-01-03 3
3 2022-01-04 4
In this example, we first create a DataFrame with a datetime column and a value column. We then use the pd.to_datetime()
function to convert the date
column to datetime format. We next create a Boolean mask using the >=
and <=
operators to filter rows between two dates. We then use the .loc[]
accessor to select the filtered rows and create a new DataFrame called filtered_df
. Finally, we display the resulting filtered DataFrame.
Note that we can also use other comparison operators such as <
, >
, ==
, and !=
to filter rows based on a date range, and we can combine multiple conditions using the |
and &
operators. We can also use the pd.date_range()
function to create a range of dates and filter rows based on whether the date
column falls within the date range.
How to group data by date or time intervals
Grouping data by date or time intervals is a useful technique for aggregating and summarizing time series data. Pandas provides the groupby()
method for grouping DataFrame rows based on a column of datetime objects. Here’s an example:
import pandas as pd
# create a DataFrame with a datetime column and a value column
df = pd.DataFrame({
'date': ['2022-01-01', '2022-01-01', '2022-01-02', '2022-01-02', '2022-01-03'],
'value': [1, 2, 3, 4, 5]
})
df['date'] = pd.to_datetime(df['date'])
# group data by date and calculate the sum
grouped = df.groupby('date').sum()
# display the grouped DataFrame
print(grouped)
Output:
value
date
2022-01-01 3
2022-01-02 7
2022-01-03 5
In this example, we first create a DataFrame with a datetime column and a value column. We then use the pd.to_datetime()
function to convert the date
column to datetime format. We next group the DataFrame by the date
column using the groupby()
method, and then apply the sum()
method to calculate the sum of the value
column for each date. Finally, we display the resulting grouped DataFrame.
Note that we can also group data by other time intervals, such as week, month, quarter, or year, by using the resample()
method with an appropriate frequency string (e.g., ‘W’ for weekly, ‘M’ for monthly, ‘Q’ for quarterly, or ‘Y’ for yearly). Additionally, we can apply various aggregation functions such as mean()
, min()
, max()
, std()
, count()
, or agg()
to calculate various summary statistics for each group.
How to resample time series data at different frequencies
Resampling time series data involves changing the frequency of the data, such as converting hourly data to daily data or monthly data to quarterly data. Pandas provides the resample()
method for resampling time series data at different frequencies. Here’s an example:
import pandas as pd
# create a DataFrame with a datetime index and a value column
dates = pd.date_range('2022-01-01', '2022-01-31')
df = pd.DataFrame({
'value': range(len(dates))
}, index=dates)
# resample data at a daily frequency and calculate the sum
daily = df.resample('D').sum()
# resample data at a weekly frequency and calculate the mean
weekly = df.resample('W').mean()
# display the resampled DataFrames
print(daily)
print(weekly)
Output:
value
2022-01-01 0
2022-01-02 1
2022-01-03 2
2022-01-04 3
2022-01-05 4
2022-01-06 5
2022-01-07 6
2022-01-08 7
2022-01-09 8
2022-01-10 9
2022-01-11 10
2022-01-12 11
2022-01-13 12
2022-01-14 13
2022-01-15 14
2022-01-16 15
2022-01-17 16
2022-01-18 17
2022-01-19 18
2022-01-20 19
2022-01-21 20
2022-01-22 21
2022-01-23 22
2022-01-24 23
2022-01-25 24
2022-01-26 25
2022-01-27 26
2022-01-28 27
2022-01-29 28
2022-01-30 29
2022-01-31 30
value
2022-01-02 0.500000
2022-01-09 3.500000
2022-01-16 9.500000
2022-01-23 15.500000
2022-01-30 23.333333
In this example, we first create a DataFrame with a datetime index and a value column using the pd.date_range()
function. We then resample the data at a daily frequency and calculate the sum using the resample()
and sum()
methods, and resample the data at a weekly frequency and calculate the mean using the same methods. Finally, we display the resulting resampled DataFrames.
Note that the frequency string used in the resample()
method determines the size of the time bins for resampling. The frequency string can be specified in various formats, such as ‘D’ for daily, ‘W’ for weekly, ‘M’ for monthly, ‘Q’ for quarterly, ‘Y’ for yearly, or ‘H’ for hourly, among others. Additionally, we can apply various aggregation functions such as mean()
, min()
, max()
, std()
, count()
, or agg()
to calculate various summary statistics for each time bin.
How to fill missing values in time series data
Time series data often contains missing values, which can arise due to various reasons such as data collection errors, sensor failures, or network disruptions. Pandas provides several methods for filling missing values in time series data. Here’s an example:
import pandas as pd
import numpy as np
# create a DataFrame with a datetime index and missing values
dates = pd.date_range('2022-01-01', '2022-01-07')
df = pd.DataFrame({
'value': [1, 2, np.nan, 4, 5, np.nan, 7]
}, index=dates)
# fill missing values with a forward fill
ffill = df.fillna(method='ffill')
# fill missing values with a backward fill
bfill = df.fillna(method='bfill')
# fill missing values with a linear interpolation
interp = df.interpolate()
# display the filled DataFrames
print(ffill)
print(bfill)
print(interp)
Output:
value
2022-01-01 1.0
2022-01-02 2.0
2022-01-03 2.0
2022-01-04 4.0
2022-01-05 5.0
2022-01-06 5.0
2022-01-07 7.0
value
2022-01-01 1.0
2022-01-02 2.0
2022-01-03 4.0
2022-01-04 4.0
2022-01-05 5.0
2022-01-06 7.0
2022-01-07 7.0
value
2022-01-01 1.000000
2022-01-02 2.000000
2022-01-03 3.000000
2022-01-04 4.000000
2022-01-05 5.000000
2022-01-06 6.000000
2022-01-07 7.000000
In this example, we first create a DataFrame with a datetime index and missing values using the pd.date_range()
function and the np.nan
value. We then fill the missing values using the fillna()
method with different methods: forward fill (ffill
), backward fill (bfill
), and linear interpolation (interp
). Finally, we display the resulting filled DataFrames.
Note that the fillna()
method can also take a scalar value or a dictionary of column-wise values to fill missing values. Additionally, we can use other methods such as dropna()
to remove rows or columns with missing values, or interpolate()
with various interpolation methods such as linear, polynomial, or spline interpolation.
How to shift and lag time series data
Shifting and lagging time series data involves moving the data values forward or backward in time by a certain number of time steps. Pandas provides the shift()
method for shifting and lagging time series data. Here’s an example:
import pandas as pd
# create a DataFrame with a datetime index and a value column
dates = pd.date_range('2022-01-01', '2022-01-07')
df = pd.DataFrame({
'value': range(len(dates))
}, index=dates)
# shift data forward by two time steps
forward = df.shift(2)
# shift data backward by two time steps
backward = df.shift(-2)
# display the shifted DataFrames
print(df)
print(forward)
print(backward)
Output:
value
2022-01-01 0
2022-01-02 1
2022-01-03 2
2022-01-04 3
2022-01-05 4
2022-01-06 5
2022-01-07 6
value
2022-01-01 NaN
2022-01-02 NaN
2022-01-03 0.0
2022-01-04 1.0
2022-01-05 2.0
2022-01-06 3.0
2022-01-07 4.0
value
2022-01-01 2
2022-01-02 3
2022-01-03 4
2022-01-04 5
2022-01-05 6
2022-01-06 NaN
2022-01-07 NaN
In this example, we first create a DataFrame with a datetime index and a value column using the pd.date_range()
function. We then shift the data forward by two time steps using the shift()
method with a positive integer argument (forward
), and shift the data backward by two time steps using a negative integer argument (backward
). Finally, we display the resulting shifted DataFrames.
Note that the shift()
method can also take a freq
argument to shift the data by a time interval, and can be combined with other methods such as diff()
to calculate the difference between consecutive values, or pct_change()
to calculate the percentage change between consecutive values.
How to calculate rolling windows and moving averages
Rolling windows and moving averages are useful techniques for smoothing time series data and calculating trends over time. Pandas provides the rolling()
method for calculating rolling windows and moving averages. Here’s an example:
import pandas as pd
# create a DataFrame with a datetime index and a value column
dates = pd.date_range('2022-01-01', '2022-01-31')
df = pd.DataFrame({
'value': range(len(dates))
}, index=dates)
# calculate a rolling window of size 7 and calculate the mean
rolling = df.rolling(window=7).mean()
# calculate a moving average of size 7
moving = df.rolling(window=7, min_periods=1).mean()
# display the rolling and moving averages
print(rolling)
print(moving)
Output:
value
2022-01-01 NaN
2022-01-02 NaN
2022-01-03 NaN
2022-01-04 NaN
2022-01-05 NaN
2022-01-06 NaN
2022-01-07 3.0
2022-01-08 4.0
2022-01-09 5.0
2022-01-10 6.0
2022-01-11 7.0
2022-01-12 8.0
2022-01-13 9.0
2022-01-14 10.0
2022-01-15 11.0
2022-01-16 12.0
2022-01-17 13.0
2022-01-18 14.0
2022-01-19 15.0
2022-01-20 16.0
2022-01-21 17.0
2022-01-22 18.0
2022-01-23 19.0
2022-01-24 20.0
2022-01-25 21.0
2022-01-26 22.0
2022-01-27 23.0
2022-01-28 24.0
2022-01-29 25.0
2022-01-30 26.0
2022-01-31 27.0
value
2022-01-01 0.0
2022-01-02 0.5
2022-01-03 1.0
2022-01-04 1.5
2022-01-05 2.0
2022-01-06 2.5
2022-01-07 3.0
2022-01-08 4.0
2022-01-09 5.0
2022-01-10 6.0
2022-01-11 7.0
2022-01-12 8.0
2022-01-13 9.0
2022-01-14 10.0
2022-01-15 11.0
2022-01-16 12.0
2022-01-17 13.0
2022-01-18 14.0
2022-01-19 15.0
2022-01-20 16.0
2022-01-21 17.0
2022-01-22 18.0
2022-01-23 19.0
2022-01-24 20.0
2022-01-25 21.0
2022-01-26 22.0
2022-01-27 23.0
2022-01-28 24.0
2022-01-29 25.0
2022-01-30 26.0
2022-01-31 27.0
How to plot time series data with pandas and matplotlib
Pandas and matplotlib provide several functions and methods for plotting time series data. Here’s an example:
import pandas as pd
import matplotlib.pyplot as plt
# create a DataFrame with a datetime index and a value column
dates = pd.date_range('2022-01-01', '2022-01-31')
df = pd.DataFrame({
'value': range(len(dates))
}, index=dates)
# plot the time series data
df.plot(figsize=(10, 6))
plt.title('Time Series Plot')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
In this example, we first create a DataFrame with a datetime index and a value column using the pd.date_range()
function. We then plot the time series data using the plot()
method with the figsize
argument to adjust the size of the plot. We also set the title, x-axis label, and y-axis label using the title()
, xlabel()
, and ylabel()
functions, respectively. Finally, we display the plot using the show()
function.
Note that the plot()
method can also take various arguments such as kind
to specify the type of plot (e.g., line plot, area plot, bar plot, etc.), color
to specify the color of the plot, and style
to specify the line style of the plot. Additionally, we can use other functions such as hist()
to plot histograms of time series data, or scatter()
to plot scatter plots of time series data.
How to handle time zones in pandas
Pandas provides the tz_localize()
and tz_convert()
methods for handling time zones in time series data. Here’s an example:
import pandas as pd
# create a datetime with a timezone-aware index
dates = pd.date_range('2022-01-01 00:00:00', periods=3, freq='H', tz='UTC')
df = pd.DataFrame({
'value': [1, 2, 3]
}, index=dates)
# localize the datetime to a different timezone
df_localized = df.tz_localize('Asia/Tokyo')
# convert the timezone to another timezone
df_converted = df_localized.tz_convert('US/Pacific')
# display the original, localized, and converted DataFrames
print(df)
print(df_localized)
print(df_converted)
Output:
value
2022-01-01 00:00:00+00:00 1
2022-01-01 01:00:00+00:00 2
2022-01-01 02:00:00+00:00 3
value
2022-01-01 00:00:00+09:00 1
2022-01-01 01:00:00+09:00 2
2022-01-01 02:00:00+09:00 3
value
2021-12-31 07:00:00-08:00 1
2021-12-31 08:00:00-08:00 2
2021-12-31 09:00:00-08:00 3
In this example, we first create a DataFrame with a timezone-aware datetime index using the pd.date_range()
function and the tz
argument. We then localize the datetime index to a different timezone using the tz_localize()
method and convert the timezone to another timezone using the tz_convert()
method. Finally, we display the resulting original, localized, and converted DataFrames.
Note that the tz_localize()
method sets the timezone for a datetime index that doesn’t have a timezone, and the tz_convert()
method changes the timezone for a datetime index that already has a timezone. Additionally, pandas supports various time zones that can be specified using the IANA Time Zone database, such as ‘US/Pacific’, ‘Europe/London’, ‘Asia/Tokyo’, among others.
- Python Pandas Working with Dates and Time Series Data (vegibit.com)
- Time series / date functionality — pandas 2.0.0 (pandas.pydata.org)
- Tutorial: Time Series Analysis with Pandas – Dataquest (www.dataquest.io)
- python – Plot datetime.date / time series in a pandas (stackoverflow.com)
- Working with Time Series | Python Data Science (jakevdp.github.io)
- DateTime in Pandas and Python • datagy (datagy.io)
- A Collection of Must-Know Techniques for Working with (towardsdatascience.com)
- Python Pandas – Working with TIME & Date SERIES Data (www.youtube.com)
- Work With Datetime Format in Python – Time Series (www.earthdatascience.org)
- Python processing time series data – Programmer All (programmerall.com)
- A Guide to Time Series Visualization with Python 3 (www.digitalocean.com)
- How To Resample and Interpolate Your Time Series Data With Python (machinelearningmastery.com)