Pandas: Working with Dates and Times
April 18, 2019
Introduction
In this tutorial, we will learn how we can handle datetime data with pandas.
We will also use our previous knowledge in indexing DataFrames to handle time-series data specifically.
Data structures in Pandas for Date and Time
We will first look at the basic data structures available for dealing with date and time in Pandas.
Timestamp
- These refer to particular instance of time. Eg:'2018-05-31 12:30:45'
. This is similar to Python’sdatetime
. It is based on NumPy’s data typedatetime64
.Period
- These refer to standard time intervals such as a month, a week or a day. These are fixed-frequency intervals. This is based on the intervals of NumPy’s data typedatetime64
.Timedelta
- These refer to exact time intervals such as an interval of 35.27 seconds. It is similar to Python’sdatetime.timedelta
, but is based on NumPy’s data typetimedelta64
.
The electricity dataset
For this tutorial, we will be making use of some time series data. Specifically, we will be using the electricity
dataset.
The data is taken from ”Open Power System Data” and modified for this tutorial.
The file contains data about electric power consumption and generation in United Kingdom. The data is recorded on an hourly basis from 11th May 2018 04:00 to 31st May 2018 23:00
Attribution: Open Power System Data. 2018. Data Package Time series. Version 2018-06-30. https://doi.org/10.25832/time_series/2018-06-30
Variable descriptions and data types:
utc_timestamp
: Start of time period in Coordinated Universal TimeBritain_hourly_load
: Total load in United Kingdom (in MW)Britain_solar
: Solar power generation in United Kingdom (in MW)Britain_wind
: Wind power generation in United Kingdom (in MW)
Loading Datetime data
Let’s load the electricity
dataset and get a glimpse of it using the head()
method.
load_commonlounge_dataset("electricity")
edata = pd.read_csv("/tmp/electricity.csv")
print(edata.head())
We can see that the utc_timestamp
column contains date-time data.
Let us see the dtypes of the columns with the info()
method:
edata.info()
We can see that the utc_timestamp
column has the dtype object
. As CSV files can only have numbers and strings, we have to explicitly convert the data to Timestamps
ourselves.
Convert to datetime using to_datetime()
We can convert the utc_timestamp
column to Timestamp
using the to_datetime()
function. The syntax is as follows:
var = pandas.to_datetime(arg)
arg
here is the argument we pass, which we want to convert to Timestamp
.
The var
returned, depends on the argument of to_datetime
.
- For a scalar it returns a
Timestamp
. - For a Series it returns a Series of dtype
datetime64
(This allows Pandas to take advantage of vectorized operations over the NumPy arrays).
Let us convert the column utc_timestamp
from dtype object
to datetime64
:
edata.utc_timestamp = pd.to_datetime(edata.utc_timestamp)
Let us check if the dtype has changed.
print(edata.info())
As we can see, it is now of dtype datetime64
!
To convert datetime data from string to datetime, Pandas has to figure out the format in which the datetime data is written in the input file.
By default, Pandas infers the format for each element separately. If all the elements in our column have the same format, then we can considerably speed up the conversion by explicitly specifying the format.
For our data, the format is '%Y-%m-%d %H:%M:%S'
. The specification for format
is shared with Python datetime
module’s strptime()
and strftime()
functions. These functions are used for date time conversion in Python. The most commonly used codes are mentioned below:
We can specify the format
parameter with the following syntax:
series = pandas.to_datetime(arg, format='%Y-%m-%d %H:%M:%S')
parse_dates
parameter in read_csv()
We can also parse the datetime data using the parse_dates
parameter in read_csv
function. To do this, we pass the list of column names or column positions to parse as datetime to the parse_dates
parameter.
dataframe = pandas.read_csv("path_to_file", parse_date=[col1, col2, col3])
DatetimeIndex
Pandas supports an index structure called DatetimeIndex
which is based on the datetime64
datatype. This index allows us to use Timestamp
data to identify rows, or to use with loc
for slicing.
To set a column as Index, we will use the set_index
method from Pandas DataFrame.
DataFrame.set_index("col_label")
set_index
sets an existing column as the new index for the DataFrame. The argument is the column label of the column we want as the new index.
Let us do it for our DataFrame:
edata.set_index("utc_timestamp", inplace=True)
If the inplace
parameter is True
, then the original DataFrame will itself be modified, instead of creating and returning a new DataFrame.
Let us have a look at our data with the DatetimeIndex
print(edata.head())
Attributes of DatetimeIndex
Finding trends and patterns over time is a very essential aspect in exploring and modelling time-series data. Some of the attributes from DatetimeIndex
help us do that.
Suppose we are interested in knowing the electricity consumption level at 12pm everyday. To do this, we can use the hour
attribute of DatetimeIndex
. This returns only the hours of the datetime objects as an integer index.
The DatetimeIndex
can be accessed with DataFrame.index
, and the hours can be accessed as DataFrame.index.hour
.
Let us look at what the attribute returns:
print(edata.index.hour)
Let’s use this to select all the rows with hour 12pm
# select data for 12pm everyday
subset = edata[edata.index.hour==12]
# display first few results
print(subset.head())
We could use other attributes like:
day
: day of the month as an integerweekday
: day of the week as an integer (Monday is 0, Tuesday is 1, …, Sunday is 6)month
: month of the year as an integeryear
: year as integer
We can also use loc
to slice data according to date time.
For example if we want all the data from 1pm to 11pm of 11th May 2018, we could get it as follows:
print(edata.loc['2018-05-11 13:00:00':'2018-05-11 23:00:00'])
Resampling
Our current data is stored on an hourly basis. From time to time, we might need to get data according to different frequencies (daily, weekly, etc).
Changing the Frequency is called resampling. It could take two different courses:
- downsample: resample to lower frequency. For example calculating daily total or average consumption from the hourly consumption.
- upsample: resample to higher frequency. This is much rarer as the increased frequency will create extra data points in time for which we don’t have values. The values are then filled by different methods like forward fill or backward fill, similar to imputation techniques.
Resampling is done with the resample()
function. The syntax is:
DataFrame.resample(rule)
The following are some of the arguments that can be passed to the rule
parameter, to specify the new frequency.
'H'
: Hour'D'
: Day'M'
: Calendar month ends'A'
: Year ends
Note: These strings map to
DataOffset
objects, which denote duration of time, similar totimedelta
, but with standard calendar rules. For example a one month increment byDataOffset
from 5th January will be 5th February and then 5th March, irrespective of whether the month contains 28, 29, 30 or 31 days.
For downsampling we can assign an aggregate function like mean()
, median()
or sum()
to fill up the columns with appropriate aggregate values that we want.
For upsampling, we can assign a filling method for NaN
.
Let us take an example of downsampling to daily frequency from our original hourly frequency, with the aggregator function mean()
. This will give us the daily mean electricity consumption levels, and wind and solar generation levels.
# downsampling to daily frequency
resampled_df = edata.resample('D').mean()
# display first few results
print(resampled_df.head())
Rolling Windows
Similar to resampling, rolling windows aggregates data in each window with a function such as mean()
, median()
, sum()
, etc.
However, unlike resampling where the time intervals don’t overlap, rolling windows overlap and “roll” along at the same frequency as the data. Thus the new DataFrame has the same frequency as the original one.
The syntax is:
DataFrame.rolling(window)
The window
parameter decides how many rows to include in each bin.
- It could be an integer to specify exact number of rows, or
- It could be a frequency -
'H'
,'D'
,'M'
denoting hourly, daily, monthly, etc
Let us see an example where we take a rolling window of 48 hours (since our data us hourly, that means window of 48 rows). We will use the aggregate function mean()
for each window and display rows 50 to 60 (as the first 47 rows will have NaN
values).
rolling_mean = edata.rolling(48).mean()
print(rolling_mean.iloc[50:60,:])
Summary
- Data type of variables can be converted to
Timestamp
using theto_datetime()
function. It converts scalars toTimestamp
and Series todatetime64.
parse_dates
parameter inread_csv()
, allows us to convert a column todatetime64
DatetimeIndex
is a special index structure made ofdatetime64
data type. It has helpful attributes likeyear
,month
,weekday
,day
,hour
, etc.- We can resample Time-series data using
resample()
, with specific frequencies such as hourly, daily, monthly, etc. - The
rolling()
function allows aggregate calculations on a rolling window of rows.