Pandas: Working with Dates and Times

April 18, 2019

Introduction

In this tutorial, we will learn how we can handle datetime data with pandas.

We will also use our previous knowledge in indexing DataFrames to handle time-series data specifically.

Data structures in Pandas for Date and Time

We will first look at the basic data structures available for dealing with date and time in Pandas.

Timestamp - These refer to particular instance of time. Eg: '2018-05-31 12:30:45'. This is similar to Python’s datetime. It is based on NumPy’s data type datetime64.
Period - These refer to standard time intervals such as a month, a week or a day. These are fixed-frequency intervals. This is based on the intervals of NumPy’s data type datetime64.
Timedelta - These refer to exact time intervals such as an interval of 35.27 seconds. It is similar to Python’s datetime.timedelta, but is based on NumPy’s data type timedelta64.

The electricity dataset

For this tutorial, we will be making use of some time series data. Specifically, we will be using the electricity dataset.

The data is taken from ”Open Power System Data” and modified for this tutorial.

The file contains data about electric power consumption and generation in United Kingdom. The data is recorded on an hourly basis from 11th May 2018 04:00 to 31st May 2018 23:00

Attribution: Open Power System Data. 2018. Data Package Time series. Version 2018-06-30. https://doi.org/10.25832/time_series/2018-06-30

Variable descriptions and data types:

utc_timestamp: Start of time period in Coordinated Universal Time
Britain_hourly_load: Total load in United Kingdom (in MW)
Britain_solar: Solar power generation in United Kingdom (in MW)
Britain_wind: Wind power generation in United Kingdom (in MW)

Loading Datetime data

Let’s load the electricity dataset and get a glimpse of it using the head() method.

load_commonlounge_dataset("electricity")
edata = pd.read_csv("/tmp/electricity.csv")
print(edata.head())

We can see that the utc_timestamp column contains date-time data.

Let us see the dtypes of the columns with the info() method:

edata.info()

We can see that the utc_timestamp column has the dtype object. As CSV files can only have numbers and strings, we have to explicitly convert the data to Timestamps ourselves.

Convert to datetime using `to_datetime()`

We can convert the utc_timestamp column to Timestamp using the to_datetime() function. The syntax is as follows:

var = pandas.to_datetime(arg)

arg here is the argument we pass, which we want to convert to Timestamp.

The var returned, depends on the argument of to_datetime.

For a scalar it returns a Timestamp.
For a Series it returns a Series of dtype datetime64 (This allows Pandas to take advantage of vectorized operations over the NumPy arrays).

Let us convert the column utc_timestamp from dtype object to datetime64:

edata.utc_timestamp = pd.to_datetime(edata.utc_timestamp)

Let us check if the dtype has changed.

print(edata.info())

As we can see, it is now of dtype datetime64!

To convert datetime data from string to datetime, Pandas has to figure out the format in which the datetime data is written in the input file.

By default, Pandas infers the format for each element separately. If all the elements in our column have the same format, then we can considerably speed up the conversion by explicitly specifying the format.

For our data, the format is '%Y-%m-%d %H:%M:%S'. The specification for format is shared with Python datetime module’s strptime() and strftime() functions. These functions are used for date time conversion in Python. The most commonly used codes are mentioned below:

We can specify the format parameter with the following syntax:

series = pandas.to_datetime(arg, format='%Y-%m-%d %H:%M:%S')

`parse_dates` parameter in `read_csv()`

We can also parse the datetime data using the parse_dates parameter in read_csv function. To do this, we pass the list of column names or column positions to parse as datetime to the parse_dates parameter.

dataframe = pandas.read_csv("path_to_file", parse_date=[col1, col2, col3])

DatetimeIndex

Pandas supports an index structure called DatetimeIndex which is based on the datetime64 datatype. This index allows us to use Timestamp data to identify rows, or to use with loc for slicing.

To set a column as Index, we will use the set_index method from Pandas DataFrame.

DataFrame.set_index("col_label")

set_index sets an existing column as the new index for the DataFrame. The argument is the column label of the column we want as the new index.

Let us do it for our DataFrame:

edata.set_index("utc_timestamp", inplace=True)

If the inplace parameter is True, then the original DataFrame will itself be modified, instead of creating and returning a new DataFrame.

Let us have a look at our data with the DatetimeIndex

print(edata.head())

Attributes of DatetimeIndex

Finding trends and patterns over time is a very essential aspect in exploring and modelling time-series data. Some of the attributes from DatetimeIndex help us do that.

Suppose we are interested in knowing the electricity consumption level at 12pm everyday. To do this, we can use the hour attribute of DatetimeIndex. This returns only the hours of the datetime objects as an integer index.

The DatetimeIndex can be accessed with DataFrame.index, and the hours can be accessed as DataFrame.index.hour.

Let us look at what the attribute returns:

print(edata.index.hour)

Let’s use this to select all the rows with hour 12pm

# select data for 12pm everyday
subset = edata[edata.index.hour==12]
# display first few results
print(subset.head())

We could use other attributes like:

day: day of the month as an integer
weekday: day of the week as an integer (Monday is 0, Tuesday is 1, …, Sunday is 6)
month: month of the year as an integer
year: year as integer

We can also use loc to slice data according to date time.

For example if we want all the data from 1pm to 11pm of 11th May 2018, we could get it as follows:

print(edata.loc['2018-05-11 13:00:00':'2018-05-11 23:00:00'])

Resampling

Our current data is stored on an hourly basis. From time to time, we might need to get data according to different frequencies (daily, weekly, etc).

Changing the Frequency is called resampling. It could take two different courses:

downsample: resample to lower frequency. For example calculating daily total or average consumption from the hourly consumption.
upsample: resample to higher frequency. This is much rarer as the increased frequency will create extra data points in time for which we don’t have values. The values are then filled by different methods like forward fill or backward fill, similar to imputation techniques.

Resampling is done with the resample() function. The syntax is:

DataFrame.resample(rule)

The following are some of the arguments that can be passed to the rule parameter, to specify the new frequency.

'H' : Hour
'D' : Day
'M' : Calendar month ends
'A' : Year ends

Note: These strings map to DataOffset objects, which denote duration of time, similar to timedelta, but with standard calendar rules. For example a one month increment by DataOffset from 5th January will be 5th February and then 5th March, irrespective of whether the month contains 28, 29, 30 or 31 days.

For downsampling we can assign an aggregate function like mean(), median() or sum() to fill up the columns with appropriate aggregate values that we want.

For upsampling, we can assign a filling method for NaN.

Let us take an example of downsampling to daily frequency from our original hourly frequency, with the aggregator function mean(). This will give us the daily mean electricity consumption levels, and wind and solar generation levels.

# downsampling to daily frequency
resampled_df = edata.resample('D').mean()
# display first few results
print(resampled_df.head())

Rolling Windows

Similar to resampling, rolling windows aggregates data in each window with a function such as mean(), median(), sum(), etc.

However, unlike resampling where the time intervals don’t overlap, rolling windows overlap and “roll” along at the same frequency as the data. Thus the new DataFrame has the same frequency as the original one.

The syntax is:

DataFrame.rolling(window)

The window parameter decides how many rows to include in each bin.

It could be an integer to specify exact number of rows, or
It could be a frequency - 'H', 'D', 'M' denoting hourly, daily, monthly, etc

Let us see an example where we take a rolling window of 48 hours (since our data us hourly, that means window of 48 rows). We will use the aggregate function mean() for each window and display rows 50 to 60 (as the first 47 rows will have NaN values).

rolling_mean = edata.rolling(48).mean()
print(rolling_mean.iloc[50:60,:])

Summary

Data type of variables can be converted to Timestamp using the to_datetime() function. It converts scalars to Timestamp and Series to datetime64.
parse_dates parameter in read_csv(), allows us to convert a column to datetime64
DatetimeIndex is a special index structure made of datetime64 data type. It has helpful attributes like year, month, weekday, day, hour, etc.
We can resample Time-series data using resample(), with specific frequencies such as hourly, daily, monthly, etc.
The rolling() function allows aggregate calculations on a rolling window of rows.