5.4. Time series#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

5.4.1. Intro#

pandas has extensive support for handling dates and times.

We will loosely refer to data with date or time information as time series data.

In this lecture, we will cover the most useful parts of pandas’ time series functionality.

Among these topics are:

  • Parsing strings as dates

  • Writing datetime objects as (inverse operation of previous point)

  • Extracting data from a DataFrame or Series with date information in the index

  • Shifting data through time (taking leads or lags)

  • Re-sampling data to a different frequency and rolling operations

However, even more than with previous topics, we will skip a lot of the functionality pandas offers, and we urge you to refer to the official documentation for more information.

5.4.2. Parsing Strings as Dates#

When working with time series data, we almost always receive the data with dates encoded as strings.

Hopefully, the date strings follow a structured format or pattern.

One common pattern is YYYY-MM-DD: 4 numbers for the year, 2 for the month, and 2 for the day with each section separated by a -.

For example, we write Christmas day 2017 in this format as

christmas_str = "2017-12-25"

To convert a string into a time-aware object, we use the pd.to_datetime function.

christmas = pd.to_datetime(christmas_str)
print("The type of christmas is", type(christmas))
christmas
The type of christmas is <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Timestamp('2017-12-25 00:00:00')

The pd.to_datetime function is pretty smart at guessing the format of the date…

for date in ["December 25, 2017", "Dec. 25, 2017",
             "Monday, Dec. 25, 2017", "25 Dec. 2017", "25th Dec. 2017"]:
    print("pandas interprets {} as {}".format(date, pd.to_datetime(date)))
pandas interprets December 25, 2017 as 2017-12-25 00:00:00
pandas interprets Dec. 25, 2017 as 2017-12-25 00:00:00
pandas interprets Monday, Dec. 25, 2017 as 2017-12-25 00:00:00
pandas interprets 25 Dec. 2017 as 2017-12-25 00:00:00
pandas interprets 25th Dec. 2017 as 2017-12-25 00:00:00

However, sometimes we will need to give pandas a hint.

For example, that same time (midnight on Christmas) would be reported on an Amazon transaction report as

christmas_amzn = "2017-12-25T00:00:00+ 00 :00"

If we try to pass this to pd.to_datetime, it will fail.

pd.to_datetime(christmas_amzn)

To parse a date with this format, we need to specify the format argument for pd.to_datetime.

amzn_strftime = "%Y-%m-%dT%H:%M:%S+ 00 :00"
pd.to_datetime(christmas_amzn, format=amzn_strftime)
Timestamp('2017-12-25 00:00:00')

Can you guess what amzn_strftime represents?

Let’s take a closer look at amzn_strftime and christmas_amzn.

print(amzn_strftime)
print(christmas_amzn)
%Y-%m-%dT%H:%M:%S+ 00 :00
2017-12-25T00:00:00+ 00 :00

Notice that both of the strings have a similar form, but that instead of actual numerical values, amzn_strftime has placeholders.

Specifically, anywhere the % shows up is a signal to the pd.to_datetime function that it is where relevant information is stored.

For example, the %Y is a stand-in for a four digit year, %m is for 2 a digit month, and so on…

The official Python documentation contains a complete list of possible %something patterns that are accepted in the format argument.

5.4.2.1. Multiple Dates#

If we have dates in a Series (e.g. column of DataFrame) or a list, we can pass the entire collection to pd.to_datetime and get a collection of dates back.

We’ll just show an example of that here as the mechanics are the same as a single date.

pd.to_datetime(["2017-12-25", "2017-12-31"])
DatetimeIndex(['2017-12-25', '2017-12-31'], dtype='datetime64[ns]', freq=None)

5.4.3. Extracting Data#

When the index of a DataFrame has date information and pandas recognizes the values as datetime values, we can leverage some convenient indexing features for extracting data.

The flexibility of these features is best understood through example, so let’s load up some data and take a look.


Below we are using pandas data reader API that allows us to get Alpha Advantadge finance data for free look here

You can apply for a free API key here https://www.alphavantage.co/support/#api-key

There is a wide range of different API for Python. We will later discuss a few more that we will use

import pandas_datareader.data as pdr
import datetime as dt

# yahoo, not currently working
#df = web.DataReader('GME', 'yahoo', start='2019-09-10', end='2019-10-09')

# alpha vantage, require api key
gme= pdr.DataReader("GME", "av-daily", start=dt.datetime(2013, 1, 2),
                      end=dt.datetime(2021, 11, 1),
                     api_key='N78MZQUK4ZCDUABU')


gme.tail()
open high low close volume
2021-10-26 173.36 185.00 172.50 177.84 2176749
2021-10-27 180.00 183.09 172.33 173.51 1106998
2021-10-28 175.16 183.14 175.00 182.85 1696206
2021-10-29 182.81 185.75 178.00 183.51 2293970
2021-11-01 182.53 208.57 182.05 200.09 4910802

LEts start by getting prices back in 2015

#gme.loc["2015"]

What did go wrong?

gme.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2225 entries, 2013-01-02 to 2021-11-01
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   open    2225 non-null   float64
 1   high    2225 non-null   float64
 2   low     2225 non-null   float64
 3   close   2225 non-null   float64
 4   volume  2225 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 104.3+ KB

Lets make sure that the index is interpeted as Date

gme.index=pd.to_datetime(gme.index)
gme.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2225 entries, 2013-01-02 to 2021-11-01
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   open    2225 non-null   float64
 1   high    2225 non-null   float64
 2   low     2225 non-null   float64
 3   close   2225 non-null   float64
 4   volume  2225 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 104.3 KB

Here, we have the Game Stop from 2013 until today.

Notice that the type of index is DateTimeIndex.

This is the key that enables things like…

Extracting all data for the year 2015 by passing "2015" to .loc.

gme.loc["2015"]
open high low close volume
2015-01-02 34.06 34.1600 33.2500 33.80 1612946
2015-01-05 33.52 34.8800 33.3300 34.72 4940754
2015-01-06 35.17 36.0300 33.5200 33.69 4885107
2015-01-07 34.29 34.6800 32.9800 33.30 2558324
2015-01-08 33.60 34.1700 33.3200 33.69 4547436
... ... ... ... ... ...
2015-12-24 28.62 28.8000 28.2799 28.37 868965
2015-12-28 28.37 28.9400 28.1400 28.47 2325435
2015-12-29 28.62 28.8000 28.3268 28.43 2041056
2015-12-30 28.37 28.9399 28.2400 28.48 1438398
2015-12-31 28.43 28.8600 28.0300 28.04 1696030

252 rows × 5 columns

We can also narrow down to specific months.

# By month's name
gme.loc["August 2017"]
open high low close volume
2017-08-01 21.69 21.690 21.1400 21.36 3059060
2017-08-02 21.32 21.620 21.2718 21.38 1114494
2017-08-03 21.32 21.820 21.3200 21.66 1047946
2017-08-04 21.78 22.110 21.6874 21.93 1136118
2017-08-07 21.94 22.255 21.8250 22.17 1276275
2017-08-08 22.20 22.370 21.9000 21.94 1422533
2017-08-09 21.75 22.120 21.6200 22.08 1502404
2017-08-10 21.91 21.910 21.4600 21.47 1517745
2017-08-11 21.24 21.830 21.0700 21.76 1616078
2017-08-14 21.95 21.990 21.7150 21.85 1532902
2017-08-15 21.70 21.770 20.7800 20.93 3886017
2017-08-16 21.12 21.490 21.0800 21.39 2412361
2017-08-17 21.37 21.660 21.0000 21.10 1928060
2017-08-18 21.05 21.330 20.9700 21.22 2117015
2017-08-21 21.16 21.360 20.9700 20.98 1633080
2017-08-22 21.12 21.690 21.0900 21.64 2468755
2017-08-23 21.59 21.820 21.2100 21.48 2428181
2017-08-24 21.75 22.120 21.6500 21.78 3669710
2017-08-25 19.90 19.940 18.7200 19.40 20500700
2017-08-28 19.41 19.700 18.9600 19.15 5213366
2017-08-29 18.93 19.070 18.6400 18.75 4663522
2017-08-30 18.75 18.970 18.6800 18.79 2209062
2017-08-31 18.83 18.870 18.4700 18.50 2625788
# By month's number
gme.loc["08/2017"]
open high low close volume
2017-08-01 21.69 21.690 21.1400 21.36 3059060
2017-08-02 21.32 21.620 21.2718 21.38 1114494
2017-08-03 21.32 21.820 21.3200 21.66 1047946
2017-08-04 21.78 22.110 21.6874 21.93 1136118
2017-08-07 21.94 22.255 21.8250 22.17 1276275
2017-08-08 22.20 22.370 21.9000 21.94 1422533
2017-08-09 21.75 22.120 21.6200 22.08 1502404
2017-08-10 21.91 21.910 21.4600 21.47 1517745
2017-08-11 21.24 21.830 21.0700 21.76 1616078
2017-08-14 21.95 21.990 21.7150 21.85 1532902
2017-08-15 21.70 21.770 20.7800 20.93 3886017
2017-08-16 21.12 21.490 21.0800 21.39 2412361
2017-08-17 21.37 21.660 21.0000 21.10 1928060
2017-08-18 21.05 21.330 20.9700 21.22 2117015
2017-08-21 21.16 21.360 20.9700 20.98 1633080
2017-08-22 21.12 21.690 21.0900 21.64 2468755
2017-08-23 21.59 21.820 21.2100 21.48 2428181
2017-08-24 21.75 22.120 21.6500 21.78 3669710
2017-08-25 19.90 19.940 18.7200 19.40 20500700
2017-08-28 19.41 19.700 18.9600 19.15 5213366
2017-08-29 18.93 19.070 18.6400 18.75 4663522
2017-08-30 18.75 18.970 18.6800 18.79 2209062
2017-08-31 18.83 18.870 18.4700 18.50 2625788

Or even a day…

# By date name
gme.loc["August 1, 2017"]
open           21.69
high           21.69
low            21.14
close          21.36
volume    3059060.00
Name: 2017-08-01 00:00:00, dtype: float64
# By date number
gme.loc["08-01-2017"]
open           21.69
high           21.69
low            21.14
close          21.36
volume    3059060.00
Name: 2017-08-01 00:00:00, dtype: float64

What can we pass as the .loc argument when we have a DateTimeIndex?

Anything that can be converted to a datetime using pd.to_datetime, without having to specify the format argument.

When that condition holds, pandas will return all rows whose date in the index “belong” to that date or period.

We can also use the range shorthand notation to give a start and end date for selection.

gme.loc["April 1, 2015":"April 10, 2015"]
open high low close volume
2015-04-01 37.86 38.50 37.330 37.78 2464874
2015-04-02 38.06 38.81 37.880 38.20 1172602
2015-04-06 38.00 38.39 37.680 38.37 1483853
2015-04-07 38.37 38.50 37.915 37.98 1332103
2015-04-08 37.80 38.78 37.740 38.71 1649946
2015-04-09 38.76 40.00 38.350 39.99 1912082
2015-04-10 40.18 40.41 39.560 40.40 1268980

5.4.4. Accessing Date Properties#

Sometimes, we would like to directly access a part of the date/time.

If our date/time information is in the index, we can to df.index.XX where XX is replaced by year, month, or whatever we would like to access.

gme.index.year
Int64Index([2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,
            ...
            2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021],
           dtype='int64', length=2225)
gme.index.day
Int64Index([ 2,  3,  4,  7,  8,  9, 10, 11, 14, 15,
            ...
            19, 20, 21, 22, 25, 26, 27, 28, 29,  1],
           dtype='int64', length=2225)

We can also do the same if the date/time information is stored in a column, but we have to use a slightly different syntax.

df["column_name"].dt.XX
gme_date_column = gme.reset_index()
gme_date_column.head()
index open high low close volume
0 2013-01-02 25.58 25.870 24.99 25.66 4898100
1 2013-01-03 25.58 25.685 23.92 24.36 10642800
2 2013-01-04 24.42 25.170 24.36 24.80 4040000
3 2013-01-07 24.66 24.920 23.87 24.75 3983900
4 2013-01-08 22.93 23.960 22.66 23.19 11131200
gme_date_column["index"].dt.year.head()
0    2013
1    2013
2    2013
3    2013
4    2013
Name: index, dtype: int64
gme_date_column["index"].dt.month.head()
0    1
1    1
2    1
3    1
4    1
Name: index, dtype: int64

5.4.5. Leads and Lags: df.shift#

When doing time series analysis, we often want to compare data at one date against data at another date.

pandas can help us with this if we leverage the shift method.

Without any additional arguments, shift() will move all data forward one period, filling the first row with missing data.

# so we can see the result of shift clearly
gme.head()
open high low close volume
2013-01-02 25.58 25.870 24.99 25.66 4898100
2013-01-03 25.58 25.685 23.92 24.36 10642800
2013-01-04 24.42 25.170 24.36 24.80 4040000
2013-01-07 24.66 24.920 23.87 24.75 3983900
2013-01-08 22.93 23.960 22.66 23.19 11131200
gme.shift().head()
open high low close volume
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 25.58 25.870 24.99 25.66 4898100.0
2013-01-04 25.58 25.685 23.92 24.36 10642800.0
2013-01-07 24.42 25.170 24.36 24.80 4040000.0
2013-01-08 24.66 24.920 23.87 24.75 3983900.0

We can use this to compute the percent change from one day to the next. (Quiz: Why does that work? Remember how pandas uses the index to align data.)

((gme.close - gme.close.shift()) / gme.close.shift()).head()
2013-01-02         NaN
2013-01-03   -0.050663
2013-01-04    0.018062
2013-01-07   -0.002016
2013-01-08   -0.063030
Name: close, dtype: float64

Setting the first argument to n tells pandas to shift the data down n rows (apply an n period lag).

gme.shift(3).head()
open high low close volume
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 NaN NaN NaN NaN NaN
2013-01-04 NaN NaN NaN NaN NaN
2013-01-07 25.58 25.870 24.99 25.66 4898100.0
2013-01-08 25.58 25.685 23.92 24.36 10642800.0

A negative value will shift the data up or apply a lead.

gme.shift(-2).tail()
open high low close volume
2021-10-26 175.16 183.14 175.00 182.85 1696206.0
2021-10-27 182.81 185.75 178.00 183.51 2293970.0
2021-10-28 182.53 208.57 182.05 200.09 4910802.0
2021-10-29 NaN NaN NaN NaN NaN
2021-11-01 NaN NaN NaN NaN NaN
gme.shift(-2).head()
open high low close volume
2013-01-02 24.42 25.17 24.36 24.80 4040000.0
2013-01-03 24.66 24.92 23.87 24.75 3983900.0
2013-01-04 22.93 23.96 22.66 23.19 11131200.0
2013-01-07 23.12 23.32 22.52 22.61 4331900.0
2013-01-08 22.81 22.88 22.40 22.79 3900300.0

5.4.6. Rolling Computations: .rolling#

pandas has facilities that enable easy computation of rolling statistics.

These are best understood by example, so we will dive right in.

# first take only the first 6 rows so we can easily see what is going on
gme_small = gme.head(6)
gme_small
open high low close volume
2013-01-02 25.58 25.870 24.99 25.66 4898100
2013-01-03 25.58 25.685 23.92 24.36 10642800
2013-01-04 24.42 25.170 24.36 24.80 4040000
2013-01-07 24.66 24.920 23.87 24.75 3983900
2013-01-08 22.93 23.960 22.66 23.19 11131200
2013-01-09 23.12 23.320 22.52 22.61 4331900

Below, we compute the 2 day moving average (for all columns).

gme_small.rolling("2d").mean()
open high low close volume
2013-01-02 25.580 25.8700 24.990 25.66 4898100.0
2013-01-03 25.580 25.7775 24.455 25.01 7770450.0
2013-01-04 25.000 25.4275 24.140 24.58 7341400.0
2013-01-07 24.660 24.9200 23.870 24.75 3983900.0
2013-01-08 23.795 24.4400 23.265 23.97 7557550.0
2013-01-09 23.025 23.6400 22.590 22.90 7731550.0

To do this operation, pandas starts at each row (date) then looks backwards the specified number of periods (here 2 days) and then applies some aggregation function (mean) on all the data in that window.

If pandas cannot look back the full length of the window (e.g. when working on the first row), it fills as much of the window as possible and then does the operation. Notice that the value at 2013-02-25 is the same in both DataFrames.

Below, we see a visual depiction of the rolling maximum on a 21 day window for the whole dataset.

fig, ax = plt.subplots(figsize=(10, 4))
gme["open"].plot(ax=ax, linestyle="--", alpha=0.8)
gme.rolling("21d").max()["open"].plot(ax=ax, alpha=0.8, linewidth=3)
ax.legend(["Original", "21 day max"])
<matplotlib.legend.Legend at 0x1c4d19cd3a0>
../../_images/timeseries_61_1.png

We can also ask pandas to apply custom functions, similar to what we saw when studying GroupBy.

def is_volatile(x):
    "Returns a 1 if the variance is greater than 1, otherwise returns 0"
    if x.var() > 1.0:
        return 1.0
    else:
        return 0.0
gme_small.open.rolling("21d").apply(is_volatile)
2013-01-02    0.0
2013-01-03    0.0
2013-01-04    0.0
2013-01-07    0.0
2013-01-08    1.0
2013-01-09    1.0
Name: open, dtype: float64

5.4.7. Changing Frequencies: .resample (Opt)#

In addition to computing rolling statistics, we can also change the frequency of the data.

For example, instead of a monthly moving average, suppose that we wanted to compute the average within each calendar month.

We will use the resample method to do this.

Below are some examples.

# business quarter
gme.resample("BQ").mean()
open high low close volume
2013-03-29 24.670000 25.070072 24.251587 24.744000 3.315910e+06
2013-06-28 35.254844 36.018991 34.700516 35.398438 4.000192e+06
2013-09-30 47.824062 48.547236 47.145555 47.809062 2.185230e+06
2013-12-31 51.333750 51.968745 50.496134 51.277344 2.739705e+06
2014-03-31 38.386230 39.046482 37.721043 38.420984 4.613946e+06
2014-06-30 39.208889 39.785459 38.583568 39.152540 2.880822e+06
2014-09-30 42.605937 43.126973 41.989127 42.523750 2.490632e+06
2014-12-31 38.776094 39.396766 38.087609 38.687188 3.475835e+06
2015-03-31 37.692623 38.290061 36.998803 37.606721 2.332886e+06
2015-06-30 41.116825 41.542130 40.632362 41.114603 1.663396e+06
2015-09-30 44.426305 44.998609 43.783681 44.450625 1.832850e+06
2015-12-31 38.615391 39.205953 37.986083 38.613125 2.435190e+06
2016-03-31 28.679016 29.254346 28.027015 28.744754 2.692138e+06
2016-06-30 29.325000 29.673306 28.911500 29.273594 2.996190e+06
2016-09-30 29.172656 29.548209 28.799633 29.147344 2.647624e+06
2016-12-30 24.722857 25.053348 24.372424 24.693333 2.621204e+06
2017-03-31 24.392258 24.710763 24.092465 24.429194 3.041910e+06
2017-06-30 22.591746 22.869237 22.268905 22.570476 2.832941e+06
2017-09-29 20.702540 20.939829 20.493956 20.702857 2.322212e+06
2017-12-29 18.730635 18.973173 18.430000 18.643333 3.169164e+06
2018-03-30 16.367213 16.623770 16.079672 16.316393 4.207486e+06
2018-06-29 13.661250 13.944219 13.407031 13.701719 5.471269e+06
2018-09-28 15.309524 15.593652 15.021032 15.326667 3.359851e+06
2018-12-31 14.017778 14.324886 13.716952 13.990794 2.968817e+06
2019-03-29 12.418033 12.644918 12.207377 12.424098 3.756533e+06
2019-06-28 7.830635 7.967460 7.684640 7.817937 5.591152e+06
2019-09-30 4.319062 4.471875 4.191562 4.316875 8.059061e+06
2019-12-31 5.834687 6.021562 5.693594 5.868125 4.436861e+06
2020-03-31 4.289677 4.469355 4.119355 4.280484 3.827072e+06
2020-06-30 4.613016 4.846032 4.385873 4.612540 4.015092e+06
2020-09-30 5.728438 6.034587 5.519947 5.780156 6.683569e+06
2020-12-31 13.727656 14.500636 13.175869 13.763750 1.206631e+07
2021-03-31 122.863877 142.475803 101.349221 118.085574 4.424763e+07
2021-06-30 193.948413 204.396471 184.579965 194.032698 7.838682e+06
2021-09-30 182.931766 189.845989 176.869534 182.793438 3.118228e+06
2021-12-31 178.409545 184.079082 174.906223 179.182727 1.851492e+06

Note that unlike with rolling, a single number is returned for each column for each quarter.

The resample method will alter the frequency of the data and the number of rows in the result will be different from the number of rows in the input.

On the other hand, with rolling, the size and frequency of the result are the same as the input.

We can sample at other frequencies and aggregate with multiple aggregations function at once.

# multiple functions at 2 start-of-quarter frequency
gme.resample("2BQS").agg(["min", "max"])
open high low close volume
min max min max min max min max min max
2013-01-01 22.68 41.41 22.8800 42.8400 22.30 41.1200 22.61 42.03 1203200 13867200
2013-07-01 42.00 57.68 42.3400 57.7400 41.66 56.9200 42.17 57.59 626000 14606000
2014-01-01 33.67 49.71 34.3800 50.0000 33.10 49.0700 33.83 49.65 1317800 23506700
2014-07-01 32.11 46.06 32.7300 46.5900 31.81 45.8600 31.92 46.10 1102800 18759780
2015-01-01 32.55 44.58 33.7100 45.5000 31.69 44.1624 32.27 44.55 747201 8682832
2015-07-01 28.37 47.62 28.8000 47.8250 27.90 46.9000 28.04 47.44 746129 16927011
2016-01-01 25.06 33.20 25.4500 33.7200 24.33 33.1000 25.06 33.38 1264704 22220471
2016-07-01 20.59 32.13 20.9200 32.6700 20.10 31.8300 20.73 32.16 1031652 14493353
2017-01-02 20.33 26.33 20.6600 26.6799 20.24 26.0200 20.46 26.52 1135065 15943366
2017-07-03 16.01 22.20 16.3800 22.3700 15.85 21.9000 16.00 22.17 972854 20500700
2018-01-01 12.51 19.92 12.7500 20.3100 12.20 19.7700 12.46 19.96 1840539 25823988
2018-07-02 11.72 16.98 12.0200 17.2700 11.56 16.6200 11.67 17.04 1289301 17471600
2019-01-01 4.98 16.02 5.1300 16.9000 4.71 15.8800 5.02 15.98 1282019 39354238
2019-07-01 3.25 6.65 3.3600 6.9200 3.15 6.3900 3.21 6.68 1368983 33980425
2020-01-01 2.85 6.21 2.9400 6.4700 2.57 6.0700 2.80 6.31 1467612 13491454
2020-07-01 3.90 21.31 4.0490 22.3499 3.77 20.3500 3.85 20.99 1330148 77152780
2021-01-01 17.34 379.71 18.0765 483.0000 17.08 291.5100 17.25 347.51 1789186 194084194
2021-07-01 148.00 224.00 156.5000 231.4390 145.22 211.4600 146.80 218.24 818207 14652209

As with groupby and rolling, you can also provide custom functions to .resample(...).agg and .resample(...).apply

To make the optimal decision we need to, for each month, compute the maximum value of the close price on any day minus the open price on the first day of the month.