3.4.3. The Index#

import pandas as pd
import numpy as np

3.4.3.1. So What is this Index?#

Every Series or DataFrame has an index.

We told you that the index was the “row labels” for the data.

This is true, but an index in pandas does much more than label the rows.

The purpose of this lecture is to understand the importance of the index.

The pandas documentation says

Data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.

In practice, the index and column names are used to make sure the data is properly aligned when operating on multiple DataFrames.

This is a somewhat abstract concept that is best understood by example…

Let’s begin by loading our stock return data and break it in smaller data sets

from datetime import datetime

# Create a date parser function that will allow pandas to read the dates in the format we have them
date_parser = lambda x: datetime.strptime(x, "%Y%m")
# our 50 stock returns data

url = "https://raw.githubusercontent.com/amoreira2/Fin418/main/assets/data/Retuns50stocks.csv"

# Use pd.read_csv with the date_parser
df  = pd.read_csv(url, parse_dates=['date'], date_parser=date_parser)
# Set the date column as the index

We’ll also extract a couple smaller DataFrames we can use in examples.

df0 = df.iloc[:5,1:5]
df0
CTL T CSCO FCX
0 -18.4697 -11.5513 2.2170 -17.4556
1 -12.9450 -11.2245 20.7192 -21.1470
2 10.5502 10.6732 16.9740 -12.2727
3 -34.0067 4.6083 -10.3274 -20.2073
4 10.3980 -0.2853 -17.8724 -4.5455
df1 = df.iloc[3:8,1:5]
df1
CTL T CSCO FCX
3 -34.0067 4.6083 -10.3274 -20.2073
4 10.3980 -0.2853 -17.8724 -4.5455
5 6.4815 -1.0014 11.6356 0.6803
6 1.9565 -1.0029 2.9499 -2.7027
7 -1.5437 -1.9090 4.8711 9.0278

Note that the dates now overlap but the overlap is not perfect

Note also that both have five rows

What happens when we try to add these two dataframes?

#df0+df1

What is happening? Why is that useful to us?

Note now with non-overlapping columns

df0 = df.iloc[:5,1:5]

df1 = df.iloc[3:8,2:6]

#df0+df1

3.4.3.1.1. Automatic Alignment#

For all (row, column) combinations that appear in both DataFrames, the value of df0+df1 is equal to df0.loc[row, col] + df1.loc[row, col].

This happened even though the rows and columns were not in the same order.

We refer to this as pandas aligning the data for us.

To see how awesome this is, think about how to do something similar in Excel:

  • df0 and df1 would be in different sheets.

  • The index and column names would be the first column and row in each sheet.

  • For each label in the first row and column of either the df0 sheet or the df1 sheet we would have to do a IFELSE to check if the label exists in the other sheet and then a VLOOKUP to extract the value.

In pandas, this happens automatically, behind the scenes, and very quickly.

3.4.3.1.2. Handling Missing Data#

For all elements in the non-overlapping row/column combination, the value in df0+df1 is NaN.

This is how pandas represents missing data.

So, when pandas was trying to look up the values in df0 and df1, it could only find a value in one DataFrame: the other value was missing.

When pandas tries to add a number to something that is missing, it says that the result is missing (spelled NaN).

3.4.3.2. Setting the Index#

For a DataFrame df, the df.set_index method allows us to use one (or more) of the DataFrame’s columns as the index.

Here’s an example.

# first, create the DataFrame
df_date = df.set_index(["date"])
df_date.head()
CTL T CSCO FCX XL IVZ AMT WHR IR WFT ... SWK DVN TMO PEP LNC EMR MLM CCI NU Market
date
2000-01-01 -18.4697 -11.5513 2.2170 -17.4556 -12.5301 -4.0929 17.3824 -10.4707 -14.5289 -5.9468 ... -16.5975 6.8441 15.4167 -3.1915 -6.9312 -4.0305 2.4390 -1.5564 -0.3040 -3.9612
2000-02-01 -12.9450 -11.2245 20.7192 -21.1470 -9.8898 5.3057 37.2822 -6.1760 -18.2311 19.8003 ... -8.4577 6.0498 -9.7473 -5.8608 -25.2115 -16.6039 -15.1667 1.9763 -7.7439 3.1777
2000-03-01 10.5502 10.6732 16.9740 -12.2727 36.9397 24.4250 0.2538 7.9402 15.4976 30.0000 ... 15.6304 30.5034 30.4000 8.9805 21.2670 16.5981 33.8028 17.4419 14.2857 5.3500
2000-04-01 -34.0067 4.6083 -10.3274 -20.2073 -13.9955 0.9670 -5.6962 11.0874 6.0734 2.7153 ... 11.8483 -0.9009 -4.9080 5.1971 4.7836 3.2941 11.5789 1.3201 0.0000 -5.9530
2000-05-01 10.3980 -0.2853 -17.8724 -4.5455 25.8793 -8.7719 -20.2685 -12.4338 -2.5672 6.0000 ... -8.8983 24.2857 -4.1935 11.4140 11.3106 8.1686 -7.4198 -31.7590 3.3721 -3.8871

5 rows × 51 columns

Now the dates are caried in any manipulation we do

the return observation is bind together with the index

df0 = df_date.iloc[:5,1:5]

df0
T CSCO FCX XL
date
2000-01-01 -11.5513 2.2170 -17.4556 -12.5301
2000-02-01 -11.2245 20.7192 -21.1470 -9.8898
2000-03-01 10.6732 16.9740 -12.2727 36.9397
2000-04-01 4.6083 -10.3274 -20.2073 -13.9955
2000-05-01 -0.2853 -17.8724 -4.5455 25.8793

Now that the date is on the index, we can use .loc to extract all the data for a specific year.

df_date.loc['2001']
CTL T CSCO FCX XL IVZ AMT WHR IR WFT ... SWK DVN TMO PEP LNC EMR MLM CCI NU Market
date
2001-01-01 -12.2238 1.7880 -2.1242 36.6423 -15.1016 5.9770 -4.4224 10.0708 5.8149 3.4497 ... 10.044100 -10.1197 -0.336100 -11.0820 -4.5601 -3.5686 0.165500 3.2333 -6.886600 3.959900
2001-02-01 -8.1581 -1.3444 -36.7279 20.0855 3.0871 -13.6876 -20.0552 1.3717 -1.7829 6.4443 ... 1.398600 4.0146 -5.902200 4.5609 -2.1851 -11.4704 8.779800 -10.0671 -9.211699 -9.909901
2001-03-01 -0.0694 -6.4361 -33.2454 -7.1174 0.0789 -26.6816 -36.0746 -5.4473 -8.3968 -5.1509 ... -4.655200 2.1930 -19.426500 -4.3186 -3.1912 -7.3244 -7.072900 -41.0448 -14.803900 -7.039000
2001-04-01 -5.4609 -6.9992 7.3834 8.5057 -6.9278 25.8621 44.8649 11.5623 18.3581 17.9939 ... 10.015200 1.3917 17.259800 -0.3185 9.4066 7.5000 7.658100 65.3333 2.704300 8.383100
2001-05-01 4.8197 4.3636 13.4276 10.6638 12.9378 3.1507 -7.6866 13.3764 5.3617 -3.1942 ... 5.462100 -1.3557 5.880100 2.1684 6.6724 2.1643 7.700700 -32.2172 8.347300 1.045800
2001-06-01 6.5401 -6.9454 -5.5036 -29.4831 3.2704 -8.0212 -16.4511 -0.6201 -16.5147 -14.8483 ... 10.210500 -9.7234 -21.103500 -0.9272 5.0975 -10.6484 0.243100 -1.2048 7.848200 -1.755400
2001-07-01 2.1782 13.0460 5.6044 -2.6244 -6.5164 -7.7678 -17.9971 12.8640 6.0194 -12.1875 ... 4.083100 3.2571 3.542200 5.4977 -0.8019 -5.1901 -10.931500 -40.5488 -3.614500 -1.829400
2001-08-01 13.3721 -9.1495 -15.0364 15.0558 8.7427 -8.8917 -14.6313 -5.9257 -6.7308 -21.0676 ... -3.349400 -14.6467 -1.195700 0.7935 -2.2928 -5.8883 -10.072600 4.5128 -1.875000 -5.898000
2001-09-01 -4.4223 15.1797 -25.4133 -11.2278 -4.8193 -25.7581 -4.0083 -16.1618 -16.6872 -23.3243 ... -12.747700 -25.5457 -16.705100 3.5000 -6.4781 -12.2015 -0.987300 -11.6781 -3.948700 -9.148800
2001-10-01 -5.6716 -18.5776 38.9163 1.0009 9.9494 10.9040 -20.6623 6.6305 10.3550 34.1827 ... 4.842700 11.3372 17.119100 0.4330 -8.5246 4.1649 2.071100 30.0000 -5.766200 2.783300
2001-11-01 7.1202 -1.9155 20.8038 20.2703 7.5754 18.4874 -20.1452 11.9959 12.7614 -2.2203 ... 9.681601 -10.2089 14.830700 -0.1642 12.6328 11.0720 6.813600 -6.9231 -0.708200 7.878000
2001-12-01 -2.9586 4.7886 -11.3992 0.2996 -1.7423 2.9078 7.6136 11.5116 -0.1910 11.3236 ... 11.438100 12.5327 9.953899 0.4215 1.8239 5.6234 9.647099 -1.9284 1.321800 1.784800

12 rows × 51 columns

This would be helpful, for example, if we wanted to compute the difference in the average returns of all stocks from one year to the next.

df_date.loc['2009'].mean() - df_date.loc['2008'].mean()
CTL        5.487558
T          2.937392
CSCO       7.361342
FCX       19.844483
XL        33.775933
IVZ       10.040733
AMT        6.153350
WHR       12.306317
IR        14.456158
WFT       12.827383
YUM        2.444442
CVS        3.850175
GD         5.122642
TYC        8.857808
EL         6.362417
MUR        6.686392
CTAS       4.201033
CBSA      15.443775
SNV       -7.350500
CAM       12.429442
DLTR      -2.890100
CAH        6.473425
DTE        3.731900
SSP       24.993375
PSA       -0.423242
EXC        1.991258
TKR        4.816925
CMA        8.773100
ORCL       4.711567
MS        13.742083
RSG        3.409825
ACAS      18.108168
AGN        7.490375
MMM        6.472683
ETFC      12.750600
CAR       55.472400
MDR       20.387558
NOV       12.799792
PCH        4.852950
BAX        1.485500
JCI        9.846317
SWK        6.712133
DVN        3.786125
TMO        6.738350
PEP        3.447342
LNC       12.628092
EMR        4.992283
MLM        1.451650
CCI       12.952525
NU         2.722825
Market     6.205725
dtype: float64

Notice that pandas did a few things for us.

  • After computing .mean(), the row labels (index) were the former column names.

  • These column names were used to align data when we wanted asked pandas to compute the difference.

Suppose that someone asked you, “What was the Return volatily of the CSCO in 2010?”

With the original df dataset you would so something like

df.loc[df['date'].dt.year==2010, "CSCO"].std()
9.851823038802452

How would you do that using the df_date dataframe?

3.4.3.2.1. Setting a Hierarchical Index#

Include multiple columns in the index is advantageous in some situations.

These situations might include:

  • When we need more than one piece of information (column) to identify an observation

  • When we need data-alignment by more than one column

To achieve multiple columns in the index, we pass a list of multiple column names to set_index.

To show this we will first stack our data set

df_stacked=df_date.stack().reset_index()
df_stacked
date level_1 0
0 2000-01-01 CTL -18.4697
1 2000-01-01 T -11.5513
2 2000-01-01 CSCO 2.2170
3 2000-01-01 FCX -17.4556
4 2000-01-01 XL -12.5301
... ... ... ...
9175 2014-12-01 EMR -3.1686
9176 2014-12-01 MLM -8.0973
9177 2014-12-01 CCI -4.2965
9178 2014-12-01 NU 6.4623
9179 2014-12-01 Market -0.3616

9180 rows × 3 columns

# let us rename the columns to date ticker and return

df_stacked.columns = ["date", "ticker", "return"]
df_stacked.head()
date ticker return
0 2000-01-01 CTL -18.4697
1 2000-01-01 T -11.5513
2 2000-01-01 CSCO 2.2170
3 2000-01-01 FCX -17.4556
4 2000-01-01 XL -12.5301
df_stacked = df_stacked.sort_values(by=['date','ticker']).set_index(["date", "ticker"])
df_stacked.head(5)
return
date ticker
2000-01-01 ACAS 13.7363
AGN 14.5729
AMT 17.3824
BAX 1.6915
CAH -0.3916

Notice that in the display above, the row labels seem to have two levels now.

The outer (or left-most) level is named date and the inner (or right-most) level is named ticker.

When a DataFrame’s index has multiple levels, we (and the pandas documentation) refer to the DataFrame as having a hierarchical index.

3.4.3.2.2. Slicing a Hierarchical Index#

df_stacked.loc[('2000',"CSCO" )]
return
date
2000-01-01 2.2170
2000-02-01 20.7192
2000-03-01 16.9740
2000-04-01 -10.3274
2000-05-01 -17.8724
2000-06-01 11.6356
2000-07-01 2.9499
2000-08-01 4.8711
2000-09-01 -19.4900
2000-10-01 -2.4887
2000-11-01 -11.1369
2000-12-01 -20.1044
df_stacked.loc[('2000',["CSCO","ACAS"]),:]
return
date ticker
2000-01-01 CSCO 2.217000
ACAS 13.736300
2000-02-01 CSCO 20.719200
2000-03-01 CSCO 16.974000
2000-04-01 CSCO -10.327400
2000-05-01 CSCO -17.872400
2000-06-01 CSCO 11.635600
2000-07-01 CSCO 2.949900
2000-08-01 CSCO 4.871100
2000-09-01 CSCO -19.490000
2000-10-01 CSCO -2.488700
2000-11-01 CSCO -11.136900
2000-12-01 CSCO -20.104400
2000-02-01 ACAS -7.246400
2000-03-01 ACAS 7.604199
2000-04-01 ACAS -4.433500
2000-05-01 ACAS -12.154600
2000-06-01 ACAS 14.714700
2000-07-01 ACAS 7.329800
2000-08-01 ACAS -6.829300
2000-09-01 ACAS 1.267000
2000-10-01 ACAS -6.860200
2000-11-01 ACAS -3.399400
2000-12-01 ACAS 21.654000

As shown above, we can use .loc to extract different slices of our data

The rules for using .loc with a hierarchically-indexed DataFrame are similar to the ones we’ve learned for standard DataFrames, but they are a bit more elaborate as we now have more structure to our data.

We will summarize the main rules, and then work through an exercise that demonstrates each of them.

Slicing rules

pandas slicing reacts differently to lists and tuples.

It does this to provide more flexibility to select the data you want.

list in row slicing will be an “or” operation, where it chooses rows based on whether the index value corresponds to any element of the list.

tuple in row slicing will be used to denote a single hierarchical index and must include a value for each level.

Row slicing examples

  1. df_stacked.loc[“CSCO”]: all rows where the outer most index value is equal to CSCO

  2. df_stacked.loc[(‘2005’,”CSCO”)]: all rows where the outer-most index value picks all dates in the year 2005 and the second level is equal to CSCO

  3. df_stacked.loc[[“2010”, “2006”]]: all rows where the outer-most index is belongs to the years 2010 or 2006

  4. df_stacked.loc[([‘2010’, ‘2011’],[“CSCO”, “T”]), :]: all rows where the outer-most index belong to the years of 2010 or 2011 AND where the second level index is either "CSCO or "T"

  5. df_stacked.loc[[(“CSCO”, ‘2010-01-01’), (“T”, ‘2011-02-01’)], :]: all rows where the the two hierarchical indices are either ("CSCO", '2010-01-01') or("T", '2011-02-01') . Here we cannot slice a a whole year an d need instead to pass a precise date

If we had multiple columns re can also restrict .loc to extract certain columns by doing:

  1. df_stacked.loc[rows, 'dividends']: return the rows specified by rows

Why don’t you try a few of these possibilities?

3.4.3.2.3. Alignment with MultiIndex#

The data alignment features we talked about above also apply to a MultiIndex DataFrame.

3.4.3.2.4. pd.IndexSlice#

When we want to extract rows for a few values of the outer index and all values for an inner index level, we can use the convenient df.loc[[id11, id22]] shorthand.

We can use this notation to extract all the data for two years.

df_stacked.loc[["2010", "2004"]]
return
date ticker
2010-01-01 ACAS 51.229500
AGN -8.744599
AMT -1.758800
BAX -1.857500
CAH 2.574400
... ... ...
2004-01-01 TYC 0.943400
WFT 12.000000
WHR 4.542300
XL 2.514500
YUM -1.424400

102 rows × 1 columns

However, suppose we wanted to extract the data for all years, but only for thhe stocks ACAS YUM and TYC.

We cannot do this using .loc because the ticker is on the second level, not outer-most level of our index.

To get around this limitation, we can use the pd.IndexSlice helper.

Here’s an example.

df_stacked.loc[pd.IndexSlice[:, ['ACAS', 'YUM', 'TYC']], :]
return
date ticker
2000-01-01 ACAS 13.736300
2000-02-01 ACAS -7.246400
2000-03-01 ACAS 7.604199
2000-04-01 ACAS -4.433500
2000-05-01 ACAS -12.154600
... ... ...
2014-08-01 TYC 3.406700
2014-09-01 TYC -0.112100
2014-10-01 TYC -3.275700
2014-11-01 TYC -0.069900
2014-12-01 TYC 2.237800

540 rows × 1 columns

Notice that the : in the first part of [:, ["A", "D"]] instructed pandas to give us rows for all values of the outer most index level and that the : just before ] said grab all the columns.

You can also select a slice but in this case it is very important that your dataframe is sorted with respect to the index

3.4.3.2.5. Multi-index Columns#

The functionality of MultiIndex also applies to the column names.

3.4.3.3. Re-setting the Index#

The df.reset_index method will move one or more level of the index back into the DataFrame as a normal column.

With no additional arguments, it moves all levels out of the index and sets the index of the returned DataFrame to the default of range(df.shape[0]).

df_stacked.reset_index()
date ticker return
0 2000-01-01 ACAS 13.7363
1 2000-01-01 AGN 14.5729
2 2000-01-01 AMT 17.3824
3 2000-01-01 BAX 1.6915
4 2000-01-01 CAH -0.3916
... ... ... ...
9175 2014-12-01 TYC 2.2378
9176 2014-12-01 WFT -12.5954
9177 2014-12-01 WHR 4.0662
9178 2014-12-01 XL -2.7872
9179 2014-12-01 YUM -5.6958

9180 rows × 3 columns

3.4.3.4. Choose the Index Carefully#

So, now that we know that we use index and column names for aligning data, “how should we pick the index?” is a natural question to ask.

To guide us to the right answer, we will list the first two components to Hadley Wickham’s description of tidy data:

  1. Each column should each have one variable.

  2. Each row should each have one observation.

If we strive to have our data in a tidy form (we should), then when choosing the index, we should set:

  • the row labels (index) to be a unique identifier for an observation of data

  • the column names to identify one variable

For example, suppose we are looking data on interest rates.

Each column might represent one bond or asset and each row might represent the date.

Using hierarchical row and column indices allows us to store higher dimensional data in our (inherently) two dimensional DataFrame.

3.4.3.4.1. Know Your Goal#

The correct column(s) to choose for the index often depends on the context of your analysis.

For example, if I were studying how GDP and consumption evolved over time for various countries, I would want time (year) and country name on the index

On the other hand, if I were trying to look at the differences across countries and variables within a particular year, I may opt to put the country and variable on the index and have years be columns.

Following the tidy data rules above and thinking about how you intend to use the data – and a little practice – will enable you to consistently select the correct index.