3.4.3. The Index#
import pandas as pd
import numpy as np
3.4.3.1. So What is this Index?#
Every Series or DataFrame has an index.
We told you that the index was the “row labels” for the data.
This is true, but an index in pandas does much more than label the rows.
The purpose of this lecture is to understand the importance of the index.
The pandas documentation says
Data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.
In practice, the index and column names are used to make sure the data is properly aligned when operating on multiple DataFrames.
This is a somewhat abstract concept that is best understood by example…
Let’s begin by loading our stock return data and break it in smaller data sets
from datetime import datetime
# Create a date parser function that will allow pandas to read the dates in the format we have them
date_parser = lambda x: datetime.strptime(x, "%Y%m")
# our 50 stock returns data
url = "https://raw.githubusercontent.com/amoreira2/Fin418/main/assets/data/Retuns50stocks.csv"
# Use pd.read_csv with the date_parser
df = pd.read_csv(url, parse_dates=['date'], date_parser=date_parser)
# Set the date column as the index
We’ll also extract a couple smaller DataFrames we can use in examples.
df0 = df.iloc[:5,1:5]
df0
CTL | T | CSCO | FCX | |
---|---|---|---|---|
0 | -18.4697 | -11.5513 | 2.2170 | -17.4556 |
1 | -12.9450 | -11.2245 | 20.7192 | -21.1470 |
2 | 10.5502 | 10.6732 | 16.9740 | -12.2727 |
3 | -34.0067 | 4.6083 | -10.3274 | -20.2073 |
4 | 10.3980 | -0.2853 | -17.8724 | -4.5455 |
df1 = df.iloc[3:8,1:5]
df1
CTL | T | CSCO | FCX | |
---|---|---|---|---|
3 | -34.0067 | 4.6083 | -10.3274 | -20.2073 |
4 | 10.3980 | -0.2853 | -17.8724 | -4.5455 |
5 | 6.4815 | -1.0014 | 11.6356 | 0.6803 |
6 | 1.9565 | -1.0029 | 2.9499 | -2.7027 |
7 | -1.5437 | -1.9090 | 4.8711 | 9.0278 |
Note that the dates now overlap but the overlap is not perfect
Note also that both have five rows
What happens when we try to add these two dataframes?
#df0+df1
What is happening? Why is that useful to us?
Note now with non-overlapping columns
df0 = df.iloc[:5,1:5]
df1 = df.iloc[3:8,2:6]
#df0+df1
3.4.3.1.1. Automatic Alignment#
For all (row, column) combinations that appear in both DataFrames, the value of df0+df1
is equal to df0.loc[row, col] + df1.loc[row, col]
.
This happened even though the rows and columns were not in the same order.
We refer to this as pandas aligning the data for us.
To see how awesome this is, think about how to do something similar in Excel:
df0
anddf1
would be in different sheets.The index and column names would be the first column and row in each sheet.
For each label in the first row and column of either the
df0
sheet or thedf1
sheet we would have to do aIFELSE
to check if the label exists in the other sheet and then aVLOOKUP
to extract the value.
In pandas, this happens automatically, behind the scenes, and very quickly.
3.4.3.1.2. Handling Missing Data#
For all elements in the non-overlapping row/column combination,
the value in df0+df1
is NaN
.
This is how pandas represents missing data.
So, when pandas was trying to look up the values in df0
and df1
, it could
only find a value in one DataFrame: the other value was missing.
When pandas tries to add a number to something that is missing, it says
that the result is missing (spelled NaN
).
3.4.3.2. Setting the Index#
For a DataFrame df
, the df.set_index
method allows us to use one
(or more) of the DataFrame’s columns as the index.
Here’s an example.
# first, create the DataFrame
df_date = df.set_index(["date"])
df_date.head()
CTL | T | CSCO | FCX | XL | IVZ | AMT | WHR | IR | WFT | ... | SWK | DVN | TMO | PEP | LNC | EMR | MLM | CCI | NU | Market | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||||||||||
2000-01-01 | -18.4697 | -11.5513 | 2.2170 | -17.4556 | -12.5301 | -4.0929 | 17.3824 | -10.4707 | -14.5289 | -5.9468 | ... | -16.5975 | 6.8441 | 15.4167 | -3.1915 | -6.9312 | -4.0305 | 2.4390 | -1.5564 | -0.3040 | -3.9612 |
2000-02-01 | -12.9450 | -11.2245 | 20.7192 | -21.1470 | -9.8898 | 5.3057 | 37.2822 | -6.1760 | -18.2311 | 19.8003 | ... | -8.4577 | 6.0498 | -9.7473 | -5.8608 | -25.2115 | -16.6039 | -15.1667 | 1.9763 | -7.7439 | 3.1777 |
2000-03-01 | 10.5502 | 10.6732 | 16.9740 | -12.2727 | 36.9397 | 24.4250 | 0.2538 | 7.9402 | 15.4976 | 30.0000 | ... | 15.6304 | 30.5034 | 30.4000 | 8.9805 | 21.2670 | 16.5981 | 33.8028 | 17.4419 | 14.2857 | 5.3500 |
2000-04-01 | -34.0067 | 4.6083 | -10.3274 | -20.2073 | -13.9955 | 0.9670 | -5.6962 | 11.0874 | 6.0734 | 2.7153 | ... | 11.8483 | -0.9009 | -4.9080 | 5.1971 | 4.7836 | 3.2941 | 11.5789 | 1.3201 | 0.0000 | -5.9530 |
2000-05-01 | 10.3980 | -0.2853 | -17.8724 | -4.5455 | 25.8793 | -8.7719 | -20.2685 | -12.4338 | -2.5672 | 6.0000 | ... | -8.8983 | 24.2857 | -4.1935 | 11.4140 | 11.3106 | 8.1686 | -7.4198 | -31.7590 | 3.3721 | -3.8871 |
5 rows × 51 columns
Now the dates are caried in any manipulation we do
the return observation is bind together with the index
df0 = df_date.iloc[:5,1:5]
df0
T | CSCO | FCX | XL | |
---|---|---|---|---|
date | ||||
2000-01-01 | -11.5513 | 2.2170 | -17.4556 | -12.5301 |
2000-02-01 | -11.2245 | 20.7192 | -21.1470 | -9.8898 |
2000-03-01 | 10.6732 | 16.9740 | -12.2727 | 36.9397 |
2000-04-01 | 4.6083 | -10.3274 | -20.2073 | -13.9955 |
2000-05-01 | -0.2853 | -17.8724 | -4.5455 | 25.8793 |
Now that the date is on the index, we can use .loc
to extract all the
data for a specific year.
df_date.loc['2001']
CTL | T | CSCO | FCX | XL | IVZ | AMT | WHR | IR | WFT | ... | SWK | DVN | TMO | PEP | LNC | EMR | MLM | CCI | NU | Market | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||||||||||
2001-01-01 | -12.2238 | 1.7880 | -2.1242 | 36.6423 | -15.1016 | 5.9770 | -4.4224 | 10.0708 | 5.8149 | 3.4497 | ... | 10.044100 | -10.1197 | -0.336100 | -11.0820 | -4.5601 | -3.5686 | 0.165500 | 3.2333 | -6.886600 | 3.959900 |
2001-02-01 | -8.1581 | -1.3444 | -36.7279 | 20.0855 | 3.0871 | -13.6876 | -20.0552 | 1.3717 | -1.7829 | 6.4443 | ... | 1.398600 | 4.0146 | -5.902200 | 4.5609 | -2.1851 | -11.4704 | 8.779800 | -10.0671 | -9.211699 | -9.909901 |
2001-03-01 | -0.0694 | -6.4361 | -33.2454 | -7.1174 | 0.0789 | -26.6816 | -36.0746 | -5.4473 | -8.3968 | -5.1509 | ... | -4.655200 | 2.1930 | -19.426500 | -4.3186 | -3.1912 | -7.3244 | -7.072900 | -41.0448 | -14.803900 | -7.039000 |
2001-04-01 | -5.4609 | -6.9992 | 7.3834 | 8.5057 | -6.9278 | 25.8621 | 44.8649 | 11.5623 | 18.3581 | 17.9939 | ... | 10.015200 | 1.3917 | 17.259800 | -0.3185 | 9.4066 | 7.5000 | 7.658100 | 65.3333 | 2.704300 | 8.383100 |
2001-05-01 | 4.8197 | 4.3636 | 13.4276 | 10.6638 | 12.9378 | 3.1507 | -7.6866 | 13.3764 | 5.3617 | -3.1942 | ... | 5.462100 | -1.3557 | 5.880100 | 2.1684 | 6.6724 | 2.1643 | 7.700700 | -32.2172 | 8.347300 | 1.045800 |
2001-06-01 | 6.5401 | -6.9454 | -5.5036 | -29.4831 | 3.2704 | -8.0212 | -16.4511 | -0.6201 | -16.5147 | -14.8483 | ... | 10.210500 | -9.7234 | -21.103500 | -0.9272 | 5.0975 | -10.6484 | 0.243100 | -1.2048 | 7.848200 | -1.755400 |
2001-07-01 | 2.1782 | 13.0460 | 5.6044 | -2.6244 | -6.5164 | -7.7678 | -17.9971 | 12.8640 | 6.0194 | -12.1875 | ... | 4.083100 | 3.2571 | 3.542200 | 5.4977 | -0.8019 | -5.1901 | -10.931500 | -40.5488 | -3.614500 | -1.829400 |
2001-08-01 | 13.3721 | -9.1495 | -15.0364 | 15.0558 | 8.7427 | -8.8917 | -14.6313 | -5.9257 | -6.7308 | -21.0676 | ... | -3.349400 | -14.6467 | -1.195700 | 0.7935 | -2.2928 | -5.8883 | -10.072600 | 4.5128 | -1.875000 | -5.898000 |
2001-09-01 | -4.4223 | 15.1797 | -25.4133 | -11.2278 | -4.8193 | -25.7581 | -4.0083 | -16.1618 | -16.6872 | -23.3243 | ... | -12.747700 | -25.5457 | -16.705100 | 3.5000 | -6.4781 | -12.2015 | -0.987300 | -11.6781 | -3.948700 | -9.148800 |
2001-10-01 | -5.6716 | -18.5776 | 38.9163 | 1.0009 | 9.9494 | 10.9040 | -20.6623 | 6.6305 | 10.3550 | 34.1827 | ... | 4.842700 | 11.3372 | 17.119100 | 0.4330 | -8.5246 | 4.1649 | 2.071100 | 30.0000 | -5.766200 | 2.783300 |
2001-11-01 | 7.1202 | -1.9155 | 20.8038 | 20.2703 | 7.5754 | 18.4874 | -20.1452 | 11.9959 | 12.7614 | -2.2203 | ... | 9.681601 | -10.2089 | 14.830700 | -0.1642 | 12.6328 | 11.0720 | 6.813600 | -6.9231 | -0.708200 | 7.878000 |
2001-12-01 | -2.9586 | 4.7886 | -11.3992 | 0.2996 | -1.7423 | 2.9078 | 7.6136 | 11.5116 | -0.1910 | 11.3236 | ... | 11.438100 | 12.5327 | 9.953899 | 0.4215 | 1.8239 | 5.6234 | 9.647099 | -1.9284 | 1.321800 | 1.784800 |
12 rows × 51 columns
This would be helpful, for example, if we wanted to compute the difference in the average returns of all stocks from one year to the next.
df_date.loc['2009'].mean() - df_date.loc['2008'].mean()
CTL 5.487558
T 2.937392
CSCO 7.361342
FCX 19.844483
XL 33.775933
IVZ 10.040733
AMT 6.153350
WHR 12.306317
IR 14.456158
WFT 12.827383
YUM 2.444442
CVS 3.850175
GD 5.122642
TYC 8.857808
EL 6.362417
MUR 6.686392
CTAS 4.201033
CBSA 15.443775
SNV -7.350500
CAM 12.429442
DLTR -2.890100
CAH 6.473425
DTE 3.731900
SSP 24.993375
PSA -0.423242
EXC 1.991258
TKR 4.816925
CMA 8.773100
ORCL 4.711567
MS 13.742083
RSG 3.409825
ACAS 18.108168
AGN 7.490375
MMM 6.472683
ETFC 12.750600
CAR 55.472400
MDR 20.387558
NOV 12.799792
PCH 4.852950
BAX 1.485500
JCI 9.846317
SWK 6.712133
DVN 3.786125
TMO 6.738350
PEP 3.447342
LNC 12.628092
EMR 4.992283
MLM 1.451650
CCI 12.952525
NU 2.722825
Market 6.205725
dtype: float64
Notice that pandas did a few things for us.
After computing
.mean()
, the row labels (index) were the former column names.These column names were used to align data when we wanted asked pandas to compute the difference.
Suppose that someone asked you, “What was the Return volatily of the CSCO in 2010?”
With the original df dataset you would so something like
df.loc[df['date'].dt.year==2010, "CSCO"].std()
9.851823038802452
How would you do that using the df_date
dataframe?
3.4.3.2.1. Setting a Hierarchical Index#
Include multiple columns in the index is advantageous in some situations.
These situations might include:
When we need more than one piece of information (column) to identify an observation
When we need data-alignment by more than one column
To achieve multiple columns in the index, we pass a list of multiple column
names to set_index
.
To show this we will first stack our data set
df_stacked=df_date.stack().reset_index()
df_stacked
date | level_1 | 0 | |
---|---|---|---|
0 | 2000-01-01 | CTL | -18.4697 |
1 | 2000-01-01 | T | -11.5513 |
2 | 2000-01-01 | CSCO | 2.2170 |
3 | 2000-01-01 | FCX | -17.4556 |
4 | 2000-01-01 | XL | -12.5301 |
... | ... | ... | ... |
9175 | 2014-12-01 | EMR | -3.1686 |
9176 | 2014-12-01 | MLM | -8.0973 |
9177 | 2014-12-01 | CCI | -4.2965 |
9178 | 2014-12-01 | NU | 6.4623 |
9179 | 2014-12-01 | Market | -0.3616 |
9180 rows × 3 columns
# let us rename the columns to date ticker and return
df_stacked.columns = ["date", "ticker", "return"]
df_stacked.head()
date | ticker | return | |
---|---|---|---|
0 | 2000-01-01 | CTL | -18.4697 |
1 | 2000-01-01 | T | -11.5513 |
2 | 2000-01-01 | CSCO | 2.2170 |
3 | 2000-01-01 | FCX | -17.4556 |
4 | 2000-01-01 | XL | -12.5301 |
df_stacked = df_stacked.sort_values(by=['date','ticker']).set_index(["date", "ticker"])
df_stacked.head(5)
return | ||
---|---|---|
date | ticker | |
2000-01-01 | ACAS | 13.7363 |
AGN | 14.5729 | |
AMT | 17.3824 | |
BAX | 1.6915 | |
CAH | -0.3916 |
Notice that in the display above, the row labels seem to have two levels now.
The outer (or left-most) level is named date
and the inner (or
right-most) level is named ticker
.
When a DataFrame’s index has multiple levels, we (and the pandas documentation) refer to the DataFrame as having a hierarchical index.
3.4.3.2.2. Slicing a Hierarchical Index#
df_stacked.loc[('2000',"CSCO" )]
return | |
---|---|
date | |
2000-01-01 | 2.2170 |
2000-02-01 | 20.7192 |
2000-03-01 | 16.9740 |
2000-04-01 | -10.3274 |
2000-05-01 | -17.8724 |
2000-06-01 | 11.6356 |
2000-07-01 | 2.9499 |
2000-08-01 | 4.8711 |
2000-09-01 | -19.4900 |
2000-10-01 | -2.4887 |
2000-11-01 | -11.1369 |
2000-12-01 | -20.1044 |
df_stacked.loc[('2000',["CSCO","ACAS"]),:]
return | ||
---|---|---|
date | ticker | |
2000-01-01 | CSCO | 2.217000 |
ACAS | 13.736300 | |
2000-02-01 | CSCO | 20.719200 |
2000-03-01 | CSCO | 16.974000 |
2000-04-01 | CSCO | -10.327400 |
2000-05-01 | CSCO | -17.872400 |
2000-06-01 | CSCO | 11.635600 |
2000-07-01 | CSCO | 2.949900 |
2000-08-01 | CSCO | 4.871100 |
2000-09-01 | CSCO | -19.490000 |
2000-10-01 | CSCO | -2.488700 |
2000-11-01 | CSCO | -11.136900 |
2000-12-01 | CSCO | -20.104400 |
2000-02-01 | ACAS | -7.246400 |
2000-03-01 | ACAS | 7.604199 |
2000-04-01 | ACAS | -4.433500 |
2000-05-01 | ACAS | -12.154600 |
2000-06-01 | ACAS | 14.714700 |
2000-07-01 | ACAS | 7.329800 |
2000-08-01 | ACAS | -6.829300 |
2000-09-01 | ACAS | 1.267000 |
2000-10-01 | ACAS | -6.860200 |
2000-11-01 | ACAS | -3.399400 |
2000-12-01 | ACAS | 21.654000 |
As shown above, we can use .loc
to extract different slices of our
data
The rules for using .loc
with a hierarchically-indexed DataFrame are
similar to the ones we’ve learned for standard DataFrames, but they are a bit
more elaborate as we now have more structure to our data.
We will summarize the main rules, and then work through an exercise that demonstrates each of them.
Slicing rules
pandas slicing reacts differently to list
s and tuple
s.
It does this to provide more flexibility to select the data you want.
list
in row slicing will be an “or” operation, where it chooses rows
based on whether the index value corresponds to any element of the list.
tuple
in row slicing will be used to denote a single hierarchical
index and must include a value for each level.
Row slicing examples
df_stacked
.loc[“CSCO”]: all rows where the outer most index value is equal toCSCO
df_stacked
.loc[(‘2005’,”CSCO”)]: all rows where the outer-most index value picks all dates in the year2005
and the second level is equal toCSCO
df_stacked
.loc[[“2010”, “2006”]]: all rows where the outer-most index is belongs to the years 2010 or 2006df_stacked
.loc[([‘2010’, ‘2011’],[“CSCO”, “T”]), :]: all rows where the outer-most index belong to the years of2010
or2011
AND where the second level index is either"CSCO
or"T"
df_stacked
.loc[[(“CSCO”, ‘2010-01-01’), (“T”, ‘2011-02-01’)], :]: all rows where the the two hierarchical indices are either("CSCO", '2010-01-01')
or("T", '2011-02-01')
. Here we cannot slice a a whole year an d need instead to pass a precise date
If we had multiple columns re can also restrict .loc
to extract certain columns by doing:
df_stacked.loc[rows, 'dividends']
: return the rows specified by rows
Why don’t you try a few of these possibilities?
3.4.3.2.3. Alignment with MultiIndex
#
The data alignment features we talked about above also apply to a
MultiIndex
DataFrame.
3.4.3.2.4. pd.IndexSlice
#
When we want to extract rows for a few values of the outer index and all
values for an inner index level, we can use the convenient
df.loc[[id11, id22]]
shorthand.
We can use this notation to extract all the data for two years.
df_stacked.loc[["2010", "2004"]]
return | ||
---|---|---|
date | ticker | |
2010-01-01 | ACAS | 51.229500 |
AGN | -8.744599 | |
AMT | -1.758800 | |
BAX | -1.857500 | |
CAH | 2.574400 | |
... | ... | ... |
2004-01-01 | TYC | 0.943400 |
WFT | 12.000000 | |
WHR | 4.542300 | |
XL | 2.514500 | |
YUM | -1.424400 |
102 rows × 1 columns
However, suppose we wanted to extract the data for all years, but only for thhe stocks ACAS YUM and TYC.
We cannot do this using .loc
because the ticker is on the second level,
not outer-most level of our index.
To get around this limitation, we can use the pd.IndexSlice
helper.
Here’s an example.
df_stacked.loc[pd.IndexSlice[:, ['ACAS', 'YUM', 'TYC']], :]
return | ||
---|---|---|
date | ticker | |
2000-01-01 | ACAS | 13.736300 |
2000-02-01 | ACAS | -7.246400 |
2000-03-01 | ACAS | 7.604199 |
2000-04-01 | ACAS | -4.433500 |
2000-05-01 | ACAS | -12.154600 |
... | ... | ... |
2014-08-01 | TYC | 3.406700 |
2014-09-01 | TYC | -0.112100 |
2014-10-01 | TYC | -3.275700 |
2014-11-01 | TYC | -0.069900 |
2014-12-01 | TYC | 2.237800 |
540 rows × 1 columns
Notice that the :
in the first part of [:, ["A", "D"]]
instructed pandas to give us rows for all values of the outer most index
level and that the :
just before ]
said grab all the columns.
You can also select a slice but in this case it is very important that your dataframe is sorted with respect to the index
3.4.3.2.5. Multi-index Columns#
The functionality of MultiIndex
also applies to the column names.
3.4.3.3. Re-setting the Index#
The df.reset_index
method will move one or more level of the index
back into the DataFrame as a normal column.
With no additional arguments, it moves all levels out of the index and
sets the index of the returned DataFrame to the default of
range(df.shape[0])
.
df_stacked.reset_index()
date | ticker | return | |
---|---|---|---|
0 | 2000-01-01 | ACAS | 13.7363 |
1 | 2000-01-01 | AGN | 14.5729 |
2 | 2000-01-01 | AMT | 17.3824 |
3 | 2000-01-01 | BAX | 1.6915 |
4 | 2000-01-01 | CAH | -0.3916 |
... | ... | ... | ... |
9175 | 2014-12-01 | TYC | 2.2378 |
9176 | 2014-12-01 | WFT | -12.5954 |
9177 | 2014-12-01 | WHR | 4.0662 |
9178 | 2014-12-01 | XL | -2.7872 |
9179 | 2014-12-01 | YUM | -5.6958 |
9180 rows × 3 columns
3.4.3.4. Choose the Index Carefully#
So, now that we know that we use index and column names for aligning data, “how should we pick the index?” is a natural question to ask.
To guide us to the right answer, we will list the first two components to Hadley Wickham’s description of tidy data:
Each column should each have one variable.
Each row should each have one observation.
If we strive to have our data in a tidy form (we should), then when choosing the index, we should set:
the row labels (index) to be a unique identifier for an observation of data
the column names to identify one variable
For example, suppose we are looking data on interest rates.
Each column might represent one bond or asset and each row might represent the date.
Using hierarchical row and column indices allows us to store higher dimensional data in our (inherently) two dimensional DataFrame.
3.4.3.4.1. Know Your Goal#
The correct column(s) to choose for the index often depends on the context of your analysis.
For example, if I were studying how GDP and consumption evolved over time for various countries, I would want time (year) and country name on the index
On the other hand, if I were trying to look at the differences across countries and variables within a particular year, I may opt to put the country and variable on the index and have years be columns.
Following the tidy data rules above and thinking about how you intend to use the data – and a little practice – will enable you to consistently select the correct index.