The Index

3.4.3. The Index#

import pandas as pd
import numpy as np

3.4.3.1. So What is this Index?#

Every Series or DataFrame has an index.

We told you that the index was the “row labels” for the data.

This is true, but an index in pandas does much more than label the rows.

The purpose of this lecture is to understand the importance of the index.

The pandas documentation says

Data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.

In practice, the index and column names are used to make sure the data is properly aligned when operating on multiple DataFrames.

This is a somewhat abstract concept that is best understood by example…

Let’s begin by loading our stock return data and break it in smaller data sets

from datetime import datetime

# Create a date parser function that will allow pandas to read the dates in the format we have them
date_parser = lambda x: datetime.strptime(x, "%Y%m")
# our 50 stock returns data

url = "https://raw.githubusercontent.com/amoreira2/Fin418/main/assets/data/Retuns50stocks.csv"

# Use pd.read_csv with the date_parser
df  = pd.read_csv(url, parse_dates=['date'], date_parser=date_parser)
# Set the date column as the index

We’ll also extract a couple smaller DataFrames we can use in examples.

df0 = df.iloc[:5,1:5]
df0

	CTL	T	CSCO	FCX
0	-18.4697	-11.5513	2.2170	-17.4556
1	-12.9450	-11.2245	20.7192	-21.1470
2	10.5502	10.6732	16.9740	-12.2727
3	-34.0067	4.6083	-10.3274	-20.2073
4	10.3980	-0.2853	-17.8724	-4.5455

df1 = df.iloc[3:8,1:5]
df1

	CTL	T	CSCO	FCX
3	-34.0067	4.6083	-10.3274	-20.2073
4	10.3980	-0.2853	-17.8724	-4.5455
5	6.4815	-1.0014	11.6356	0.6803
6	1.9565	-1.0029	2.9499	-2.7027
7	-1.5437	-1.9090	4.8711	9.0278

Note that the dates now overlap but the overlap is not perfect

Note also that both have five rows

What happens when we try to add these two dataframes?

#df0+df1

What is happening? Why is that useful to us?

Note now with non-overlapping columns

df0 = df.iloc[:5,1:5]

df1 = df.iloc[3:8,2:6]

#df0+df1

3.4.3.1.1. Automatic Alignment#

For all (row, column) combinations that appear in both DataFrames, the value of df0+df1 is equal to df0.loc[row, col] + df1.loc[row, col].

This happened even though the rows and columns were not in the same order.

We refer to this as pandas aligning the data for us.

To see how awesome this is, think about how to do something similar in Excel:

df0 and df1 would be in different sheets.
The index and column names would be the first column and row in each sheet.
For each label in the first row and column of either the df0 sheet or the df1 sheet we would have to do a IFELSE to check if the label exists in the other sheet and then a VLOOKUP to extract the value.

In pandas, this happens automatically, behind the scenes, and very quickly.

3.4.3.1.2. Handling Missing Data#

For all elements in the non-overlapping row/column combination, the value in df0+df1 is NaN.

This is how pandas represents missing data.

So, when pandas was trying to look up the values in df0 and df1, it could only find a value in one DataFrame: the other value was missing.

When pandas tries to add a number to something that is missing, it says that the result is missing (spelled NaN).

3.4.3.2. Setting the Index#

For a DataFrame df, the df.set_index method allows us to use one (or more) of the DataFrame’s columns as the index.

Here’s an example.

# first, create the DataFrame
df_date = df.set_index(["date"])
df_date.head()

	CTL	T	CSCO	FCX	XL	IVZ	AMT	WHR	IR	WFT	...	SWK	DVN	TMO	PEP	LNC	EMR	MLM	CCI	NU	Market
date
2000-01-01	-18.4697	-11.5513	2.2170	-17.4556	-12.5301	-4.0929	17.3824	-10.4707	-14.5289	-5.9468	...	-16.5975	6.8441	15.4167	-3.1915	-6.9312	-4.0305	2.4390	-1.5564	-0.3040	-3.9612
2000-02-01	-12.9450	-11.2245	20.7192	-21.1470	-9.8898	5.3057	37.2822	-6.1760	-18.2311	19.8003	...	-8.4577	6.0498	-9.7473	-5.8608	-25.2115	-16.6039	-15.1667	1.9763	-7.7439	3.1777
2000-03-01	10.5502	10.6732	16.9740	-12.2727	36.9397	24.4250	0.2538	7.9402	15.4976	30.0000	...	15.6304	30.5034	30.4000	8.9805	21.2670	16.5981	33.8028	17.4419	14.2857	5.3500
2000-04-01	-34.0067	4.6083	-10.3274	-20.2073	-13.9955	0.9670	-5.6962	11.0874	6.0734	2.7153	...	11.8483	-0.9009	-4.9080	5.1971	4.7836	3.2941	11.5789	1.3201	0.0000	-5.9530
2000-05-01	10.3980	-0.2853	-17.8724	-4.5455	25.8793	-8.7719	-20.2685	-12.4338	-2.5672	6.0000	...	-8.8983	24.2857	-4.1935	11.4140	11.3106	8.1686	-7.4198	-31.7590	3.3721	-3.8871

5 rows × 51 columns

Now the dates are caried in any manipulation we do

the return observation is bind together with the index

df0 = df_date.iloc[:5,1:5]

df0

	T	CSCO	FCX	XL
date
2000-01-01	-11.5513	2.2170	-17.4556	-12.5301
2000-02-01	-11.2245	20.7192	-21.1470	-9.8898
2000-03-01	10.6732	16.9740	-12.2727	36.9397
2000-04-01	4.6083	-10.3274	-20.2073	-13.9955
2000-05-01	-0.2853	-17.8724	-4.5455	25.8793

Now that the date is on the index, we can use .loc to extract all the data for a specific year.

df_date.loc['2001']

	CTL	T	CSCO	FCX	XL	IVZ	AMT	WHR	IR	WFT	...	SWK	DVN	TMO	PEP	LNC	EMR	MLM	CCI	NU	Market
date
2001-01-01	-12.2238	1.7880	-2.1242	36.6423	-15.1016	5.9770	-4.4224	10.0708	5.8149	3.4497	...	10.044100	-10.1197	-0.336100	-11.0820	-4.5601	-3.5686	0.165500	3.2333	-6.886600	3.959900
2001-02-01	-8.1581	-1.3444	-36.7279	20.0855	3.0871	-13.6876	-20.0552	1.3717	-1.7829	6.4443	...	1.398600	4.0146	-5.902200	4.5609	-2.1851	-11.4704	8.779800	-10.0671	-9.211699	-9.909901
2001-03-01	-0.0694	-6.4361	-33.2454	-7.1174	0.0789	-26.6816	-36.0746	-5.4473	-8.3968	-5.1509	...	-4.655200	2.1930	-19.426500	-4.3186	-3.1912	-7.3244	-7.072900	-41.0448	-14.803900	-7.039000
2001-04-01	-5.4609	-6.9992	7.3834	8.5057	-6.9278	25.8621	44.8649	11.5623	18.3581	17.9939	...	10.015200	1.3917	17.259800	-0.3185	9.4066	7.5000	7.658100	65.3333	2.704300	8.383100
2001-05-01	4.8197	4.3636	13.4276	10.6638	12.9378	3.1507	-7.6866	13.3764	5.3617	-3.1942	...	5.462100	-1.3557	5.880100	2.1684	6.6724	2.1643	7.700700	-32.2172	8.347300	1.045800
2001-06-01	6.5401	-6.9454	-5.5036	-29.4831	3.2704	-8.0212	-16.4511	-0.6201	-16.5147	-14.8483	...	10.210500	-9.7234	-21.103500	-0.9272	5.0975	-10.6484	0.243100	-1.2048	7.848200	-1.755400
2001-07-01	2.1782	13.0460	5.6044	-2.6244	-6.5164	-7.7678	-17.9971	12.8640	6.0194	-12.1875	...	4.083100	3.2571	3.542200	5.4977	-0.8019	-5.1901	-10.931500	-40.5488	-3.614500	-1.829400
2001-08-01	13.3721	-9.1495	-15.0364	15.0558	8.7427	-8.8917	-14.6313	-5.9257	-6.7308	-21.0676	...	-3.349400	-14.6467	-1.195700	0.7935	-2.2928	-5.8883	-10.072600	4.5128	-1.875000	-5.898000
2001-09-01	-4.4223	15.1797	-25.4133	-11.2278	-4.8193	-25.7581	-4.0083	-16.1618	-16.6872	-23.3243	...	-12.747700	-25.5457	-16.705100	3.5000	-6.4781	-12.2015	-0.987300	-11.6781	-3.948700	-9.148800
2001-10-01	-5.6716	-18.5776	38.9163	1.0009	9.9494	10.9040	-20.6623	6.6305	10.3550	34.1827	...	4.842700	11.3372	17.119100	0.4330	-8.5246	4.1649	2.071100	30.0000	-5.766200	2.783300
2001-11-01	7.1202	-1.9155	20.8038	20.2703	7.5754	18.4874	-20.1452	11.9959	12.7614	-2.2203	...	9.681601	-10.2089	14.830700	-0.1642	12.6328	11.0720	6.813600	-6.9231	-0.708200	7.878000
2001-12-01	-2.9586	4.7886	-11.3992	0.2996	-1.7423	2.9078	7.6136	11.5116	-0.1910	11.3236	...	11.438100	12.5327	9.953899	0.4215	1.8239	5.6234	9.647099	-1.9284	1.321800	1.784800

12 rows × 51 columns

This would be helpful, for example, if we wanted to compute the difference in the average returns of all stocks from one year to the next.

df_date.loc['2009'].mean() - df_date.loc['2008'].mean()

CTL        5.487558
T          2.937392
CSCO       7.361342
FCX       19.844483
XL        33.775933
IVZ       10.040733
AMT        6.153350
WHR       12.306317
IR        14.456158
WFT       12.827383
YUM        2.444442
CVS        3.850175
GD         5.122642
TYC        8.857808
EL         6.362417
MUR        6.686392
CTAS       4.201033
CBSA      15.443775
SNV       -7.350500
CAM       12.429442
DLTR      -2.890100
CAH        6.473425
DTE        3.731900
SSP       24.993375
PSA       -0.423242
EXC        1.991258
TKR        4.816925
CMA        8.773100
ORCL       4.711567
MS        13.742083
RSG        3.409825
ACAS      18.108168
AGN        7.490375
MMM        6.472683
ETFC      12.750600
CAR       55.472400
MDR       20.387558
NOV       12.799792
PCH        4.852950
BAX        1.485500
JCI        9.846317
SWK        6.712133
DVN        3.786125
TMO        6.738350
PEP        3.447342
LNC       12.628092
EMR        4.992283
MLM        1.451650
CCI       12.952525
NU         2.722825
Market     6.205725
dtype: float64

Notice that pandas did a few things for us.

After computing .mean(), the row labels (index) were the former column names.
These column names were used to align data when we wanted asked pandas to compute the difference.

Suppose that someone asked you, “What was the Return volatily of the CSCO in 2010?”

With the original df dataset you would so something like

df.loc[df['date'].dt.year==2010, "CSCO"].std()

9.851823038802452

How would you do that using the df_date dataframe?

3.4.3.2.1. Setting a Hierarchical Index#

Include multiple columns in the index is advantageous in some situations.

These situations might include:

When we need more than one piece of information (column) to identify an observation
When we need data-alignment by more than one column

To achieve multiple columns in the index, we pass a list of multiple column names to set_index.

To show this we will first stack our data set

df_stacked=df_date.stack().reset_index()
df_stacked

	date	level_1	0
0	2000-01-01	CTL	-18.4697
1	2000-01-01	T	-11.5513
2	2000-01-01	CSCO	2.2170
3	2000-01-01	FCX	-17.4556
4	2000-01-01	XL	-12.5301
...	...	...	...
9175	2014-12-01	EMR	-3.1686
9176	2014-12-01	MLM	-8.0973
9177	2014-12-01	CCI	-4.2965
9178	2014-12-01	NU	6.4623
9179	2014-12-01	Market	-0.3616

9180 rows × 3 columns

# let us rename the columns to date ticker and return

df_stacked.columns = ["date", "ticker", "return"]
df_stacked.head()

	date	ticker	return
0	2000-01-01	CTL	-18.4697
1	2000-01-01	T	-11.5513
2	2000-01-01	CSCO	2.2170
3	2000-01-01	FCX	-17.4556
4	2000-01-01	XL	-12.5301

df_stacked = df_stacked.sort_values(by=['date','ticker']).set_index(["date", "ticker"])
df_stacked.head(5)

		return
date	ticker
2000-01-01	ACAS	13.7363
	AGN	14.5729
	AMT	17.3824
	BAX	1.6915
	CAH	-0.3916

Notice that in the display above, the row labels seem to have two levels now.

The outer (or left-most) level is named date and the inner (or right-most) level is named ticker.

When a DataFrame’s index has multiple levels, we (and the pandas documentation) refer to the DataFrame as having a hierarchical index.

3.4.3.2.2. Slicing a Hierarchical Index#

df_stacked.loc[('2000',"CSCO" )]

	return
date
2000-01-01	2.2170
2000-02-01	20.7192
2000-03-01	16.9740
2000-04-01	-10.3274
2000-05-01	-17.8724
2000-06-01	11.6356
2000-07-01	2.9499
2000-08-01	4.8711
2000-09-01	-19.4900
2000-10-01	-2.4887
2000-11-01	-11.1369
2000-12-01	-20.1044

df_stacked.loc[('2000',["CSCO","ACAS"]),:]

		return
date	ticker
2000-01-01	CSCO	2.217000
2000-01-01	ACAS	13.736300
2000-02-01	CSCO	20.719200
2000-03-01	CSCO	16.974000
2000-04-01	CSCO	-10.327400
2000-05-01	CSCO	-17.872400
2000-06-01	CSCO	11.635600
2000-07-01	CSCO	2.949900
2000-08-01	CSCO	4.871100
2000-09-01	CSCO	-19.490000
2000-10-01	CSCO	-2.488700
2000-11-01	CSCO	-11.136900
2000-12-01	CSCO	-20.104400
2000-02-01	ACAS	-7.246400
2000-03-01	ACAS	7.604199
2000-04-01	ACAS	-4.433500
2000-05-01	ACAS	-12.154600
2000-06-01	ACAS	14.714700
2000-07-01	ACAS	7.329800
2000-08-01	ACAS	-6.829300
2000-09-01	ACAS	1.267000
2000-10-01	ACAS	-6.860200
2000-11-01	ACAS	-3.399400
2000-12-01	ACAS	21.654000

As shown above, we can use .loc to extract different slices of our data

The rules for using .loc with a hierarchically-indexed DataFrame are similar to the ones we’ve learned for standard DataFrames, but they are a bit more elaborate as we now have more structure to our data.

We will summarize the main rules, and then work through an exercise that demonstrates each of them.

Slicing rules

pandas slicing reacts differently to lists and tuples.

It does this to provide more flexibility to select the data you want.

list in row slicing will be an “or” operation, where it chooses rows based on whether the index value corresponds to any element of the list.

tuple in row slicing will be used to denote a single hierarchical index and must include a value for each level.

Row slicing examples

df_stacked.loc[“CSCO”]: all rows where the outer most index value is equal to CSCO
df_stacked.loc[(‘2005’,”CSCO”)]: all rows where the outer-most index value picks all dates in the year 2005 and the second level is equal to CSCO
df_stacked.loc[[“2010”, “2006”]]: all rows where the outer-most index is belongs to the years 2010 or 2006
df_stacked.loc[([‘2010’, ‘2011’],[“CSCO”, “T”]), :]: all rows where the outer-most index belong to the years of 2010 or 2011 AND where the second level index is either "CSCO or "T"
df_stacked.loc[[(“CSCO”, ‘2010-01-01’), (“T”, ‘2011-02-01’)], :]: all rows where the the two hierarchical indices are either ("CSCO", '2010-01-01') or("T", '2011-02-01') . Here we cannot slice a a whole year an d need instead to pass a precise date

If we had multiple columns re can also restrict .loc to extract certain columns by doing:

df_stacked.loc[rows, 'dividends']: return the rows specified by rows

Why don’t you try a few of these possibilities?

3.4.3.2.3. Alignment with `MultiIndex`#

The data alignment features we talked about above also apply to a MultiIndex DataFrame.

3.4.3.2.4. `pd.IndexSlice`#

When we want to extract rows for a few values of the outer index and all values for an inner index level, we can use the convenient df.loc[[id11, id22]] shorthand.

We can use this notation to extract all the data for two years.

df_stacked.loc[["2010", "2004"]]

		return
date	ticker
2010-01-01	ACAS	51.229500
	AGN	-8.744599
	AMT	-1.758800
	BAX	-1.857500
	CAH	2.574400
...	...	...
2004-01-01	TYC	0.943400
	WFT	12.000000
	WHR	4.542300
	XL	2.514500
	YUM	-1.424400

102 rows × 1 columns

However, suppose we wanted to extract the data for all years, but only for thhe stocks ACAS YUM and TYC.

We cannot do this using .loc because the ticker is on the second level, not outer-most level of our index.

To get around this limitation, we can use the pd.IndexSlice helper.

Here’s an example.

df_stacked.loc[pd.IndexSlice[:, ['ACAS', 'YUM', 'TYC']], :]

		return
date	ticker
2000-01-01	ACAS	13.736300
2000-02-01	ACAS	-7.246400
2000-03-01	ACAS	7.604199
2000-04-01	ACAS	-4.433500
2000-05-01	ACAS	-12.154600
...	...	...
2014-08-01	TYC	3.406700
2014-09-01	TYC	-0.112100
2014-10-01	TYC	-3.275700
2014-11-01	TYC	-0.069900
2014-12-01	TYC	2.237800

540 rows × 1 columns

Notice that the : in the first part of [:, ["A", "D"]] instructed pandas to give us rows for all values of the outer most index level and that the : just before ] said grab all the columns.

You can also select a slice but in this case it is very important that your dataframe is sorted with respect to the index

3.4.3.2.5. Multi-index Columns#

The functionality of MultiIndex also applies to the column names.

3.4.3.3. Re-setting the Index#

The df.reset_index method will move one or more level of the index back into the DataFrame as a normal column.

With no additional arguments, it moves all levels out of the index and sets the index of the returned DataFrame to the default of range(df.shape[0]).

df_stacked.reset_index()

	date	ticker	return
0	2000-01-01	ACAS	13.7363
1	2000-01-01	AGN	14.5729
2	2000-01-01	AMT	17.3824
3	2000-01-01	BAX	1.6915
4	2000-01-01	CAH	-0.3916
...	...	...	...
9175	2014-12-01	TYC	2.2378
9176	2014-12-01	WFT	-12.5954
9177	2014-12-01	WHR	4.0662
9178	2014-12-01	XL	-2.7872
9179	2014-12-01	YUM	-5.6958

9180 rows × 3 columns

3.4.3.4. Choose the Index Carefully#

So, now that we know that we use index and column names for aligning data, “how should we pick the index?” is a natural question to ask.

To guide us to the right answer, we will list the first two components to Hadley Wickham’s description of tidy data:

Each column should each have one variable.
Each row should each have one observation.

If we strive to have our data in a tidy form (we should), then when choosing the index, we should set:

the row labels (index) to be a unique identifier for an observation of data
the column names to identify one variable

For example, suppose we are looking data on interest rates.

Each column might represent one bond or asset and each row might represent the date.

Using hierarchical row and column indices allows us to store higher dimensional data in our (inherently) two dimensional DataFrame.

3.4.3.4.1. Know Your Goal#

The correct column(s) to choose for the index often depends on the context of your analysis.

For example, if I were studying how GDP and consumption evolved over time for various countries, I would want time (year) and country name on the index

On the other hand, if I were trying to look at the differences across countries and variables within a particular year, I may opt to put the country and variable on the index and have years be columns.

Following the tidy data rules above and thinking about how you intend to use the data – and a little practice – will enable you to consistently select the correct index.