The Index
Contents
5.3. The Index#
import pandas as pd
import numpy as np
5.3.1. So What is this Index?#
Every Series or DataFrame has an index.
We told you that the index was the “row labels” for the data.
This is true, but an index in pandas does much more than label the rows.
The purpose of this lecture is to understand the importance of the index.
The pandas documentation says
Data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.
In practice, the index and column names are used to make sure the data is properly aligned when operating on multiple DataFrames.
This is a somewhat abstract concept that is best understood by example…
Let’s begin by loading some data on GDP components that we collected from the World Bank’s World Development Indicators Dataset.
url = "https://raw.githubusercontent.com/amoreira2/Lectures/main/assets/data/wdi_data.csv"
df = pd.read_csv(url)
df.info()
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72 entries, 0 to 71
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 72 non-null object
1 year 72 non-null int64
2 GovExpend 72 non-null float64
3 Consumption 72 non-null float64
4 Exports 72 non-null float64
5 Imports 72 non-null float64
6 GDP 72 non-null float64
dtypes: float64(5), int64(1), object(1)
memory usage: 4.1+ KB
country | year | GovExpend | Consumption | Exports | Imports | GDP | |
---|---|---|---|---|---|---|---|
0 | Canada | 2017 | 0.372665 | 1.095475 | 0.582831 | 0.600031 | 1.868164 |
1 | Canada | 2016 | 0.364899 | 1.058426 | 0.576394 | 0.575775 | 1.814016 |
2 | Canada | 2015 | 0.358303 | 1.035208 | 0.568859 | 0.575793 | 1.794270 |
3 | Canada | 2014 | 0.353485 | 1.011988 | 0.550323 | 0.572344 | 1.782252 |
4 | Canada | 2013 | 0.351541 | 0.986400 | 0.518040 | 0.558636 | 1.732714 |
We’ll also extract a couple smaller DataFrames we can use in examples.
df_small = df.head(5)
df_small
country | year | GovExpend | Consumption | Exports | Imports | GDP | |
---|---|---|---|---|---|---|---|
0 | Canada | 2017 | 0.372665 | 1.095475 | 0.582831 | 0.600031 | 1.868164 |
1 | Canada | 2016 | 0.364899 | 1.058426 | 0.576394 | 0.575775 | 1.814016 |
2 | Canada | 2015 | 0.358303 | 1.035208 | 0.568859 | 0.575793 | 1.794270 |
3 | Canada | 2014 | 0.353485 | 1.011988 | 0.550323 | 0.572344 | 1.782252 |
4 | Canada | 2013 | 0.351541 | 0.986400 | 0.518040 | 0.558636 | 1.732714 |
df_tiny = df.iloc[[0, 3, 2, 4], :]
df_tiny
country | year | GovExpend | Consumption | Exports | Imports | GDP | |
---|---|---|---|---|---|---|---|
0 | Canada | 2017 | 0.372665 | 1.095475 | 0.582831 | 0.600031 | 1.868164 |
3 | Canada | 2014 | 0.353485 | 1.011988 | 0.550323 | 0.572344 | 1.782252 |
2 | Canada | 2015 | 0.358303 | 1.035208 | 0.568859 | 0.575793 | 1.794270 |
4 | Canada | 2013 | 0.351541 | 0.986400 | 0.518040 | 0.558636 | 1.732714 |
im_ex = df_small[["Imports", "Exports"]]
im_ex_copy = im_ex.copy()
im_ex_copy
Imports | Exports | |
---|---|---|
0 | 0.600031 | 0.582831 |
1 | 0.575775 | 0.576394 |
2 | 0.575793 | 0.568859 |
3 | 0.572344 | 0.550323 |
4 | 0.558636 | 0.518040 |
Observe what happens when we evaluate im_ex + im_ex_copy
.
im_ex + im_ex_copy
Imports | Exports | |
---|---|---|
0 | 1.200063 | 1.165661 |
1 | 1.151550 | 1.152787 |
2 | 1.151585 | 1.137718 |
3 | 1.144688 | 1.100646 |
4 | 1.117272 | 1.036081 |
Notice that this operated elementwise, meaning that the +
operation was applied to each element of im_ex
and the corresponding
element of im_ex_copy
.
Let’s take a closer look at df_tiny
:
df_tiny
country | year | GovExpend | Consumption | Exports | Imports | GDP | |
---|---|---|---|---|---|---|---|
0 | Canada | 2017 | 0.372665 | 1.095475 | 0.582831 | 0.600031 | 1.868164 |
3 | Canada | 2014 | 0.353485 | 1.011988 | 0.550323 | 0.572344 | 1.782252 |
2 | Canada | 2015 | 0.358303 | 1.035208 | 0.568859 | 0.575793 | 1.794270 |
4 | Canada | 2013 | 0.351541 | 0.986400 | 0.518040 | 0.558636 | 1.732714 |
Relative to im_ex
notice a few things:
The row labeled
1
appears inim_ex
but notdf_tiny
.For row labels that appear in both, they are not in the same position within each DataFrame.
Certain columns appear only in
df_tiny
.The
Imports
andExports
columns are the 6th and 5th columns ofdf_tiny
and the 1st and 2nd ofim_ex
, respectively.
Now, let’s see what happens when we try df_tiny + im_ex
.
im_ex_tiny = df_tiny + im_ex
im_ex_tiny
Consumption | Exports | GDP | GovExpend | Imports | country | year | |
---|---|---|---|---|---|---|---|
0 | NaN | 1.165661 | NaN | NaN | 1.200063 | NaN | NaN |
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | 1.137718 | NaN | NaN | 1.151585 | NaN | NaN |
3 | NaN | 1.100646 | NaN | NaN | 1.144688 | NaN | NaN |
4 | NaN | 1.036081 | NaN | NaN | 1.117272 | NaN | NaN |
Whoa, a lot happened! Let’s break it down.
5.3.1.1. Automatic Alignment#
For all (row, column) combinations that appear in both DataFrames (e.g.
rows [1, 3]
and columns [Imports, Exports]
), the value of im_ex_tiny
is equal to df_tiny.loc[row, col] + im_ex.loc[row, col]
.
This happened even though the rows and columns were not in the same order.
We refer to this as pandas aligning the data for us.
To see how awesome this is, think about how to do something similar in Excel:
df_tiny
andim_ex
would be in different sheets.The index and column names would be the first column and row in each sheet.
We would have a third sheet to hold the sum.
For each label in the first row and column of either the
df_tiny
sheet or theim_ex
sheet we would have to do aIFELSE
to check if the label exists in the other sheet and then aVLOOKUP
to extract the value.
In pandas, this happens automatically, behind the scenes, and very quickly.
5.3.1.2. Handling Missing Data#
For all elements in row 1
or columns
["country", "year", "GovExpend", "Consumption", "GDP"]
,
the value in im_ex_tiny
is NaN
.
This is how pandas represents missing data.
So, when pandas was trying to look up the values in df_tiny
and im_ex
, it could
only find a value in one DataFrame: the other value was missing.
When pandas tries to add a number to something that is missing, it says
that the result is missing (spelled NaN
).
5.3.2. Setting the Index#
For a DataFrame df
, the df.set_index
method allows us to use one
(or more) of the DataFrame’s columns as the index.
Here’s an example.
# first, create the DataFrame
df_year = df.set_index(["year"])
df_year.head()
country | GovExpend | Consumption | Exports | Imports | GDP | |
---|---|---|---|---|---|---|
year | ||||||
2017 | Canada | 0.372665 | 1.095475 | 0.582831 | 0.600031 | 1.868164 |
2016 | Canada | 0.364899 | 1.058426 | 0.576394 | 0.575775 | 1.814016 |
2015 | Canada | 0.358303 | 1.035208 | 0.568859 | 0.575793 | 1.794270 |
2014 | Canada | 0.353485 | 1.011988 | 0.550323 | 0.572344 | 1.782252 |
2013 | Canada | 0.351541 | 0.986400 | 0.518040 | 0.558636 | 1.732714 |
Now that the year is on the index, we can use .loc
to extract all the
data for a specific year.
df_year.loc[2010]
country | GovExpend | Consumption | Exports | Imports | GDP | |
---|---|---|---|---|---|---|
year | ||||||
2010 | Canada | 0.347332 | 0.921952 | 0.469949 | 0.500341 | 1.613543 |
2010 | Germany | 0.653386 | 1.915481 | 1.443735 | 1.266126 | 3.417095 |
2010 | United Kingdom | 0.521146 | 1.598563 | 0.690824 | 0.745065 | 2.452900 |
2010 | United States | 2.510143 | 10.185836 | 1.846280 | 2.360183 | 14.992053 |
This would be helpful, for example, if we wanted to compute the difference in the average of all our variables from one year to the next.
df_year.loc[2009].mean() - df_year.loc[2008].mean()
GovExpend 0.033317
Consumption -0.042998
Exports -0.121425
Imports -0.140042
GDP -0.182610
dtype: float64
Notice that pandas did a few things for us.
After computing
.mean()
, the row labels (index) were the former column names.These column names were used to align data when we wanted asked pandas to compute the difference.
Suppose that someone asked you, “What was the GDP in the US in 2010?”
To compute that using df_year
you might do something like this:
df_year.loc[df_year["country"] == "United States", "GDP"].loc[2010]
14.992052727
That was a lot of work!
Now, suppose that after seeing you extract that data, your friend asks you “What about GDP in Germany and the UK in 2010?”
To answer that question, you might write.
df_year.loc[df_year["country"].isin(["United Kingdom", "Germany"]), "GDP"].loc[2010]
year
2010 3.417095
2010 2.452900
Name: GDP, dtype: float64
Notice that this code is similar to the code above, but now provides a result that is ambiguous.
The two elements in the series both have with label 2010.
How do we know which is which?
We might think that the first value corresponds to the United Kingdom because
that is what we listed first in the call to isin
, but we would be wrong!
Let’s check.
df_year.loc[2010]
country | GovExpend | Consumption | Exports | Imports | GDP | |
---|---|---|---|---|---|---|
year | ||||||
2010 | Canada | 0.347332 | 0.921952 | 0.469949 | 0.500341 | 1.613543 |
2010 | Germany | 0.653386 | 1.915481 | 1.443735 | 1.266126 | 3.417095 |
2010 | United Kingdom | 0.521146 | 1.598563 | 0.690824 | 0.745065 | 2.452900 |
2010 | United States | 2.510143 | 10.185836 | 1.846280 | 2.360183 | 14.992053 |
Setting just the year as index has one more potential issue: we will get data alignment only on the year, which may not be sufficient.
for example here we might need to include both the year and country in the index…
5.3.2.1. Setting a Hierarchical Index#
Include multiple columns in the index is advantageous in some situations.
These situations might include:
When we need more than one piece of information (column) to identify an observation (as in the Germany and UK GDP example above)
When we need data-alignment by more than one column
To achieve multiple columns in the index, we pass a list of multiple column
names to set_index
.
wdi = df.set_index(["country", "year"])
wdi.head(20)
GovExpend | Consumption | Exports | Imports | GDP | ||
---|---|---|---|---|---|---|
country | year | |||||
Canada | 2017 | 0.372665 | 1.095475 | 0.582831 | 0.600031 | 1.868164 |
2016 | 0.364899 | 1.058426 | 0.576394 | 0.575775 | 1.814016 | |
2015 | 0.358303 | 1.035208 | 0.568859 | 0.575793 | 1.794270 | |
2014 | 0.353485 | 1.011988 | 0.550323 | 0.572344 | 1.782252 | |
2013 | 0.351541 | 0.986400 | 0.518040 | 0.558636 | 1.732714 | |
2012 | 0.354342 | 0.961226 | 0.505969 | 0.547756 | 1.693428 | |
2011 | 0.351887 | 0.943145 | 0.492349 | 0.528227 | 1.664240 | |
2010 | 0.347332 | 0.921952 | 0.469949 | 0.500341 | 1.613543 | |
2009 | 0.339686 | 0.890078 | 0.440692 | 0.439796 | 1.565291 | |
2008 | 0.330766 | 0.889602 | 0.506350 | 0.502281 | 1.612862 | |
2007 | 0.318777 | 0.864012 | 0.530453 | 0.498002 | 1.596876 | |
2006 | 0.311382 | 0.827643 | 0.524461 | 0.470931 | 1.564608 | |
2005 | 0.303043 | 0.794390 | 0.519950 | 0.447222 | 1.524608 | |
2004 | 0.299854 | 0.764357 | 0.508657 | 0.416754 | 1.477317 | |
2003 | 0.294335 | 0.741796 | 0.481993 | 0.384199 | 1.433089 | |
2002 | 0.286094 | 0.721974 | 0.490465 | 0.368615 | 1.407725 | |
2001 | 0.279767 | 0.694230 | 0.484696 | 0.362023 | 1.366590 | |
2000 | 0.270553 | 0.677713 | 0.499526 | 0.380823 | 1.342805 | |
Germany | 2017 | 0.745579 | 2.112009 | 1.930563 | 1.666348 | 3.883870 |
2016 | 0.734014 | 2.075615 | 1.844949 | 1.589495 | 3.801859 |
Notice that in the display above, the row labels seem to have two levels now.
The outer (or left-most) level is named country
and the inner (or
right-most) level is named year
.
When a DataFrame’s index has multiple levels, we (and the pandas documentation) refer to the DataFrame as having a hierarchical index.
5.3.2.2. Slicing a Hierarchical Index#
Now, we can answer our friend’s questions in a much more straightforward way.
wdi.loc[("United States", 2010), "GDP"]
14.992052727
wdi.loc[(["United Kingdom", "Germany"], 2010), "GDP"]
country year
United Kingdom 2010 2.452900
Germany 2010 3.417095
Name: GDP, dtype: float64
As shown above, we can use wdi.loc
to extract different slices of our
national accounts data.
The rules for using .loc
with a hierarchically-indexed DataFrame are
similar to the ones we’ve learned for standard DataFrames, but they are a bit
more elaborate as we now have more structure to our data.
We will summarize the main rules, and then work through an exercise that demonstrates each of them.
Slicing rules
pandas slicing reacts differently to list
s and tuple
s.
It does this to provide more flexibility to select the data you want.
list
in row slicing will be an “or” operation, where it chooses rows
based on whether the index value corresponds to any element of the list.
tuple
in row slicing will be used to denote a single hierarchical
index and must include a value for each level.
Row slicing examples
wdi.loc["United States"]
: all rows where the outer most index value is equal toUnited States
wdi.loc[("United States", 2010)]
: all rows where the outer-most index value is equal to"United States
and the second level is equal to2010
wdi.loc[["United States", "Canada"]]
: all rows where the outer-most index is either"United States"
or"Canada"
wdi.loc[(["United States", "Canada"], [2010, 2011]), :]
: all rows where the outer-most index is either"United States
or"Canada"
AND where the second level index is either2010
or2011
wdi.loc[[("United States", 2010), ("Canada", 2011)], :]
: all rows where the the two hierarchical indices are either("United States", 2010)
or("Canada", 2011)
We can also restrict .loc
to extract certain columns by doing:
wdi.loc[rows, GDP]
: return the rows specified by rows (see rules above) and only column namedGDP
(returned object will be a Series)df.loc[rows, ["GDP", "Consumption"]]
: return the rows specified by rows (see rules above) and only columnsGDP
andConsumption
5.3.2.3. Alignment with MultiIndex
#
The data alignment features we talked about above also apply to a
MultiIndex
DataFrame.
5.3.2.4. pd.IndexSlice
#
When we want to extract rows for a few values of the outer index and all
values for an inner index level, we can use the convenient
df.loc[[id11, id22]]
shorthand.
We can use this notation to extract all the data for the United States and Canada.
wdi.loc[["United States", "Canada"]]
GovExpend | Consumption | Exports | Imports | GDP | ||
---|---|---|---|---|---|---|
country | year | |||||
United States | 2017 | 2.405743 | 12.019266 | 2.287071 | 3.069954 | 17.348627 |
2016 | 2.407981 | 11.722133 | 2.219937 | 2.936004 | 16.972348 | |
2015 | 2.373130 | 11.409800 | 2.222228 | 2.881337 | 16.710459 | |
2014 | 2.334071 | 11.000619 | 2.209555 | 2.732228 | 16.242526 | |
2013 | 2.353381 | 10.687214 | 2.118639 | 2.600198 | 15.853796 | |
2012 | 2.398873 | 10.534042 | 2.045509 | 2.560677 | 15.567038 | |
2011 | 2.434378 | 10.378060 | 1.978083 | 2.493194 | 15.224555 | |
2010 | 2.510143 | 10.185836 | 1.846280 | 2.360183 | 14.992053 | |
2009 | 2.507390 | 10.010687 | 1.646432 | 2.086299 | 14.617299 | |
2008 | 2.407771 | 10.137847 | 1.797347 | 2.400349 | 14.997756 | |
2007 | 2.351987 | 10.159387 | 1.701096 | 2.455016 | 15.018268 | |
2006 | 2.314957 | 9.938503 | 1.564920 | 2.395189 | 14.741688 | |
2005 | 2.287022 | 9.643098 | 1.431205 | 2.246246 | 14.332500 | |
2004 | 2.267999 | 9.311431 | 1.335978 | 2.108585 | 13.846058 | |
2003 | 2.233519 | 8.974708 | 1.218199 | 1.892825 | 13.339312 | |
2002 | 2.193188 | 8.698306 | 1.192180 | 1.804105 | 12.968263 | |
2001 | 2.112038 | 8.480461 | 1.213253 | 1.740797 | 12.746262 | |
2000 | 2.040500 | 8.272097 | 1.287739 | 1.790995 | 12.620268 | |
Canada | 2017 | 0.372665 | 1.095475 | 0.582831 | 0.600031 | 1.868164 |
2016 | 0.364899 | 1.058426 | 0.576394 | 0.575775 | 1.814016 | |
2015 | 0.358303 | 1.035208 | 0.568859 | 0.575793 | 1.794270 | |
2014 | 0.353485 | 1.011988 | 0.550323 | 0.572344 | 1.782252 | |
2013 | 0.351541 | 0.986400 | 0.518040 | 0.558636 | 1.732714 | |
2012 | 0.354342 | 0.961226 | 0.505969 | 0.547756 | 1.693428 | |
2011 | 0.351887 | 0.943145 | 0.492349 | 0.528227 | 1.664240 | |
2010 | 0.347332 | 0.921952 | 0.469949 | 0.500341 | 1.613543 | |
2009 | 0.339686 | 0.890078 | 0.440692 | 0.439796 | 1.565291 | |
2008 | 0.330766 | 0.889602 | 0.506350 | 0.502281 | 1.612862 | |
2007 | 0.318777 | 0.864012 | 0.530453 | 0.498002 | 1.596876 | |
2006 | 0.311382 | 0.827643 | 0.524461 | 0.470931 | 1.564608 | |
2005 | 0.303043 | 0.794390 | 0.519950 | 0.447222 | 1.524608 | |
2004 | 0.299854 | 0.764357 | 0.508657 | 0.416754 | 1.477317 | |
2003 | 0.294335 | 0.741796 | 0.481993 | 0.384199 | 1.433089 | |
2002 | 0.286094 | 0.721974 | 0.490465 | 0.368615 | 1.407725 | |
2001 | 0.279767 | 0.694230 | 0.484696 | 0.362023 | 1.366590 | |
2000 | 0.270553 | 0.677713 | 0.499526 | 0.380823 | 1.342805 |
However, suppose we wanted to extract the data for all countries, but only the years 2005, 2007, and 2009.
We cannot do this using wdi.loc
because the year is on the second level,
not outer-most level of our index.
To get around this limitation, we can use the pd.IndexSlice
helper.
Here’s an example.
wdi.loc[pd.IndexSlice[:, [2005, 2007, 2009]], :]
GovExpend | Consumption | Exports | Imports | GDP | ||
---|---|---|---|---|---|---|
country | year | |||||
Canada | 2005 | 0.303043 | 0.794390 | 0.519950 | 0.447222 | 1.524608 |
Germany | 2005 | 0.591184 | 1.866253 | 1.175200 | 1.028094 | 3.213777 |
United Kingdom | 2005 | 0.490806 | 1.578914 | 0.640088 | 0.715951 | 2.403352 |
United States | 2005 | 2.287022 | 9.643098 | 1.431205 | 2.246246 | 14.332500 |
Canada | 2007 | 0.318777 | 0.864012 | 0.530453 | 0.498002 | 1.596876 |
Germany | 2007 | 0.605624 | 1.894219 | 1.442436 | 1.213835 | 3.441356 |
United Kingdom | 2007 | 0.504549 | 1.644789 | 0.710200 | 0.767699 | 2.527327 |
United States | 2007 | 2.351987 | 10.159387 | 1.701096 | 2.455016 | 15.018268 |
Canada | 2009 | 0.339686 | 0.890078 | 0.440692 | 0.439796 | 1.565291 |
Germany | 2009 | 0.645023 | 1.908393 | 1.260525 | 1.121914 | 3.283144 |
United Kingdom | 2009 | 0.519716 | 1.587152 | 0.653830 | 0.689011 | 2.411632 |
United States | 2009 | 2.507390 | 10.010687 | 1.646432 | 2.086299 | 14.617299 |
Notice that the :
in the first part of [:, ["A", "D"]]
instructed pandas to give us rows for all values of the outer most index
level and that the :
just before ]
said grab all the columns.
You can also select a slice but in this case it is very important that your dataframe is sorted with respect to the index
uncomment the code below to see the error that you get
#wdi.loc[pd.IndexSlice[:, 2005:2009], :]
Now lets properly sort out data set
wdi_sorted=wdi.sort_index()
wdi_sorted.loc[pd.IndexSlice[:, 2005:2009], :]
GovExpend | Consumption | Exports | Imports | GDP | ||
---|---|---|---|---|---|---|
country | year | |||||
Canada | 2005 | 0.303043 | 0.794390 | 0.519950 | 0.447222 | 1.524608 |
2006 | 0.311382 | 0.827643 | 0.524461 | 0.470931 | 1.564608 | |
2007 | 0.318777 | 0.864012 | 0.530453 | 0.498002 | 1.596876 | |
2008 | 0.330766 | 0.889602 | 0.506350 | 0.502281 | 1.612862 | |
2009 | 0.339686 | 0.890078 | 0.440692 | 0.439796 | 1.565291 | |
Germany | 2005 | 0.591184 | 1.866253 | 1.175200 | 1.028094 | 3.213777 |
2006 | 0.596868 | 1.894219 | 1.319574 | 1.142552 | 3.332692 | |
2007 | 0.605624 | 1.894219 | 1.442436 | 1.213835 | 3.441356 | |
2008 | 0.626140 | 1.905520 | 1.470300 | 1.241057 | 3.478602 | |
2009 | 0.645023 | 1.908393 | 1.260525 | 1.121914 | 3.283144 | |
United Kingdom | 2005 | 0.490806 | 1.578914 | 0.640088 | 0.715951 | 2.403352 |
2006 | 0.499312 | 1.604404 | 0.717506 | 0.786134 | 2.464591 | |
2007 | 0.504549 | 1.644789 | 0.710200 | 0.767699 | 2.527327 | |
2008 | 0.513870 | 1.635333 | 0.713184 | 0.753502 | 2.518585 | |
2009 | 0.519716 | 1.587152 | 0.653830 | 0.689011 | 2.411632 | |
United States | 2005 | 2.287022 | 9.643098 | 1.431205 | 2.246246 | 14.332500 |
2006 | 2.314957 | 9.938503 | 1.564920 | 2.395189 | 14.741688 | |
2007 | 2.351987 | 10.159387 | 1.701096 | 2.455016 | 15.018268 | |
2008 | 2.407771 | 10.137847 | 1.797347 | 2.400349 | 14.997756 | |
2009 | 2.507390 | 10.010687 | 1.646432 | 2.086299 | 14.617299 |
5.3.2.5. Multi-index Columns#
The functionality of MultiIndex
also applies to the column names.
Let’s see how it works.
wdiT = wdi.T # .T means "transpose" or "swap rows and columns"
wdiT
country | Canada | ... | United States | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | 2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | ... | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | 2000 |
GovExpend | 0.372665 | 0.364899 | 0.358303 | 0.353485 | 0.351541 | 0.354342 | 0.351887 | 0.347332 | 0.339686 | 0.330766 | ... | 2.507390 | 2.407771 | 2.351987 | 2.314957 | 2.287022 | 2.267999 | 2.233519 | 2.193188 | 2.112038 | 2.040500 |
Consumption | 1.095475 | 1.058426 | 1.035208 | 1.011988 | 0.986400 | 0.961226 | 0.943145 | 0.921952 | 0.890078 | 0.889602 | ... | 10.010687 | 10.137847 | 10.159387 | 9.938503 | 9.643098 | 9.311431 | 8.974708 | 8.698306 | 8.480461 | 8.272097 |
Exports | 0.582831 | 0.576394 | 0.568859 | 0.550323 | 0.518040 | 0.505969 | 0.492349 | 0.469949 | 0.440692 | 0.506350 | ... | 1.646432 | 1.797347 | 1.701096 | 1.564920 | 1.431205 | 1.335978 | 1.218199 | 1.192180 | 1.213253 | 1.287739 |
Imports | 0.600031 | 0.575775 | 0.575793 | 0.572344 | 0.558636 | 0.547756 | 0.528227 | 0.500341 | 0.439796 | 0.502281 | ... | 2.086299 | 2.400349 | 2.455016 | 2.395189 | 2.246246 | 2.108585 | 1.892825 | 1.804105 | 1.740797 | 1.790995 |
GDP | 1.868164 | 1.814016 | 1.794270 | 1.782252 | 1.732714 | 1.693428 | 1.664240 | 1.613543 | 1.565291 | 1.612862 | ... | 14.617299 | 14.997756 | 15.018268 | 14.741688 | 14.332500 | 13.846058 | 13.339312 | 12.968263 | 12.746262 | 12.620268 |
5 rows × 72 columns
Notice that wdiT
seems to have two levels of names for the columns.
The same logic laid out in the above row slicing rules applies when we have a hierarchical index for column names.
wdiT.loc[:, "United States"]
year | 2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | 2000 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GovExpend | 2.405743 | 2.407981 | 2.373130 | 2.334071 | 2.353381 | 2.398873 | 2.434378 | 2.510143 | 2.507390 | 2.407771 | 2.351987 | 2.314957 | 2.287022 | 2.267999 | 2.233519 | 2.193188 | 2.112038 | 2.040500 |
Consumption | 12.019266 | 11.722133 | 11.409800 | 11.000619 | 10.687214 | 10.534042 | 10.378060 | 10.185836 | 10.010687 | 10.137847 | 10.159387 | 9.938503 | 9.643098 | 9.311431 | 8.974708 | 8.698306 | 8.480461 | 8.272097 |
Exports | 2.287071 | 2.219937 | 2.222228 | 2.209555 | 2.118639 | 2.045509 | 1.978083 | 1.846280 | 1.646432 | 1.797347 | 1.701096 | 1.564920 | 1.431205 | 1.335978 | 1.218199 | 1.192180 | 1.213253 | 1.287739 |
Imports | 3.069954 | 2.936004 | 2.881337 | 2.732228 | 2.600198 | 2.560677 | 2.493194 | 2.360183 | 2.086299 | 2.400349 | 2.455016 | 2.395189 | 2.246246 | 2.108585 | 1.892825 | 1.804105 | 1.740797 | 1.790995 |
GDP | 17.348627 | 16.972348 | 16.710459 | 16.242526 | 15.853796 | 15.567038 | 15.224555 | 14.992053 | 14.617299 | 14.997756 | 15.018268 | 14.741688 | 14.332500 | 13.846058 | 13.339312 | 12.968263 | 12.746262 | 12.620268 |
wdiT.loc[:, ["United States", "Canada"]]
country | United States | ... | Canada | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | 2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | ... | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | 2000 |
GovExpend | 2.405743 | 2.407981 | 2.373130 | 2.334071 | 2.353381 | 2.398873 | 2.434378 | 2.510143 | 2.507390 | 2.407771 | ... | 0.339686 | 0.330766 | 0.318777 | 0.311382 | 0.303043 | 0.299854 | 0.294335 | 0.286094 | 0.279767 | 0.270553 |
Consumption | 12.019266 | 11.722133 | 11.409800 | 11.000619 | 10.687214 | 10.534042 | 10.378060 | 10.185836 | 10.010687 | 10.137847 | ... | 0.890078 | 0.889602 | 0.864012 | 0.827643 | 0.794390 | 0.764357 | 0.741796 | 0.721974 | 0.694230 | 0.677713 |
Exports | 2.287071 | 2.219937 | 2.222228 | 2.209555 | 2.118639 | 2.045509 | 1.978083 | 1.846280 | 1.646432 | 1.797347 | ... | 0.440692 | 0.506350 | 0.530453 | 0.524461 | 0.519950 | 0.508657 | 0.481993 | 0.490465 | 0.484696 | 0.499526 |
Imports | 3.069954 | 2.936004 | 2.881337 | 2.732228 | 2.600198 | 2.560677 | 2.493194 | 2.360183 | 2.086299 | 2.400349 | ... | 0.439796 | 0.502281 | 0.498002 | 0.470931 | 0.447222 | 0.416754 | 0.384199 | 0.368615 | 0.362023 | 0.380823 |
GDP | 17.348627 | 16.972348 | 16.710459 | 16.242526 | 15.853796 | 15.567038 | 15.224555 | 14.992053 | 14.617299 | 14.997756 | ... | 1.565291 | 1.612862 | 1.596876 | 1.564608 | 1.524608 | 1.477317 | 1.433089 | 1.407725 | 1.366590 | 1.342805 |
5 rows × 36 columns
wdiT.loc[:, (["United States", "Canada"], [2008,2010])]
country | United States | Canada | ||
---|---|---|---|---|
year | 2008 | 2010 | 2008 | 2010 |
GovExpend | 2.407771 | 2.510143 | 0.330766 | 0.347332 |
Consumption | 10.137847 | 10.185836 | 0.889602 | 0.921952 |
Exports | 1.797347 | 1.846280 | 0.506350 | 0.469949 |
Imports | 2.400349 | 2.360183 | 0.502281 | 0.500341 |
GDP | 14.997756 | 14.992053 | 1.612862 | 1.613543 |
5.3.3. Re-setting the Index#
The df.reset_index
method will move one or more level of the index
back into the DataFrame as a normal column.
With no additional arguments, it moves all levels out of the index and
sets the index of the returned DataFrame to the default of
range(df.shape[0])
.
wdi.reset_index()
country | year | GovExpend | Consumption | Exports | Imports | GDP | |
---|---|---|---|---|---|---|---|
0 | Canada | 2017 | 0.372665 | 1.095475 | 0.582831 | 0.600031 | 1.868164 |
1 | Canada | 2016 | 0.364899 | 1.058426 | 0.576394 | 0.575775 | 1.814016 |
2 | Canada | 2015 | 0.358303 | 1.035208 | 0.568859 | 0.575793 | 1.794270 |
3 | Canada | 2014 | 0.353485 | 1.011988 | 0.550323 | 0.572344 | 1.782252 |
4 | Canada | 2013 | 0.351541 | 0.986400 | 0.518040 | 0.558636 | 1.732714 |
... | ... | ... | ... | ... | ... | ... | ... |
67 | United States | 2004 | 2.267999 | 9.311431 | 1.335978 | 2.108585 | 13.846058 |
68 | United States | 2003 | 2.233519 | 8.974708 | 1.218199 | 1.892825 | 13.339312 |
69 | United States | 2002 | 2.193188 | 8.698306 | 1.192180 | 1.804105 | 12.968263 |
70 | United States | 2001 | 2.112038 | 8.480461 | 1.213253 | 1.740797 | 12.746262 |
71 | United States | 2000 | 2.040500 | 8.272097 | 1.287739 | 1.790995 | 12.620268 |
72 rows × 7 columns
5.3.4. Choose the Index Carefully#
So, now that we know that we use index and column names for aligning data, “how should we pick the index?” is a natural question to ask.
To guide us to the right answer, we will list the first two components to Hadley Wickham’s description of tidy data:
Each column should each have one variable.
Each row should each have one observation.
If we strive to have our data in a tidy form (we should), then when choosing the index, we should set:
the row labels (index) to be a unique identifier for an observation of data
the column names to identify one variable
For example, suppose we are looking data on interest rates.
Each column might represent one bond or asset and each row might represent the date.
Using hierarchical row and column indices allows us to store higher dimensional data in our (inherently) two dimensional DataFrame.
5.3.4.1. Know Your Goal#
The correct column(s) to choose for the index often depends on the context of your analysis.
For example, if I were studying how GDP and consumption evolved over time for various countries, I would want time (year) and country name on the index
On the other hand, if I were trying to look at the differences across countries and variables within a particular year, I may opt to put the country and variable on the index and have years be columns.
Following the tidy data rules above and thinking about how you intend to use the data – and a little practice – will enable you to consistently select the correct index.