4.5. Introduction to Numpy#

4.5.1. Numpy Arrays#

Now that we have learned the fundamentals of programming in Python, we will learn how we can use Python to perform the computations required in data science and economics. We call these the “scientific Python tools”.

The foundational library that helps us perform these computations is known as numpy (numerical Python).

Numpy’s core contribution is a new data-type called an array.

An array is similar to a list, but numpy imposes some additional restrictions on how the data inside is organized.

These restrictions allow numpy to

  1. Be more efficient in performing mathematical and scientific computations.

  2. Expose functions that allow numpy to do the necessary linear algebra for machine learning and statistics.

Before we get started, please note that the convention for importing the numpy package is to use the nickname np:

import numpy as np

4.5.1.1. What is an Array?#

An array is a multi-dimensional grid of values.

What does this mean? It is easier to demonstrate than to explain.

In this block of code, we build a 1-dimensional array.

# create an array from a list
x_1d = np.array([1, 2, 3])
print(x_1d)
[1 2 3]

You can think of a 1-dimensional array as a list of numbers.

# We can index like we did with lists
print(x_1d[0])
print(x_1d[0:2])
1
[1 2]

Note that the range of indices does not include the end-point, that is

print(x_1d[0:3] == x_1d[:])
print(x_1d[0:2])
[ True  True  True]
[1 2]

The differences emerge as we move into higher dimensions.

Next, we define a 2-dimensional array (a matrix)

x_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(x_2d)
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Notice that the data is no longer represented as something flat, but rather, as three rows and three columns of numbers.

The first question that you might ask yourself is: “how do I access the values in this array?”

You access each element by specifying a row first and then a column. For example, if we wanted to access the 6, we would ask for the (1, 2) element.

print(x_2d[1, 2])  # Indexing into two dimensions!
6

Or to get the top left corner…

print(x_2d[0, 0])  # Indexing into two dimensions!
1

To get the first, and then second rows…

print(x_2d[0, :])
print(x_2d[1, :])
[1 2 3]
[4 5 6]

Or the columns…

print(x_2d[:, 0])
print(x_2d[:, 1])
[1 4 7]
[2 5 8]

This continues to generalize, since numpy gives us as many dimensions as we want in an array.

4.5.1.1.1. Array Indexing#

Now that there are multiple dimensions, indexing might feel somewhat non-obvious.

Do the rows or columns come first? In higher dimensions, what is the order of the index?

Notice that the array is built using a list of lists (you could also use tuples!).

Indexing into the array will correspond to choosing elements from each list.

First, notice that the dimensions give two stacked matrices, which we can access with

print(x_2d[0])
print(x_2d[1])
[1 2 3]
[4 5 6]

In the case of the first, it is synonymous with

print(x_2d[0, :])
print(x_2d[1, :])
[1 2 3]
[4 5 6]

Notice that we put a : on the dimension where we want to select all of the elements. We can also slice out subsets of the elements by doing start:stop+1.

Notice how the following arrays differ.

print(x_2d[:, 0])
print(x_2d[:, 0:2])
print(x_2d[:, :2])  # the 0  in 0:2 is optional
[1 4 7]
[[1 2]
 [4 5]
 [7 8]]
[[1 2]
 [4 5]
 [7 8]]

4.5.2. Array Functionality#

4.5.2.1. Array Properties#

All numpy arrays have various useful properties.

Properties are similar to methods in that they’re accessed through the “dot notation.” However, they aren’t a function so we don’t need parentheses.

The two most frequently used properties are shape and dtype.

shape tells us how many elements are in each array dimension.

dtype tells us the types of an array’s elements.

Let’s do some examples to see these properties in action.

x = np.array([[1, 2, 3], [4, 5, 6]])
print(x.shape)
print(x.dtype)
(2, 3)
int32

We’ll use this to practice unpacking a tuple, like x.shape, directly into variables.

rows, columns = x.shape
print(f"rows = {rows}, columns = {columns}")
rows = 2, columns = 3
x = np.array([True, False, True])
print(x.shape)
print(x.dtype)
(3,)
bool

Note that in the above, the (3,) represents a tuple of length 1, distinct from a scalar integer 3.

x = np.array([
    [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]],
    [[7.0, 8.0], [9.0, 10.0], [11.0, 12.0]]
])
print(x.shape)
print(x.dtype)
(2, 3, 2)
float64

4.5.2.2. Creating Arrays#

It’s usually impractical to define arrays by hand as we have done so far.

We’ll often need to create an array with default values and then fill it with other values.

We can create arrays with the functions np.zeros and np.ones.

Both functions take a tuple that denotes the shape of an array and creates an array filled with 0s or 1s respectively.

sizes = (2, 3, 4)
x = np.zeros(sizes) # note, a tuple!
x
array([[[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]],

       [[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]]])
y = np.ones((4))
y
array([1., 1., 1., 1.])

4.5.2.3. Broadcasting Operations#

Two types of operations that will be useful for arrays of any dimension are:

  1. Operations between an array and a single number.

  2. Operations between two arrays of the same shape.

When we perform operations on an array by using a single number, we simply apply that operation to every element of the array.

# Using np.ones to create an array
x = np.ones((2, 2))
print("x = ", x)
print("2 + x = ", 2 + x)
print("2 - x = ", 2 - x)
print("2 * x = ", 2 * x)
print("x / 2 = ", x / 2)
x =  [[1. 1.]
 [1. 1.]]
2 + x =  [[3. 3.]
 [3. 3.]]
2 - x =  [[1. 1.]
 [1. 1.]]
2 * x =  [[2. 2.]
 [2. 2.]]
x / 2 =  [[0.5 0.5]
 [0.5 0.5]]

Operations between two arrays of the same size, in this case (2, 2), simply apply the operation element-wise between the arrays.

x = np.array([[1.0, 2.0], [3.0, 4.0]])
y = np.ones((2, 2))
print("x = ", x)
print("y = ", y)
print("x + y = ", x + y)
print("x - y", x - y)
print("(elementwise) x * y = ", x * y)
print("(elementwise) x / y = ", x / y)
x =  [[1. 2.]
 [3. 4.]]
y =  [[1. 1.]
 [1. 1.]]
x + y =  [[2. 3.]
 [4. 5.]]
x - y [[0. 1.]
 [2. 3.]]
(elementwise) x * y =  [[1. 2.]
 [3. 4.]]
(elementwise) x / y =  [[1. 2.]
 [3. 4.]]

4.5.2.4. Dot Product#

Another operation very frequently used in data science is the dot product.

The dot between \( x \) and \( y \) is written \( x \cdot y \) and is equal to \( \sum_{i=1}^N x_i y_i \).

(We will use this one a lot!)

We can use @ to denote dot products (and matrix multiplication which we’ll see soon!).

print("Dot product with @", x @ y)
Dot product with @ 32

4.5.3. Visualizing Vectors#

A (N-element) vector is \( N \) numbers stored together.

We typically write a vector as \( x = \begin{bmatrix} x_1 \\ x_2 \\ \dots \\ x_N \end{bmatrix} \).

In numpy terms, a vector is a 1-dimensional array.

4.5.4. Matrices#

An \( N \times M \) matrix can be thought of as a collection of M N-element vectors stacked side-by-side as columns.

We write a matrix as

\[\begin{split} \begin{bmatrix} x_{11} & x_{12} & \dots & x_{1M} \\ x_{21} & \dots & \dots & x_{2M} \\ \vdots & \vdots & \vdots & \vdots \\ x_{N1} & x_{N2} & \dots & x_{NM} \end{bmatrix} \end{split}\]

In numpy terms, a matrix is a 2-dimensional array.

We can create a matrix by passing a list of lists to the np.array function.

x = np.array([[1, 2, 3], [4, 5, 6]])
y = np.ones((2, 3))
z = np.array([[1, 2], [3, 4], [5, 6]])

We can perform element-wise and scalar operations as we did with vectors. In fact, we can do these two operations on arrays of any dimension.

print("Element-wise Addition\n", x + y)
print("Element-wise Subtraction\n", x - y)
print("Element-wise Multiplication\n", x * y)
print("Element-wise Division\n", x / y)

print("Scalar Addition\n", 3 + x)
print("Scalar Subtraction\n", 3 - x)
print("Scalar Multiplication\n", 3 * x)
print("Scalar Division\n", 3 / x)
Element-wise Addition
 [[2. 3. 4.]
 [5. 6. 7.]]
Element-wise Subtraction
 [[0. 1. 2.]
 [3. 4. 5.]]
Element-wise Multiplication
 [[1. 2. 3.]
 [4. 5. 6.]]
Element-wise Division
 [[1. 2. 3.]
 [4. 5. 6.]]
Scalar Addition
 [[4 5 6]
 [7 8 9]]
Scalar Subtraction
 [[ 2  1  0]
 [-1 -2 -3]]
Scalar Multiplication
 [[ 3  6  9]
 [12 15 18]]
Scalar Division
 [[3.   1.5  1.  ]
 [0.75 0.6  0.5 ]]

4.5.4.1. Transpose#

A matrix transpose is an operation that flips all elements of a matrix along the diagonal.

More formally, the \( (i, j) \) element of \( x \) becomes the \( (j, i) \) element of \( x^T \).

In particular, let \( x \) be given by

\[\begin{split} x = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ \end{bmatrix} \end{split}\]

then \( x \) transpose, written as \( x' \), is given by

\[\begin{split} x = \begin{bmatrix} 1 & 4 & 7 \\ 2 & 5 & 8 \\ 3 & 6 & 9 \\ \end{bmatrix} \end{split}\]

In Python, we do this by

x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print("x transpose is")
print(x.transpose())
x transpose is
[[1 4 7]
 [2 5 8]
 [3 6 9]]

4.5.4.2. Identity Matrix#

In linear algebra, one particular matrix acts very similarly to how 1 behaves for scalar numbers.

This matrix is known as the identity matrix and is given by

\[\begin{split} I = \begin{bmatrix} 1 & 0 & 0 & \dots & 0 \\ 0 & 1 & 0 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \dots & 1 \end{bmatrix} \end{split}\]

As seen above, it has 1s on the diagonal and 0s everywhere else.

When we multiply any matrix or vector by the identity matrix, we get the original matrix or vector back!

Let’s see some examples.

I = np.eye(3)
x = np.reshape(np.arange(9), (3, 3))
y = np.array([1, 2, 3])

print("I @ x", "\n", I @ x)
print("x @ I", "\n", x @ I)
print("I @ y", "\n", I @ y)
print("y @ I", "\n", y @ I)
I @ x 
 [[0. 1. 2.]
 [3. 4. 5.]
 [6. 7. 8.]]
x @ I 
 [[0. 1. 2.]
 [3. 4. 5.]
 [6. 7. 8.]]
I @ y 
 [1. 2. 3.]
y @ I 
 [1. 2. 3.]

4.5.4.3. Matrix Multiplication#

Similar to how we combine vectors with a dot product, matrices can do what we’ll call matrix multiplication.

Matrix multiplication is effectively a generalization of dot products.

Matrix multiplication: Let \( v = x \cdot y \) then we can write \( v_{ij} = \sum_{k=1}^N x_{ik} y_{kj} \) where \( x_{ij} \) is notation that denotes the element found in the ith row and jth column of the matrix \( x \).

The image below from Wikipedia, by Bilou, shows how matrix multiplication simplifies to a series of dot products:

image

After looking at the math and image above, you might have realized that matrix multiplication requires very specific matrix shapes!

For two matrices \( x, y \) to be multiplied, \( x \) must have the same number of columns as \( y \) has rows.

Formally, we require that for some integer numbers, \( M, N, \) and \( K \) that if \( x \) is \( N \times M \) then \( y \) must be \( M \times K \).

If we think of a vector as a \( 1 \times M \) or \( M \times 1 \) matrix, we can even do matrix multiplication between a matrix and a vector!

Let’s see some examples of this.

x1 = np.reshape(np.arange(6), (3, 2))
x2 = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
x3 = np.array([[2, 5, 2], [1, 2, 1]])
x4 = np.ones((2, 3))

y1 = np.array([1, 2, 3])
y2 = np.array([0.5, 0.5])

Numpy allows us to do matrix multiplication in three ways.

print("Using @ for two matrices")
print(x1 @ x4)
Using @ for two matrices
[[1. 1. 1.]
 [5. 5. 5.]
 [9. 9. 9.]]
print("Using @ for vec and mat")
print(y1 @ x1)
Using @ for vec and mat
[16 22]

4.5.5. Universal Functions#

We will often need to transform data by applying a function to every element of an array.

Numpy has good support for these operations, called universal functions or ufuncs for short.

The numpy documentation has a list of all available ufuncs.

Note

You should think of operations between a single number and an array, as we just saw, as a ufunc.

Below, we will create an array that contains 10 points between 0 and 25.

# This is similar to range -- but spits out 50 evenly spaced points from 0.5
# to 25.
x = np.linspace(0.5, 25, 10)

We will experiment with some ufuncs below:

# Applies the sin function to each element of x
np.sin(x)
array([ 0.47942554, -0.08054223, -0.33229977,  0.68755122, -0.92364381,
        0.99966057, -0.9024271 ,  0.64879484, -0.28272056, -0.13235175])

You can use the inspector or the docstrings with np.<TAB> to see other available functions, such as

# Takes log of each element of x
np.log(x)
array([-0.69314718,  1.17007125,  1.78245708,  2.15948425,  2.43263822,
        2.64696251,  2.82336105,  2.97325942,  3.10358967,  3.21887582])

A benefit of using the numpy arrays is that numpy has succinct code for combining vectorized operations.

# Calculate log(z) * z elementwise
z = np.array([1,2,3])
np.log(z) * z
array([0.        , 1.38629436, 3.29583687])

4.5.6. Other Useful Array Operations#

We have barely scratched the surface of what is possible using numpy arrays.

We hope you will experiment with other functions from numpy and see how they work.

Below, we demonstrate a few more array operations that we find most useful – just to give you an idea of what else you might find.

When you’re attempting to do an operation that you feel should be common, the numpy library probably has it.

Use Google and tab completion to check this.

x = np.linspace(0, 25, 10)
np.mean(x)
12.5
np.std(x)
7.9785592313028175
# np.min, np.median, etc... are also defined
np.max(x)
25.0
np.diff(x)
array([2.77777778, 2.77777778, 2.77777778, 2.77777778, 2.77777778,
       2.77777778, 2.77777778, 2.77777778, 2.77777778])
np.reshape(x, (5, 2))
array([[ 0.        ,  2.77777778],
       [ 5.55555556,  8.33333333],
       [11.11111111, 13.88888889],
       [16.66666667, 19.44444444],
       [22.22222222, 25.        ]])

Note that many of these operations can be called as methods on x:

print(x.mean())
print(x.std())
print(x.max())
print(x.reshape((5, 2)))
12.5
7.9785592313028175
25.0
[[ 0.          2.77777778]
 [ 5.55555556  8.33333333]
 [11.11111111 13.88888889]
 [16.66666667 19.44444444]
 [22.22222222 25.        ]]

4.5.7. A list of some Array Methods#

Arrays have useful methods, all of which are carefully optimized

a = np.array((4, 3, 2, 1))
a
array([4, 3, 2, 1])
a.sort()              # Sorts a in place
a
array([1, 2, 3, 4])
a.sum()               # Sum
10
a.mean()              # Mean
2.5
a.max()               # Max
4
a.argmax()            # Returns the index of the maximal element
3
a.cumsum()            # Cumulative sum of the elements of a
array([ 1,  3,  6, 10], dtype=int32)
a.cumprod()           # Cumulative product of the elements of a
array([ 1,  2,  6, 24], dtype=int32)
a.var()               # Variance
1.25
a.std()               # Standard deviation
1.118033988749895
a.T                   # Equivalent to a.transpose()
array([1, 2, 3, 4])

Many of the methods discussed above have equivalent functions in the NumPy namespace

a = np.array((4, 3, 2, 1))
np.sum(a)
10
np.mean(a)
2.5

4.5.8. Mutability and Copying Arrays#

NumPy arrays are mutable data types, like Python lists.

In other words, their contents can be altered (mutated) in memory after initialization.

We already saw examples above.

Here’s another example:

a = np.array([42, 44])
a
array([42, 44])
a[-1] = 0  # Change last element to 0
a
array([42,  0])

Mutability leads to the following behavior (which can be shocking to MATLAB programmers…)

a = np.random.randn(3)
a
array([-2.74255688, -1.31106808, -1.56576428])
b = a
b[0] = 0.0
a
array([ 0.        , -1.31106808, -1.56576428])

What’s happened is that we have changed a by changing b.

The name b is bound to a and becomes just another reference to the array (the Python assignment model is described in more detail later in the course).

Hence, it has equal rights to make changes to that array.

This is in fact the most sensible default behavior!

It means that we pass around only pointers to data, rather than making copies.

Making copies is expensive in terms of both speed and memory.

4.5.8.1. Making Copies#

It is of course possible to make b an independent copy of a when required.

This can be done using np.copy

a = np.random.randn(3)
a
array([ 0.72461914, -0.79681709,  0.09462078])
b = np.copy(a)
b
array([ 0.72461914, -0.79681709,  0.09462078])

Now b is an independent copy (called a deep copy)

b[:] = 1
b
array([1., 1., 1.])
a
array([ 0.72461914, -0.79681709,  0.09462078])

Note that the change to b has not affected a.

4.5.9. Comparisons#

As a rule, comparisons on arrays are done element-wise

z = np.array([2, 3])
y = np.array([2, 3])
z == y
array([ True,  True])
y[0] = 5
z == y
array([False,  True])
z != y
array([ True, False])

The situation is similar for >, <, >= and <=.

We can also do comparisons against scalars

z = np.linspace(0, 10, 5)
z
array([ 0. ,  2.5,  5. ,  7.5, 10. ])
z > 3
array([False, False,  True,  True,  True])

This is particularly useful for conditional extraction

b = z > 3
b
array([False, False,  True,  True,  True])
z[b]
array([ 5. ,  7.5, 10. ])

Of course we can—and frequently do—perform this in one step

z[z > 3]
array([ 5. ,  7.5, 10. ])