# Introduction to `numpy`

:::{note}
This material is mostly adapted from the following resources:
- [Earth and Environmental Data Science: Numpy and Matplotlib](https://earth-env-data-science.github.io/lectures/basic_scipy/numpy_and_matplotlib.html)
- [Python Programming for Data Science: Chapter 5: Introduction to NumPy](https://www.tomasbeuzen.com/python-programming-for-data-science/chapters/chapter5-numpy.html)
- [Data Science for Energy System Modelling: Introduction to numpy and matplotlib](https://fneum.github.io/data-science-for-esm/02-workshop-numpy.html)
:::


<img src="https://numpy.org/images/logo.svg" width="100px" />

**Numpy** is the fundamental package for scientific computing with Python. NumPy is the standard Python library used for working with arrays (i.e., vectors & matrices), linear algebra, and other numerical computations.

- Website: <https://numpy.org/>
- GitHub: <https://github.com/numpy/numpy>

:::{note}
Documentation for this package is available at https://numpy.org/doc/stable/index.html.
:::


:::{note}
If you have not yet set up Python on your computer, you can execute this tutorial in your browser via [Google Colab](https://colab.research.google.com/). Click on the rocket in the top right corner and launch "Colab". If that doesn't work download the `.ipynb` file and import it in [Google Colab](https://colab.research.google.com/)

Then install `numpy` by executing the following command in a Jupyter cell at the top of the notebook.

```sh
!pip install numpy
```
:::

## Importing a Package

This will be our first experience with _importing_ a package.

Usually we import `numpy` with the _alias_ `np`.

In [3]:
import numpy as np

## NDArrays

NDarrays (short for n-dimensional arrays) are a key data structure in `numpy`. NDarrays are similar to Python lists, but they allow fast, efficient computations on large arrays and matrices of numerical data. NDarrays can have any number of dimensions, and are used for a wide range of numerical and scientific computing tasks, including linear algebra, statistical analysis, and image processing.

Thus, the main differences between a numpy array and a `list` are the following:
- `numpy` arrays can have N dimensions (while lists only have 1)
- `numpy` arrays hold values of the same datatype (e.g. `int`, `float`), while lists can contain anything.
- `numpy` optimizes numerical operations on arrays. Numpy is _fast!_

<img src="https://predictivehacks.com/wp-content/uploads/2020/08/numpy_arrays.png" width="720px" />

In [4]:
# create an array from a list
a = np.array([9, 0, 2, 1, 0])

:::{note}
If you're in Jupyter, you can use `<shift> + <tab>` to inspect a function.
:::

In [5]:
# find out the datatype
a.dtype

dtype('int32')

In [6]:
# find out the shape
a.shape

(5,)

In [7]:
# another array with a different datatype and shape
b = np.array([[5, 3, 1, 9], [9, 2, 3, 0]], dtype=np.float64)

In [8]:
# check dtype
b.dtype

dtype('float64')

In [9]:
# check shape
b.shape

(2, 4)

## Array Creation

There are lots of ways to create arrays.

In [10]:
np.zeros((4, 4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [11]:
np.ones((2, 2, 3))

array([[[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]]])

In [12]:
np.full((3, 2), np.pi)

array([[3.14159265, 3.14159265],
       [3.14159265, 3.14159265],
       [3.14159265, 3.14159265]])

In [13]:
np.random.rand(5, 2)

array([[0.50363433, 0.79931259],
       [0.40116053, 0.57723486],
       [0.86872588, 0.00241641],
       [0.4258008 , 0.58526495],
       [0.02911971, 0.5366019 ]])

In [14]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [15]:
np.arange(2, 4, 0.25)

array([2.  , 2.25, 2.5 , 2.75, 3.  , 3.25, 3.5 , 3.75])

A frequent need is to generate an array of N numbers, evenly spaced between two values. That is what `linspace` is for.

In [16]:
np.linspace(2, 4, 20)

array([2.        , 2.10526316, 2.21052632, 2.31578947, 2.42105263,
       2.52631579, 2.63157895, 2.73684211, 2.84210526, 2.94736842,
       3.05263158, 3.15789474, 3.26315789, 3.36842105, 3.47368421,
       3.57894737, 3.68421053, 3.78947368, 3.89473684, 4.        ])

Numpy also has some utilities for helping us generate multi-dimensional arrays.
For instance, `meshgrid` creates 2D arrays out of a combination of 1D arrays.

In [17]:
x = np.linspace(-2 * np.pi, 2 * np.pi, 5)
y = np.linspace(-np.pi, np.pi, 4)
xx, yy = np.meshgrid(x, y)
xx.shape, yy.shape

((4, 5), (4, 5))

In [18]:
yy

array([[-3.14159265, -3.14159265, -3.14159265, -3.14159265, -3.14159265],
       [-1.04719755, -1.04719755, -1.04719755, -1.04719755, -1.04719755],
       [ 1.04719755,  1.04719755,  1.04719755,  1.04719755,  1.04719755],
       [ 3.14159265,  3.14159265,  3.14159265,  3.14159265,  3.14159265]])

In [19]:
xx

array([[-6.28318531, -3.14159265,  0.        ,  3.14159265,  6.28318531],
       [-6.28318531, -3.14159265,  0.        ,  3.14159265,  6.28318531],
       [-6.28318531, -3.14159265,  0.        ,  3.14159265,  6.28318531],
       [-6.28318531, -3.14159265,  0.        ,  3.14159265,  6.28318531]])

## Indexing

Basic indexing in `numpy` is similar to lists.

In [20]:
# get some individual elements of xx
xx[3, 4]

6.283185307179586

In [21]:
# get some whole rows
xx[0]

array([-6.28318531, -3.14159265,  0.        ,  3.14159265,  6.28318531])

In [22]:
# get some whole columns
xx[:, -1]

array([6.28318531, 6.28318531, 6.28318531, 6.28318531])

In [23]:
# get some ranges (also called slicing)
xx[0:2, 3:5]

array([[3.14159265, 6.28318531],
       [3.14159265, 6.28318531]])

## Broadcasting


Not all the arrays we want to work with will have the same size!

__Broadcasting__ is a powerful feature in `numpy` that allows you to perform operations on arrays of different shapes and sizes. It automatically expands the smaller array to match the dimensions of the larger array, without actually making copies of the data, so that element-wise operations can be performed. This is done by following a set of rules that determine how the shapes of the arrays align.

Broadcasting allows you to vectorize operations and avoid explicit loops, leading to more concise and efficient code. It's particularly useful when working with large data sets, as it helps optimize memory usage and computational speed.

<img src="http://scipy-lectures.github.io/_images/numpy_broadcasting.png" width="720px" />

Dimensions are automatically aligned _starting with the last dimension_.
If the last two dimensions have the same length, then the two arrays can be broadcast.

In [25]:
f = np.sin(xx) * np.cos(0.5 * yy)
print(f.shape, x.shape)

(4, 5) (5,)


In [26]:
g = f * x

In [27]:
print(g.shape)

(4, 5)


## Reduction Operations

In data science, we usually start with a lot of data and want to reduce it down in order to make plots of summary tables.

There are many different reduction operations. The table below lists the most common functions:

| Reduction Operation | Description                                                  |
|---------------------|--------------------------------------------------------------|
| `numpy.sum()`       | Computes the sum of array elements over a given axis.         |
| `numpy.mean()`      | Computes the arithmetic mean along a specified axis.          |
| `numpy.min()`       | Computes the minimum value along a specified axis.            |
| `numpy.max()`       | Computes the maximum value along a specified axis.            |
| `numpy.prod()`      | Computes the product of array elements over a given axis.      |
| `numpy.std()`       | Computes the standard deviation along a specified axis.       |
| `numpy.var()`       | Computes the variance along a specified axis.                 |



In [28]:
# sum
g.sum()

-3.9982744542688894e-15

In [29]:
# mean
g.mean()

-1.9991372271344446e-16

In [30]:
# standard deviation
g.std()

5.809358098232775e-16

A key property of numpy reductions is the ability to operate on just one axis.

In [31]:
# apply on just one axis
g_ymean = g.mean(axis=0)
g_xmean = g.mean(axis=1)

## Exercises

Import `numpy` under the alias `np`.

In [33]:
import numpy as np

Create the following arrays:

1. Create an array of 5 zeros.
2. Create an array of 10 ones.
3. Create an array of 5 $\pi$ values.
4. Create an array of the integers 1 to 20.
5. Create a 5 x 5 matrix of ones with a dtype `int`.

In [34]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [35]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [36]:
np.full(5, np.pi)

array([3.14159265, 3.14159265, 3.14159265, 3.14159265, 3.14159265])

In [37]:
np.arange(1, 21)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20])

In [38]:
np.ones((5, 5), dtype=np.int8)

array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]], dtype=int8)

Create a 3D matrix of 3 x 3 x 3 full of random numbers drawn from a standard normal distribution (hint: [`np.random.randn()`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.html))

In [39]:
np.random.randn(3, 3, 3)

array([[[-5.81969439e-01, -9.81112032e-01, -3.07642722e+00],
        [-3.65459504e-01,  1.69452852e+00, -1.35313103e+00],
        [ 8.43999092e-01,  8.99145753e-01, -5.13311942e-01]],

       [[ 8.11475844e-01, -5.96687619e-01, -1.50108961e+00],
        [ 6.49054741e-04, -6.75415803e-01,  9.42688351e-01],
        [-1.82053558e-01, -1.91309248e+00,  7.00414251e-01]],

       [[-1.17685536e+00,  1.58800658e+00, -4.56261583e-01],
        [-1.41860069e+00, -1.11481372e+00,  9.56395967e-01],
        [-1.11355790e+00,  1.91757095e+00,  2.83903428e-01]]])

Create an array of 20 linearly spaced numbers between 1 and 10.

In [40]:
np.linspace(1, 10, 20)

array([ 1.        ,  1.47368421,  1.94736842,  2.42105263,  2.89473684,
        3.36842105,  3.84210526,  4.31578947,  4.78947368,  5.26315789,
        5.73684211,  6.21052632,  6.68421053,  7.15789474,  7.63157895,
        8.10526316,  8.57894737,  9.05263158,  9.52631579, 10.        ])

Below I've defined an array of shape 4 x 4. Use indexing to produce the given outputs.

In [28]:
a = np.arange(1, 26).reshape(5, -1)
a

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

In [29]:
a[1:, 3:]

array([[ 9, 10],
       [14, 15],
       [19, 20],
       [24, 25]])

```python
array([[ 9, 10],
       [14, 15],
       [19, 20],
       [24, 25]])
```

In [30]:
a[1]

array([ 6,  7,  8,  9, 10])

```python
array([ 6,  7,  8,  9, 10])
```

In [31]:
a[2:4]

array([[11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20]])

```python
array([[11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20]])
```

In [32]:
a[1:3, 2:4]

array([[ 8,  9],
       [13, 14]])

```python
array([[ 8,  9],
       [13, 14]])
```

Calculate the sum of all the numbers in `a`.

In [33]:
a.sum()

325

Calculate the sum of each row in `a`.

In [34]:
a.sum(axis=1)

array([ 15,  40,  65,  90, 115])

In [35]:
a.sum(axis=0)

array([55, 60, 65, 70, 75])

Extract all values of `a` greater than the mean of `a` (hint: use a boolean mask).

In [36]:
a[a > a.mean()]

array([14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25])