Skip to article frontmatterSkip to article content

Overview

In order to gain some familiarity with the method, this notebook will work through the EOF method step by step on some synthetic data. We will also see clearly some of the limitations of the method, including mode mixing Chen et al., 2017 and the generation of unphysical modes Dommenget & Latif, 2002.

Prerequisites

ConceptsImportanceNotes
Foundations NumPy sectionNecessary
Intro to EOFsHelpful
  • Time to learn: 20 minutes

Imports

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import CenteredNorm

Set the domain

Assume a domain with two spatial dimensions, xx and yy, and time tt. For simplicity, we will start all dimensions at zero and increment by one up to some arbitrary maximum, here x=9x=9, y=19y=19, and t=49t=49. Different dimension lengths will help highlight the shape of the data as we work through the EOF analysis. We are using numpy.mgrid to create our 3-dimensional domain.

x, y, t = np.mgrid[0:10, 0:20, 0:50]
nx, ny, nt = x.shape[0], x.shape[1], x.shape[2]
x.shape
(10, 20, 50)

Generate some data

Let’s generate some “modes of variability” in the domain. We will combine them and then use an EOF analysis to pull them back apart. First, here is a 2-dimensional Gaussian that oscillates between positive and negative with a period of 25 time steps:

mode_1 = 4*np.exp(-((x - np.mean(x))**2)/3 - ((y - np.mean(y))**2)/5) * np.cos(2*np.pi*t/25)
plt.pcolormesh(x[:, :, 0], y[:, :, 0], mode_1[:, :, 0], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar()
plt.ylabel('$y$')
plt.xlabel('$x$')
plt.title('Mode 1 at $t=0$')
<Figure size 640x480 with 2 Axes>

To see how this mode changes over time, we can take the mean over the x-axis (a “zonal” mean), and plot it as a function of yy and tt:

plt.pcolormesh(t[0, :, :], y[0, :, :], mode_1.mean(axis=0), cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar()
plt.ylabel('$y$')
plt.xlabel('$t$')
plt.title('Mode 1 zonal mean over time')
<Figure size 640x480 with 2 Axes>

Here is a second mode, which is the sum of cosines in xx and yy and oscillates between positive and negative with a period of 15 time steps:

mode_2 = (np.cos(2*np.pi*x/(0.5*nx) + 0.5*nx) + np.cos(2*np.pi*y/(0.5*ny) + 0.5*ny)) * np.sin(2*np.pi*t/15)
plt.pcolormesh(x[:, :, 2], y[:, :, 2], mode_2[:, :, 2], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar()
plt.ylabel('$y$')
plt.xlabel('$x$')
plt.title('Mode 2 at $t=0$')
<Figure size 640x480 with 2 Axes>
plt.pcolormesh(t[0, :, :], y[0, :, :], mode_2.mean(axis=0), cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar()
plt.ylabel('$y$')
plt.xlabel('$t$')
plt.title('Mode 2 zonal mean over time')
<Figure size 640x480 with 2 Axes>

Now, let’s add the modes together to get our full dataset:

data = mode_1 + mode_2

Visualize at t=0t=0:

plt.pcolormesh(x[:, :, 2], y[:, :, 2], data[:, :, 2], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar()
plt.ylabel('$y$')
plt.xlabel('$x$')
plt.title('Data at $t=0$')
<Figure size 640x480 with 2 Axes>

Prepare data for analysis

You can see that the modes of variability have mixed, just like in real geophysical data. The next step is to take the time-mean out of every location. Since we just constructed the data and know its dimensions, we know time is axis=2. We also have to reshape the data before subtracting for the array broadcasting to work properly (we are trying subtracting an array of shape [10, 20] from an array of shape [10, 20, 50], but we need [10, 20] to be the final dimensions; with labeled Xarray data this would be simpler).

data = (data.reshape(50, 10, 20) - data.mean(axis=2)).reshape(10, 20, 50)

The data then needs to be in the shape (space, time), so we stack the spatial dimensions x,yx,y into one dimension ss. Knowing the size of xx and yy gives us the size of ss: 10×20=20010\times20=200:

data_2d = data.reshape(200, 50)

Do the analysis

First, compute the covariance matrix RR of the 2-dimensional data:

R = np.cov(data_2d)
R.shape
(200, 200)

Now, we use numpy.linalg.eig() to find the eigenvalues eigval and eigenvectors eigvec of RR. The eigenvectors will give us the EOFs and PCs, while the eigenvalues tell us about the variance explained by the corresponding eigenvectors.

(eigval, eigvec) = np.linalg.eig(R)

The eigenpairs are by default not sorted. We want to sort by the eigenvalues, since they tell us how important the eigenvectors are. Here, we first create an index i that can sort an array by the eigenvalues eigval. Then we sort both eigval and eigvec:

i = np.argsort(-eigval)
eigval = -np.sort(-eigval)
eigvec = eigvec[:,i]

The eigendecomposition resulted in 200 eigenpairs, which is the size of ss:

eigval.shape, eigvec.shape
((200,), (200, 200))

Not all of these modes will be important. There are more objective ways to find out how many we should keep, but for now, we will just look at the eigenvalues and look for any clear separation or “drop-off”:

plt.bar(range(10), np.real(eigval[0:10]))
<BarContainer object of 10 artists>
<Figure size 640x480 with 1 Axes>

There is a pretty clear drop-off after the second eigenvalue, but let’s keep the first four to see the difference. And before truncating, we will sum over the eigenvalues for the next step.

eigval_sum = np.real(np.sum(eigval))

r = 4
eigval = eigval[:r]
eigvec = eigvec[:, :r]

To transform the eigenvalues into explained variance, we need to divide by the sum of all eigenvalues and optionally multiply by 100 to get percentages:

pvar = np.real(eigval/eigval_sum*100)
pvar
array([6.75819377e+01, 3.21033246e+01, 2.54528647e-01, 6.02090872e-02])

Note that we could have also divided by the trace of RR:

np.real(eigval/np.trace(R)*100)
array([6.75819377e+01, 3.21033246e+01, 2.54528647e-01, 6.02090872e-02])

To get the PCs, we multiply the 2-dimensional data by the eigenvectors. Looking at the shape of the arrays can help us figure out if we need to transpose (.T) anything. What shape do we expect the PC data to have? There are four of them with 50 time steps each, so (4, 50) or (50, 4).

data_2d.shape, eigvec.shape
((200, 50), (200, 4))

We need to transpose the data:

pcs = np.real(np.dot(data_2d.T, eigvec))

The PCs have dimensions (time, mode):

pcs.shape
(50, 4)

Lets look at the PCs:

plt.plot(pcs[:, 0], label='PC1')
plt.plot(pcs[:, 1], label='PC2')
plt.plot(pcs[:, 2], label='PC3', linestyle=':')
plt.plot(pcs[:, 3], label='PC4', linestyle=':')
plt.legend(loc='upper left')
plt.ylabel('PC')
plt.xlabel('$t$')
plt.xlim(0, nt-1)
plt.title('First four PCs')
<Figure size 640x480 with 1 Axes>

The EOFs are just the eigenvectors, but we need to reshape them back into the two spatial dimensions x,yx,y:

eofs = np.real(eigvec.reshape(10, 20, r))

Let’s look at the leading EOF:

plt.pcolormesh(x[:, :, 0], y[:, :, 0], eofs[:, :, 0], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar()
plt.ylabel('$y$')
plt.xlabel('$x$')
plt.title('EOF1 (%.1f%%)' %pvar[0])
<Figure size 640x480 with 2 Axes>

This looks like Mode 2 but with opposite sign and lower magnitude. We can attempt to reconstruct Mode 2 by multiplying this EOF with the corresponding PC:

fig, ax = plt.subplots(1, 2, figsize=(10, 5))

mode_2_reco_plot = ax[0].pcolormesh(x[:, :, 0], y[:, :, 0], eofs[:, :, 0]*pcs[2, 0], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar(mode_2_reco_plot, ax=ax[0])
ax[0].set_ylabel('$y$')
ax[0].set_xlabel('$x$')
ax[0].set_title('Reconstructed Mode 2 at $t=0$')

mode_2_orig_plot = plt.pcolormesh(x[:, :, 2], y[:, :, 2], mode_2[:, :, 2], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar(mode_2_orig_plot, ax=ax[1])
ax[1].set_ylabel('$y$')
ax[1].set_xlabel('$x$')
ax[1].set_title('Original Mode 2 at $t=0$')
<Figure size 1000x500 with 4 Axes>

Here is the next EOF:

plt.pcolormesh(x[:, :, 0], y[:, :, 0], eofs[:, :, 1], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar()
plt.ylabel('$y$')
plt.xlabel('$x$')
plt.title('EOF2 (%.1f%%)' %pvar[1])
<Figure size 640x480 with 2 Axes>

This is Mode 1, plus some noise and some apparent contamination (mode mixing) from Mode 2. We could reconstruct this mode, as well, but instead, let’s take a look at the next two modes:

plt.pcolormesh(x[:, :, 0], y[:, :, 0], eofs[:, :, 2], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar()
plt.ylabel('$y$')
plt.xlabel('$x$')
plt.title('EOF3 (%.1f%%)' %pvar[2])
<Figure size 640x480 with 2 Axes>
plt.pcolormesh(x[:, :, 0], y[:, :, 0], eofs[:, :, 3], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar()
plt.ylabel('$y$')
plt.xlabel('$x$')
plt.title('EOF4 (%.1f%%)' %pvar[3])
<Figure size 640x480 with 2 Axes>

EOF3 looks like some sort of residual from Mode 1, while EOF4 shows some zonal stripes that were not in our original data. Even though there is structure here, we should not interpret these as a meaningful or physical modes of variability in the data; they are artifacts of the method, particularly the orthogonality constraints. The very low explained variance is also an indication that the modes are probably not meaningful.

PC3 shows an oscillation with a period of about 10 time steps and EOF3 shows a pattern that we might expect from the data, so if we didn’t have a priori knowledge about the data, we might have been tempted to interpret this mode as meaningful. We therefore have to be careful when interpreting the modes we find with EOF analyses.

Exercise

Change the period of Mode 2 from 15 time steps to 20 time steps, bringing it closer in frequency to Mode 1. What happens?

Bonus 1: adding noise

Real geophysical data is rarely so clean. Let’s add some noise with a similar amplitude to the modes and see what happens. If the modes we constructed were modes of climate variability, this noise would be weather.

noisy_data = mode_1 + mode_2 + np.random.default_rng().normal(loc=0.0, scale=1.0, size=(nx, ny, nt))

Let’s look at the data at t=0t=0 for comparison:

plt.pcolormesh(x[:, :, 2], y[:, :, 2], noisy_data[:, :, 2], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar()
plt.ylabel('$y$')
plt.xlabel('$x$')
plt.title('Noisy data at $t=0$')
<Figure size 640x480 with 2 Axes>

The modes aren’t nearly as apparent with the noise added, just like real climate data.

noisy_data = (noisy_data.reshape(50, 10, 20) - noisy_data.mean(axis=2)).reshape(10, 20, 50)
noisy_data_2d = noisy_data.reshape(200, 50)

n_R = np.cov(noisy_data_2d)
(n_eigval, n_eigvec) = np.linalg.eig(n_R)

n_i = np.argsort(-n_eigval)
n_eigval = -np.sort(-n_eigval)
n_eigvec = n_eigvec[:,i]
r = 4
n_eigval = n_eigval[:r]
n_eigvec = n_eigvec[:, :r]

n_pvar = np.real(n_eigval/np.trace(n_R)*100)
n_pcs = np.real(np.dot(noisy_data_2d.T, n_eigvec))
n_eofs = np.real(n_eigvec.reshape(10, 20, r))
plt.plot(n_pcs[:, 0], label='PC1')
plt.plot(n_pcs[:, 1], label='PC2')
plt.plot(n_pcs[:, 2], label='PC3', linestyle=':')
plt.plot(n_pcs[:, 3], label='PC4', linestyle=':')
plt.legend(loc='upper left')
plt.ylabel('PC')
plt.xlabel('$t$')
plt.xlim(0, nt-1)
plt.title('First four PCs (noisy data)')
<Figure size 640x480 with 1 Axes>
n_fig, n_ax = plt.subplots(2, 2, figsize=(10, 5), layout='constrained')

noisy_eof1_plot = n_ax[0, 0].pcolormesh(x[:, :, 0], y[:, :, 0], n_eofs[:, :, 0], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar(noisy_eof1_plot, ax=n_ax[0, 0])
n_ax[0, 0].set_ylabel('$y$')
n_ax[0, 0].set_title('Noisy EOF1 (%.1f%%)' %n_pvar[0])

noisy_eof2_plot = n_ax[0, 1].pcolormesh(x[:, :, 0], y[:, :, 0], n_eofs[:, :, 1], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar(noisy_eof1_plot, ax=n_ax[0, 1])
n_ax[0, 1].set_title('Noisy EOF2 (%.1f%%)' %n_pvar[1])

noisy_eof3_plot = n_ax[1, 0].pcolormesh(x[:, :, 0], y[:, :, 0], n_eofs[:, :, 2], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar(noisy_eof3_plot, ax=n_ax[1, 0])
n_ax[1, 0].set_ylabel('$y$')
n_ax[1, 0].set_xlabel('$x$')
n_ax[1, 0].set_title('Noisy EOF3 (%.1f%%)' %n_pvar[2])

noisy_eof4_plot = n_ax[1, 1].pcolormesh(x[:, :, 0], y[:, :, 0], n_eofs[:, :, 3], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar(noisy_eof4_plot, ax=n_ax[1, 1])
n_ax[1, 1].set_xlabel('$x$')
n_ax[1, 1].set_title('Noisy EOF4 (%.1f%%)' %n_pvar[3])
<Figure size 1000x500 with 8 Axes>

The EOF analysis has again successfully identified our two modes. The third and fourth modes explain more variance than in the clean example, but the spatial patterns strongly suggest that they are just noise.

Bonus 2: method of snapshots

Sometimes you have a lot more locations than time steps in your data, making computing the covariance matrix impractical, if not impossible. For example, consider a few years of 0.1° gridded climate data formatted as 32-bit floats. That’s (3600×1800)2×41.68×1014(3600\times1800)^2\times 4\approx 1.68\times10^{14} bytes, or about 168 TB, required to hold the covariance matrix in memory.

To get around this issue, we turn to the method of snapshots Sirovich, 1987. The only change in our simple example is that we compute the covariance of the transposed data, resulting in a matrix of dimension (time, time), which would be much more manageable in our 0.1° data example. Here is the result for our clean 2-mode data:

s_R = np.cov(data_2d.T)
(s_eigval, s_eigvec) = np.linalg.eig(s_R)

s_i = np.argsort(-s_eigval)
s_eigval = -np.sort(-s_eigval)
s_eigvec = s_eigvec[:,s_i]
s_eigval = s_eigval[:r]
s_eigvec = s_eigvec[:, :r]

s_pvar = np.real(s_eigval/np.trace(s_R)*100)
s_pcs = np.real(s_eigvec)
s_eofs = np.real(np.dot(data_2d, s_eigvec)).reshape(10, 20, r)
plt.plot(s_pcs[:, 0], label='PC1')
plt.plot(s_pcs[:, 1], label='PC2')
plt.plot(s_pcs[:, 2], label='PC3', linestyle=':')
plt.plot(s_pcs[:, 3], label='PC4', linestyle=':')
plt.legend(loc='upper left')
plt.ylabel('PC')
plt.xlabel('$t$')
plt.xlim(0, nt-1)
plt.title('First four PCs (method of snapshots)')
<Figure size 640x480 with 1 Axes>
s_fig, s_ax = plt.subplots(2, 2, figsize=(10, 5), layout='constrained')

mos_eof1_plot = s_ax[0, 0].pcolormesh(x[:, :, 0], y[:, :, 0], s_eofs[:, :, 0], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar(mos_eof1_plot, ax=s_ax[0, 0])
s_ax[0, 0].set_ylabel('$y$')
s_ax[0, 0].set_title('EOF1 (%.1f%%)' %s_pvar[0])

mos_eof2_plot = s_ax[0, 1].pcolormesh(x[:, :, 0], y[:, :, 0], s_eofs[:, :, 1], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar(mos_eof1_plot, ax=s_ax[0, 1])
s_ax[0, 1].set_title('EOF2 (%.1f%%)' %s_pvar[1])

mos_eof3_plot = s_ax[1, 0].pcolormesh(x[:, :, 0], y[:, :, 0], s_eofs[:, :, 2], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar(mos_eof3_plot, ax=s_ax[1, 0])
s_ax[1, 0].set_ylabel('$y$')
s_ax[1, 0].set_xlabel('$x$')
s_ax[1, 0].set_title('EOF3 (%.1f%%)' %s_pvar[2])

mos_eof4_plot = s_ax[1, 1].pcolormesh(x[:, :, 0], y[:, :, 0], s_eofs[:, :, 3], cmap='RdBu_r', norm=CenteredNorm())
plt.colorbar(mos_eof4_plot, ax=s_ax[1, 1])
s_ax[1, 1].set_xlabel('$x$')
s_ax[1, 1].set_title('EOF4 (%.1f%%)' %s_pvar[3])
<Figure size 1000x500 with 8 Axes>

Summary

In this notebook, we implemented the EOF method on some synthetic data using NumPy. First, we generated two simple modes of variability and combined them. Then, our EOF analysis was able to largely isolate the modes as the first two EOF/PC pairs.

What’s next?

In the next notebook, we will use a package called xeofs to find climate modes in real sea surface temperature data.

References
  1. Chen, X., Wallace, J. M., & Tung, K.-K. (2017). Pairwise-Rotated EOFs of Global SST. Journal of Climate, 30(14), 5473–5489. 10.1175/JCLI-D-16-0786.1
  2. Dommenget, D., & Latif, M. (2002). A Cautionary Note on the Interpretation of EOFs. Journal of Climate, 15(2), 216–225. https://doi.org/10.1175/1520-0442(2002)015<;0216:ACNOTI>2.0.CO;2
  3. Sirovich, L. (1987). Turbulence and the dynamics of coherent structures. I. Coherent structures. Quarterly of Applied Mathematics, 45(3), 561–571. 10.1090/qam/910462