MITgcm ECCOv4 Example

Overview

This Jupyter notebook demonstrates how to use xarray and xgcm to analyze data from the ECCO v4r3 ocean state estimate.

Loading ECCO zarr data and converting to an xarray dataset
Visualize ocean depth using cartopy
Indexing and selecting data using xarray
Use dask cluster to speed up reading the data
Calculate and plot the horizontally integrated heat content anomaly
Use xgcm to compute the time-mean convergence of veritcally-integrated heat fluxes

Prerequisites

Concepts	Importance	Notes
Intro to Cartopy	Helpful
Xarray	Helpful	Slicing, indexing, basic statistics
Dask	Helpful

Time to learn: 1 hour

Imports

import xarray as xr
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import intake
import cartopy as cart
import pyresample
from dask_gateway import GatewayCluster
from dask.distributed import Client
import xgcm

Load the data

The ECCOv4r3 data was converted from its raw MDS (.data / .meta file) format to zarr format, using the xmitgcm package. Zarr is a powerful data storage format that can be thought of as an alternative to HDF. In contrast to HDF, zarr works very well with cloud object storage. Zarr is currently useable in python, java, C++, and julia. It is likely that zarr will form the basis of the next major version of the netCDF library.

If you’re curious, here are some resources to learn more about zarr:

https://zarr.readthedocs.io/en/stable/tutorial.html
https://speakerdeck.com/rabernat/pangeo-zarr-cloud-data-storage
https://mrocklin.github.com/blog/work/2018/02/06/hdf-in-the-cloud

The ECCO zarr data currently lives in Google Cloud Storage as part of the Pangeo Data Catalog. This means we can open the whole dataset using one line of code.

This takes a bit of time to run because the metadata must be downloaded and parsed. The type of object returned is an Xarray dataset.

cat = intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml")
ds = cat.ECCOv4r3.to_dask()
ds

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 2
      1 cat = intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml")
----> 2 ds = cat.ECCOv4r3.to_dask()
      3 ds

File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/intake/catalog/base.py:427, in Catalog.__getattr__(self, item)
    424 if not item.startswith("_"):
    425     # Fall back to __getitem__.
    426     try:
--> 427         return self[item]  # triggers reload_on_change
    428     except KeyError as e:
    429         raise AttributeError(item) from e

File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/intake/catalog/base.py:472, in Catalog.__getitem__(self, key)
    463 """Return a catalog entry by name.
    464 
    465 Can also use attribute syntax, like ``cat.entry_name``, or
   (...)
    468 cat['name1', 'name2']
    469 """
    470 if not isinstance(key, list) and key in self:
    471     # triggers reload_on_change
--> 472     s = self._get_entry(key)
    473     if s.container == "catalog":
    474         s.name = key

File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/intake/catalog/utils.py:43, in reload_on_change.<locals>.wrapper(self, *args, **kwargs)
     40 @functools.wraps(f)
     41 def wrapper(self, *args, **kwargs):
     42     self.reload()
---> 43     return f(self, *args, **kwargs)

File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/intake/catalog/base.py:355, in Catalog._get_entry(self, name)
    353 ups = [up for name, up in self.user_parameters.items() if name not in up_names]
    354 entry._user_parameters = ups + (entry._user_parameters or [])
--> 355 return entry()

File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/intake/catalog/entry.py:60, in CatalogEntry.__call__(self, persist, **kwargs)
     58 def __call__(self, persist=None, **kwargs):
     59     """Instantiate DataSource with given user arguments"""
---> 60     s = self.get(**kwargs)
     61     s._entry = self
     62     s._passed_kwargs = list(kwargs)

File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/intake/catalog/local.py:313, in LocalCatalogEntry.get(self, **user_parameters)
    310     return self._default_source
    312 plugin, open_args = self._create_open_args(user_parameters)
--> 313 data_source = plugin(**open_args)
    314 data_source.catalog_object = self._catalog
    315 data_source.name = self.name

TypeError: ZarrArraySource.__init__() got an unexpected keyword argument 'consolidated'

Note that no data has been actually download yet. Xarray uses the approach of lazy evaluation, in which loading of data and execution of computations is delayed as long as possible (i.e. until data is actually needed for a plot). The data are represented symbolically as dask arrays. For example:

SALT       (time, k, face, j, i) float32 dask.array<shape=(288, 50, 13, 90, 90), chunksize=(1, 50, 13, 90, 90)>

The full shape of the array is (288, 50, 13, 90, 90), quite large. But the chunksize is (1, 50, 13, 90, 90). Here the chunks correspond to the individual granuales of data (objects) in cloud storage. The chunk is the minimum amount of data we can read at one time.

# a trick to make things work a bit faster
coords = ds.coords.to_dataset().reset_coords()
ds = ds.reset_coords(drop=True)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 2
      1 # a trick to make things work a bit faster
----> 2 coords = ds.coords.to_dataset().reset_coords()
      3 ds = ds.reset_coords(drop=True)

NameError: name 'ds' is not defined

Visualizing Data

A Direct Plot

Let’s try to visualize something simple: the Depth variable. Here is how the data are stored:

Depth      (face, j, i) float32 dask.array<shape=(13, 90, 90), chunksize=(13, 90, 90)>

Although depth is a 2D field, there is an extra, dimension (face) corresponding to the LLC face number. Let’s use xarray’s built in plotting functions to plot each face individually.

coords.Depth.plot(col='face', col_wrap=5)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 coords.Depth.plot(col='face', col_wrap=5)

NameError: name 'coords' is not defined

This view is not the most useful. It reflects how the data is arranged logically, rather than geographically.

A Pretty Map

To make plotting easier, we can define a quick function to plot the data in a more geographically friendly way. Eventually these plotting functions may be provided by the gcmplots package: xecco/gcmplots. For now, it is easy enough to roll our own.

class LLCMapper:

    def __init__(self, ds, dx=0.25, dy=0.25):
        # Extract LLC 2D coordinates
        lons_1d = ds.XC.values.ravel()
        lats_1d = ds.YC.values.ravel()

        # Define original grid
        self.orig_grid = pyresample.geometry.SwathDefinition(lons=lons_1d, lats=lats_1d)

        # Longitudes latitudes to which we will we interpolate
        lon_tmp = np.arange(-180, 180, dx) + dx/2
        lat_tmp = np.arange(-90, 90, dy) + dy/2

        # Define the lat lon points of the two parts.
        self.new_grid_lon, self.new_grid_lat = np.meshgrid(lon_tmp, lat_tmp)
        self.new_grid  = pyresample.geometry.GridDefinition(lons=self.new_grid_lon,
                                                            lats=self.new_grid_lat)

    def __call__(self, da, ax=None, projection=cart.crs.Robinson(), lon_0=-60, **plt_kwargs):

        assert set(da.dims) == set(['face', 'j', 'i']), "da must have dimensions ['face', 'j', 'i']"

        if ax is None:
            fig, ax = plt.subplots(figsize=(12, 6), subplot_kw={'projection': projection})
        else:
            m = plt.axes(projection=projection)
            
        field = pyresample.kd_tree.resample_nearest(self.orig_grid, da.values,
                                                    self.new_grid,
                                                    radius_of_influence=100000,
                                                    fill_value=None)

        vmax = plt_kwargs.pop('vmax', field.max())
        vmin = plt_kwargs.pop('vmin', field.min())

        
        x,y = self.new_grid_lon, self.new_grid_lat

        # Find index where data is splitted for mapping
        split_lon_idx = round(x.shape[1]/(360/(lon_0 if lon_0>0 else lon_0+360)))


        p = ax.pcolormesh(x[:,:split_lon_idx], y[:,:split_lon_idx], field[:,:split_lon_idx],
                         vmax=vmax, vmin=vmin, transform=cart.crs.PlateCarree(), zorder=1, **plt_kwargs)
        p = ax.pcolormesh(x[:,split_lon_idx:], y[:,split_lon_idx:], field[:,split_lon_idx:],
                         vmax=vmax, vmin=vmin, transform=cart.crs.PlateCarree(), zorder=2, **plt_kwargs)

        ax.add_feature(cart.feature.LAND, facecolor='0.5', zorder=3)
        label = ''
        if da.name is not None:
            label = da.name
        if 'units' in da.attrs:
            label += ' [%s]' % da.attrs['units']
        cb = plt.colorbar(p, shrink=0.4, label=label)
        return ax

mapper = LLCMapper(coords)
mapper(coords.Depth);

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 mapper = LLCMapper(coords)
      2 mapper(coords.Depth);

NameError: name 'coords' is not defined

We can use this with any 2D cell-centered LLC variable.

Selecting data

The entire ECCOv4e3 dataset is contained in a single Xarray.Dataset object. How do we find a view specific pieces of data? This is handled by Xarray’s indexing and selecting functions. To get the SST from January 2000, we do this:

sst = ds.THETA.sel(time='2000-01-15', k=0)
sst

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 sst = ds.THETA.sel(time='2000-01-15', k=0)
      2 sst

NameError: name 'ds' is not defined

Still no data has been actually downloaded. That doesn’t happen until we call .load() explicitly or try to make a plot.

mapper(sst, cmap='RdBu_r');

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 mapper(sst, cmap='RdBu_r');

NameError: name 'mapper' is not defined

Do some Calculations

Now let’s start doing something besides just plotting the existing data. For example, let’s calculate the time-mean SST.

mean_sst = ds.THETA.sel(k=0).mean(dim='time')
mean_sst

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 mean_sst = ds.THETA.sel(k=0).mean(dim='time')
      2 mean_sst

NameError: name 'ds' is not defined

As usual, no data was loaded. Instead, mean_sst is a symbolic representation of the data that needs to be pulled and the computations that need to be executed to produce the desired result. In this case, the 288 original chunks all need to be read from cloud storage. Dask coordinates this automatically for us. But it does take some time.

%time mean_sst.load()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed eval>:1

NameError: name 'mean_sst' is not defined

mapper(mean_sst, cmap='RdBu_r');

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 mapper(mean_sst, cmap='RdBu_r');

NameError: name 'mapper' is not defined

Speeding things up with a Dask Cluster

How can we speed things up? In general, the main bottleneck for this type of data analysis is the speed with which we can read the data. With cloud storage, the access is highly parallelizeable.

From a Pangeo environment, we can create a Dask cluster to spread the work out amongst many compute nodes. This works on both HPC and cloud. In the cloud, the compute nodes are provisioned on the fly and can be shut down as soon as we are done with our analysis.

The code below will create a cluster with five compute nodes. It can take a few minutes to provision our nodes.

cluster = GatewayCluster()
cluster.scale(5)
client = Client(cluster)
cluster

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 1
----> 1 cluster = GatewayCluster()
cluster.scale(5)
client = Client(cluster)

File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/dask_gateway/client.py:816, in GatewayCluster.__init__(self, address, proxy_address, public_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
def __init__(
   self,
   address=None,
   (...)
   **kwargs,
):
--> 816     self._init_internal(
       address=address,
       proxy_address=proxy_address,
       public_address=public_address,
       auth=auth,
       cluster_options=cluster_options,
       cluster_kwargs=kwargs,
       shutdown_on_close=shutdown_on_close,
       asynchronous=asynchronous,
       loop=loop,
   )

File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/dask_gateway/client.py:889, in GatewayCluster._init_internal(self, address, proxy_address, public_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
self.shutdown_on_close = shutdown_on_close
self._instances.add(self)
--> 889 self.gateway = Gateway(
   address=address,
   proxy_address=proxy_address,
   public_address=public_address,
   auth=auth,
   asynchronous=asynchronous,
   loop=loop,
)
# Internals
self.scheduler_info = {}

File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/dask_gateway/client.py:282, in Gateway.__init__(self, address, proxy_address, public_address, auth, asynchronous, loop)
   address = format_template(dask.config.get("gateway.address"))
if address is None:
--> 282     raise ValueError(
       "No dask-gateway address provided or found in configuration"
   )
address = address.rstrip("/")
if public_address is None:

ValueError: No dask-gateway address provided or found in configuration

Now we re-run the mean calculation. Note how the dashboard helps us visualize what the cluster is doing.

%time ds.THETA.isel(k=0).mean(dim='time').load()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed eval>:1

NameError: name 'ds' is not defined

Spatially-Integrated Heat Content Anomaly

Now let’s do something harder. We will calculate the horizontally integrated heat content anomaly for the full 3D model domain.

# the monthly climatology
theta_clim = ds.THETA.groupby('time.month').mean(dim='time')
# the anomaly
theta_anom = ds.THETA.groupby('time.month') - theta_clim
rho0 = 1029
cp = 3994
ohc = rho0 * cp * (theta_anom *
                   coords.rA *
                   coords.hFacC).sum(dim=['face', 'j', 'i'])
ohc

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 2
      1 # the monthly climatology
----> 2 theta_clim = ds.THETA.groupby('time.month').mean(dim='time')
      3 # the anomaly
      4 theta_anom = ds.THETA.groupby('time.month') - theta_clim

NameError: name 'ds' is not defined

# actually load the data
ohc.load()
# put the depth coordinate back for plotting purposes
ohc.coords['Z'] = coords.Z
ohc.swap_dims({'k': 'Z'}).transpose().plot(vmax=1e20)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 2
      1 # actually load the data
----> 2 ohc.load()
      3 # put the depth coordinate back for plotting purposes
      4 ohc.coords['Z'] = coords.Z

NameError: name 'ohc' is not defined

Spatial Derivatives: Heat Budget

As our final exercise, we will do something much more complicated. We will compute the time-mean convergence of vertically-integrated heat fluxes. This is hard for several reasons.

The first reason it is hard is because it involves variables located at different grid points. Following MITgcm conventions, xmitgcm (which produced this dataset) labels the center point with the coordinates j, i, the u-velocity point as j, i_g, and the v-velocity point as j_g, i. The horizontal advective heat flux variables are

ADVx_TH    (time, k, face, j, i_g) float32 dask.array<shape=(288, 50, 13, 90, 90), chunksize=(1, 50, 13, 90, 90)>
ADVy_TH    (time, k, face, j_g, i) float32 dask.array<shape=(288, 50, 13, 90, 90), chunksize=(1, 50, 13, 90, 90)>

Xarray won’t allow us to add or multiply variables that have different dimensions, and xarray by itself doesn’t understand how to transform from one grid position to another.

That’s why xgcm was created.

Xgcm allows us to create a Grid object, which understands how to interpolate and take differences in a way that is compatible with finite volume models such at MITgcm. Xgcm also works with many other models, including ROMS, POP, MOM5/6, NEMO, etc.

A second reason this is hard is because of the complex topology connecting the different MITgcm faces. Fortunately xgcm also supports this.

# define the connectivity between faces
face_connections = {'face':
                    {0: {'X':  ((12, 'Y', False), (3, 'X', False)),
                         'Y':  (None,             (1, 'Y', False))},
                     1: {'X':  ((11, 'Y', False), (4, 'X', False)),
                         'Y':  ((0, 'Y', False),  (2, 'Y', False))},
                     2: {'X':  ((10, 'Y', False), (5, 'X', False)),
                         'Y':  ((1, 'Y', False),  (6, 'X', False))},
                     3: {'X':  ((0, 'X', False),  (9, 'Y', False)),
                         'Y':  (None,             (4, 'Y', False))},
                     4: {'X':  ((1, 'X', False),  (8, 'Y', False)),
                         'Y':  ((3, 'Y', False),  (5, 'Y', False))},
                     5: {'X':  ((2, 'X', False),  (7, 'Y', False)),
                         'Y':  ((4, 'Y', False),  (6, 'Y', False))},
                     6: {'X':  ((2, 'Y', False),  (7, 'X', False)),
                         'Y':  ((5, 'Y', False),  (10, 'X', False))},
                     7: {'X':  ((6, 'X', False),  (8, 'X', False)),
                         'Y':  ((5, 'X', False),  (10, 'Y', False))},
                     8: {'X':  ((7, 'X', False),  (9, 'X', False)),
                         'Y':  ((4, 'X', False),  (11, 'Y', False))},
                     9: {'X':  ((8, 'X', False),  None),
                         'Y':  ((3, 'X', False),  (12, 'Y', False))},
                     10: {'X': ((6, 'Y', False),  (11, 'X', False)),
                          'Y': ((7, 'Y', False),  (2, 'X', False))},
                     11: {'X': ((10, 'X', False), (12, 'X', False)),
                          'Y': ((8, 'Y', False),  (1, 'X', False))},
                     12: {'X': ((11, 'X', False), None),
                          'Y': ((9, 'Y', False),  (0, 'X', False))}}}

# create the grid object
grid = xgcm.Grid(ds, periodic=False, face_connections=face_connections)
grid

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 31
      2 face_connections = {'face':
      3                     {0: {'X':  ((12, 'Y', False), (3, 'X', False)),
      4                          'Y':  (None,             (1, 'Y', False))},
   (...)
     27                      12: {'X': ((11, 'X', False), None),
     28                           'Y': ((9, 'Y', False),  (0, 'X', False))}}}
     30 # create the grid object
---> 31 grid = xgcm.Grid(ds, periodic=False, face_connections=face_connections)
     32 grid

NameError: name 'ds' is not defined

Now we can use the grid object we created to take the divergence of a 2D vector

# vertical integral and time mean of horizontal diffusive heat flux
advx_th_vint = ds.ADVx_TH.sum(dim='k').mean(dim='time')
advy_th_vint = ds.ADVy_TH.sum(dim='k').mean(dim='time')

# difference in the x and y directions
diff_ADV_th = grid.diff_2d_vector({'X': advx_th_vint, 'Y': advy_th_vint}, boundary='fill')
# convergence
conv_ADV_th = -diff_ADV_th['X'] - diff_ADV_th['Y']
conv_ADV_th

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 2
      1 # vertical integral and time mean of horizontal diffusive heat flux
----> 2 advx_th_vint = ds.ADVx_TH.sum(dim='k').mean(dim='time')
      3 advy_th_vint = ds.ADVy_TH.sum(dim='k').mean(dim='time')
      5 # difference in the x and y directions

NameError: name 'ds' is not defined

# vertical integral and time mean of horizontal diffusive heat flux
difx_th_vint = ds.DFxE_TH.sum(dim='k').mean(dim='time')
dify_th_vint = ds.DFyE_TH.sum(dim='k').mean(dim='time')

# difference in the x and y directions
diff_DIF_th = grid.diff_2d_vector({'X': difx_th_vint, 'Y': dify_th_vint}, boundary='fill')
# convergence
conv_DIF_th = -diff_DIF_th['X'] - diff_DIF_th['Y']
conv_DIF_th

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[18], line 2
      1 # vertical integral and time mean of horizontal diffusive heat flux
----> 2 difx_th_vint = ds.DFxE_TH.sum(dim='k').mean(dim='time')
      3 dify_th_vint = ds.DFyE_TH.sum(dim='k').mean(dim='time')
      5 # difference in the x and y directions

NameError: name 'ds' is not defined

# convert to Watts / m^2 and load
mean_adv_conv = rho0 * cp * (conv_ADV_th/coords.rA).fillna(0.).load()
mean_dif_conv = rho0 * cp * (conv_DIF_th/coords.rA).fillna(0.).load()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 2
      1 # convert to Watts / m^2 and load
----> 2 mean_adv_conv = rho0 * cp * (conv_ADV_th/coords.rA).fillna(0.).load()
      3 mean_dif_conv = rho0 * cp * (conv_DIF_th/coords.rA).fillna(0.).load()

NameError: name 'rho0' is not defined

ax = mapper(mean_adv_conv, cmap='RdBu_r', vmax=300, vmin=-300);
ax.set_title(r'Convergence of Advective Flux (W/m$^2$)');

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 ax = mapper(mean_adv_conv, cmap='RdBu_r', vmax=300, vmin=-300);
      2 ax.set_title(r'Convergence of Advective Flux (W/m$^2$)');

NameError: name 'mapper' is not defined

ax = mapper(mean_dif_conv, cmap='RdBu_r', vmax=300, vmin=-300)
ax.set_title(r'Convergence of Diffusive Flux (W/m$^2$)');

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 ax = mapper(mean_dif_conv, cmap='RdBu_r', vmax=300, vmin=-300)
      2 ax.set_title(r'Convergence of Diffusive Flux (W/m$^2$)');

NameError: name 'mapper' is not defined

ax = mapper(mean_dif_conv + mean_adv_conv, cmap='RdBu_r', vmax=300, vmin=-300)
ax.set_title(r'Convergence of Net Horizontal Flux (W/m$^2$)');

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 ax = mapper(mean_dif_conv + mean_adv_conv, cmap='RdBu_r', vmax=300, vmin=-300)
      2 ax.set_title(r'Convergence of Net Horizontal Flux (W/m$^2$)');

NameError: name 'mapper' is not defined

ax = mapper(ds.TFLUX.mean(dim='time').load(), cmap='RdBu_r', vmax=300, vmin=-300);
ax.set_title(r'Surface Heat Flux (W/m$^2$)');

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[23], line 1
----> 1 ax = mapper(ds.TFLUX.mean(dim='time').load(), cmap='RdBu_r', vmax=300, vmin=-300);
      2 ax.set_title(r'Surface Heat Flux (W/m$^2$)');

NameError: name 'mapper' is not defined

Summary

In this example we used xarray and cartopy to visualize ocean depth and ocean heat content anomalies. Then, we used xgcm to easily work with variables that have different dimensions.

What’s next?

In our last example, we will visualize ocean currents.

Resources and references

This notebook is based on the ECCOv4 example from the Pangeo physical oceanography gallery: http://gallery.pangeo.io/repos/pangeo-gallery/physical-oceanography/04_eccov4.html