MITgcm ECCOv4 Example
Overview
This Jupyter notebook demonstrates how to use xarray and xgcm to analyze data from the ECCO v4r3 ocean state estimate.
Loading ECCO zarr data and converting to an xarray dataset
Visualize ocean depth using cartopy
Indexing and selecting data using xarray
Use dask cluster to speed up reading the data
Calculate and plot the horizontally integrated heat content anomaly
Use xgcm to compute the time-mean convergence of veritcally-integrated heat fluxes
Prerequisites
Concepts |
Importance |
Notes |
---|---|---|
Helpful |
||
Helpful |
Slicing, indexing, basic statistics |
|
Helpful |
Time to learn: 1 hour
Imports
import xarray as xr
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import intake
import cartopy as cart
import pyresample
from dask_gateway import GatewayCluster
from dask.distributed import Client
import xgcm
Load the data
The ECCOv4r3 data was converted from its raw MDS (.data / .meta file) format to zarr format, using the xmitgcm package. Zarr is a powerful data storage format that can be thought of as an alternative to HDF. In contrast to HDF, zarr works very well with cloud object storage. Zarr is currently useable in python, java, C++, and julia. It is likely that zarr will form the basis of the next major version of the netCDF library.
If you’re curious, here are some resources to learn more about zarr:
https://zarr.readthedocs.io/en/stable/tutorial.html
https://speakerdeck.com/rabernat/pangeo-zarr-cloud-data-storage
https://mrocklin.github.com/blog/work/2018/02/06/hdf-in-the-cloud
The ECCO zarr data currently lives in Google Cloud Storage as part of the Pangeo Data Catalog. This means we can open the whole dataset using one line of code.
This takes a bit of time to run because the metadata must be downloaded and parsed. The type of object returned is an Xarray dataset.
cat = intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml")
ds = cat.ECCOv4r3.to_dask()
ds
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[2], line 2
1 cat = intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml")
----> 2 ds = cat.ECCOv4r3.to_dask()
3 ds
File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/intake/catalog/base.py:427, in Catalog.__getattr__(self, item)
424 if not item.startswith("_"):
425 # Fall back to __getitem__.
426 try:
--> 427 return self[item] # triggers reload_on_change
428 except KeyError as e:
429 raise AttributeError(item) from e
File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/intake/catalog/base.py:472, in Catalog.__getitem__(self, key)
463 """Return a catalog entry by name.
464
465 Can also use attribute syntax, like ``cat.entry_name``, or
(...)
468 cat['name1', 'name2']
469 """
470 if not isinstance(key, list) and key in self:
471 # triggers reload_on_change
--> 472 s = self._get_entry(key)
473 if s.container == "catalog":
474 s.name = key
File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/intake/catalog/utils.py:43, in reload_on_change.<locals>.wrapper(self, *args, **kwargs)
40 @functools.wraps(f)
41 def wrapper(self, *args, **kwargs):
42 self.reload()
---> 43 return f(self, *args, **kwargs)
File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/intake/catalog/base.py:355, in Catalog._get_entry(self, name)
353 ups = [up for name, up in self.user_parameters.items() if name not in up_names]
354 entry._user_parameters = ups + (entry._user_parameters or [])
--> 355 return entry()
File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/intake/catalog/entry.py:60, in CatalogEntry.__call__(self, persist, **kwargs)
58 def __call__(self, persist=None, **kwargs):
59 """Instantiate DataSource with given user arguments"""
---> 60 s = self.get(**kwargs)
61 s._entry = self
62 s._passed_kwargs = list(kwargs)
File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/intake/catalog/local.py:313, in LocalCatalogEntry.get(self, **user_parameters)
310 return self._default_source
312 plugin, open_args = self._create_open_args(user_parameters)
--> 313 data_source = plugin(**open_args)
314 data_source.catalog_object = self._catalog
315 data_source.name = self.name
TypeError: ZarrArraySource.__init__() got an unexpected keyword argument 'consolidated'
Note that no data has been actually download yet. Xarray uses the approach of lazy evaluation, in which loading of data and execution of computations is delayed as long as possible (i.e. until data is actually needed for a plot). The data are represented symbolically as dask arrays. For example:
SALT (time, k, face, j, i) float32 dask.array<shape=(288, 50, 13, 90, 90), chunksize=(1, 50, 13, 90, 90)>
The full shape of the array is (288, 50, 13, 90, 90), quite large. But the chunksize is (1, 50, 13, 90, 90). Here the chunks correspond to the individual granuales of data (objects) in cloud storage. The chunk is the minimum amount of data we can read at one time.
# a trick to make things work a bit faster
coords = ds.coords.to_dataset().reset_coords()
ds = ds.reset_coords(drop=True)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[3], line 2
1 # a trick to make things work a bit faster
----> 2 coords = ds.coords.to_dataset().reset_coords()
3 ds = ds.reset_coords(drop=True)
NameError: name 'ds' is not defined
Visualizing Data
A Direct Plot
Let’s try to visualize something simple: the Depth
variable. Here is how the data are stored:
Depth (face, j, i) float32 dask.array<shape=(13, 90, 90), chunksize=(13, 90, 90)>
Although depth is a 2D field, there is an extra, dimension (face
) corresponding to the LLC face number. Let’s use xarray’s built in plotting functions to plot each face individually.
coords.Depth.plot(col='face', col_wrap=5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[4], line 1
----> 1 coords.Depth.plot(col='face', col_wrap=5)
NameError: name 'coords' is not defined
This view is not the most useful. It reflects how the data is arranged logically, rather than geographically.
A Pretty Map
To make plotting easier, we can define a quick function to plot the data in a more geographically friendly way. Eventually these plotting functions may be provided by the gcmplots package: xecco/gcmplots. For now, it is easy enough to roll our own.
class LLCMapper:
def __init__(self, ds, dx=0.25, dy=0.25):
# Extract LLC 2D coordinates
lons_1d = ds.XC.values.ravel()
lats_1d = ds.YC.values.ravel()
# Define original grid
self.orig_grid = pyresample.geometry.SwathDefinition(lons=lons_1d, lats=lats_1d)
# Longitudes latitudes to which we will we interpolate
lon_tmp = np.arange(-180, 180, dx) + dx/2
lat_tmp = np.arange(-90, 90, dy) + dy/2
# Define the lat lon points of the two parts.
self.new_grid_lon, self.new_grid_lat = np.meshgrid(lon_tmp, lat_tmp)
self.new_grid = pyresample.geometry.GridDefinition(lons=self.new_grid_lon,
lats=self.new_grid_lat)
def __call__(self, da, ax=None, projection=cart.crs.Robinson(), lon_0=-60, **plt_kwargs):
assert set(da.dims) == set(['face', 'j', 'i']), "da must have dimensions ['face', 'j', 'i']"
if ax is None:
fig, ax = plt.subplots(figsize=(12, 6), subplot_kw={'projection': projection})
else:
m = plt.axes(projection=projection)
field = pyresample.kd_tree.resample_nearest(self.orig_grid, da.values,
self.new_grid,
radius_of_influence=100000,
fill_value=None)
vmax = plt_kwargs.pop('vmax', field.max())
vmin = plt_kwargs.pop('vmin', field.min())
x,y = self.new_grid_lon, self.new_grid_lat
# Find index where data is splitted for mapping
split_lon_idx = round(x.shape[1]/(360/(lon_0 if lon_0>0 else lon_0+360)))
p = ax.pcolormesh(x[:,:split_lon_idx], y[:,:split_lon_idx], field[:,:split_lon_idx],
vmax=vmax, vmin=vmin, transform=cart.crs.PlateCarree(), zorder=1, **plt_kwargs)
p = ax.pcolormesh(x[:,split_lon_idx:], y[:,split_lon_idx:], field[:,split_lon_idx:],
vmax=vmax, vmin=vmin, transform=cart.crs.PlateCarree(), zorder=2, **plt_kwargs)
ax.add_feature(cart.feature.LAND, facecolor='0.5', zorder=3)
label = ''
if da.name is not None:
label = da.name
if 'units' in da.attrs:
label += ' [%s]' % da.attrs['units']
cb = plt.colorbar(p, shrink=0.4, label=label)
return ax
mapper = LLCMapper(coords)
mapper(coords.Depth);
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[6], line 1
----> 1 mapper = LLCMapper(coords)
2 mapper(coords.Depth);
NameError: name 'coords' is not defined
We can use this with any 2D cell-centered LLC variable.
Selecting data
The entire ECCOv4e3 dataset is contained in a single Xarray.Dataset
object. How do we find a view specific pieces of data? This is handled by Xarray’s indexing and selecting functions. To get the SST from January 2000, we do this:
sst = ds.THETA.sel(time='2000-01-15', k=0)
sst
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[7], line 1
----> 1 sst = ds.THETA.sel(time='2000-01-15', k=0)
2 sst
NameError: name 'ds' is not defined
Still no data has been actually downloaded. That doesn’t happen until we call .load()
explicitly or try to make a plot.
mapper(sst, cmap='RdBu_r');
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[8], line 1
----> 1 mapper(sst, cmap='RdBu_r');
NameError: name 'mapper' is not defined
Do some Calculations
Now let’s start doing something besides just plotting the existing data. For example, let’s calculate the time-mean SST.
mean_sst = ds.THETA.sel(k=0).mean(dim='time')
mean_sst
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 1
----> 1 mean_sst = ds.THETA.sel(k=0).mean(dim='time')
2 mean_sst
NameError: name 'ds' is not defined
As usual, no data was loaded. Instead, mean_sst
is a symbolic representation of the data that needs to be pulled and the computations that need to be executed to produce the desired result. In this case, the 288 original chunks all need to be read from cloud storage. Dask coordinates this automatically for us. But it does take some time.
%time mean_sst.load()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
File <timed eval>:1
NameError: name 'mean_sst' is not defined
mapper(mean_sst, cmap='RdBu_r');
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[11], line 1
----> 1 mapper(mean_sst, cmap='RdBu_r');
NameError: name 'mapper' is not defined
Speeding things up with a Dask Cluster
How can we speed things up? In general, the main bottleneck for this type of data analysis is the speed with which we can read the data. With cloud storage, the access is highly parallelizeable.
From a Pangeo environment, we can create a Dask cluster to spread the work out amongst many compute nodes. This works on both HPC and cloud. In the cloud, the compute nodes are provisioned on the fly and can be shut down as soon as we are done with our analysis.
The code below will create a cluster with five compute nodes. It can take a few minutes to provision our nodes.
cluster = GatewayCluster()
cluster.scale(5)
client = Client(cluster)
cluster
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[12], line 1
----> 1 cluster = GatewayCluster()
2 cluster.scale(5)
3 client = Client(cluster)
File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/dask_gateway/client.py:816, in GatewayCluster.__init__(self, address, proxy_address, public_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
804 def __init__(
805 self,
806 address=None,
(...)
814 **kwargs,
815 ):
--> 816 self._init_internal(
817 address=address,
818 proxy_address=proxy_address,
819 public_address=public_address,
820 auth=auth,
821 cluster_options=cluster_options,
822 cluster_kwargs=kwargs,
823 shutdown_on_close=shutdown_on_close,
824 asynchronous=asynchronous,
825 loop=loop,
826 )
File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/dask_gateway/client.py:889, in GatewayCluster._init_internal(self, address, proxy_address, public_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
885 self.shutdown_on_close = shutdown_on_close
887 self._instances.add(self)
--> 889 self.gateway = Gateway(
890 address=address,
891 proxy_address=proxy_address,
892 public_address=public_address,
893 auth=auth,
894 asynchronous=asynchronous,
895 loop=loop,
896 )
898 # Internals
899 self.scheduler_info = {}
File ~/miniconda3/envs/po-cookbook-dev/lib/python3.10/site-packages/dask_gateway/client.py:282, in Gateway.__init__(self, address, proxy_address, public_address, auth, asynchronous, loop)
280 address = format_template(dask.config.get("gateway.address"))
281 if address is None:
--> 282 raise ValueError(
283 "No dask-gateway address provided or found in configuration"
284 )
285 address = address.rstrip("/")
287 if public_address is None:
ValueError: No dask-gateway address provided or found in configuration
Now we re-run the mean calculation. Note how the dashboard helps us visualize what the cluster is doing.
%time ds.THETA.isel(k=0).mean(dim='time').load()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
File <timed eval>:1
NameError: name 'ds' is not defined
Spatially-Integrated Heat Content Anomaly
Now let’s do something harder. We will calculate the horizontally integrated heat content anomaly for the full 3D model domain.
# the monthly climatology
theta_clim = ds.THETA.groupby('time.month').mean(dim='time')
# the anomaly
theta_anom = ds.THETA.groupby('time.month') - theta_clim
rho0 = 1029
cp = 3994
ohc = rho0 * cp * (theta_anom *
coords.rA *
coords.hFacC).sum(dim=['face', 'j', 'i'])
ohc
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[14], line 2
1 # the monthly climatology
----> 2 theta_clim = ds.THETA.groupby('time.month').mean(dim='time')
3 # the anomaly
4 theta_anom = ds.THETA.groupby('time.month') - theta_clim
NameError: name 'ds' is not defined
# actually load the data
ohc.load()
# put the depth coordinate back for plotting purposes
ohc.coords['Z'] = coords.Z
ohc.swap_dims({'k': 'Z'}).transpose().plot(vmax=1e20)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[15], line 2
1 # actually load the data
----> 2 ohc.load()
3 # put the depth coordinate back for plotting purposes
4 ohc.coords['Z'] = coords.Z
NameError: name 'ohc' is not defined
Spatial Derivatives: Heat Budget
As our final exercise, we will do something much more complicated. We will compute the time-mean convergence of vertically-integrated heat fluxes. This is hard for several reasons.
The first reason it is hard is because it involves variables located at different grid points.
Following MITgcm conventions, xmitgcm (which produced this dataset) labels the center point with the coordinates j, i
, the u-velocity point as j, i_g
, and the v-velocity point as j_g, i
.
The horizontal advective heat flux variables are
ADVx_TH (time, k, face, j, i_g) float32 dask.array<shape=(288, 50, 13, 90, 90), chunksize=(1, 50, 13, 90, 90)>
ADVy_TH (time, k, face, j_g, i) float32 dask.array<shape=(288, 50, 13, 90, 90), chunksize=(1, 50, 13, 90, 90)>
Xarray won’t allow us to add or multiply variables that have different dimensions, and xarray by itself doesn’t understand how to transform from one grid position to another.
That’s why xgcm was created.
Xgcm allows us to create a Grid
object, which understands how to interpolate and take differences in a way that is compatible with finite volume models such at MITgcm. Xgcm also works with many other models, including ROMS, POP, MOM5/6, NEMO, etc.
A second reason this is hard is because of the complex topology connecting the different MITgcm faces. Fortunately xgcm also supports this.
# define the connectivity between faces
face_connections = {'face':
{0: {'X': ((12, 'Y', False), (3, 'X', False)),
'Y': (None, (1, 'Y', False))},
1: {'X': ((11, 'Y', False), (4, 'X', False)),
'Y': ((0, 'Y', False), (2, 'Y', False))},
2: {'X': ((10, 'Y', False), (5, 'X', False)),
'Y': ((1, 'Y', False), (6, 'X', False))},
3: {'X': ((0, 'X', False), (9, 'Y', False)),
'Y': (None, (4, 'Y', False))},
4: {'X': ((1, 'X', False), (8, 'Y', False)),
'Y': ((3, 'Y', False), (5, 'Y', False))},
5: {'X': ((2, 'X', False), (7, 'Y', False)),
'Y': ((4, 'Y', False), (6, 'Y', False))},
6: {'X': ((2, 'Y', False), (7, 'X', False)),
'Y': ((5, 'Y', False), (10, 'X', False))},
7: {'X': ((6, 'X', False), (8, 'X', False)),
'Y': ((5, 'X', False), (10, 'Y', False))},
8: {'X': ((7, 'X', False), (9, 'X', False)),
'Y': ((4, 'X', False), (11, 'Y', False))},
9: {'X': ((8, 'X', False), None),
'Y': ((3, 'X', False), (12, 'Y', False))},
10: {'X': ((6, 'Y', False), (11, 'X', False)),
'Y': ((7, 'Y', False), (2, 'X', False))},
11: {'X': ((10, 'X', False), (12, 'X', False)),
'Y': ((8, 'Y', False), (1, 'X', False))},
12: {'X': ((11, 'X', False), None),
'Y': ((9, 'Y', False), (0, 'X', False))}}}
# create the grid object
grid = xgcm.Grid(ds, periodic=False, face_connections=face_connections)
grid
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[16], line 31
2 face_connections = {'face':
3 {0: {'X': ((12, 'Y', False), (3, 'X', False)),
4 'Y': (None, (1, 'Y', False))},
(...)
27 12: {'X': ((11, 'X', False), None),
28 'Y': ((9, 'Y', False), (0, 'X', False))}}}
30 # create the grid object
---> 31 grid = xgcm.Grid(ds, periodic=False, face_connections=face_connections)
32 grid
NameError: name 'ds' is not defined
Now we can use the grid
object we created to take the divergence of a 2D vector
# vertical integral and time mean of horizontal diffusive heat flux
advx_th_vint = ds.ADVx_TH.sum(dim='k').mean(dim='time')
advy_th_vint = ds.ADVy_TH.sum(dim='k').mean(dim='time')
# difference in the x and y directions
diff_ADV_th = grid.diff_2d_vector({'X': advx_th_vint, 'Y': advy_th_vint}, boundary='fill')
# convergence
conv_ADV_th = -diff_ADV_th['X'] - diff_ADV_th['Y']
conv_ADV_th
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[17], line 2
1 # vertical integral and time mean of horizontal diffusive heat flux
----> 2 advx_th_vint = ds.ADVx_TH.sum(dim='k').mean(dim='time')
3 advy_th_vint = ds.ADVy_TH.sum(dim='k').mean(dim='time')
5 # difference in the x and y directions
NameError: name 'ds' is not defined
# vertical integral and time mean of horizontal diffusive heat flux
difx_th_vint = ds.DFxE_TH.sum(dim='k').mean(dim='time')
dify_th_vint = ds.DFyE_TH.sum(dim='k').mean(dim='time')
# difference in the x and y directions
diff_DIF_th = grid.diff_2d_vector({'X': difx_th_vint, 'Y': dify_th_vint}, boundary='fill')
# convergence
conv_DIF_th = -diff_DIF_th['X'] - diff_DIF_th['Y']
conv_DIF_th
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[18], line 2
1 # vertical integral and time mean of horizontal diffusive heat flux
----> 2 difx_th_vint = ds.DFxE_TH.sum(dim='k').mean(dim='time')
3 dify_th_vint = ds.DFyE_TH.sum(dim='k').mean(dim='time')
5 # difference in the x and y directions
NameError: name 'ds' is not defined
# convert to Watts / m^2 and load
mean_adv_conv = rho0 * cp * (conv_ADV_th/coords.rA).fillna(0.).load()
mean_dif_conv = rho0 * cp * (conv_DIF_th/coords.rA).fillna(0.).load()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[19], line 2
1 # convert to Watts / m^2 and load
----> 2 mean_adv_conv = rho0 * cp * (conv_ADV_th/coords.rA).fillna(0.).load()
3 mean_dif_conv = rho0 * cp * (conv_DIF_th/coords.rA).fillna(0.).load()
NameError: name 'rho0' is not defined
ax = mapper(mean_adv_conv, cmap='RdBu_r', vmax=300, vmin=-300);
ax.set_title(r'Convergence of Advective Flux (W/m$^2$)');
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[20], line 1
----> 1 ax = mapper(mean_adv_conv, cmap='RdBu_r', vmax=300, vmin=-300);
2 ax.set_title(r'Convergence of Advective Flux (W/m$^2$)');
NameError: name 'mapper' is not defined
ax = mapper(mean_dif_conv, cmap='RdBu_r', vmax=300, vmin=-300)
ax.set_title(r'Convergence of Diffusive Flux (W/m$^2$)');
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[21], line 1
----> 1 ax = mapper(mean_dif_conv, cmap='RdBu_r', vmax=300, vmin=-300)
2 ax.set_title(r'Convergence of Diffusive Flux (W/m$^2$)');
NameError: name 'mapper' is not defined
ax = mapper(mean_dif_conv + mean_adv_conv, cmap='RdBu_r', vmax=300, vmin=-300)
ax.set_title(r'Convergence of Net Horizontal Flux (W/m$^2$)');
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[22], line 1
----> 1 ax = mapper(mean_dif_conv + mean_adv_conv, cmap='RdBu_r', vmax=300, vmin=-300)
2 ax.set_title(r'Convergence of Net Horizontal Flux (W/m$^2$)');
NameError: name 'mapper' is not defined
ax = mapper(ds.TFLUX.mean(dim='time').load(), cmap='RdBu_r', vmax=300, vmin=-300);
ax.set_title(r'Surface Heat Flux (W/m$^2$)');
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[23], line 1
----> 1 ax = mapper(ds.TFLUX.mean(dim='time').load(), cmap='RdBu_r', vmax=300, vmin=-300);
2 ax.set_title(r'Surface Heat Flux (W/m$^2$)');
NameError: name 'mapper' is not defined
Summary
In this example we used xarray and cartopy to visualize ocean depth and ocean heat content anomalies. Then, we used xgcm to easily work with variables that have different dimensions.
What’s next?
In our last example, we will visualize ocean currents.
Resources and references
This notebook is based on the ECCOv4 example from the Pangeo physical oceanography gallery: http://gallery.pangeo.io/repos/pangeo-gallery/physical-oceanography/04_eccov4.html