Reading in CESM output


Overview

Output from one run of CESM is the main dataset that we’ll be looking at in this cookbook. Let’s learn how to read it in. And note that this is just one way that CESM output can look. This run has been post-processed, so the data are in the form of “time-series” files, where each file stores one variable across the full timespan of the run. Before this processing, CESM actually outputs data in the form of “history” files instead, where each file contains all variables over a shorter time-slice. We won’t dive into the specifics of CESM data processing here, but this Jupyter book from the CESM tutorial has some more info!

Prerequisites

Concepts

Importance

Notes

Intro to Xarray

Necessary

  • Time to learn: 5 min


Imports

import xarray as xr
import glob
import s3fs
import netCDF4

Loading our data into xarray

Our data is stored in the cloud on Jetstream2. We load in each file path, then use xarray’s open_mfdataset() function to load all the files into an xarray Dataset, dropping a few variables whose coordinates don’t fit nicely.

jetstream_url = 'https://js2.jetstream-cloud.org:8001/'

s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(endpoint_url=jetstream_url))

# Generate a list of all files in CESM folder
s3path = 's3://pythia/ocean-bgc/cesm/g.e22.GOMIPECOIAF_JRA-1p4-2018.TL319_g17.4p2z.002branch/ocn/proc/tseries/month_1/*'
remote_files = s3.glob(s3path)

# Open all files from folder
fileset = [s3.open(file) for file in remote_files]

# Open with xarray
ds = xr.open_mfdataset(fileset, data_vars="minimal", coords='minimal', compat="override", parallel=True,
                       drop_variables=["transport_components", "transport_regions", 'moc_components'], decode_times=True)
/home/runner/miniconda3/envs/ocean-bgc-cookbook-dev/lib/python3.12/site-packages/xarray/backends/api.py:660: RuntimeWarning: 'h5netcdf' fails while guessing
  engine = plugins.guess_engine(filename_or_obj)
/home/runner/miniconda3/envs/ocean-bgc-cookbook-dev/lib/python3.12/site-packages/xarray/backends/api.py:660: RuntimeWarning: 'scipy' fails while guessing
  engine = plugins.guess_engine(filename_or_obj)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 13
     10 fileset = [s3.open(file) for file in remote_files]
     12 # Open with xarray
---> 13 ds = xr.open_mfdataset(fileset, data_vars="minimal", coords='minimal', compat="override", parallel=True,
     14                        drop_variables=["transport_components", "transport_regions", 'moc_components'], decode_times=True)

File ~/miniconda3/envs/ocean-bgc-cookbook-dev/lib/python3.12/site-packages/xarray/backends/api.py:1619, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
   1614     datasets = [preprocess(ds) for ds in datasets]
   1616 if parallel:
   1617     # calling compute here will return the datasets/file_objs lists,
   1618     # the underlying datasets will still be stored as dask arrays
-> 1619     datasets, closers = dask.compute(datasets, closers)
   1621 # Combine all datasets, closing them in case of a ValueError
   1622 try:

File ~/miniconda3/envs/ocean-bgc-cookbook-dev/lib/python3.12/site-packages/dask/base.py:660, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    657     postcomputes.append(x.__dask_postcompute__())
    659 with shorten_traceback():
--> 660     results = schedule(dsk, keys, **kwargs)
    662 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File ~/miniconda3/envs/ocean-bgc-cookbook-dev/lib/python3.12/site-packages/xarray/backends/api.py:660, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    657     kwargs.update(backend_kwargs)
    659 if engine is None:
--> 660     engine = plugins.guess_engine(filename_or_obj)
    662 if from_array_kwargs is None:
    663     from_array_kwargs = {}

File ~/miniconda3/envs/ocean-bgc-cookbook-dev/lib/python3.12/site-packages/xarray/backends/plugins.py:194, in guess_engine(store_spec)
    186 else:
    187     error_msg = (
    188         "found the following matches with the input file in xarray's IO "
    189         f"backends: {compatible_engines}. But their dependencies may not be installed, see:\n"
    190         "https://docs.xarray.dev/en/stable/user-guide/io.html \n"
    191         "https://docs.xarray.dev/en/stable/getting-started-guide/installing.html"
    192     )
--> 194 raise ValueError(error_msg)

ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'h5netcdf', 'scipy', 'gini', 'rasterio', 'zarr']. Consider explicitly selecting one of the installed engines via the ``engine`` parameter, or installing additional IO dependencies, see:
https://docs.xarray.dev/en/stable/getting-started-guide/installing.html
https://docs.xarray.dev/en/stable/user-guide/io.html
ds
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 ds

NameError: name 'ds' is not defined

Looks good!


Summary

You’ve learned how to read in CESM output, which we’ll be using for all the following notebooks in this cookbook.

Resources and references