Reading in CESM output
Overview
Output from one run of CESM is the main dataset that we’ll be looking at in this cookbook. Let’s learn how to read it in. And note that this is just one way that CESM output can look. This run has been post-processed, so the data are in the form of “time-series” files, where each file stores one variable across the full timespan of the run. Before this processing, CESM actually outputs data in the form of “history” files instead, where each file contains all variables over a shorter time-slice. We won’t dive into the specifics of CESM data processing here, but this Jupyter book from the CESM tutorial has some more info!
Prerequisites
Concepts |
Importance |
Notes |
---|---|---|
Necessary |
Time to learn: 5 min
Imports
import xarray as xr
import glob
import s3fs
import netCDF4
Loading our data into xarray
Our data is stored in the cloud on Jetstream2. We load in each file path, then use xarray’s open_mfdataset()
function to load all the files into an xarray Dataset, dropping a few variables whose coordinates don’t fit nicely.
jetstream_url = 'https://js2.jetstream-cloud.org:8001/'
s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(endpoint_url=jetstream_url))
# Generate a list of all files in CESM folder
s3path = 's3://pythia/ocean-bgc/cesm/g.e22.GOMIPECOIAF_JRA-1p4-2018.TL319_g17.4p2z.002branch/ocn/proc/tseries/month_1/*'
remote_files = s3.glob(s3path)
# Open all files from folder
fileset = [s3.open(file) for file in remote_files]
# Open with xarray
ds = xr.open_mfdataset(fileset, data_vars="minimal", coords='minimal', compat="override", parallel=True,
drop_variables=["transport_components", "transport_regions", 'moc_components'], decode_times=True)
/home/runner/miniconda3/envs/ocean-bgc-cookbook-dev/lib/python3.12/site-packages/xarray/backends/api.py:660: RuntimeWarning: 'h5netcdf' fails while guessing
engine = plugins.guess_engine(filename_or_obj)
/home/runner/miniconda3/envs/ocean-bgc-cookbook-dev/lib/python3.12/site-packages/xarray/backends/api.py:660: RuntimeWarning: 'scipy' fails while guessing
engine = plugins.guess_engine(filename_or_obj)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[2], line 13
10 fileset = [s3.open(file) for file in remote_files]
12 # Open with xarray
---> 13 ds = xr.open_mfdataset(fileset, data_vars="minimal", coords='minimal', compat="override", parallel=True,
14 drop_variables=["transport_components", "transport_regions", 'moc_components'], decode_times=True)
File ~/miniconda3/envs/ocean-bgc-cookbook-dev/lib/python3.12/site-packages/xarray/backends/api.py:1619, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
1614 datasets = [preprocess(ds) for ds in datasets]
1616 if parallel:
1617 # calling compute here will return the datasets/file_objs lists,
1618 # the underlying datasets will still be stored as dask arrays
-> 1619 datasets, closers = dask.compute(datasets, closers)
1621 # Combine all datasets, closing them in case of a ValueError
1622 try:
File ~/miniconda3/envs/ocean-bgc-cookbook-dev/lib/python3.12/site-packages/dask/base.py:660, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
657 postcomputes.append(x.__dask_postcompute__())
659 with shorten_traceback():
--> 660 results = schedule(dsk, keys, **kwargs)
662 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File ~/miniconda3/envs/ocean-bgc-cookbook-dev/lib/python3.12/site-packages/xarray/backends/api.py:660, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
657 kwargs.update(backend_kwargs)
659 if engine is None:
--> 660 engine = plugins.guess_engine(filename_or_obj)
662 if from_array_kwargs is None:
663 from_array_kwargs = {}
File ~/miniconda3/envs/ocean-bgc-cookbook-dev/lib/python3.12/site-packages/xarray/backends/plugins.py:194, in guess_engine(store_spec)
186 else:
187 error_msg = (
188 "found the following matches with the input file in xarray's IO "
189 f"backends: {compatible_engines}. But their dependencies may not be installed, see:\n"
190 "https://docs.xarray.dev/en/stable/user-guide/io.html \n"
191 "https://docs.xarray.dev/en/stable/getting-started-guide/installing.html"
192 )
--> 194 raise ValueError(error_msg)
ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'h5netcdf', 'scipy', 'gini', 'rasterio', 'zarr']. Consider explicitly selecting one of the installed engines via the ``engine`` parameter, or installing additional IO dependencies, see:
https://docs.xarray.dev/en/stable/getting-started-guide/installing.html
https://docs.xarray.dev/en/stable/user-guide/io.html
ds
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[3], line 1
----> 1 ds
NameError: name 'ds' is not defined
Looks good!
Summary
You’ve learned how to read in CESM output, which we’ll be using for all the following notebooks in this cookbook.