Skip to article frontmatterSkip to article content

Reproducing Key Figures from Kay et al. (2015)


Overview

This notebook demonstrates how one might use the NCAR Community Earth System Model (CESM) Large Ensemble (LENS) data hosted on AWS S3. The notebook shows how to reproduce figures 2 and 4 from the Kay et al. (2015) paper describing the CESM LENS dataset Kay et al., 2015.

This resource is intended to be helpful for people not familiar with elements of the Pangeo framework including Jupyter Notebooks, Xarray, and Zarr data format, or with the original paper, so it includes additional explanation.

Prerequisites

ConceptsImportanceNotes
Intro to XarrayNecessary
DaskHelpful
  • Time to learn: 30 minutes


NOTE: In this notebook, we access very large cloud-served datasets and use Dask to parallelize our workflow. The end-to-end execution time may be on the order of an hour or more, depending on your computing resources.

Imports

import sys
import warnings
warnings.filterwarnings("ignore")

import intake
import matplotlib.pyplot as plt
from dask.distributed import Client
import numpy as np
import pandas as pd
import xarray as xr
import cmaps  # for NCL colormaps
import cartopy.crs as ccrs
import dask
import s3fs
dask.config.set({"distributed.scheduler.worker-saturation": 1.0})
<dask.config.set at 0x7f9e0ffa2190>

Create and Connect to Dask Distributed Cluster

Here we’ll use a dask cluster to parallelize our analysis.

platform = sys.platform

if (platform == 'win32'):
    import multiprocessing.popen_spawn_win32
else:
    import multiprocessing.popen_spawn_posix
client = Client()
client
Loading...

Load and Prepare Data

catalog_url = 'https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json'
col = intake.open_esm_datastore(catalog_url)
col
Loading...

Show the first few lines of the catalog:

col.df.head(10)
Loading...

Show expanded version of collection structure with details:

col.keys_info().head()
Loading...

Extract data needed to construct Figure 2

Search the catalog to find the desired data, in this case the reference height temperature of the atmosphere, at daily time resolution, for the Historical, 20th Century, and RCP8.5 (IPCC Representative Concentration Pathway 8.5) experiments.

col_subset = col.search(frequency=["daily", "monthly"], component="atm", variable="TREFHT",
                        experiment=["20C", "RCP85", "HIST"])

col_subset
Loading...
col_subset.df
Loading...

Load catalog entries for subset into a dictionary of Xarray Datasets:

dsets = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True}, storage_options={"anon": True})
print(f"\nDataset dictionary keys:\n {dsets.keys()}")
Loading...

Define Xarray Datasets corresponding to the three experiments:

ds_HIST = dsets['atm.HIST.monthly']
ds_20C = dsets['atm.20C.daily']
ds_RCP85 = dsets['atm.RCP85.daily']

Use the dask.distributed utility function to display size of each dataset:

from dask.utils import format_bytes
print(f"Historical: {format_bytes(ds_HIST.nbytes)}\n"
      f"20th Century: {format_bytes(ds_20C.nbytes)}\n"
      f"RCP8.5: {format_bytes(ds_RCP85.nbytes)}")
Historical: 177.21 MiB
20th Century: 258.65 GiB
RCP8.5: 285.71 GiB

Now, extract the Reference Height Temperature data variable:

t_hist = ds_HIST["TREFHT"]
t_20c = ds_20C["TREFHT"]
t_rcp = ds_RCP85["TREFHT"]
t_20c
Loading...

The global surface temperature anomaly was computed relative to the 1961-90 base period in the Kay et al. paper, so extract that time slice:

t_ref = t_20c.sel(time=slice("1961", "1990"))

Figure 2

Read grid cell areas

Cell size varies with latitude, so this must be accounted for when computing the global mean.

cat = col.search(frequency="static", component="atm", experiment=["20C"])
_, grid = cat.to_dataset_dict(aggregate=False, storage_options={'anon':True}, zarr_kwargs={"consolidated": True}).popitem()
grid
Loading...
cell_area = grid.area.load()
total_area = cell_area.sum()
cell_area
Loading...

Define weighted means

Note: resample(time="AS") does an annual resampling based on start of calendar year. See documentation for Pandas resampling options.

t_ref_ts = (
    (t_ref.resample(time="AS").mean("time") * cell_area).sum(dim=("lat", "lon"))
    / total_area
).mean(dim=("time", "member_id"))

t_hist_ts = (
    (t_hist.resample(time="AS").mean("time") * cell_area).sum(dim=("lat", "lon"))
) / total_area

t_20c_ts = (
    (t_20c.resample(time="AS").mean("time") * cell_area).sum(dim=("lat", "lon"))
) / total_area

t_rcp_ts = (
    (t_rcp.resample(time="AS").mean("time") * cell_area).sum(dim=("lat", "lon"))
) / total_area

Read data and compute means

Dask’s “lazy execution” philosophy means that until this point we have not actually read the bulk of the data. Steps 1, 3, and 4 take a while to complete, so we include the Notebook “cell magic” directive %%time to display elapsed and CPU times after computation.

Step 1 (takes a while)

%%time
# this cell takes a while, be patient
t_ref_mean = t_ref_ts.load()
t_ref_mean
Loading...

Step 2 (executes quickly)

%%time 
t_hist_ts_df = t_hist_ts.to_series().T
#t_hist_ts_df.head()
CPU times: user 352 ms, sys: 50.2 ms, total: 402 ms
Wall time: 3.56 s

Step 3 (takes even longer than Step 1)

%%time
t_20c_ts_df = t_20c_ts.to_series().unstack().T
t_20c_ts_df.head()
Loading...

Step 4 (similar to Step 3 in its execution time)

%%time
# This also takes a while
t_rcp_ts_df = t_rcp_ts.to_series().unstack().T
t_rcp_ts_df.head()
Loading...

Get observations for Figure 2 (HadCRUT4)

The HadCRUT4 temperature dataset is described by Morice et al. (2012).

Observational time series data for comparison with ensemble average:

obsDataURL = "https://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/cru/hadcrut4/air.mon.anom.median.nc"
ds = xr.open_dataset(obsDataURL).load()
ds
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/micromamba/envs/cla-cookbook-dev/lib/python3.11/site-packages/xarray/backends/file_manager.py:219, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
    218 try:
--> 219     file = self._cache[self._key]
    220 except KeyError:

File ~/micromamba/envs/cla-cookbook-dev/lib/python3.11/site-packages/xarray/backends/lru_cache.py:56, in LRUCache.__getitem__(self, key)
     55 with self._lock:
---> 56     value = self._cache[key]
     57     self._cache.move_to_end(key)

KeyError: [<function _open_scipy_netcdf at 0x7f9e12058540>, ('https://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/cru/hadcrut4/air.mon.anom.median.nc',), 'r', (('mmap', None), ('version', 2)), '82f16502-813b-4c6b-8c3a-0ccc21757bfa']

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
Cell In[22], line 2
      1 obsDataURL = "https://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/cru/hadcrut4/air.mon.anom.median.nc"
----> 2 ds = xr.open_dataset(obsDataURL).load()
      3 ds

File ~/micromamba/envs/cla-cookbook-dev/lib/python3.11/site-packages/xarray/backends/api.py:596, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, create_default_indexes, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    584 decoders = _resolve_decoders_kwargs(
    585     decode_cf,
    586     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)    592     decode_coords=decode_coords,
    593 )
    595 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 596 backend_ds = backend.open_dataset(
    597     filename_or_obj,
    598     drop_variables=drop_variables,
    599     **decoders,
    600     **kwargs,
    601 )
    602 ds = _dataset_from_backend_dataset(
    603     backend_ds,
    604     filename_or_obj,
   (...)    615     **kwargs,
    616 )
    617 return ds

File ~/micromamba/envs/cla-cookbook-dev/lib/python3.11/site-packages/xarray/backends/scipy_.py:395, in ScipyBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, mode, format, group, mmap, lock)
    393 store_entrypoint = StoreBackendEntrypoint()
    394 with close_on_error(store):
--> 395     ds = store_entrypoint.open_dataset(
    396         store,
    397         mask_and_scale=mask_and_scale,
    398         decode_times=decode_times,
    399         concat_characters=concat_characters,
    400         decode_coords=decode_coords,
    401         drop_variables=drop_variables,
    402         use_cftime=use_cftime,
    403         decode_timedelta=decode_timedelta,
    404     )
    405 return ds

File ~/micromamba/envs/cla-cookbook-dev/lib/python3.11/site-packages/xarray/backends/store.py:42, in StoreBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, set_indexes, use_cftime, decode_timedelta)
     27 def open_dataset(
     28     self,
     29     filename_or_obj: T_PathFileOrDataStore,
   (...)     38     decode_timedelta=None,
     39 ) -> Dataset:
     40     assert isinstance(filename_or_obj, AbstractDataStore)
---> 42     vars, attrs = filename_or_obj.load()
     43     encoding = filename_or_obj.get_encoding()
     45     vars, attrs, coord_names = conventions.decode_cf_variables(
     46         vars,
     47         attrs,
   (...)     54         decode_timedelta=decode_timedelta,
     55     )

File ~/micromamba/envs/cla-cookbook-dev/lib/python3.11/site-packages/xarray/backends/common.py:366, in AbstractDataStore.load(self)
    347 def load(self):
    348     """
    349     This loads the variables and attributes simultaneously.
    350     A centralized loading function makes it easier to create
   (...)    363     are requested, so care should be taken to make sure its fast.
    364     """
    365     variables = FrozenDict(
--> 366         (_decode_variable_name(k), v) for k, v in self.get_variables().items()
    367     )
    368     attributes = FrozenDict(self.get_attrs())
    369     return variables, attributes

File ~/micromamba/envs/cla-cookbook-dev/lib/python3.11/site-packages/xarray/backends/scipy_.py:246, in ScipyDataStore.get_variables(self)
    244 def get_variables(self):
    245     return FrozenDict(
--> 246         (k, self.open_store_variable(k, v)) for k, v in self.ds.variables.items()
    247     )

File ~/micromamba/envs/cla-cookbook-dev/lib/python3.11/site-packages/xarray/backends/scipy_.py:235, in ScipyDataStore.ds(self)
    233 @property
    234 def ds(self) -> scipy.io.netcdf_file:
--> 235     return self._manager.acquire()

File ~/micromamba/envs/cla-cookbook-dev/lib/python3.11/site-packages/xarray/backends/file_manager.py:201, in CachingFileManager.acquire(self, needs_lock)
    186 def acquire(self, needs_lock: bool = True) -> T_File:
    187     """Acquire a file object from the manager.
    188 
    189     A new file is only opened if it has expired from the
   (...)    199         An open file object, as returned by ``opener(*args, **kwargs)``.
    200     """
--> 201     file, _ = self._acquire_with_cache_info(needs_lock)
    202     return file

File ~/micromamba/envs/cla-cookbook-dev/lib/python3.11/site-packages/xarray/backends/file_manager.py:225, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
    223     kwargs = kwargs.copy()
    224     kwargs["mode"] = self._mode
--> 225 file = self._opener(*self._args, **kwargs)
    226 if self._mode == "w":
    227     # ensure file doesn't get overridden when opened again
    228     self._mode = "a"

File ~/micromamba/envs/cla-cookbook-dev/lib/python3.11/site-packages/xarray/backends/scipy_.py:158, in _open_scipy_netcdf(filename, mode, mmap, version, flush_only)
    155             raise
    157 try:
--> 158     return netcdf_file(filename, mode=mode, mmap=mmap, version=version)
    159 except TypeError as e:  # netcdf3 message is obscure in this case
    160     errmsg = e.args[0]

File ~/micromamba/envs/cla-cookbook-dev/lib/python3.11/site-packages/scipy/io/_netcdf.py:251, in netcdf_file.__init__(self, filename, mode, mmap, version, maskandscale)
    249 self.filename = filename
    250 omode = 'r+' if mode == 'a' else mode
--> 251 self.fp = open(self.filename, f'{omode}b')
    252 if mmap is None:
    253     # Mmapped files on PyPy cannot be usually closed
    254     # before the GC runs, so it's better to use mmap=False
    255     # as the default.
    256     mmap = (not IS_PYPY)

FileNotFoundError: [Errno 2] No such file or directory: 'https://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/cru/hadcrut4/air.mon.anom.median.nc'
def weighted_temporal_mean(ds):
    """
    weight by days in each month
    """
    time_bound_diff = ds.time_bnds.diff(dim="nbnds")[:, 0]
    wgts = time_bound_diff.groupby("time.year") / time_bound_diff.groupby(
        "time.year"
    ).sum(xr.ALL_DIMS)
    obs = ds["air"]
    cond = obs.isnull()
    ones = xr.where(cond, 0.0, 1.0)
    obs_sum = (obs * wgts).resample(time="AS").sum(dim="time")
    ones_out = (ones * wgts).resample(time="AS").sum(dim="time")
    obs_s = (obs_sum / ones_out).mean(("lat", "lon")).to_series()
    return obs_s

Limit observations to 20th century:

obs_s = weighted_temporal_mean(ds)
obs_s = obs_s['1920':]
obs_s.head()
all_ts_anom = pd.concat([t_20c_ts_df, t_rcp_ts_df]) - t_ref_mean.data
years = [val.year for val in all_ts_anom.index]
obs_years = [val.year for val in obs_s.index]

Combine ensemble member 1 data from historical and 20th century experiments:

hist_anom = t_hist_ts_df - t_ref_mean.data
member1 = pd.concat([hist_anom.iloc[:-2], all_ts_anom.iloc[:,0]], verify_integrity=True)
member1_years = [val.year for val in member1.index]

Plotting Figure 2

Global surface temperature anomaly (1961-90 base period) for individual ensemble members, and observations:

ax = plt.axes()

ax.tick_params(right=True, top=True, direction="out", length=6, width=2, grid_alpha=0.5)
ax.plot(years, all_ts_anom.iloc[:,1:], color="grey")
ax.plot(obs_years, obs_s['1920':], color="red")
ax.plot(member1_years, member1, color="black")


ax.text(
    0.35,
    0.4,
    "observations",
    verticalalignment="bottom",
    horizontalalignment="left",
    transform=ax.transAxes,
    color="red",
    fontsize=10,
)
ax.text(
    0.35,
    0.33,
    "members 2-40",
    verticalalignment="bottom",
    horizontalalignment="left",
    transform=ax.transAxes,
    color="grey",
    fontsize=10,
)
ax.text(
    0.05,
    0.2,
    "member 1",
    verticalalignment="bottom",
    horizontalalignment="left",
    transform=ax.transAxes,
    color="black",
    fontsize=10,
)

ax.set_xticks([1850, 1920, 1950, 2000, 2050, 2100])
plt.ylim(-1, 5)
plt.xlim(1850, 2100)
plt.ylabel("Global Surface\nTemperature Anomaly (K)")
plt.show()

Figure 4

Compute linear trend for winter seasons

def linear_trend(da, dim="time"):
    da_chunk = da.chunk({dim: -1})
    trend = xr.apply_ufunc(
        calc_slope,
        da_chunk,
        vectorize=True,
        input_core_dims=[[dim]],
        output_core_dims=[[]],
        output_dtypes=[np.float64],
        dask="parallelized",
    )
    return trend


def calc_slope(y):
    """ufunc to be used by linear_trend"""
    x = np.arange(len(y))

    # drop missing values (NaNs) from x and y
    finite_indexes = ~np.isnan(y)
    slope = np.nan if (np.sum(finite_indexes) < 2) else np.polyfit(x[finite_indexes], y[finite_indexes], 1)[0]
    return slope
%%time 
# Takes several minutes
t = xr.concat([t_20c, t_rcp], dim="time")
seasons = t.sel(time=slice("1979", "2012")).resample(time="QS-DEC").mean("time")
# Include only full seasons from 1979 and 2012
seasons = seasons.sel(time=slice("1979", "2012")).load()
winter_seasons = seasons.sel(
    time=seasons.time.where(seasons.time.dt.month == 12, drop=True)
)
winter_trends = linear_trend(
    winter_seasons.chunk({"lat": 20, "lon": 20, "time": -1})
).load() * len(winter_seasons.time)

# Compute ensemble mean from the first 30 members
winter_trends_mean = winter_trends.isel(member_id=range(30)).mean(dim='member_id')

Make sure that we have 34 seasons:

assert len(winter_seasons.time) == 34

Get observations for Figure 4 (NASA GISS GisTemp)

This is observational time series data for comparison with ensemble average. Here we are using the GISS Surface Temperature Analysis (GISTEMP v4) from NASA’s Goddard Institute of Space Studies Lenssen et al., 2019.

Define the URL to Project Pythia’s Jetstream2 Object Store and the path to the Zarr file.

URL = 'https://js2.jetstream-cloud.org:8001'
filePath = 's3://pythia/gistemp1200_GHCNv4_ERSSTv5.zarr'

Create a container for the S3 file system

fs = s3fs.S3FileSystem(anon=True, client_kwargs=dict(endpoint_url=URL))

Link to the Zarr file as it exists on the S3 object store

store = s3fs.S3Map(root=filePath, s3=fs, check=False )
ds = xr.open_zarr(store, consolidated=True, chunks="auto")
ds

Create an Xarray Dataset from the Zarr object

Remap longitude range from [-180, 180] to [0, 360] for plotting purposes:

ds = ds.assign_coords(lon=((ds.lon + 360) % 360)).sortby('lon')
ds

Include only full seasons from 1979 through 2012:

obs_seasons = ds.sel(time=slice("1979", "2012")).resample(time="QS-DEC").mean("time")
obs_seasons = obs_seasons.sel(time=slice("1979", "2012")).load()
obs_winter_seasons = obs_seasons.sel(
    time=obs_seasons.time.where(obs_seasons.time.dt.month == 12, drop=True)
)
obs_winter_seasons

And compute observed winter trends:

obs_winter_trends = linear_trend(
    obs_winter_seasons.chunk({"lat": 20, "lon": 20, "time": -1})
).load() * len(obs_winter_seasons.time)
obs_winter_trends

Plotting Figure 4

Global maps of historical (1979 - 2012) boreal winter (DJF) surface air trends:

contour_levels = [-6, -5, -4, -3, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, 3, 4, 5, 6]
color_map = cmaps.ncl_default
def make_map_plot(nplot_rows, nplot_cols, plot_index, data, plot_label):
    """ Create a single map subplot. """
    ax = plt.subplot(nplot_rows, nplot_cols, plot_index, projection = ccrs.Robinson(central_longitude = 180))
    cplot = plt.contourf(lons, lats, data,
                         levels = contour_levels,
                         cmap = color_map,
                         extend = 'both',
                         transform = ccrs.PlateCarree())
    ax.coastlines(color = 'grey')
    ax.text(0.01, 0.01, plot_label, fontsize = 14, transform = ax.transAxes)
    return cplot, ax
%%time
# Generate plot (may take a while as many individual maps are generated)
numPlotRows = 8
numPlotCols = 4
figWidth = 20
figHeight = 30

fig, axs = plt.subplots(numPlotRows, numPlotCols, figsize=(figWidth,figHeight))

lats = winter_trends.lat
lons = winter_trends.lon

# Create ensemble member plots
for ensemble_index in range(30):
    plot_data = winter_trends.isel(member_id = ensemble_index)
    plot_index = ensemble_index + 1
    plot_label = str(plot_index)
    plotRow = ensemble_index // numPlotCols
    plotCol = ensemble_index % numPlotCols
    # Retain axes objects for figure colorbar
    cplot, axs[plotRow, plotCol] = make_map_plot(numPlotRows, numPlotCols, plot_index, plot_data, plot_label)

# Create plots for the ensemble mean, observations, and a figure color bar.
cplot, axs[7,2] = make_map_plot(numPlotRows, numPlotCols, 31, winter_trends_mean, 'EM')

lats = obs_winter_trends.lat
lons = obs_winter_trends.lon
cplot, axs[7,3] = make_map_plot(numPlotRows, numPlotCols, 32, obs_winter_trends.tempanomaly, 'OBS')

cbar = fig.colorbar(cplot, ax=axs, orientation='horizontal', shrink = 0.7, pad = 0.02)
cbar.ax.set_title('1979-2012 DJF surface air temperature trends (K/34 years)', fontsize = 16)
cbar.set_ticks(contour_levels)
cbar.set_ticklabels(contour_levels)

Close our client:

client.close()

Summary

In this notebook, we used CESM LENS data hosted on AWS to recreate two key figures in the paper that describes the project.

What’s next?

More example workflows using these datasets may be added in the future.

References
  1. Kay, J. E., Deser, C., Phillips, A., Mai, A., Hannay, C., Strand, G., Arblaster, J. M., Bates, S. C., Danabasoglu, G., Edwards, J., Holland, M., Kushner, P., Lamarque, J.-F., Lawrence, D., Lindsday, K., Middleton, A., Munoz, E., Neale, R., Oleson, K., … Vertenstein, M. (2015). The Community Earth System Model (CESM) Large Ensemble Project. Bull. Amer. Meteor. Soc. 10.1175/BAMS-D-13-00255.1
  2. Morice, C. P., Kennedy, J. J., Rayner, N. A., & Jones, P. D. (2012). Quantifying uncertainties in global and regional temperature change using an ensemble of observational estimates: The HadCRUT4 data set. J. Geophys. Res. Atmos., 117(D8). 10.1029/2011JD017187
  3. Lenssen, N., Schmidt, G., Hansen, J., Menne, M., Persin, A., Ruedy, R., & Zyss, D. (2019). Improvements in the GISTEMP uncertainty model. Journal of Geophysical Research: Atmospheres, 124(12), 6307–6326. 10.1029/2018JD029522