Skip to article frontmatterSkip to article content

Global Mean Surface Temperature Anomalies (GMSTA) from CMIP6 data


Overview

In this notebook we will compute the Global Mean Surface Temperature Anomalies (GMSTA) from CMIP6 data and compare it with observations. This notebook is heavily inspired by the GMST example in the CMIP6 cookbook and we thank the authors for their workflow.

  1. We will get the CMIP6 temperature data from the AWS open data program via the us-west-2 origin
  2. In order to do this, we will use an intake-ESM catalog (hosted on NCAR’s RDA) that uses pelicanFS backed links instead of https or s3 links
  3. We will grab observational data hosted on NCAR’s RDA, which is accessible via the NCAR origin
  4. Please refer to the first chapter of this cookbook to learn more about OSDF, pelican or pelicanFS
  5. This notebook demonstrates that you can seamlessly stream data from multiple OSDF origins in your workflow

Prerequisites

ConceptsImportanceNotes
Intro to Intake-ESMNecessaryUsed for searching CMIP6 data
Understanding of ZarrHelpfulFamiliarity with metadata structure
SeabornHelpfulUsed for plotting
PelicanFSNecessaryThe python package used to stream data in this notebook
OSDFHelpfulOSDF is used to stream data in this notebook
  • Time to learn: 20 mins

Imports

from matplotlib import pyplot as plt
import xarray as xr
import numpy as np
from dask.diagnostics import progress
from tqdm.autonotebook import tqdm
import intake
import fsspec
import seaborn as sns
import aiohttp
import dask
from dask.distributed import LocalCluster
import pelicanfs 
/tmp/ipykernel_4033/2849986847.py:5: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from tqdm.autonotebook import tqdm

We will use an intake-ESM catalog hosted on NCAR’s Research Data Archive. This is nothing but the AWS cmip6 catalog modified to use OSDF

# Load catalog URL
rda_url     =  'https://data.rda.ucar.edu/'
cat_url     = rda_url +  'd850001/catalogs/osdf/cmip6-aws/cmip6-osdf-zarr.json'
print(cat_url)
https://data.rda.ucar.edu/d850001/catalogs/osdf/cmip6-aws/cmip6-osdf-zarr.json

Set up local dask cluster

Before we do any computation let us first set up a local cluster using dask

cluster = LocalCluster()          
client = cluster.get_client()
/home/runner/micromamba/envs/osdf-cookbook/lib/python3.12/site-packages/distributed/node.py:187: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 43969 instead
  warnings.warn(
# Scale the cluster
n_workers = 5
cluster.scale(n_workers)
cluster
Loading...

Data Loading

Load CMIP6 data from AWS

col = intake.open_esm_datastore(cat_url)
col
Loading...
# there is currently a significant amount of data for these runs
expts = ['historical', 'ssp245', 'ssp370']

query = dict(
    experiment_id=expts,
    table_id='Amon',
    variable_id=['tas'],
    member_id = 'r1i1p1f1',
    #activity_id = 'CMIP',
)

col_subset = col.search(require_all_on=["source_id"], **query)
col_subset
Loading...
  • Let us inspect the zarr store paths to see if we are using the pelican protocol.
  • We see that zstore column has paths that start with ‘osdf:///’ instead of ‘https://’ which tells us that we are not using a simple ‘https’ GET request to fetch the data.
  • In order to know more about the pelican protocol, please refer to the first chapter of this cookbook.
col_subset.df
Loading...

Grab some Observational time series data for comparison with ensemble spread

  • The observational data we will use is the HadCRUT5 dataset from the UK Met Office
  • The data has been downloaded to NCAR’s Research Data Archive (RDA) from https://www.metoffice.gov.uk/hadobs/hadcrut5/
  • We will use an OSDF to access this copy from the RDA. Again the links will start with ‘osdf:///’
%%time
obs_url    = 'osdf:///ncar/rda/d850001/HadCRUT.5.0.2.0.analysis.summary_series.global.monthly.nc'
#
obs_ds = xr.open_dataset(obs_url, engine='h5netcdf').tas_mean
obs_ds
Loading...

Some helpful functions

def drop_all_bounds(ds):
    drop_vars = [vname for vname in ds.coords
                 if (('_bounds') in vname ) or ('_bnds') in vname]
    return ds.drop_vars(drop_vars)

def open_dset(df):
    assert len(df) == 1
    mapper = fsspec.get_mapper(df.zstore.values[0])
    #path = df.zstore.values[0][7:]+".zmetadata"
    ds = xr.open_zarr(mapper, consolidated=True)
    return drop_all_bounds(ds)

def open_delayed(df):
    return dask.delayed(open_dset)(df)

from collections import defaultdict
dsets = defaultdict(dict)

for group, df in col_subset.df.groupby(by=['source_id', 'experiment_id']):
    dsets[group[0]][group[1]] = open_delayed(df)
dsets_ = dask.compute(dict(dsets))[0]
#calculate global means
def get_lat_name(ds):
    for lat_name in ['lat', 'latitude']:
        if lat_name in ds.coords:
            return lat_name
    raise RuntimeError("Couldn't find a latitude coordinate")

def global_mean(ds):
    lat = ds[get_lat_name(ds)]
    weight = np.cos(np.deg2rad(lat))
    weight /= weight.mean()
    other_dims = set(ds.dims) - {'time'}
    return (ds * weight).mean(other_dims)
#calculate global means
def get_lat_name(ds):
    for lat_name in ['lat', 'latitude']:
        if lat_name in ds.coords:
            return lat_name
    raise RuntimeError("Couldn't find a latitude coordinate")

def global_mean(ds):
    lat = ds[get_lat_name(ds)]
    weight = np.cos(np.deg2rad(lat))
    weight /= weight.mean()
    other_dims = set(ds.dims) - {'time'}
    return (ds * weight).mean(other_dims)

GMST computation

expt_da = xr.DataArray(expts, dims='experiment_id', name='experiment_id',
                       coords={'experiment_id': expts})

dsets_aligned = {}

for k, v in tqdm(dsets_.items()):
    expt_dsets = v.values()
    if any([d is None for d in expt_dsets]):
        print(f"Missing experiment for {k}")
        continue

    for ds in expt_dsets:
        ds.coords['year'] = ds.time.dt.year

    # workaround for
    # https://github.com/pydata/xarray/issues/2237#issuecomment-620961663
    dsets_ann_mean = [v[expt].pipe(global_mean).swap_dims({'time': 'year'})
                             .drop_vars('time').coarsen(year=12).mean()
                      for expt in expts]

    # align everything with the 4xCO2 experiment
    dsets_aligned[k] = xr.concat(dsets_ann_mean, join='outer',dim=expt_da)
  0%|          | 0/27 [00:00<?, ?it/s]
  4%|▎         | 1/27 [00:00<00:17,  1.50it/s]
  7%|▋         | 2/27 [00:00<00:09,  2.56it/s]
 11%|█         | 3/27 [00:01<00:07,  3.38it/s]
 15%|█▍        | 4/27 [00:01<00:05,  3.86it/s]
 19%|█▊        | 5/27 [00:01<00:05,  4.04it/s]
 22%|██▏       | 6/27 [00:01<00:04,  4.61it/s]
 26%|██▌       | 7/27 [00:01<00:04,  4.44it/s]
 30%|██▉       | 8/27 [00:02<00:04,  4.08it/s]
 33%|███▎      | 9/27 [00:02<00:04,  3.91it/s]
 37%|███▋      | 10/27 [00:02<00:04,  3.73it/s]
 41%|████      | 11/27 [00:03<00:04,  3.39it/s]
 44%|████▍     | 12/27 [00:03<00:04,  3.58it/s]
 48%|████▊     | 13/27 [00:03<00:03,  3.87it/s]
 52%|█████▏    | 14/27 [00:03<00:03,  4.22it/s]
 56%|█████▌    | 15/27 [00:03<00:02,  4.31it/s]
 59%|█████▉    | 16/27 [00:04<00:02,  4.53it/s]
 63%|██████▎   | 17/27 [00:04<00:02,  4.71it/s]
 67%|██████▋   | 18/27 [00:04<00:01,  5.08it/s]
 70%|███████   | 19/27 [00:04<00:01,  5.37it/s]
 74%|███████▍  | 20/27 [00:04<00:01,  5.53it/s]
 78%|███████▊  | 21/27 [00:05<00:01,  5.56it/s]
 81%|████████▏ | 22/27 [00:05<00:00,  5.27it/s]
 85%|████████▌ | 23/27 [00:05<00:00,  4.85it/s]
 89%|████████▉ | 24/27 [00:05<00:00,  4.81it/s]
 93%|█████████▎| 25/27 [00:05<00:00,  4.32it/s]
 96%|█████████▋| 26/27 [00:06<00:00,  4.43it/s]
100%|██████████| 27/27 [00:06<00:00,  4.59it/s]
100%|██████████| 27/27 [00:06<00:00,  4.23it/s]

%%time
with progress.ProgressBar():
    dsets_aligned_ = dask.compute(dsets_aligned)[0]
CPU times: user 15.6 s, sys: 3.54 s, total: 19.1 s
Wall time: 2min 14s
source_ids = list(dsets_aligned_.keys())
source_da = xr.DataArray(source_ids, dims='source_id', name='source_id',
                         coords={'source_id': source_ids})

big_ds = xr.concat([ds.reset_coords(drop=True)
                    for ds in dsets_aligned_.values()],
                    dim=source_da)

big_ds
Loading...
# Compute annual mean temperatures anomalies of observational data
obs_gmsta = obs_ds.resample(time='YS').mean(dim='time')
# obs_gmsta

Compute anomlaies and plot

  • We will compute the temperature anomalies w.r.t 1960-1990 baseline period
  • Convert xarray datasets to pandas dataframes
  • Use Seaborn to plot GMSTA
df_all = big_ds.to_dataframe().reset_index()
df_all.head()
Loading...
# Define the baseline period
baseline_df = df_all[(df_all["year"] >= 1960) & (df_all["year"] <= 1990)]

# Compute the baseline mean
baseline_mean = baseline_df["tas"].mean()

# Compute anomalies
df_all["tas_anomaly"] = df_all["tas"] - baseline_mean
df_all
Loading...
obs_df = obs_gmsta.to_dataframe(name='tas_anomaly').reset_index()
# Convert 'time' to 'year' (keeping only the year)
obs_df['year'] = obs_df['time'].dt.year

# Drop the original 'time' column since we extracted 'year'
obs_df = obs_df[['year', 'tas_anomaly']]
obs_df
Loading...

Almost there! Let us now use seaborn to plot all the anomalies

g = sns.relplot(data=df_all, x="year", y="tas_anomaly",
                hue='experiment_id', kind="line", errorbar="sd", aspect=2, palette="Set2")  # Adjust the color palette)

# Get the current axis from the FacetGrid
ax = g.ax

# Overlay the observational data in red
sns.lineplot(data=obs_df, x="year", y="tas_anomaly",color="red", 
             linestyle="dashed", linewidth=2,label="Observations", ax=ax)

# Adjust the legend to include observations
ax.legend(title="Experiment ID + Observations")

# Show the plot
plt.show()
<Figure size 1113.5x500 with 1 Axes>

Summary

In this notebook, we used surface air temperature data from several CMIP6 models for the ‘historical’, ‘SSP245’ and ‘SSP370’ runs to compute Global Mean Surface Temperature Anomaly (GMSTA) relative to the 1960-1990 baseline period and compare it with anomalies computed from the HadCRUT monthly surface temperature dataset. We used a modified intake-ESM catalog and pelicanFS to ‘stream/download’ temperature data from two different OSDF origins. The CMIP6 model data was streamed from the AWS OpenData origin in the us-west-2 region and the observational data was streamed from NCAR’s OSDF origin.

Resources and references

  1. Original notebook in the Pangeo Gallery by Henri Drake and Ryan Abernathey
  2. CMIP6 cookbook by Ryan Abernathey, Henri Drake, Robert Ford and Max Grover
  3. Coupled Model Intercomparison Project 6 was accessed from https://registry.opendata.aws/cmip6 using a modified intake-ESM catalog hosted on NCAR’s RDA
  4. We thank the UK Met Office Hadley Center for providing the observational data