Search and Load CMIP6 Data via ESGF/OPeNDAP

Overview¶

This notebook shows how to search and load data via Earth System Grid Federation infrastructure. This infrastructure works great and is the foundation of the CMIP6 distribution system.

The main technologies used here are the ESGF search API, used to figure out what data we want, and OPeNDAP, a remote data access protocol over HTTP.

Prerequisites¶

Concepts	Importance	Notes
Intro to Xarray	Necessary
Understanding of NetCDF	Helpful	Familiarity with metadata structure

Time to learn: 10 minutes

Imports¶

import warnings

from distributed import Client
import holoviews as hv
import hvplot.xarray
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from pyesgf.search import SearchConnection
import xarray as xr

xr.set_options(display_style='html')
warnings.filterwarnings("ignore")
hv.extension('bokeh')

client = Client()
client

Search using ESGF API¶

Fortunately, there is an ESGF API implemented in Python - pyesgf, which requires three major steps:

Establish a search connection
Query your data
Extract the urls to your data

Once you have this information, you can load the data into an xarray.Dataset.

Configure the connection to a data server¶

First, we configure our connection to some server, using the distributed option (distrib=False). In this case, we are searching from the Lawerence Livermore National Lab (LLNL) data node.

conn = SearchConnection('https://esgf-node.llnl.gov/esg-search',
                        distrib=False)

Query our dataset¶

We are interested in a single experiment from CMIP6 - one of the Community Earth System Model version 2 (CESM2) runs, specifically the historical part of the simulation.

We are also interested in a single variable - temperature at the surface (tas), with a single ensemble member (r10i1p1f1)

ctx = conn.new_context(
    facets='project,experiment_id',
    project='CMIP6',
    table_id='Amon',
    institution_id="NCAR",
    experiment_id='historical',
    source_id='CESM2',
    variable='tas',
    variant_label='r10i1p1f1',
)

Extract the OpenDAP urls¶

In order to access the datasets, we need the urls to the data. Once we have these, we can read the data remotely!

result = ctx.search()[0]
files = result.file_context().search()
files

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Cell In[5], line 2
      1 result = ctx.search()[0]
----> 2 files = result.file_context().search()
      3 files

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/pyesgf/search/context.py:139, in SearchContext.search(self, batch_size, ignore_facet_check, **constraints)
    136     sc = self
    138 if not ignore_facet_check:
--> 139     sc.__update_counts()
    141 return ResultSet(sc, batch_size=batch_size)

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/pyesgf/search/context.py:220, in SearchContext.__update_counts(self)
    217     if self.connection.distrib:
    218         self._do_facets_star_warning()
--> 220 response = self.connection.send_search(query_dict, limit=0)
    221 for facet, counts in (list(response['facet_counts']['facet_fields'].items())):
    222     d = self.__facet_counts[facet] = {}

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/pyesgf/search/connection.py:159, in SearchConnection.send_search(self, query_dict, limit, offset, shards)
    157 if not self._isopen:
    158     self.open()
--> 159 response = self._send_query('search', full_query)
    160 ret = response.json()
    161 response.close()

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/pyesgf/search/connection.py:213, in SearchConnection._send_query(self, endpoint, full_query)
    210     raise Exception("Invalid query parameter(s): %s" % content)
    212 # Raise if query was unsucessful:
--> 213 response.raise_for_status()
    214 return response

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/requests/models.py:1026, in Response.raise_for_status(self)
   1021     http_error_msg = (
   1022         f"{self.status_code} Server Error: {reason} for url: {self.url}"
   1023     )
   1025 if http_error_msg:
-> 1026     raise HTTPError(http_error_msg, response=self)

HTTPError: 422 Client Error: Unprocessable Content for url: https://esgf-node.ornl.gov/esgf-1-5-bridge?format=application%2Fsolr%2Bjson&limit=0&distrib=false&type=File&dataset_id=CMIP6.CMIP.NCAR.CESM2.historical.r10i1p1f1.Amon.tas.gn.v20190313%7Cesgf-node.ornl.gov&facets=%2A

The files object is not immediately helpful - we need to extract the opendap_url method from this.

files[0].opendap_url

We can use this for the whole list using list comprehension, as shown below.

opendap_urls = [file.opendap_url for file in files]
opendap_urls

Read the data into an `xarray.Dataset`¶

Now that we have our urls to the data, we can use open multifile dataset (open_mfdataset) to read the data, combining the coordinates and chunking by time.

Xarray, together with the netCDF4 Python library, allow lazy loading.

ds = xr.open_mfdataset(opendap_urls,
                       combine='by_coords',
                       chunks={'time':480})
ds

Plot a quick look of the data¶

Now that we have the dataset, let’s plot a few quick looks of the data.

ds.tas.sel(time='1950-01').squeeze().plot(cmap='Spectral_r');

These are OPeNDAP endpoints. Xarray, together with the netCDF4 Python library, allow lazy loading.

Compute an area-weighted global average¶

Let’s apply some computation to this dataset. We would like to calculate the global average temperature. This requires weighting each of the grid cells properly, using the area.

Find the area of the cells¶

We can query the dataserver again, this time extracting the area of the cell (areacella).

ctx = conn.new_context(
    facets='project,experiment_id',
    project='CMIP6',
    institution_id="NCAR",
    experiment_id='historical',
    source_id='CESM2',
    variable='areacella',
)

As before, we extract the opendap urls.

result = ctx.search()[0]
files = result.file_context().search()
opendap_urls = [file.opendap_url for file in files]
opendap_urls

And finally, we load our cell area file into an xarray.Dataset

ds_area = xr.open_dataset(opendap_urls[0])
ds_area

Compute the global average¶

Now that we have the area of each cell, and the temperature at each point, we can compute the global average temperature.

total_area = ds_area.areacella.sum(dim=['lon', 'lat'])
ta_timeseries = (ds.tas * ds_area.areacella).sum(dim=['lon', 'lat']) / total_area
ta_timeseries

By default the data are loaded lazily, as Dask arrays. Here we trigger computation explicitly.

%time ta_timeseries.load()

Visualize our results¶

Now that we have our results, we can visualize using static and dynamic plots. Let’s start with static plots using matplotlib, then dynamic plots using hvPlot.

ta_timeseries['time'] = ta_timeseries.indexes['time'].to_datetimeindex()

fig = plt.figure(figsize=(12,8))
ta_timeseries.plot(label='monthly')
ta_timeseries.rolling(time=12).mean().plot(label='12 month rolling mean')
plt.legend()
plt.title('Global Mean Surface Air Temperature')

ta_timeseries.name = 'Temperature (K)'
monthly_average = ta_timeseries.hvplot(title = 'Global Mean Surface Air Temperature',
                                       label='monthly')
rolling_monthly_average = ta_timeseries.rolling(time=12).mean().hvplot(label='12 month rolling mean',)

(monthly_average * rolling_monthly_average).opts(legend_position='top_left')