Skip to article frontmatterSkip to article content

Search and Load CMIP6 Data via ESGF/OPeNDAP


Overview

This notebook shows how to search and load data via Earth System Grid Federation infrastructure. This infrastructure works great and is the foundation of the CMIP6 distribution system.

The main technologies used here are the ESGF search API, used to figure out what data we want, and OPeNDAP, a remote data access protocol over HTTP.

Prerequisites

ConceptsImportanceNotes
Intro to XarrayNecessary
Understanding of NetCDFHelpfulFamiliarity with metadata structure
  • Time to learn: 10 minutes

Imports

import warnings

from distributed import Client
import holoviews as hv
import hvplot.xarray
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from pyesgf.search import SearchConnection
import xarray as xr

xr.set_options(display_style='html')
warnings.filterwarnings("ignore")
hv.extension('bokeh')
Loading...
client = Client()
client
Loading...

Search using ESGF API

Fortunately, there is an ESGF API implemented in Python - pyesgf, which requires three major steps:

  • Establish a search connection
  • Query your data
  • Extract the urls to your data

Once you have this information, you can load the data into an xarray.Dataset.

Configure the connection to a data server

First, we configure our connection to some server, using the distributed option (distrib=False). In this case, we are searching from the Lawerence Livermore National Lab (LLNL) data node.

conn = SearchConnection('https://esgf-node.llnl.gov/esg-search',
                        distrib=False)

Query our dataset

We are interested in a single experiment from CMIP6 - one of the Community Earth System Model version 2 (CESM2) runs, specifically the historical part of the simulation.

We are also interested in a single variable - temperature at the surface (tas), with a single ensemble member (r10i1p1f1)

ctx = conn.new_context(
    facets='project,experiment_id',
    project='CMIP6',
    table_id='Amon',
    institution_id="NCAR",
    experiment_id='historical',
    source_id='CESM2',
    variable='tas',
    variant_label='r10i1p1f1',
)

Extract the OpenDAP urls

In order to access the datasets, we need the urls to the data. Once we have these, we can read the data remotely!

result = ctx.search()[0]
files = result.file_context().search()
files
---------------------------------------------------------------------------
TimeoutError                              Traceback (most recent call last)
File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/urllib3/connection.py:198, in HTTPConnection._new_conn(self)
    197 try:
--> 198     sock = connection.create_connection(
    199         (self._dns_host, self.port),
    200         self.timeout,
    201         source_address=self.source_address,
    202         socket_options=self.socket_options,
    203     )
    204 except socket.gaierror as e:

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/urllib3/util/connection.py:85, in create_connection(address, timeout, source_address, socket_options)
     84 try:
---> 85     raise err
     86 finally:
     87     # Break explicitly a reference cycle

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/urllib3/util/connection.py:73, in create_connection(address, timeout, source_address, socket_options)
     72     sock.bind(source_address)
---> 73 sock.connect(sa)
     74 # Break explicitly a reference cycle

TimeoutError: timed out

The above exception was the direct cause of the following exception:

ConnectTimeoutError                       Traceback (most recent call last)
File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/urllib3/connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    786 # Make the request on the HTTPConnection object
--> 787 response = self._make_request(
    788     conn,
    789     method,
    790     url,
    791     timeout=timeout_obj,
    792     body=body,
    793     headers=headers,
    794     chunked=chunked,
    795     retries=retries,
    796     response_conn=response_conn,
    797     preload_content=preload_content,
    798     decode_content=decode_content,
    799     **response_kw,
    800 )
    802 # Everything went great!

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/urllib3/connectionpool.py:488, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    487         new_e = _wrap_proxy_error(new_e, conn.proxy.scheme)
--> 488     raise new_e
    490 # conn.request() calls http.client.*.request, not the method in
    491 # urllib3.request. It also calls makefile (recv) on the socket.

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/urllib3/connectionpool.py:464, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    463 try:
--> 464     self._validate_conn(conn)
    465 except (SocketTimeout, BaseSSLError) as e:

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/urllib3/connectionpool.py:1093, in HTTPSConnectionPool._validate_conn(self, conn)
   1092 if conn.is_closed:
-> 1093     conn.connect()
   1095 # TODO revise this, see https://github.com/urllib3/urllib3/issues/2791

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/urllib3/connection.py:753, in HTTPSConnection.connect(self)
    752 sock: socket.socket | ssl.SSLSocket
--> 753 self.sock = sock = self._new_conn()
    754 server_hostname: str = self.host

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/urllib3/connection.py:207, in HTTPConnection._new_conn(self)
    206 except SocketTimeout as e:
--> 207     raise ConnectTimeoutError(
    208         self,
    209         f"Connection to {self.host} timed out. (connect timeout={self.timeout})",
    210     ) from e
    212 except OSError as e:

ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7f58b8f3d550>, 'Connection to esgf-node.llnl.gov timed out. (connect timeout=120)')

The above exception was the direct cause of the following exception:

MaxRetryError                             Traceback (most recent call last)
File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/requests/adapters.py:667, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    666 try:
--> 667     resp = conn.urlopen(
    668         method=request.method,
    669         url=url,
    670         body=request.body,
    671         headers=request.headers,
    672         redirect=False,
    673         assert_same_host=False,
    674         preload_content=False,
    675         decode_content=False,
    676         retries=self.max_retries,
    677         timeout=timeout,
    678         chunked=chunked,
    679     )
    681 except (ProtocolError, OSError) as err:

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/urllib3/connectionpool.py:841, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    839     new_e = ProtocolError("Connection aborted.", new_e)
--> 841 retries = retries.increment(
    842     method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2]
    843 )
    844 retries.sleep()

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/urllib3/util/retry.py:519, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    518     reason = error or ResponseError(cause)
--> 519     raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    521 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)

MaxRetryError: HTTPSConnectionPool(host='esgf-node.llnl.gov', port=443): Max retries exceeded with url: /esg-search/search?format=application%2Fsolr%2Bjson&limit=0&distrib=false&type=Dataset&project=CMIP6&table_id=Amon&institution_id=NCAR&experiment_id=historical&source_id=CESM2&variable=tas&variant_label=r10i1p1f1&facets=project%2Cexperiment_id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f58b8f3d550>, 'Connection to esgf-node.llnl.gov timed out. (connect timeout=120)'))

During handling of the above exception, another exception occurred:

ConnectTimeout                            Traceback (most recent call last)
Cell In[5], line 1
----> 1 result = ctx.search()[0]
      2 files = result.file_context().search()
      3 files

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/pyesgf/search/context.py:139, in SearchContext.search(self, batch_size, ignore_facet_check, **constraints)
    136     sc = self
    138 if not ignore_facet_check:
--> 139     sc.__update_counts()
    141 return ResultSet(sc, batch_size=batch_size)

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/pyesgf/search/context.py:220, in SearchContext.__update_counts(self)
    217     if self.connection.distrib:
    218         self._do_facets_star_warning()
--> 220 response = self.connection.send_search(query_dict, limit=0)
    221 for facet, counts in (list(response['facet_counts']['facet_fields'].items())):
    222     d = self.__facet_counts[facet] = {}

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/pyesgf/search/connection.py:159, in SearchConnection.send_search(self, query_dict, limit, offset, shards)
    157 if not self._isopen:
    158     self.open()
--> 159 response = self._send_query('search', full_query)
    160 ret = response.json()
    161 response.close()

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/pyesgf/search/connection.py:203, in SearchConnection._send_query(self, endpoint, full_query)
    200 query_url = '%s/%s?%s' % (self.url, endpoint, urlencode(full_query))
    201 log.debug('Query request is %s' % query_url)
--> 203 response = self.session.get(query_url, verify=self.verify,
    204                             timeout=self.timeout)
    205 if response.status_code == 400:
    206     # If error code 400, use urllib to find the errors:
    207     errors = set(re.findall(r"Invalid HTTP query parameter=(\w+)",
    208                  response.text))

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/requests/sessions.py:602, in Session.get(self, url, **kwargs)
    594 r"""Sends a GET request. Returns :class:`Response` object.
    595 
    596 :param url: URL for the new :class:`Request` object.
    597 :param \*\*kwargs: Optional arguments that ``request`` takes.
    598 :rtype: requests.Response
    599 """
    601 kwargs.setdefault("allow_redirects", True)
--> 602 return self.request("GET", url, **kwargs)

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    584 send_kwargs = {
    585     "timeout": timeout,
    586     "allow_redirects": allow_redirects,
    587 }
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
    700 start = preferred_clock()
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)
    706 elapsed = preferred_clock() - start

File ~/micromamba/envs/cmip6-cookbook-dev/lib/python3.11/site-packages/requests/adapters.py:688, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    685 if isinstance(e.reason, ConnectTimeoutError):
    686     # TODO: Remove this in 3.0.0: see #2811
    687     if not isinstance(e.reason, NewConnectionError):
--> 688         raise ConnectTimeout(e, request=request)
    690 if isinstance(e.reason, ResponseError):
    691     raise RetryError(e, request=request)

ConnectTimeout: HTTPSConnectionPool(host='esgf-node.llnl.gov', port=443): Max retries exceeded with url: /esg-search/search?format=application%2Fsolr%2Bjson&limit=0&distrib=false&type=Dataset&project=CMIP6&table_id=Amon&institution_id=NCAR&experiment_id=historical&source_id=CESM2&variable=tas&variant_label=r10i1p1f1&facets=project%2Cexperiment_id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f58b8f3d550>, 'Connection to esgf-node.llnl.gov timed out. (connect timeout=120)'))

The files object is not immediately helpful - we need to extract the opendap_url method from this.

files[0].opendap_url

We can use this for the whole list using list comprehension, as shown below.

opendap_urls = [file.opendap_url for file in files]
opendap_urls

Read the data into an xarray.Dataset

Now that we have our urls to the data, we can use open multifile dataset (open_mfdataset) to read the data, combining the coordinates and chunking by time.

Xarray, together with the netCDF4 Python library, allow lazy loading.

ds = xr.open_mfdataset(opendap_urls,
                       combine='by_coords',
                       chunks={'time':480})
ds

Plot a quick look of the data

Now that we have the dataset, let’s plot a few quick looks of the data.

ds.tas.sel(time='1950-01').squeeze().plot(cmap='Spectral_r');

These are OPeNDAP endpoints. Xarray, together with the netCDF4 Python library, allow lazy loading.

Compute an area-weighted global average

Let’s apply some computation to this dataset. We would like to calculate the global average temperature. This requires weighting each of the grid cells properly, using the area.

Find the area of the cells

We can query the dataserver again, this time extracting the area of the cell (areacella).

ctx = conn.new_context(
    facets='project,experiment_id',
    project='CMIP6',
    institution_id="NCAR",
    experiment_id='historical',
    source_id='CESM2',
    variable='areacella',
)

As before, we extract the opendap urls.

result = ctx.search()[0]
files = result.file_context().search()
opendap_urls = [file.opendap_url for file in files]
opendap_urls

And finally, we load our cell area file into an xarray.Dataset

ds_area = xr.open_dataset(opendap_urls[0])
ds_area

Compute the global average

Now that we have the area of each cell, and the temperature at each point, we can compute the global average temperature.

total_area = ds_area.areacella.sum(dim=['lon', 'lat'])
ta_timeseries = (ds.tas * ds_area.areacella).sum(dim=['lon', 'lat']) / total_area
ta_timeseries

By default the data are loaded lazily, as Dask arrays. Here we trigger computation explicitly.

%time ta_timeseries.load()

Visualize our results

Now that we have our results, we can visualize using static and dynamic plots. Let’s start with static plots using matplotlib, then dynamic plots using hvPlot.

ta_timeseries['time'] = ta_timeseries.indexes['time'].to_datetimeindex()
fig = plt.figure(figsize=(12,8))
ta_timeseries.plot(label='monthly')
ta_timeseries.rolling(time=12).mean().plot(label='12 month rolling mean')
plt.legend()
plt.title('Global Mean Surface Air Temperature')
ta_timeseries.name = 'Temperature (K)'
monthly_average = ta_timeseries.hvplot(title = 'Global Mean Surface Air Temperature',
                                       label='monthly')
rolling_monthly_average = ta_timeseries.rolling(time=12).mean().hvplot(label='12 month rolling mean',)

(monthly_average * rolling_monthly_average).opts(legend_position='top_left')

Summary

In this notebook, we searched for and opened a CESM2 dataset using the ESGF API and OPeNDAP. We then plotted global average surface air temperature.

What’s next?

We will see some more advanced examples of using the CMIP6 data.

Resources and references