Search and Load CMIP6 Data via ESGF/OPeNDAP

Overview

This notebook shows how to search and load data via Earth System Grid Federation infrastructure. This infrastructure works great and is the foundation of the CMIP6 distribution system.

The main technologies used here are the ESGF search API, used to figure out what data we want, and OPeNDAP, a remote data access protocol over HTTP.

Prerequisites

Concepts	Importance	Notes
Intro to Xarray	Necessary
Understanding of NetCDF	Helpful	Familiarity with metadata structure

Time to learn: 10 minutes

Imports

import warnings

from distributed import Client
import holoviews as hv
import hvplot.xarray
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from pyesgf.search import SearchConnection
import xarray as xr

xr.set_options(display_style='html')
warnings.filterwarnings("ignore")
hv.extension('bokeh')

client = Client()
client

Client

Client-14eba331-2373-11ef-8bcb-000d3a61081e

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

0ee9993f

Dashboard: http://127.0.0.1:8787/status	Workers: 4
Total threads: 4	Total memory: 15.61 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-8237f11a-db52-4308-b55e-84cccb2ffd29

Comm: tcp://127.0.0.1:38821	Workers: 4
Dashboard: http://127.0.0.1:8787/status	Total threads: 4
Started: Just now	Total memory: 15.61 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:42851	Total threads: 1
Dashboard: http://127.0.0.1:38427/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:43975
Local directory: /tmp/dask-scratch-space/worker-0_o_txnw

Worker: 1

Comm: tcp://127.0.0.1:41267	Total threads: 1
Dashboard: http://127.0.0.1:34429/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:35437
Local directory: /tmp/dask-scratch-space/worker-v1xra_ps

Worker: 2

Comm: tcp://127.0.0.1:40761	Total threads: 1
Dashboard: http://127.0.0.1:44645/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:41975
Local directory: /tmp/dask-scratch-space/worker-tod623cs

Worker: 3

Comm: tcp://127.0.0.1:44577	Total threads: 1
Dashboard: http://127.0.0.1:37821/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:32799
Local directory: /tmp/dask-scratch-space/worker-kl4ubswn

Search using ESGF API

Fortunately, there is an ESGF API implemented in Python - pyesgf, which requires three major steps:

Establish a search connection
Query your data
Extract the urls to your data

Once you have this information, you can load the data into an xarray.Dataset.

Configure the connection to a data server

First, we configure our connection to some server, using the distributed option (distrib=False). In this case, we are searching from the Lawerence Livermore National Lab (LLNL) data node.

conn = SearchConnection('https://esgf-node.llnl.gov/esg-search',
                        distrib=False)

Query our dataset

We are interested in a single experiment from CMIP6 - one of the Community Earth System Model version 2 (CESM2) runs, specifically the historical part of the simulation.

We are also interested in a single variable - temperature at the surface (tas), with a single ensemble member (r10i1p1f1)

ctx = conn.new_context(
    facets='project,experiment_id',
    project='CMIP6',
    table_id='Amon',
    institution_id="NCAR",
    experiment_id='historical',
    source_id='CESM2',
    variable='tas',
    variant_label='r10i1p1f1',
)

Extract the OpenDAP urls

In order to access the datasets, we need the urls to the data. Once we have these, we can read the data remotely!

result = ctx.search()[0]
files = result.file_context().search()
files

<pyesgf.search.results.ResultSet at 0x7fb33032d7b0>

The files object is not immediately helpful - we need to extract the opendap_url method from this.

files[0].opendap_url

'http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_185001-189912.nc'

We can use this for the whole list using list comprehension, as shown below.

opendap_urls = [file.opendap_url for file in files]
opendap_urls

['http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_185001-189912.nc',
 'http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_190001-194912.nc',
 'http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_195001-199912.nc',
 'http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_200001-201412.nc']

Read the data into an `xarray.Dataset`

Now that we have our urls to the data, we can use open multifile dataset (open_mfdataset) to read the data, combining the coordinates and chunking by time.

Xarray, together with the netCDF4 Python library, allow lazy loading.

ds = xr.open_mfdataset(opendap_urls,
                       combine='by_coords',
                       chunks={'time':480})
ds

<xarray.Dataset> Size: 453MB
Dimensions:    (time: 1980, nbnd: 2, lat: 192, lon: 288)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8
  * time       (time) object 16kB 1850-01-15 12:00:00 ... 2014-12-15 12:00:00
Dimensions without coordinates: nbnd
Data variables:
    time_bnds  (time, nbnd) object 32kB dask.array<chunksize=(480, 2), meta=np.ndarray>
    lat_bnds   (time, lat, nbnd) float64 6MB dask.array<chunksize=(600, 192, 2), meta=np.ndarray>
    lon_bnds   (time, lon, nbnd) float64 9MB dask.array<chunksize=(600, 288, 2), meta=np.ndarray>
    tas        (time, lat, lon) float32 438MB dask.array<chunksize=(480, 192, 288), meta=np.ndarray>
Attributes: (12/46)
    Conventions:                     CF-1.7 CMIP-6.2
    activity_id:                     CMIP
    branch_method:                   standard
    branch_time_in_child:            674885.0
    branch_time_in_parent:           306600.0
    case_id:                         24
    ...                              ...
    table_id:                        Amon
    tracking_id:                     hdl:21.14100/e47b79db-3925-45a7-9c0a-679...
    variable_id:                     tas
    variant_info:                    CMIP6 20th century experiments (1850-201...
    variant_label:                   r10i1p1f1
    DODS_EXTRA.Unlimited_Dimension:  time

Plot a quick look of the data

Now that we have the dataset, let’s plot a few quick looks of the data.

ds.tas.sel(time='1950-01').squeeze().plot(cmap='Spectral_r');

../../_images/53af4a3ac51cfa26cb19ca316ea933679e7c5e01e09ff9ac031a546c3fa0a458.png

These are OPeNDAP endpoints. Xarray, together with the netCDF4 Python library, allow lazy loading.

Compute an area-weighted global average

Let’s apply some computation to this dataset. We would like to calculate the global average temperature. This requires weighting each of the grid cells properly, using the area.

Compute the global average

Now that we have the area of each cell, and the temperature at each point, we can compute the global average temperature.

total_area = ds_area.areacella.sum(dim=['lon', 'lat'])
ta_timeseries = (ds.tas * ds_area.areacella).sum(dim=['lon', 'lat']) / total_area
ta_timeseries

<xarray.DataArray (time: 1980)> Size: 8kB
dask.array<truediv, shape=(1980,), dtype=float32, chunksize=(480,), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) object 16kB 1850-01-15 12:00:00 ... 2014-12-15 12:00:00

xarray.DataArray

time: 1980

dask.array<chunksize=(480,), meta=np.ndarray>

	Array	Chunk
Bytes	7.73 kiB	1.88 kiB
Shape	(1980,)	(480,)
Dask graph	7 chunks in 18 graph layers
Data type	float32 numpy.ndarray

Coordinates: (1)

time

(time)

object

1850-01-15 12:00:00 ... 2014-12-...

axis :: T
bounds :: time_bnds
standard_name :: time
title :: time
type :: double
_ChunkSizes :: 512

array([cftime.DatetimeNoLeap(1850, 1, 15, 12, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(1850, 2, 14, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(1850, 3, 15, 12, 0, 0, 0, has_year_zero=True),
       ...,
       cftime.DatetimeNoLeap(2014, 10, 15, 12, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2014, 11, 15, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2014, 12, 15, 12, 0, 0, 0, has_year_zero=True)],
      dtype=object)

Indexes: (1)