Search and Load CMIP6 Data via ESGF/OPeNDAP
Overview
This notebook shows how to search and load data via Earth System Grid Federation infrastructure. This infrastructure works great and is the foundation of the CMIP6 distribution system.
The main technologies used here are the ESGF search API, used to figure out what data we want, and OPeNDAP, a remote data access protocol over HTTP.
Prerequisites
Concepts |
Importance |
Notes |
---|---|---|
Necessary |
||
Helpful |
Familiarity with metadata structure |
Time to learn: 10 minutes
Imports
import warnings
from distributed import Client
import holoviews as hv
import hvplot.xarray
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from pyesgf.search import SearchConnection
import xarray as xr
xr.set_options(display_style='html')
warnings.filterwarnings("ignore")
hv.extension('bokeh')
client = Client()
client
Client
Client-d6e5ce56-8ccb-11ef-8bdb-6045bdb5c525
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: http://127.0.0.1:8787/status |
Cluster Info
LocalCluster
ffddc8e7
Dashboard: http://127.0.0.1:8787/status | Workers: 4 |
Total threads: 4 | Total memory: 15.61 GiB |
Status: running | Using processes: True |
Scheduler Info
Scheduler
Scheduler-e844c40f-9e32-4ae3-946c-87a05fe22ba8
Comm: tcp://127.0.0.1:38861 | Workers: 4 |
Dashboard: http://127.0.0.1:8787/status | Total threads: 4 |
Started: Just now | Total memory: 15.61 GiB |
Workers
Worker: 0
Comm: tcp://127.0.0.1:34585 | Total threads: 1 |
Dashboard: http://127.0.0.1:41513/status | Memory: 3.90 GiB |
Nanny: tcp://127.0.0.1:44469 | |
Local directory: /tmp/dask-scratch-space/worker-2z533llh |
Worker: 1
Comm: tcp://127.0.0.1:39049 | Total threads: 1 |
Dashboard: http://127.0.0.1:45033/status | Memory: 3.90 GiB |
Nanny: tcp://127.0.0.1:45563 | |
Local directory: /tmp/dask-scratch-space/worker-uryy7802 |
Worker: 2
Comm: tcp://127.0.0.1:40629 | Total threads: 1 |
Dashboard: http://127.0.0.1:38329/status | Memory: 3.90 GiB |
Nanny: tcp://127.0.0.1:34139 | |
Local directory: /tmp/dask-scratch-space/worker-7xxpz9sk |
Worker: 3
Comm: tcp://127.0.0.1:38561 | Total threads: 1 |
Dashboard: http://127.0.0.1:46737/status | Memory: 3.90 GiB |
Nanny: tcp://127.0.0.1:40793 | |
Local directory: /tmp/dask-scratch-space/worker-h_lgcmew |
Search using ESGF API
Fortunately, there is an ESGF API implemented in Python - pyesgf
, which requires three major steps:
Establish a search connection
Query your data
Extract the urls to your data
Once you have this information, you can load the data into an xarray.Dataset
.
Configure the connection to a data server
First, we configure our connection to some server, using the distributed option (distrib=False
). In this case, we are searching from the Lawerence Livermore National Lab (LLNL) data node.
conn = SearchConnection('https://esgf-node.llnl.gov/esg-search',
distrib=False)
Query our dataset
We are interested in a single experiment from CMIP6 - one of the Community Earth System Model version 2 (CESM2) runs, specifically the historical part of the simulation.
We are also interested in a single variable - temperature at the surface (tas), with a single ensemble member (r10i1p1f1
)
ctx = conn.new_context(
facets='project,experiment_id',
project='CMIP6',
table_id='Amon',
institution_id="NCAR",
experiment_id='historical',
source_id='CESM2',
variable='tas',
variant_label='r10i1p1f1',
)
Extract the OpenDAP urls
In order to access the datasets, we need the urls to the data. Once we have these, we can read the data remotely!
result = ctx.search()[0]
files = result.file_context().search()
files
<pyesgf.search.results.ResultSet at 0x7f801b8d8ee0>
The files object is not immediately helpful - we need to extract the opendap_url
method from this.
files[0].opendap_url
'http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_185001-189912.nc'
We can use this for the whole list using list comprehension, as shown below.
opendap_urls = [file.opendap_url for file in files]
opendap_urls
['http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_185001-189912.nc',
'http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_190001-194912.nc',
'http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_195001-199912.nc',
'http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_200001-201412.nc']
Read the data into an xarray.Dataset
Now that we have our urls to the data, we can use open multifile dataset (open_mfdataset
) to read the data, combining the coordinates and chunking by time.
Xarray, together with the netCDF4 Python library, allow lazy loading.
ds = xr.open_mfdataset(opendap_urls,
combine='by_coords',
chunks={'time':480})
ds
<xarray.Dataset> Size: 453MB Dimensions: (time: 1980, nbnd: 2, lat: 192, lon: 288) Coordinates: * lat (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0 * lon (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8 * time (time) object 16kB 1850-01-15 12:00:00 ... 2014-12-15 12:00:00 Dimensions without coordinates: nbnd Data variables: time_bnds (time, nbnd) object 32kB dask.array<chunksize=(480, 2), meta=np.ndarray> lat_bnds (time, lat, nbnd) float64 6MB dask.array<chunksize=(600, 192, 2), meta=np.ndarray> lon_bnds (time, lon, nbnd) float64 9MB dask.array<chunksize=(600, 288, 2), meta=np.ndarray> tas (time, lat, lon) float32 438MB dask.array<chunksize=(480, 192, 288), meta=np.ndarray> Attributes: (12/46) Conventions: CF-1.7 CMIP-6.2 activity_id: CMIP branch_method: standard branch_time_in_child: 674885.0 branch_time_in_parent: 306600.0 case_id: 24 ... ... table_id: Amon tracking_id: hdl:21.14100/e47b79db-3925-45a7-9c0a-679... variable_id: tas variant_info: CMIP6 20th century experiments (1850-201... variant_label: r10i1p1f1 DODS_EXTRA.Unlimited_Dimension: time
Plot a quick look of the data
Now that we have the dataset, let’s plot a few quick looks of the data.
ds.tas.sel(time='1950-01').squeeze().plot(cmap='Spectral_r');
These are OPeNDAP endpoints. Xarray, together with the netCDF4 Python library, allow lazy loading.
Compute an area-weighted global average
Let’s apply some computation to this dataset. We would like to calculate the global average temperature. This requires weighting each of the grid cells properly, using the area.
Find the area of the cells
We can query the dataserver again, this time extracting the area of the cell (areacella
).
ctx = conn.new_context(
facets='project,experiment_id',
project='CMIP6',
institution_id="NCAR",
experiment_id='historical',
source_id='CESM2',
variable='areacella',
)
As before, we extract the opendap urls.
result = ctx.search()[0]
files = result.file_context().search()
opendap_urls = [file.opendap_url for file in files]
opendap_urls
['http://aims3.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/NCAR/CESM2/historical/r11i1p1f1/fx/areacella/gn/v20190514/areacella_fx_CESM2_historical_r11i1p1f1_gn.nc']
And finally, we load our cell area file into an xarray.Dataset
ds_area = xr.open_dataset(opendap_urls[0])
ds_area
<xarray.Dataset> Size: 233kB Dimensions: (lat: 192, lon: 288, nbnd: 2) Coordinates: * lat (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0 * lon (lon) float64 2kB 0.0 1.25 2.5 3.75 ... 355.0 356.2 357.5 358.8 Dimensions without coordinates: nbnd Data variables: lat_bnds (lat, nbnd) float64 3kB ... lon_bnds (lon, nbnd) float64 5kB ... areacella (lat, lon) float32 221kB ... Attributes: (12/44) Conventions: CF-1.7 CMIP-6.2 activity_id: CMIP branch_method: standard branch_time_in_child: 674885.0 branch_time_in_parent: 219000.0 case_id: 972 ... ... sub_experiment_id: none table_id: fx tracking_id: hdl:21.14100/96455df2-979e-4cd4-8521-ddf307c6bc4a variable_id: areacella variant_info: CMIP6 20th century experiments (1850-2014) with C... variant_label: r11i1p1f1
Compute the global average
Now that we have the area of each cell, and the temperature at each point, we can compute the global average temperature.
total_area = ds_area.areacella.sum(dim=['lon', 'lat'])
ta_timeseries = (ds.tas * ds_area.areacella).sum(dim=['lon', 'lat']) / total_area
ta_timeseries
<xarray.DataArray (time: 1980)> Size: 8kB dask.array<truediv, shape=(1980,), dtype=float32, chunksize=(480,), chunktype=numpy.ndarray> Coordinates: * time (time) object 16kB 1850-01-15 12:00:00 ... 2014-12-15 12:00:00
By default the data are loaded lazily, as Dask arrays. Here we trigger computation explicitly.
%time ta_timeseries.load()
CPU times: user 409 ms, sys: 112 ms, total: 520 ms
Wall time: 10.9 s
<xarray.DataArray (time: 1980)> Size: 8kB array([284.99948, 285.23215, 285.85364, ..., 288.54376, 287.61884, 287.06284], dtype=float32) Coordinates: * time (time) object 16kB 1850-01-15 12:00:00 ... 2014-12-15 12:00:00
Visualize our results
Now that we have our results, we can visualize using static and dynamic plots. Let’s start with static plots using matplotlib
, then dynamic plots using hvPlot
.
ta_timeseries['time'] = ta_timeseries.indexes['time'].to_datetimeindex()
fig = plt.figure(figsize=(12,8))
ta_timeseries.plot(label='monthly')
ta_timeseries.rolling(time=12).mean().plot(label='12 month rolling mean')
plt.legend()
plt.title('Global Mean Surface Air Temperature')
Text(0.5, 1.0, 'Global Mean Surface Air Temperature')
ta_timeseries.name = 'Temperature (K)'
monthly_average = ta_timeseries.hvplot(title = 'Global Mean Surface Air Temperature',
label='monthly')
rolling_monthly_average = ta_timeseries.rolling(time=12).mean().hvplot(label='12 month rolling mean',)
(monthly_average * rolling_monthly_average).opts(legend_position='top_left')
Summary
In this notebook, we searched for and opened a CESM2 dataset using the ESGF API and OPeNDAP. We then plotted global average surface air temperature.
What’s next?
We will see some more advanced examples of using the CMIP6 data.
Resources and references
Original notebook in the Pangeo Gallery by Henri Drake and Ryan Abernathey