Globus logo ESGF logo

Basic Demonstration of Data Reduction Using Globus, Intake-ESGF, and Clisops

Overview

Within this notebook, we highlight how to use a collection of open-source tools in the Earth System Grid Federation user-computing community, to reduce and select datasets available through the federation of servers. Mainly, we will

  • Select a given time frame

  • Subset for a point

  • Average into yearly frequency

Prerequisites

Concepts

Importance

Notes

Intro to Xarray

Necessary

hvPlot Basics

Necessary

Interactive Visualization with hvPlot

  • Time to learn: 30 minutes

Imports

import hvplot.xarray
import holoviews as hv
import numpy as np
import hvplot.xarray
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
from intake_esgf import ESGFCatalog
import xarray as xr
import warnings
from clisops.ops.subset import subset, subset_bbox
from clisops.ops.average import average_over_dims, average_time
import os
from globus_compute_sdk import Executor, Client
warnings.filterwarnings("ignore")

hv.extension("matplotlib")

Search and Find Data Using Intake-ESGF

Let’s start with a sample dataset - which we can search for using intake-esgf.

cat = ESGFCatalog()
cat
Perform a search() to populate the catalog.
cat.search(
    experiment_id="historical",
    source_id="CanESM5",
    frequency="mon",
    variable_id=["gpp", "tas", "pr"],
    variant_label="r1i1p1f1",  # addition from the last search
)
   Searching indices: 100%|███████████████████████████████|1/1 [    4.22s/index]
Summary information for 3 results:
mip_era                  [CMIP6]
activity_id               [CMIP]
institution_id           [CCCma]
source_id              [CanESM5]
experiment_id       [historical]
member_id             [r1i1p1f1]
table_id            [Amon, Lmon]
variable_id       [tas, pr, gpp]
grid_label                  [gn]
dtype: object
dsd = cat.to_dataset_dict()
dsd.keys()
 Obtaining file info: 100%|███████████████████████████████|3/3 [  1.24dataset/s]
Adding cell measures: 100%|███████████████████████████████|3/3 [  3.04s/dataset]
dict_keys(['Amon.tas', 'Lmon.gpp', 'Amon.pr'])
ds = dsd["Amon.tas"]
ds
<xarray.Dataset>
Dimensions:    (time: 1980, bnds: 2, lat: 64, lon: 128)
Coordinates:
  * time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
  * lat        (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 82.31 85.1 87.86
  * lon        (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2
    height     float64 ...
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) object ...
    lat_bnds   (lat, bnds) float64 ...
    lon_bnds   (lon, bnds) float64 ...
    tas        (time, lat, lon) float32 ...
    areacella  (lat, lon) float32 ...
Attributes: (12/53)
    CCCma_model_hash:            3dedf95315d603326fde4f5340dc0519d80d10c0
    CCCma_parent_runid:          rc3-pictrl
    CCCma_pycmor_hash:           33c30511acc319a98240633965a04ca99c26427e
    CCCma_runid:                 rc3.1-his01
    Conventions:                 CF-1.7 CMIP-6.2
    YMDH_branch_time_in_child:   1850:01:01:00
    ...                          ...
    tracking_id:                 hdl:21.14100/872062df-acae-499b-aa0f-9eaca76...
    variable_id:                 tas
    variant_label:               r1i1p1f1
    version:                     v20190429
    license:                     CMIP6 model data produced by The Government ...
    cmor_version:                3.4.0

Use clisops to subset for time and location

def subset_time(ds, start_time="1850-01-01T12:00:00Z", end_time="2014-12-30T12:00:00Z"):
    from clisops.ops.subset import subset
    
    return subset(ds, time=f"{start_time}/{end_time}", output="xarray")

def subset_location(ds, lat_bounds=[30, 50], lon_bounds=[-100, -80]):
    from clisops.ops.subset import subset_bbox
    
    return subset_bbox(ds, lat_bnds=lat_bounds, lon_bnds=lon_bounds)
ds.tas.isel(time=0).hvplot.quadmesh(geo=True, cmap="Reds")
subset_location(ds).tas.isel(time=-1).hvplot(x='lon',
                                             y='lat',
                                             features=["land", "lakes", "ocean", "borders"],
                                             cmap='Reds',
                                             geo=True)

Calculate a yearly average

def yearly_average(ds):
    from clisops.ops.average import average_time
    return average_time(ds, "year", output_type="xarray")[0]
yearly_average(subset_location(ds)).isel(time=0).tas.hvplot(x='lon',
                                             y='lat',
                                             features=["land", "lakes", "ocean", "borders"],
                                             cmap='Reds',
                                             geo=True)
yearly_average(subset_location(ds)).isel(time=-1).tas.hvplot(x='lon',
                                             y='lat',
                                             features=["land", "lakes", "ocean", "borders"],
                                             cmap='Reds',
                                             geo=True)

Summary

In this notebook, we applied data reduction functions from the ESGF stack to data accessed through intake-esgf.

What’s next?

We will see some more advanced examples of using these functions, including full task orchestration using Globus-Flows.