<img src="images/globus-logo.png" width=250 alt="Globus logo"></img>
<img src="images/esgf.png" width=250 alt="ESGF logo"></img>

# Basic Demonstration of Data Reduction Using Globus, Intake-ESGF, and Clisops

## Overview
Within this notebook, we highlight how to use a collection of open-source tools in the Earth System Grid Federation user-computing community, to reduce and select datasets available through the federation of servers. Mainly, we will
- Select a given time frame
- Subset for a point
- Average into yearly frequency

## Prerequisites

| Concepts | Importance | Notes |
| --- | --- | --- |
| [Intro to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Necessary | |
| [hvPlot Basics](https://hvplot.holoviz.org/getting_started/hvplot.html) | Necessary | Interactive Visualization with hvPlot |
- **Time to learn**: 30 minutes

## Imports

In [10]:
import hvplot.xarray
import holoviews as hv
import numpy as np
import hvplot.xarray
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
from intake_esgf import ESGFCatalog
import xarray as xr
import warnings
from clisops.ops.subset import subset, subset_bbox
from clisops.ops.average import average_over_dims, average_time
import os
from globus_compute_sdk import Executor, Client
warnings.filterwarnings("ignore")

hv.extension("matplotlib")

## Search and Find Data Using Intake-ESGF
Let's start with a sample dataset - which we can search for using intake-esgf.

In [2]:
cat = ESGFCatalog()
cat

Perform a search() to populate the catalog.

In [3]:
cat.search(
    experiment_id="historical",
    source_id="CanESM5",
    frequency="mon",
    variable_id=["gpp", "tas", "pr"],
    variant_label="r1i1p1f1",  # addition from the last search
)

   Searching indices: 100%|███████████████████████████████|1/1 [    4.22s/index]


Summary information for 3 results:
mip_era                  [CMIP6]
activity_id               [CMIP]
institution_id           [CCCma]
source_id              [CanESM5]
experiment_id       [historical]
member_id             [r1i1p1f1]
table_id            [Amon, Lmon]
variable_id       [tas, pr, gpp]
grid_label                  [gn]
dtype: object

In [4]:
dsd = cat.to_dataset_dict()
dsd.keys()

 Obtaining file info: 100%|███████████████████████████████|3/3 [  1.24dataset/s]
Adding cell measures: 100%|███████████████████████████████|3/3 [  3.04s/dataset]


dict_keys(['Amon.tas', 'Lmon.gpp', 'Amon.pr'])

In [5]:
ds = dsd["Amon.tas"]
ds

## Use clisops to subset for time and location

In [21]:
def subset_time(ds, start_time="1850-01-01T12:00:00Z", end_time="2014-12-30T12:00:00Z"):
    from clisops.ops.subset import subset
    
    return subset(ds, time=f"{start_time}/{end_time}", output="xarray")

def subset_location(ds, lat_bounds=[30, 50], lon_bounds=[-100, -80]):
    from clisops.ops.subset import subset_bbox
    
    return subset_bbox(ds, lat_bnds=lat_bounds, lon_bnds=lon_bounds)

In [26]:
ds.tas.isel(time=0).hvplot.quadmesh(geo=True, cmap="Reds")

In [27]:
subset_location(ds).tas.isel(time=-1).hvplot(x='lon',
                                             y='lat',
                                             features=["land", "lakes", "ocean", "borders"],
                                             cmap='Reds',
                                             geo=True)

### Calculate a yearly average

In [31]:
def yearly_average(ds):
    from clisops.ops.average import average_time
    return average_time(ds, "year", output_type="xarray")[0]

In [40]:
yearly_average(subset_location(ds)).isel(time=0).tas.hvplot(x='lon',
                                             y='lat',
                                             features=["land", "lakes", "ocean", "borders"],
                                             cmap='Reds',
                                             geo=True)

In [42]:
yearly_average(subset_location(ds)).isel(time=-1).tas.hvplot(x='lon',
                                             y='lat',
                                             features=["land", "lakes", "ocean", "borders"],
                                             cmap='Reds',
                                             geo=True)

## Summary
In this notebook, we applied data reduction functions from the ESGF stack to data accessed through intake-esgf.

### What's next?
We will see some more advanced examples of using these functions, including full task orchestration using Globus-Flows.

## Resources and references
- [Intake-ESGF Documentation](https://github.com/nocollier/intake-esgf)
- [Globus Compute Documentation](https://www.globus.org/compute)