Globus logo ESGF logo

ENSO Calculations using Globus Compute

Overview

In this workflow, we combine topics covered in previous Pythia Foundations and CMIP6 Cookbook content to compute the Niño 3.4 Index to multiple datasets, with the primary computations occuring on a remote machine. As a refresher of what the ENSO 3.4 index is, please see the following text, which is also included in the ENSO Xarray content in the Pythia Foundations content.

Niño 3.4 (5N-5S, 170W-120W): The Niño 3.4 anomalies may be thought of as representing the average equatorial SSTs across the Pacific from about the dateline to the South American coast. The Niño 3.4 index typically uses a 5-month running mean, and El Niño or La Niña events are defined when the Niño 3.4 SSTs exceed +/- 0.4C for a period of six months or more.

Niño X Index computation: a) Compute area averaged total SST from Niño X region; b) Compute monthly climatology (e.g., 1950-1979) for area averaged total SST from Niño X region, and subtract climatology from area averaged total SST time series to obtain anomalies; c) Smooth the anomalies with a 5-month running mean; d) Normalize the smoothed values by its standard deviation over the climatological period.

The previous cookbook, we ran this in a single notebook locally. In this example, we aim to execute the workflow on a remote machine, with only the visualizion of the dataset occuring locally.

The overall goal of this tutorial is to introduce the idea of functions as a service with Globus, and how this can be used to calculate ENSO indices.

Prerequisites

Concepts

Importance

Notes

Intro to Xarray

Necessary

hvPlot Basics

Necessary

Interactive Visualization with hvPlot

Understanding of NetCDF

Helpful

Familiarity with metadata structure

Calculating ENSO with Xarray

Neccessary

Understanding of Masking and Xarray Functions

Dask

Helpful

  • Time to learn: 30 minutes

Imports

import hvplot.xarray
import holoviews as hv
import numpy as np
import hvplot.xarray
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
from intake_esgf import ESGFCatalog
import xarray as xr
import cf_xarray
import warnings
import os
from globus_compute_sdk import Executor, Client
warnings.filterwarnings("ignore")

hv.extension("bokeh")

Accessing our Data and Computing the ENSO 3.4 Index

As mentioned in the introduction, we are utilizing functions from the previous ENSO notebooks. In order to run these with Globus Compute, we need to comply with the following requirements

  • All libraries/packages used in the function need to be installed on the globus compute endpoint

  • All functions/libraries/packages need to be imported and defined within the function to execute

  • The output from the function needs to serializable (ex. xarray.Dataset, numpy.array)

Using these constraints, we setup the following function, with the key parameter being which modeling center (model) to compare. Two examples here include The National Center for Atmospheric Research (NCAR) and the Model for Interdisciplinary Research on Climate (MIROC).

def run_plot_enso(model, return_path=False):
    import numpy as np
    import matplotlib.pyplot as plt
    from intake_esgf import ESGFCatalog
    import xarray as xr
    import cf_xarray
    import warnings
    warnings.filterwarnings("ignore")

    def search_esgf(institution_id, grid='gn'):

        # Search and load the ocean surface temperature (tos)
        cat = ESGFCatalog()
        cat.search(
            activity_id="CMIP",
            experiment_id="historical",
            institution_id=institution_id,
            variable_id=["tos"],
            member_id='r11i1p1f1',
            table_id="Omon",
        )
        try:
            tos_ds = cat.to_datatree()[grid].to_dataset()
        except ValueError:
            tos_ds = cat.to_dataset_dict()[""]

        # Search and load the ocean grid cell area
        cat = ESGFCatalog()
        cat.search(
            activity_id="CMIP",
            experiment_id="historical",
            institution_id=institution_id,
            variable_id=["areacello"],
            member_id='r11i1p1f1',
        )
        try:
            area_ds = cat.to_datatree()[grid].to_dataset()
        except ValueError:
            area_ds = cat.to_dataset_dict()[""]
        return xr.merge([tos_ds, area_ds])

    def calculate_enso(ds):

        # Subset the El Nino 3.4 index region
        dso = ds.where(
        (ds.cf["latitude"] < 5) & (ds.cf["latitude"] > -5) & (ds.cf["longitude"] > 190) & (ds.cf["longitude"] < 240), drop=True
        )

        # Calculate the monthly means
        gb = dso.tos.groupby('time.month')

        # Subtract the monthly averages, returning the anomalies
        tos_nino34_anom = gb - gb.mean(dim='time')

        # Determine the non-time dimensions and average using these
        non_time_dims = set(tos_nino34_anom.dims)
        non_time_dims.remove(ds.tos.cf["T"].name)
        weighted_average = tos_nino34_anom.weighted(ds["areacello"]).mean(dim=list(non_time_dims))

        # Calculate the rolling average
        rolling_average = weighted_average.rolling(time=5, center=True).mean()
        std_dev = weighted_average.std()
        return rolling_average / std_dev

    def add_enso_thresholds(da, threshold=0.4):

        # Conver the xr.DataArray into an xr.Dataset
        ds = da.to_dataset()

        # Cleanup the time and use the thresholds
        try:
            ds["time"]= ds.indexes["time"].to_datetimeindex()
        except:
            pass
        ds["tos_gt_04"] = ("time", ds.tos.where(ds.tos >= threshold, threshold).data)
        ds["tos_lt_04"] = ("time", ds.tos.where(ds.tos <= -threshold, -threshold).data)

        # Add fields for the thresholds
        ds["el_nino_threshold"] = ("time", np.zeros_like(ds.tos) + threshold)
        ds["la_nina_threshold"] = ("time", np.zeros_like(ds.tos) - threshold)

        return ds
    
    ds = search_esgf("NCAR")
    enso_index = add_enso_thresholds(calculate_enso(ds).compute())
    enso_index.attrs = ds.attrs
    enso_index.attrs["model"] = model

    return enso_index

Configure Globus Compute

Now that we have our functions, we can move toward using Globus Flows and Globus Compute.

Globus Flows is a reliable and secure platform for orchestrating and performing research data management and analysis tasks. A flow is often needed to manage data coming from instruments, e.g., image files can be moved from local storage attached to a microscope to a high-performance storage system where they may be accessed by all members of the research project.

More examples of creating and running flows can be found on our demo instance.

Setup a Globus Compute Endpoint

Globus Compute (GC) is a service that allows python functions to be sent to remote points, executed, with the output from that function returned to the user. While there are a collection of endpoints already installed, we highlight in this section the steps required to configure for yourself. This idea is also known as “serverless” computing, where users do not need to think about the underlying infrastructure executing the code, but rather submit functions to be run and returned.

To start a GC endpoint at your system you need to login, configure a conda environment, and pip install globus-compute-endpoint.

You can then run:

globus-compute-endpoint configure esgf-test

globus-compute-endpoint start esgf-test

Note that by default your endpoint will execute tasks on the login node (if you are using a High Performance Compute System). Additional configuration is needed for the endpoint to provision compute nodes. For example, here is the documentation on configuring globus compute endpoints on the Argonne Leadership Computing Facility’s Polaris system

  • https://globus-compute.readthedocs.io/en/latest/endpoints.html#polaris-alcf

endpoint_id = "b3d1d669-d49b-412e-af81-95f3368e525c"

Setup an Executor to Run our Functions

Once we have our compute endpoint ID, we need to pass this to our executor, which will be used to pass our functions from our local machine to the machine we would like to compute on.

gce = Executor(endpoint_id=endpoint_id)
gce.amqp_port = 443
gce
Executor<ep_id:b3d1d669-d49b-412e-af81-95f3368e525c; tg_id:None; bs:128>

Test our Functions

Now that we have our functions prepared, and an executor to run on, we can test them out using our endpoint!

We pass in our function name, and the additional arguments for our functions. For example, let’s look at comparing at the NCAR and MIROC modeling center’s CMIP6 simulations.

ncar_task = gce.submit(run_plot_enso, model='NCAR')
miroc_task = gce.submit(run_plot_enso, model='MIROC')

The results are started as python objects, with the resultant datasets available using .result()

ncar_ds = ncar_task.result()
miroc_ds = miroc_task.result()

ncar_ds
<xarray.Dataset>
Dimensions:            (time: 1980)
Coordinates:
  * time               (time) datetime64[ns] 1850-01-15T13:00:00.000008 ... 2...
    month              (time) int64 1 2 3 4 5 6 7 8 9 ... 4 5 6 7 8 9 10 11 12
Data variables:
    tos                (time) float32 nan nan 0.06341 ... 0.7921 nan nan
    tos_gt_04          (time) float32 0.4 0.4 0.4 0.4 ... 0.6829 0.7921 0.4 0.4
    tos_lt_04          (time) float32 -0.4 -0.4 -0.4 -0.4 ... -0.4 -0.4 -0.4
    el_nino_threshold  (time) float32 0.4 0.4 0.4 0.4 0.4 ... 0.4 0.4 0.4 0.4
    la_nina_threshold  (time) float32 -0.4 -0.4 -0.4 -0.4 ... -0.4 -0.4 -0.4
Attributes: (12/46)
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            CMIP
    branch_method:          standard
    branch_time_in_child:   674885.0
    branch_time_in_parent:  219000.0
    case_id:                972
    ...                     ...
    table_id:               Omon
    tracking_id:            hdl:21.14100/b0ffb89d-095d-4533-a159-a2e1241ff138
    variable_id:            tos
    variant_info:           CMIP6 20th century experiments (1850-2014) with C...
    variant_label:          r11i1p1f1
    model:                  NCAR

Plot our Data

Now that we have pre-computed datasets, the last step is to visualize the output. In the other example, we stepped through how to utilize the .hvplot tool to create interactive displays of ENSO values. We will utilize that functionality here, wrapping into a function.

def plot_enso(ds):
    el_nino = ds.hvplot.area(x="time", y2='tos_gt_04', y='el_nino_threshold', color='red', hover=False)
    el_nino_label = hv.Text(ds.isel(time=40).time.values, 2, 'El Niño').opts(text_color='red',)

    # Create the La Niña area graphs
    la_nina = ds.hvplot.area(x="time", y2='tos_lt_04', y='la_nina_threshold', color='blue', hover=False)
    la_nina_label = hv.Text(ds.isel(time=-40).time.values, -2, 'La Niña').opts(text_color='blue')

    # Plot a timeseries of the ENSO 3.4 index
    enso = ds.tos.hvplot(x='time', line_width=0.5, color='k', xlabel='Year', ylabel='ENSO 3.4 Index')

    # Combine all the plots into a single plot
    return (el_nino_label * la_nina_label * el_nino * la_nina * enso).opts(title=f'{ds.attrs["model"]} {ds.attrs["source_id"]} \n Ensemble Member: {ds.attrs["variant_label"]}')

Once we have the function, we apply to our two datasets and combine into a single column.

(plot_enso(ncar_ds) + plot_enso(miroc_ds)).cols(1)

Summary

In this notebook, we applied the ENSO 3.4 index calculations to CMIP6 datasets remotely using Globus Compute and created interactive plots comparing where we see El Niño and La Niña.

What’s next?

We will see some more advanced examples of using the CMIP6 and other data access methods as well as computations.