ESGF logo

Complex Searching with `intake-esgf`¶

Overview¶

In this tutorial we will present an interface under design to facilitate complex searching using intake-esgf. intake-esgf is a small intake and intake-esm inspired package under development in ESGF2. Please note that there is a name collison with an existing package in PyPI and conda. You will need to install the package from source.

Prerequisites¶

Concepts	Importance	Notes
Install Package	Necessary
Understanding of NetCDF	Helpful	Familiarity with metadata structure
Familiar with intake-esm	Helpful	Similar interface
Transient climate response	Background

Time to learn: 30 minutes

Imports¶

import intake_esgf
from intake_esgf import ESGFCatalog

Initializing the Catalog¶

As with intake-esm we first instantiate the catalog. However, since we will populate the catalog with search results, the catalog starts empty. Internally, we query different ESGF index nodes for information about what datasets you wish to include in your analysis. As ESGF2 is actively working on an index redesign, our catlogs by default point to a Globus (ElasticSearch) based index at ALCF (Argonne Leadership Computing Facility).

cat = ESGFCatalog()
print(cat)
for ind in cat.indices: # Which indices are included?
    print(ind)

Perform a search() to populate the catalog.
GlobusESGFIndex('ESGF2-US-1.5-Catalog')

We also provide support for connecting to the ESGF1 Solr-based indices. You may specify a server in the dictionary or multiple servers - just make sure to include True.

Uncommend the line setting all_indices=True to include all available indices.

intake_esgf.conf.set(indices={"esgf-node.llnl.gov": True,
                              "esgf-node.ornl.gov": True,
                              "esgf.ceda.ac.uk": True})

intake_esgf.conf.set(all_indices=True)  # all federated indices

cat = ESGFCatalog()
for ind in cat.indices:
    print(ind)

GlobusESGFIndex('ESGF2-US-1.5-Catalog')
GlobusESGFIndex('anl-dev')
GlobusESGFIndex('ornl-dev')
SolrESGFIndex('esgf.ceda.ac.uk')
SolrESGFIndex('esgf-data.dkrz.de')
SolrESGFIndex('esgf-node.ipsl.upmc.fr')
SolrESGFIndex('esg-dn1.nsc.liu.se')
SolrESGFIndex('esgf.nci.org.au')
SolrESGFIndex('esgf-node.ornl.gov')
SolrESGFIndex('esgf-node.llnl.gov')
STACESGFIndex('api.stac.ceda.ac.uk')

/home/runner/micromamba/envs/esgf-cookbook-dev/lib/python3.11/site-packages/intake_esgf/catalog.py:97: UserWarning: You have enabled at least one index which uses the old Solr technology. ESGF is moving away from this technology and you may find that some indicesfail to return a response.
  warnings.warn(

Populate the catalog¶

Many times, an analysis will require several variables across multiple experiments. For example, if one were to compute the transient climate response (TCRE), you would need tempererature (tas) and carbon emissions from land (nbp) and ocean (fgco2) for a 1% CO2 increase experiment (1pctCO2) as well as the control experiment (piControl). If TCRE is not in your particular science, that is ok for this notebook. It is a motivating example and the specifics are less important than the search concepts. First, we perform a search in a familiar syntax.

cat.search(
    experiment_id=["piControl", "1pctCO2"],
    variable_id=["tas", "fgco2", "nbp"],
    table_id=["Amon", "Omon", "Lmon"],
)
print(cat)

Internally, this launches simultaneous searches that are combined locally to provide a global view of what datasets are available. While the Solr indices themselves can be searched in distributed fashion, they will not report if an index has failed to return a response. As index nodes go down from time to time, this can leave you with a false impression that you have found all the datasets of interest. By managing the searches locally, intake-esgf can report back to you that an index has failed and that your results may be incomplete.

If you would like details about what intake-esgf is doing, look in the local cache directory (${HOME}/.esgf/) for a esgf.log file. This is a full history of everything that intake-esgf has searched, downloaded, or accessed. You can also look at just this session by calling session_log(). In this case you will see how long each index took to return a response and if any failed

print(cat.session_log())

2025-07-12 01:18:35 └─SolrESGFIndex('esgf-node.ornl.gov') no response
2025-07-12 01:18:35 └─GlobusESGFIndex('anl-dev') results=9 response_time=0.46
2025-07-12 01:18:35 └─GlobusESGFIndex('ornl-dev') results=650 response_time=0.74
2025-07-12 01:18:36 └─SolrESGFIndex('esgf.ceda.ac.uk') no response
2025-07-12 01:18:36 search begin experiment_id=['historical'], source_id=['CanESM5'], frequency=['mon'], variable_id=['gpp', 'tas', 'pr'], type=['Dataset'], project=['CMIP6'], latest=[True], retracted=[False]
2025-07-12 01:18:36 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=1396 response_time=1.14
2025-07-12 01:18:36 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=800 response_time=0.44
2025-07-12 01:18:36 combine_time=0.09
2025-07-12 01:18:36 search end total_time=0.58
2025-07-12 01:18:36 search begin experiment_id=['historical'], source_id=['CanESM5'], frequency=['mon'], variable_id=['gpp', 'tas', 'pr'], variant_label=['r1i1p1f1'], type=['Dataset'], project=['CMIP6'], latest=[True], retracted=[False]
2025-07-12 01:18:37 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=14 response_time=0.15
2025-07-12 01:18:37 combine_time=0.00
2025-07-12 01:18:37 search end total_time=0.16
2025-07-12 01:18:37 file info begin
2025-07-12 01:18:37 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=14 response_time=0.14
2025-07-12 01:18:37 combine_time=0.00
2025-07-12 01:18:37 file info end total_time=0.15
2025-07-12 01:18:37 search begin activity_id=['CMIP'], experiment_id=['historical'], variable_id=['pr'], member_id=['r1i1p1f1'], grid_label=['gn'], table_id=['Amon'], source_id=['CESM2', 'CanESM5'], type=['Dataset'], project=['CMIP6'], latest=[True], retracted=[False]
2025-07-12 01:18:37 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=10 response_time=0.12
2025-07-12 01:18:38 transfer_time=0.80 [s] at 75.04 [Mb s-1] http://crd-esgf-drc.ec.gc.ca/thredds/fileServer/esgC_dataroot/AR6/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/Amon/pr/gn/v20190429/pr_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc
2025-07-12 01:18:38 └─SolrESGFIndex('esgf-data.dkrz.de') results=266 response_time=3.74
2025-07-12 01:18:39 search begin experiment_id=['historical'], variable_id=['tos'], table_id=['Omon'], project=['CMIP6'], grid_label=['gn'], source_id=['CAMS-CSM1-0', 'FGOALS-g3', 'CMCC-CM2-SR5', 'CNRM-CM6-1', 'CNRM-ESM2-1', 'CESM2'], type=['Dataset'], latest=[True], retracted=[False]
2025-07-12 01:18:39 └─SolrESGFIndex('esgf-node.ipsl.upmc.fr') results=58 response_time=4.57
2025-07-12 01:18:39 └─SolrESGFIndex('esg-dn1.nsc.liu.se') results=62 response_time=4.64
2025-07-12 01:18:40 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=266 response_time=0.25
2025-07-12 01:18:42 └─SolrESGFIndex('esgf.nci.org.au') no response
2025-07-12 01:18:46 transfer_time=8.86 [s] at 2.16 [Mb s-1] http://esgf-node.ornl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/Lmon/gpp/gn/v20190429/gpp_Lmon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc
2025-07-12 01:19:07 transfer_time=29.92 [s] at 1.75 [Mb s-1] http://esgf-node.ornl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/Amon/tas/gn/v20190429/tas_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc
2025-07-12 01:19:07 accessed /home/runner/.esgf/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/Amon/pr/gn/v20190429/pr_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc
2025-07-12 01:19:07 accessed /home/runner/.esgf/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/Amon/tas/gn/v20190429/tas_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc
2025-07-12 01:19:07 accessed /home/runner/.esgf/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/Lmon/gpp/gn/v20190429/gpp_Lmon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc
2025-07-12 01:19:07 search begin mip_era=['CMIP6'], activity_drs=['CMIP'], grid_label=['gn'], member_id=['r1i1p1f1'], institution_id=['CCCma'], source_id=['CanESM5'], experiment_id=['historical'], variable_id=['areacella'], type=['Dataset'], project=['CMIP6'], latest=[True], retracted=[False]
2025-07-12 01:19:07 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=5 response_time=0.12
2025-07-12 01:19:07 combine_time=0.00
2025-07-12 01:19:07 search end total_time=0.12
2025-07-12 01:19:07 file info begin
2025-07-12 01:19:07 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=5 response_time=0.06
2025-07-12 01:19:07 combine_time=0.00
2025-07-12 01:19:07 file info end total_time=0.06
2025-07-12 01:19:08 download failed http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/fx/areacella/gn/v20190429/areacella_fx_CanESM5_historical_r1i1p1f1_gn.nc
2025-07-12 01:19:09 download failed http://esgf-data04.diasjp.net/thredds/fileServer/esg_dataroot/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/fx/areacella/gn/v20190429/areacella_fx_CanESM5_historical_r1i1p1f1_gn.nc
2025-07-12 01:19:10 transfer_time=1.35 [s] at 0.04 [Mb s-1] https://g-52ba3.fd635.8443.data.globus.org/css03_data/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/fx/areacella/gn/v20190429/areacella_fx_CanESM5_historical_r1i1p1f1_gn.nc
2025-07-12 01:19:10 accessed /home/runner/.esgf/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/fx/areacella/gn/v20190429/areacella_fx_CanESM5_historical_r1i1p1f1_gn.nc
2025-07-12 01:19:10 search begin mip_era=['CMIP6'], activity_drs=['CMIP'], grid_label=['gn'], member_id=['r1i1p1f1'], institution_id=['CCCma'], source_id=['CanESM5'], experiment_id=['historical'], variable_id=['areacella'], type=['Dataset'], project=['CMIP6'], latest=[True], retracted=[False]
2025-07-12 01:19:10 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=5 response_time=0.00
2025-07-12 01:19:10 combine_time=0.00
2025-07-12 01:19:10 search end total_time=0.01
2025-07-12 01:19:10 file info begin
2025-07-12 01:19:10 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=5 response_time=0.00
2025-07-12 01:19:10 combine_time=0.00
2025-07-12 01:19:10 file info end total_time=0.00
2025-07-12 01:19:10 accessed /home/runner/.esgf/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/fx/areacella/gn/v20190429/areacella_fx_CanESM5_historical_r1i1p1f1_gn.nc
2025-07-12 01:19:10 search begin mip_era=['CMIP6'], activity_drs=['CMIP'], grid_label=['gn'], member_id=['r1i1p1f1'], institution_id=['CCCma'], source_id=['CanESM5'], experiment_id=['historical'], variable_id=['areacella'], type=['Dataset'], project=['CMIP6'], latest=[True], retracted=[False]
2025-07-12 01:19:10 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=5 response_time=0.00
2025-07-12 01:19:10 combine_time=0.00
2025-07-12 01:19:10 search end total_time=0.01
2025-07-12 01:19:10 file info begin
2025-07-12 01:19:10 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=5 response_time=0.00
2025-07-12 01:19:10 combine_time=0.00
2025-07-12 01:19:10 file info end total_time=0.00
2025-07-12 01:19:10 accessed /home/runner/.esgf/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/fx/areacella/gn/v20190429/areacella_fx_CanESM5_historical_r1i1p1f1_gn.nc
2025-07-12 01:19:10 search begin mip_era=['CMIP6'], activity_drs=['CMIP'], grid_label=['gn'], member_id=['r1i1p1f1'], institution_id=['CCCma'], source_id=['CanESM5'], experiment_id=['historical'], variable_id=['sftlf'], type=['Dataset'], project=['CMIP6'], latest=[True], retracted=[False]
2025-07-12 01:19:10 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=5 response_time=0.10
2025-07-12 01:19:10 combine_time=0.00
2025-07-12 01:19:10 search end total_time=0.11
2025-07-12 01:19:10 file info begin
2025-07-12 01:19:10 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=5 response_time=0.07
2025-07-12 01:19:10 combine_time=0.00
2025-07-12 01:19:10 file info end total_time=0.08
2025-07-12 01:19:11 download failed http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/fx/sftlf/gn/v20190429/sftlf_fx_CanESM5_historical_r1i1p1f1_gn.nc
2025-07-12 01:19:11 download failed http://esgf-data04.diasjp.net/thredds/fileServer/esg_dataroot/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/fx/sftlf/gn/v20190429/sftlf_fx_CanESM5_historical_r1i1p1f1_gn.nc
2025-07-12 01:19:12 transfer_time=1.08 [s] at 0.06 [Mb s-1] https://g-52ba3.fd635.8443.data.globus.org/css03_data/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/fx/sftlf/gn/v20190429/sftlf_fx_CanESM5_historical_r1i1p1f1_gn.nc
2025-07-12 01:19:12 accessed /home/runner/.esgf/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/fx/sftlf/gn/v20190429/sftlf_fx_CanESM5_historical_r1i1p1f1_gn.nc
2025-07-12 01:20:51 └─SolrESGFIndex('esgf-node.llnl.gov') no response
2025-07-12 01:20:51 └─SolrESGFIndex('esgf-node.llnl.gov') no response
2025-07-12 01:20:51 combine_time=0.00
2025-07-12 01:20:51 search end total_time=133.88
2025-07-12 01:20:51 combine_time=0.16
2025-07-12 01:20:51 search end total_time=136.60

At this stage of the search you have a catalog full of possibly relevant datasets for your analysis, stored in a pandas dataframe. You are free to view and manipulate this dataframe to help hone these results down. It is available to you as the df member of the ESGFCatalog. You should be careful to only remove rows as internally we could use any column in the downloading of the data. Also note that we have removed the user-facing notion of where the data is hosted. The id column of this dataframe is a list of full dataset_ids which includes the location information. At the point when you are ready to download data, we will choose locations automatically that are fastest for you.

cat.df

Model Groups¶

However, intake-esgf also provides you with some tools to help locate relevant data for your analysis. When conducting these kinds of analyses, we are seeking for unique combinations of a source_id, member_id, and grid_label that have all the variables that we need. We call these model groups. In an ESGF search, it is common to find a model that has, for example, a tas for r1i1p1f1 but not a fgco2. Sorting this out is time consuming and labor intensive. So first, we provide you a function to print out all model groups with the following function.

cat.model_groups().to_frame()

The function model_groups() returns a pandas Series (converted to a dataframe here for printing) with all unique combinations of (source_id,member_id,grid_label) along with the dataset count for each. This helps illustrate why it can be so difficult to locate all the data relevant to a given analysis. At the time of this writing, there are 148 model groups but relatively few of them with all 6 (2 experiments and 3 variables) datasets that we need. Furthermore, you cannot rely on a model group using r1i1p1f1 for its primary result. The results above show that UKESM does not even use f1 at all, further complicating the process of finding results.

In addition to this notion of model groups, intake-esgf provides you a method remove_incomplete() for determing which model groups you wish to keep in the current search. Internally, we will group the search results dataframe by model groups and apply a function of your design to the grouped portion of the dataframe. For example, for the current work, I could just check that there are 6 datasets in the sub-dataframe.

def shall_i_keep_it(sub_df):
    if len(sub_df) == 6:
        return True
    return False


cat.remove_incomplete(shall_i_keep_it)
cat.model_groups().to_frame()

You could write a much more complex check--it depends on what is relevant to your analysis. The effect is that the list of possible models with consistent results is now much more manageable. This method has the added benefit of forcing the user to be concrete about which models were included in an analysis.

Removing Additional Variants¶

It may also be that you wish to only include a single member_id in your analysis. The above search shows we have a few models with multiple variants that have all 6 required datasets. To be fair to those that only have 1, you may wish to only keep the smallest variant. We also provide this function as part of the ESGFCatalog object.

cat.remove_ensembles()
cat.model_groups().to_frame()

Summary¶

At this point, you would be ready to use to_dataset_dict() to download and load all datasets into a dictionary for analysis. The point of this notebook however is to expose the search capabilities. It is our goal to make annoying and time-consuming tasks easier by providing you smart interfaces for common operations. Let us know what else is painful for you in locating relevant data for your science.

References¶

Gregory, J. M., & Forster, P. M. (2008). Transient climate response estimated from radiative forcing and observed temperature change. Journal of Geophysical Research: Atmospheres, 113(D23). 10.1029/2008jd010405

Searching

Introduction to intake-esgf