Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Overview

In this tutorial we are going to use a large collection of pre-generated Kerchunk reference files and open them with Xarray’s new DataTree functionality. This chapter is heavily inspired by this blog post.

About the Dataset

This collection of reference files were generated from the NASA NEX-GDDP-CMIP6 (Global Daily Downscaled Projections) dataset. A version of this dataset is hosted on s3 as a collection of NetCDF files.

Prerequisites

ConceptsImportanceNotes
Kerchunk BasicsRequiredCore
Multiple Files and KerchunkRequiredCore
Kerchunk and DaskRequiredCore
Multi-File Datasets with KerchunkRequiredIO/Visualization
Xarray-Datatree OverviewRequiredIO
  • Time to learn: 30 minutes

Motivation

In total the dataset is roughly 12TB in compressed blob storage, with a single NetCDF file per yearly timestep, per variable. Downloading this entire dataset for analysis on a local machine would difficult to say the least. The collection of Kerchunk reference files for this entire dataset is only 272 Mb, which is about 42,000 times smaller!

Imports

import dask
import hvplot.xarray  # noqa
import pandas as pd
import xarray as xr
from xarray import DataTree
from distributed import Client
from fsspec.implementations.reference import ReferenceFileSystem
Loading...
Loading...
Loading...
Loading...

Read the reference catalog

The NASA NEX-GDDP-CMIP6 dataset is organized by GCM, Scenario and Ensemble Member. Each of these Scenario/GCM combinations is represented as a combined reference file, which was created by merging across variables and concatenating along time-steps. All of these references are organized into a simple .csv catalog in the schema:

GCM/Scenariourl

Organzing with Xarray-Datatree

Not all of the GCM/Scenario reference datasets have shared spatial coordinates and many of the have slight differences in their calendar and thus time dimension. Because of this, these cannot be combined into a single Xarray-Dataset. Fortunately Xarray-Datatree provides a higher level abstraction where related Xarray-Datasets are organized into a tree structure where each dataset corresponds to a leaf.

# Read the reference catalog into a Pandas DataFrame
cat_df = pd.read_csv(
    "s3://carbonplan-share/nasa-nex-reference/reference_catalog_nested.csv"
)
# Convert the DataFrame into a dictionary
catalog = cat_df.set_index("ID").T.to_dict("records")[0]

Load Reference Datasets into Xarray-DataTree

In the following cell we create a function load_ref_ds, which can be parallelized via Dask to load Kerchunk references into a dictionary of Xarray-Datasets.

def load_ref_ds(url: str):
    fs = ReferenceFileSystem(
        url,
        remote_protocol="s3",
        target_protocol="s3",
        remote_options={"anon": True},
        target_options={"anon": True},
        lazy=True,
    )
    return xr.open_dataset(
        fs.get_mapper(),
        engine="zarr",
        backend_kwargs={"consolidated": False},
        chunks={"time": 300},
    )


tasks = {id: dask.delayed(load_ref_ds)(url) for id, url in catalog.items()}

Use Dask Distributed to load the Xarray-Datasets from Kerchunk reference files

Using Dask, we are loading 164 reference datasets into memory. Since they are are Xarray datasets the coordinates are loaded eagerly, but the underlying data is still lazy.

client = Client(n_workers=8)
client
/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/distributed/node.py:188: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 37435 instead
  warnings.warn(
Loading...
catalog_computed = dask.compute(tasks)
Fetching long content....
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 catalog_computed = dask.compute(tasks)

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/dask/base.py:685, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    682     expr = expr.optimize()
    683     keys = list(flatten(expr.__dask_keys__()))
--> 685     results = schedule(expr, keys, **kwargs)
    687 return repack(results)

Cell In[3], line 10
---> 10     return xr.open_dataset(
     11         fs.get_mapper(),
     12         engine="zarr",
     13         backend_kwargs={"consolidated": False},

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/xarray/backends/api.py:607, in open_dataset()
    595 decoders = _resolve_decoders_kwargs(
    596     decode_cf,
    597     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)    603     decode_coords=decode_coords,
    604 )
    606 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 607 backend_ds = backend.open_dataset(
    608     filename_or_obj,
    609     drop_variables=drop_variables,
    610     **decoders,
    611     **kwargs,
    612 )
    613 ds = _dataset_from_backend_dataset(
    614     backend_ds,
    615     filename_or_obj,
   (...)    626     **kwargs,
    627 )
    628 return ds

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/xarray/backends/zarr.py:1732, in open_dataset()
   1730 filename_or_obj = _normalize_path(filename_or_obj)
   1731 if not store:
-> 1732     store = ZarrStore.open_group(
   1733         filename_or_obj,
   1734         group=group,
   1735         mode=mode,
   1736         synchronizer=synchronizer,
   1737         consolidated=consolidated,
   1738         consolidate_on_close=False,
   1739         chunk_store=chunk_store,
   1740         storage_options=storage_options,
   1741         zarr_version=zarr_version,
   1742         use_zarr_fill_value_as_mask=None,
   1743         zarr_format=zarr_format,
   1744         cache_members=cache_members,
   1745     )
   1747 store_entrypoint = StoreBackendEntrypoint()
   1748 with close_on_error(store):

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/xarray/backends/zarr.py:771, in open_group()
    745 @classmethod
    746 def open_group(
    747     cls,
   (...)    764     cache_members: bool = True,
    765 ):
    766     (
    767         zarr_group,
    768         consolidate_on_close,
    769         close_store_on_close,
    770         use_zarr_fill_value_as_mask,
--> 771     ) = _get_open_params(
    772         store=store,
    773         mode=mode,
    774         synchronizer=synchronizer,
    775         group=group,
    776         consolidated=consolidated,
    777         consolidate_on_close=consolidate_on_close,
    778         chunk_store=chunk_store,
    779         storage_options=storage_options,
    780         zarr_version=zarr_version,
    781         use_zarr_fill_value_as_mask=use_zarr_fill_value_as_mask,
    782         zarr_format=zarr_format,
    783     )
    785     return cls(
    786         zarr_group,
    787         mode,
   (...)    796         cache_members=cache_members,
    797     )

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/xarray/backends/zarr.py:1974, in _get_open_params()
   1970     if _zarr_v3():
   1971         # we have determined that we don't want to use consolidated metadata
   1972         # so we set that to False to avoid trying to read it
   1973         open_kwargs["use_consolidated"] = False
-> 1974     zarr_group = zarr.open_group(store, **open_kwargs)
   1976 close_store_on_close = zarr_group.store is not store
   1978 # we use this to determine how to handle fill_value

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/api/synchronous.py:533, in open_group()
    463 def open_group(
    464     store: StoreLike | None = None,
    465     *,
   (...)    475     use_consolidated: bool | str | None = None,
    476 ) -> Group:
    477     """Open a group using file-mode-like semantics.
    478 
    479     Parameters
   (...)    530         The new group.
    531     """
    532     return Group(
--> 533         sync(
    534             async_api.open_group(
    535                 store=store,
    536                 mode=mode,
    537                 cache_attrs=cache_attrs,
    538                 synchronizer=synchronizer,
    539                 path=path,
    540                 chunk_store=chunk_store,
    541                 storage_options=storage_options,
    542                 zarr_format=zarr_format,
    543                 meta_array=meta_array,
    544                 attributes=attributes,
    545                 use_consolidated=use_consolidated,
    546             )
    547         )
    548     )

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/core/sync.py:158, in sync()
    155 return_result = next(iter(finished)).result()
    157 if isinstance(return_result, BaseException):
--> 158     raise return_result
    159 else:
    160     return return_result

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/core/sync.py:118, in _runner()
    113 """
    114 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
    115 exception, the exception will be returned.
    116 """
    117 try:
--> 118     return await coro
    119 except Exception as ex:
    120     return ex

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/api/asynchronous.py:825, in open_group()
    822 if chunk_store is not None:
    823     warnings.warn("chunk_store is not yet implemented", ZarrRuntimeWarning, stacklevel=2)
--> 825 store_path = await make_store_path(store, mode=mode, storage_options=storage_options, path=path)
    826 if attributes is None:
    827     attributes = {}

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/storage/_common.py:453, in make_store_path()
    448     raise ValueError(
    449         "'path' was provided but is not used for FSMap store_like objects. Specify the path when creating the FSMap instance instead."
    450     )
    452 else:
--> 453     store = await make_store(store_like, mode=mode, storage_options=storage_options)
    454     return await StorePath.open(store, path=path_normalized, mode=mode)

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/storage/_common.py:385, in make_store()
    380         return FsspecStore.from_url(
    381             store_like, storage_options=storage_options, read_only=_read_only
    382         )
    384 elif _has_fsspec and isinstance(store_like, FSMap):
--> 385     return FsspecStore.from_mapper(store_like, read_only=_read_only)
    387 else:
    388     raise TypeError(f"Unsupported type for store_like: '{type(store_like).__name__}'")

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/storage/_fsspec.py:197, in from_mapper()
    173 @classmethod
    174 def from_mapper(
    175     cls,
   (...)    178     allowed_exceptions: tuple[type[Exception], ...] = ALLOWED_EXCEPTIONS,
    179 ) -> FsspecStore:
    180     """
    181     Create an FsspecStore from an FSMap object.
    182 
   (...)    195     FsspecStore
    196     """
--> 197     fs = _make_async(fs_map.fs)
    198     return cls(
    199         fs=fs,
    200         path=fs_map.root,
    201         read_only=read_only,
    202         allowed_exceptions=allowed_exceptions,
    203     )

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/storage/_fsspec.py:57, in _make_async()
     55     fs_dict = json.loads(fs.to_json())
     56     fs_dict["asynchronous"] = True
---> 57     return fsspec.AbstractFileSystem.from_json(json.dumps(fs_dict))
     59 if fsspec_version < parse_version("2024.12.0"):
     60     raise ImportError(
     61         f"The filesystem '{fs}' is synchronous, and the required "
     62         "AsyncFileSystemWrapper is not available. Upgrade fsspec to version "
     63         "2024.12.0 or later to enable this functionality."
     64     )

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/fsspec/spec.py:1494, in from_json()
   1473 """
   1474 Recreate a filesystem instance from JSON representation.
   1475 
   (...)   1490 at import time.
   1491 """
   1492 from .json import FilesystemJSONDecoder
-> 1494 return json.loads(blob, cls=FilesystemJSONDecoder)

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/json/__init__.py:365, in loads()
    363 if parse_constant is not None:
    364     kw['parse_constant'] = parse_constant
--> 365 return cls(**kw).decode(s)

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/json/decoder.py:345, in decode()
    340 def decode(self, s, _w=WHITESPACE.match):
    341     """Return the Python representation of ``s`` (a ``str`` instance
    342     containing a JSON document).
    343 
    344     """
--> 345     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    346     end = _w(s, end).end()
    347     if end != len(s):

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/json/decoder.py:361, in raw_decode()
    352 """Decode a JSON document from ``s`` (a ``str`` beginning with
    353 a JSON document) and return a 2-tuple of the Python
    354 representation and the index in ``s`` where the document ended.
   (...)    358 
    359 """
    360 try:
--> 361     obj, end = self.scan_once(s, idx)
    362 except StopIteration as err:
    363     raise JSONDecodeError("Expecting value", s, err.value) from None

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/fsspec/json.py:92, in custom_object_hook()
     90 if "cls" in dct:
     91     if (obj_cls := self.try_resolve_fs_cls(dct)) is not None:
---> 92         return AbstractFileSystem.from_dict(dct)
     93     if (obj_cls := self.try_resolve_path_cls(dct)) is not None:
     94         return obj_cls(dct["str"])

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/fsspec/spec.py:1570, in from_dict()
   1567 dct.pop("cls", None)
   1568 dct.pop("protocol", None)
-> 1570 return cls(
   1571     *json_decoder.unmake_serializable(dct.pop("args", ())),
   1572     **json_decoder.unmake_serializable(dct),
   1573 )

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/fsspec/spec.py:84, in __call__()
     82     return cls._cache[token]
     83 else:
---> 84     obj = super().__call__(*args, **kwargs, **strip_tokenize_options)
     85     # Setting _fs_token here causes some static linters to complain.
     86     obj._fs_token_ = token

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/fsspec/implementations/reference.py:781, in __init__()
    779     self.fss[k] = AsyncFileSystemWrapper(f, asynchronous=self.asynchronous)
    780 elif self.asynchronous ^ f.asynchronous:
--> 781     raise ValueError(
    782         "Reference-FS's target filesystem must have same value "
    783         "of asynchronous"
    784     )

ValueError: Reference-FS's target filesystem must have same value of asynchronous

Build an Xarray-Datatree from the dictionary of datasets

dt = DataTree.from_dict(catalog_computed[0])

Accessing the Datatree

A Datatree is a collection of related Xarray datasets. We can access individual datasets using UNIX syntax. In the cell below, we will access a single dataset from the datatree.

dt["ACCESS-CM2/ssp585"]

# or

dt["ACCESS-CM2"]["ssp585"]
Convert a Datatree node to a Dataset
dt["ACCESS-CM2"]["ssp585"].to_dataset()

Operations across a Datatree

A Datatree contains a collection of datasets with related coordinates and variables. Using some in-built methods, we can analyze it as if it were a single dataset. Instead of looping through hundreds of Xarray datasets, we can apply operations across the Datatree. In the example below, we will lazily create a time-series.

ts = dt.mean(dim=["lat", "lon"])

Visualize a single dataset with HvPlot

display(  # noqa
    dt["ACCESS-CM2/ssp585"].to_dataset().pr.hvplot("lon", "lat", rasterize=True)
)

Shut down the Dask cluster

client.shutdown()