Overview¶
In this tutorial we are going to use a large collection of pre-generated Kerchunk reference files and open them with Xarray’s new DataTree functionality. This chapter is heavily inspired by this blog post.
About the Dataset¶
This collection of reference files were generated from the NASA NEX-GDDP-CMIP6 (Global Daily Downscaled Projections) dataset. A version of this dataset is hosted on s3 as a collection of NetCDF files.
Prerequisites¶
| Concepts | Importance | Notes |
|---|---|---|
| Kerchunk Basics | Required | Core |
| Multiple Files and Kerchunk | Required | Core |
| Kerchunk and Dask | Required | Core |
| Multi-File Datasets with Kerchunk | Required | IO/Visualization |
| Xarray-Datatree Overview | Required | IO |
Time to learn: 30 minutes
Motivation¶
In total the dataset is roughly 12TB in compressed blob storage, with a single NetCDF file per yearly timestep, per variable. Downloading this entire dataset for analysis on a local machine would difficult to say the least. The collection of Kerchunk reference files for this entire dataset is only 272 Mb, which is about 42,000 times smaller!
Imports¶
import dask
import hvplot.xarray # noqa
import pandas as pd
import xarray as xr
from xarray import DataTree
from distributed import Client
from fsspec.implementations.reference import ReferenceFileSystemRead the reference catalog¶
The NASA NEX-GDDP-CMIP6 dataset is organized by GCM, Scenario and Ensemble Member. Each of these Scenario/GCM combinations is represented as a combined reference file, which was created by merging across variables and concatenating along time-steps. All of these references are organized into a simple .csv catalog in the schema:
| GCM/Scenario | url |
|---|
Organzing with Xarray-Datatree¶
Not all of the GCM/Scenario reference datasets have shared spatial coordinates and many of the have slight differences in their calendar and thus time dimension.
Because of this, these cannot be combined into a single Xarray-Dataset. Fortunately Xarray-Datatree provides a higher level abstraction where related Xarray-Datasets are organized into a tree structure where each dataset corresponds to a leaf.
# Read the reference catalog into a Pandas DataFrame
cat_df = pd.read_csv(
"s3://carbonplan-share/nasa-nex-reference/reference_catalog_nested.csv"
)
# Convert the DataFrame into a dictionary
catalog = cat_df.set_index("ID").T.to_dict("records")[0]Load Reference Datasets into Xarray-DataTree¶
In the following cell we create a function load_ref_ds, which can be parallelized via Dask to load Kerchunk references into a dictionary of Xarray-Datasets.
def load_ref_ds(url: str):
fs = ReferenceFileSystem(
url,
remote_protocol="s3",
target_protocol="s3",
remote_options={"anon": True},
target_options={"anon": True},
lazy=True,
)
return xr.open_dataset(
fs.get_mapper(),
engine="zarr",
backend_kwargs={"consolidated": False},
chunks={"time": 300},
)
tasks = {id: dask.delayed(load_ref_ds)(url) for id, url in catalog.items()}Use Dask Distributed to load the Xarray-Datasets from Kerchunk reference files¶
Using Dask, we are loading 164 reference datasets into memory. Since they are are Xarray datasets the coordinates are loaded eagerly, but the underlying data is still lazy.
client = Client(n_workers=8)
client/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/distributed/node.py:188: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 37435 instead
warnings.warn(
catalog_computed = dask.compute(tasks)---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[5], line 1
----> 1 catalog_computed = dask.compute(tasks)
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/dask/base.py:685, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
682 expr = expr.optimize()
683 keys = list(flatten(expr.__dask_keys__()))
--> 685 results = schedule(expr, keys, **kwargs)
687 return repack(results)
Cell In[3], line 10
---> 10 return xr.open_dataset(
11 fs.get_mapper(),
12 engine="zarr",
13 backend_kwargs={"consolidated": False},
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/xarray/backends/api.py:607, in open_dataset()
595 decoders = _resolve_decoders_kwargs(
596 decode_cf,
597 open_backend_dataset_parameters=backend.open_dataset_parameters,
(...) 603 decode_coords=decode_coords,
604 )
606 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 607 backend_ds = backend.open_dataset(
608 filename_or_obj,
609 drop_variables=drop_variables,
610 **decoders,
611 **kwargs,
612 )
613 ds = _dataset_from_backend_dataset(
614 backend_ds,
615 filename_or_obj,
(...) 626 **kwargs,
627 )
628 return ds
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/xarray/backends/zarr.py:1732, in open_dataset()
1730 filename_or_obj = _normalize_path(filename_or_obj)
1731 if not store:
-> 1732 store = ZarrStore.open_group(
1733 filename_or_obj,
1734 group=group,
1735 mode=mode,
1736 synchronizer=synchronizer,
1737 consolidated=consolidated,
1738 consolidate_on_close=False,
1739 chunk_store=chunk_store,
1740 storage_options=storage_options,
1741 zarr_version=zarr_version,
1742 use_zarr_fill_value_as_mask=None,
1743 zarr_format=zarr_format,
1744 cache_members=cache_members,
1745 )
1747 store_entrypoint = StoreBackendEntrypoint()
1748 with close_on_error(store):
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/xarray/backends/zarr.py:771, in open_group()
745 @classmethod
746 def open_group(
747 cls,
(...) 764 cache_members: bool = True,
765 ):
766 (
767 zarr_group,
768 consolidate_on_close,
769 close_store_on_close,
770 use_zarr_fill_value_as_mask,
--> 771 ) = _get_open_params(
772 store=store,
773 mode=mode,
774 synchronizer=synchronizer,
775 group=group,
776 consolidated=consolidated,
777 consolidate_on_close=consolidate_on_close,
778 chunk_store=chunk_store,
779 storage_options=storage_options,
780 zarr_version=zarr_version,
781 use_zarr_fill_value_as_mask=use_zarr_fill_value_as_mask,
782 zarr_format=zarr_format,
783 )
785 return cls(
786 zarr_group,
787 mode,
(...) 796 cache_members=cache_members,
797 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/xarray/backends/zarr.py:1974, in _get_open_params()
1970 if _zarr_v3():
1971 # we have determined that we don't want to use consolidated metadata
1972 # so we set that to False to avoid trying to read it
1973 open_kwargs["use_consolidated"] = False
-> 1974 zarr_group = zarr.open_group(store, **open_kwargs)
1976 close_store_on_close = zarr_group.store is not store
1978 # we use this to determine how to handle fill_value
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/api/synchronous.py:533, in open_group()
463 def open_group(
464 store: StoreLike | None = None,
465 *,
(...) 475 use_consolidated: bool | str | None = None,
476 ) -> Group:
477 """Open a group using file-mode-like semantics.
478
479 Parameters
(...) 530 The new group.
531 """
532 return Group(
--> 533 sync(
534 async_api.open_group(
535 store=store,
536 mode=mode,
537 cache_attrs=cache_attrs,
538 synchronizer=synchronizer,
539 path=path,
540 chunk_store=chunk_store,
541 storage_options=storage_options,
542 zarr_format=zarr_format,
543 meta_array=meta_array,
544 attributes=attributes,
545 use_consolidated=use_consolidated,
546 )
547 )
548 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/core/sync.py:158, in sync()
155 return_result = next(iter(finished)).result()
157 if isinstance(return_result, BaseException):
--> 158 raise return_result
159 else:
160 return return_result
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/core/sync.py:118, in _runner()
113 """
114 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
115 exception, the exception will be returned.
116 """
117 try:
--> 118 return await coro
119 except Exception as ex:
120 return ex
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/api/asynchronous.py:825, in open_group()
822 if chunk_store is not None:
823 warnings.warn("chunk_store is not yet implemented", ZarrRuntimeWarning, stacklevel=2)
--> 825 store_path = await make_store_path(store, mode=mode, storage_options=storage_options, path=path)
826 if attributes is None:
827 attributes = {}
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/storage/_common.py:453, in make_store_path()
448 raise ValueError(
449 "'path' was provided but is not used for FSMap store_like objects. Specify the path when creating the FSMap instance instead."
450 )
452 else:
--> 453 store = await make_store(store_like, mode=mode, storage_options=storage_options)
454 return await StorePath.open(store, path=path_normalized, mode=mode)
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/storage/_common.py:385, in make_store()
380 return FsspecStore.from_url(
381 store_like, storage_options=storage_options, read_only=_read_only
382 )
384 elif _has_fsspec and isinstance(store_like, FSMap):
--> 385 return FsspecStore.from_mapper(store_like, read_only=_read_only)
387 else:
388 raise TypeError(f"Unsupported type for store_like: '{type(store_like).__name__}'")
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/storage/_fsspec.py:197, in from_mapper()
173 @classmethod
174 def from_mapper(
175 cls,
(...) 178 allowed_exceptions: tuple[type[Exception], ...] = ALLOWED_EXCEPTIONS,
179 ) -> FsspecStore:
180 """
181 Create an FsspecStore from an FSMap object.
182
(...) 195 FsspecStore
196 """
--> 197 fs = _make_async(fs_map.fs)
198 return cls(
199 fs=fs,
200 path=fs_map.root,
201 read_only=read_only,
202 allowed_exceptions=allowed_exceptions,
203 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/storage/_fsspec.py:57, in _make_async()
55 fs_dict = json.loads(fs.to_json())
56 fs_dict["asynchronous"] = True
---> 57 return fsspec.AbstractFileSystem.from_json(json.dumps(fs_dict))
59 if fsspec_version < parse_version("2024.12.0"):
60 raise ImportError(
61 f"The filesystem '{fs}' is synchronous, and the required "
62 "AsyncFileSystemWrapper is not available. Upgrade fsspec to version "
63 "2024.12.0 or later to enable this functionality."
64 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/fsspec/spec.py:1494, in from_json()
1473 """
1474 Recreate a filesystem instance from JSON representation.
1475
(...) 1490 at import time.
1491 """
1492 from .json import FilesystemJSONDecoder
-> 1494 return json.loads(blob, cls=FilesystemJSONDecoder)
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/json/__init__.py:365, in loads()
363 if parse_constant is not None:
364 kw['parse_constant'] = parse_constant
--> 365 return cls(**kw).decode(s)
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/json/decoder.py:345, in decode()
340 def decode(self, s, _w=WHITESPACE.match):
341 """Return the Python representation of ``s`` (a ``str`` instance
342 containing a JSON document).
343
344 """
--> 345 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
346 end = _w(s, end).end()
347 if end != len(s):
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/json/decoder.py:361, in raw_decode()
352 """Decode a JSON document from ``s`` (a ``str`` beginning with
353 a JSON document) and return a 2-tuple of the Python
354 representation and the index in ``s`` where the document ended.
(...) 358
359 """
360 try:
--> 361 obj, end = self.scan_once(s, idx)
362 except StopIteration as err:
363 raise JSONDecodeError("Expecting value", s, err.value) from None
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/fsspec/json.py:92, in custom_object_hook()
90 if "cls" in dct:
91 if (obj_cls := self.try_resolve_fs_cls(dct)) is not None:
---> 92 return AbstractFileSystem.from_dict(dct)
93 if (obj_cls := self.try_resolve_path_cls(dct)) is not None:
94 return obj_cls(dct["str"])
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/fsspec/spec.py:1570, in from_dict()
1567 dct.pop("cls", None)
1568 dct.pop("protocol", None)
-> 1570 return cls(
1571 *json_decoder.unmake_serializable(dct.pop("args", ())),
1572 **json_decoder.unmake_serializable(dct),
1573 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/fsspec/spec.py:84, in __call__()
82 return cls._cache[token]
83 else:
---> 84 obj = super().__call__(*args, **kwargs, **strip_tokenize_options)
85 # Setting _fs_token here causes some static linters to complain.
86 obj._fs_token_ = token
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/fsspec/implementations/reference.py:781, in __init__()
779 self.fss[k] = AsyncFileSystemWrapper(f, asynchronous=self.asynchronous)
780 elif self.asynchronous ^ f.asynchronous:
--> 781 raise ValueError(
782 "Reference-FS's target filesystem must have same value "
783 "of asynchronous"
784 )
ValueError: Reference-FS's target filesystem must have same value of asynchronousBuild an Xarray-Datatree from the dictionary of datasets¶
dt = DataTree.from_dict(catalog_computed[0])Accessing the Datatree¶
A Datatree is a collection of related Xarray datasets. We can access individual datasets using UNIX syntax. In the cell below, we will access a single dataset from the datatree.
dt["ACCESS-CM2/ssp585"]
# or
dt["ACCESS-CM2"]["ssp585"]Convert a Datatree node to a Dataset¶
dt["ACCESS-CM2"]["ssp585"].to_dataset()Operations across a Datatree¶
A Datatree contains a collection of datasets with related coordinates and variables. Using some in-built methods, we can analyze it as if it were a single dataset. Instead of looping through hundreds of Xarray datasets, we can apply operations across the Datatree. In the example below, we will lazily create a time-series.
ts = dt.mean(dim=["lat", "lon"])Visualize a single dataset with HvPlot¶
display( # noqa
dt["ACCESS-CM2/ssp585"].to_dataset().pr.hvplot("lon", "lat", rasterize=True)
)Shut down the Dask cluster¶
client.shutdown()