Overview¶
In this tutorial we are going to use a large collection of pre-generated Kerchunk
reference files and open them with Xarray’s new DataTree functionality. This chapter is heavily inspired by this blog post.
About the Dataset¶
This collection of reference files were generated from the NASA NEX-GDDP-CMIP6 (Global Daily Downscaled Projections) dataset. A version of this dataset is hosted on s3
as a collection of NetCDF files.
Prerequisites¶
Concepts | Importance | Notes |
---|---|---|
Kerchunk Basics | Required | Core |
Multiple Files and Kerchunk | Required | Core |
Kerchunk and Dask | Required | Core |
Multi-File Datasets with Kerchunk | Required | IO/Visualization |
Xarray-Datatree Overview | Required | IO |
Time to learn: 30 minutes
Motivation¶
In total the dataset is roughly 12TB in compressed blob storage, with a single NetCDF
file per yearly timestep, per variable. Downloading this entire dataset for analysis on a local machine would difficult to say the least. The collection of Kerchunk
reference files for this entire dataset is only 272 Mb, which is about 42,000 times smaller!
Imports¶
import dask
import hvplot.xarray # noqa
import pandas as pd
import xarray as xr
from xarray import DataTree
from distributed import Client
from fsspec.implementations.reference import ReferenceFileSystem
Read the reference catalog¶
The NASA NEX-GDDP-CMIP6 dataset is organized by GCM, Scenario and Ensemble Member. Each of these Scenario/GCM combinations is represented as a combined reference file, which was created by merging across variables and concatenating along time-steps. All of these references are organized into a simple .csv
catalog in the schema:
GCM/Scenario | url |
---|
Organzing with Xarray-Datatree¶
Not all of the GCM/Scenario reference datasets have shared spatial coordinates and many of the have slight differences in their calendar and thus time dimension.
Because of this, these cannot be combined into a single Xarray-Dataset
. Fortunately Xarray-Datatree
provides a higher level abstraction where related Xarray-Datasets
are organized into a tree structure where each dataset corresponds to a leaf
.
# Read the reference catalog into a Pandas DataFrame
cat_df = pd.read_csv(
"s3://carbonplan-share/nasa-nex-reference/reference_catalog_nested.csv"
)
# Convert the DataFrame into a dictionary
catalog = cat_df.set_index("ID").T.to_dict("records")[0]
Load Reference Datasets into Xarray-DataTree¶
In the following cell we create a function load_ref_ds
, which can be parallelized via Dask
to load Kerchunk
references into a dictionary of Xarray-Datasets
.
def load_ref_ds(url: str):
fs = ReferenceFileSystem(
url,
remote_protocol="s3",
target_protocol="s3",
remote_options={"anon": True},
target_options={"anon": True},
lazy=True,
)
return xr.open_dataset(
fs.get_mapper(),
engine="zarr",
backend_kwargs={"consolidated": False},
chunks={"time": 300},
)
tasks = {id: dask.delayed(load_ref_ds)(url) for id, url in catalog.items()}
Use Dask Distributed to load the Xarray-Datasets from Kerchunk reference files¶
Using Dask
, we are loading 164 reference datasets into memory. Since they are are Xarray
datasets the coordinates are loaded eagerly, but the underlying data is still lazy.
client = Client(n_workers=8)
client
catalog_computed = dask.compute(tasks)
2025-10-09 00:41:01,281 - distributed.worker - ERROR - Compute Failed
Key: load_ref_ds-f91357e2-e88f-4b24-9140-225c491ce2af
State: executing
Task: <Task 'load_ref_ds-f91357e2-e88f-4b24-9140-225c491ce2af' load_ref_ds(...)>
Exception: 'ValueError("Reference-FS\'s target filesystem must have same value of asynchronous")'
Traceback: ' File "/tmp/ipykernel_4378/3963997076.py", line 10, in load_ref_ds\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/backends/api.py", line 596, in open_dataset\n backend_ds = backend.open_dataset(\n filename_or_obj,\n ...<2 lines>...\n **kwargs,\n )\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/backends/zarr.py", line 1660, in open_dataset\n store = ZarrStore.open_group(\n filename_or_obj,\n ...<10 lines>...\n cache_members=cache_members,\n )\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/backends/zarr.py", line 714, in open_group\n ) = _get_open_params(\n ~~~~~~~~~~~~~~~~^\n store=store,\n ^^^^^^^^^^^^\n ...<9 lines>...\n zarr_format=zarr_format,\n ^^^^^^^^^^^^^^^^^^^^^^^^\n )\n ^\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/backends/zarr.py", line 1902, in _get_open_params\n zarr_group = zarr.open_group(store, **open_kwargs)\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/api/synchronous.py", line 540, in open_group\n sync(\n ~~~~^\n async_api.open_group(\n ^^^^^^^^^^^^^^^^^^^^^\n ...<12 lines>...\n )\n ^\n )\n ^\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/core/sync.py", line 163, in sync\n raise return_result\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/core/sync.py", line 119, in _runner\n return await coro\n ^^^^^^^^^^\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/api/asynchronous.py", line 851, in open_group\n store_path = await make_store_path(store, mode=mode, storage_options=storage_options, path=path)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/storage/_common.py", line 418, in make_store_path\n store = await make_store(store_like, mode=mode, storage_options=storage_options)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/storage/_common.py", line 352, in make_store\n return FsspecStore.from_mapper(store_like, read_only=_read_only)\n ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/storage/_fsspec.py", line 201, in from_mapper\n fs = _make_async(fs_map.fs)\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/storage/_fsspec.py", line 57, in _make_async\n return fsspec.AbstractFileSystem.from_json(json.dumps(fs_dict))\n ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/fsspec/spec.py", line 1480, in from_json\n return json.loads(blob, cls=FilesystemJSONDecoder)\n ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/json/__init__.py", line 359, in loads\n return cls(**kw).decode(s)\n ~~~~~~~~~~~~~~~~^^^\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/json/decoder.py", line 345, in decode\n obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/json/decoder.py", line 361, in raw_decode\n obj, end = self.scan_once(s, idx)\n ~~~~~~~~~~~~~~^^^^^^^^\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/fsspec/json.py", line 97, in custom_object_hook\n return AbstractFileSystem.from_dict(dct)\n ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/fsspec/spec.py", line 1556, in from_dict\n return cls(\n *json_decoder.unmake_serializable(dct.pop("args", ())),\n **json_decoder.unmake_serializable(dct),\n )\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/fsspec/spec.py", line 81, in __call__\n obj = super().__call__(*args, **kwargs)\n File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/fsspec/implementations/reference.py", line 770, in __init__\n raise ValueError(\n ...<2 lines>...\n )\n'
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[5], line 1
----> 1 catalog_computed = dask.compute(tasks)
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/dask/base.py:681, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
678 expr = expr.optimize()
679 keys = list(flatten(expr.__dask_keys__()))
--> 681 results = schedule(expr, keys, **kwargs)
683 return repack(results)
Cell In[3], line 10, in load_ref_ds()
1 def load_ref_ds(url: str):
2 fs = ReferenceFileSystem(
3 url,
4 remote_protocol="s3",
(...) 8 lazy=True,
9 )
---> 10 return xr.open_dataset(
11 fs.get_mapper(),
12 engine="zarr",
13 backend_kwargs={"consolidated": False},
14 chunks={"time": 300},
15 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/backends/api.py:596, in open_dataset()
584 decoders = _resolve_decoders_kwargs(
585 decode_cf,
586 open_backend_dataset_parameters=backend.open_dataset_parameters,
(...) 592 decode_coords=decode_coords,
593 )
595 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 596 backend_ds = backend.open_dataset(
597 filename_or_obj,
598 drop_variables=drop_variables,
599 **decoders,
600 **kwargs,
601 )
602 ds = _dataset_from_backend_dataset(
603 backend_ds,
604 filename_or_obj,
(...) 615 **kwargs,
616 )
617 return ds
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/backends/zarr.py:1660, in open_dataset()
1658 filename_or_obj = _normalize_path(filename_or_obj)
1659 if not store:
-> 1660 store = ZarrStore.open_group(
1661 filename_or_obj,
1662 group=group,
1663 mode=mode,
1664 synchronizer=synchronizer,
1665 consolidated=consolidated,
1666 consolidate_on_close=False,
1667 chunk_store=chunk_store,
1668 storage_options=storage_options,
1669 zarr_version=zarr_version,
1670 use_zarr_fill_value_as_mask=None,
1671 zarr_format=zarr_format,
1672 cache_members=cache_members,
1673 )
1675 store_entrypoint = StoreBackendEntrypoint()
1676 with close_on_error(store):
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/backends/zarr.py:714, in open_group()
688 @classmethod
689 def open_group(
690 cls,
(...) 707 cache_members: bool = True,
708 ):
709 (
710 zarr_group,
711 consolidate_on_close,
712 close_store_on_close,
713 use_zarr_fill_value_as_mask,
--> 714 ) = _get_open_params(
715 store=store,
716 mode=mode,
717 synchronizer=synchronizer,
718 group=group,
719 consolidated=consolidated,
720 consolidate_on_close=consolidate_on_close,
721 chunk_store=chunk_store,
722 storage_options=storage_options,
723 zarr_version=zarr_version,
724 use_zarr_fill_value_as_mask=use_zarr_fill_value_as_mask,
725 zarr_format=zarr_format,
726 )
728 return cls(
729 zarr_group,
730 mode,
(...) 739 cache_members=cache_members,
740 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/backends/zarr.py:1902, in _get_open_params()
1898 if _zarr_v3():
1899 # we have determined that we don't want to use consolidated metadata
1900 # so we set that to False to avoid trying to read it
1901 open_kwargs["use_consolidated"] = False
-> 1902 zarr_group = zarr.open_group(store, **open_kwargs)
1904 close_store_on_close = zarr_group.store is not store
1906 # we use this to determine how to handle fill_value
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/api/synchronous.py:540, in open_group()
463 def open_group(
464 store: StoreLike | None = None,
465 *,
(...) 476 use_consolidated: bool | str | None = None,
477 ) -> Group:
478 """Open a group using file-mode-like semantics.
479
480 Parameters
(...) 537 The new group.
538 """
539 return Group(
--> 540 sync(
541 async_api.open_group(
542 store=store,
543 mode=mode,
544 cache_attrs=cache_attrs,
545 synchronizer=synchronizer,
546 path=path,
547 chunk_store=chunk_store,
548 storage_options=storage_options,
549 zarr_version=zarr_version,
550 zarr_format=zarr_format,
551 meta_array=meta_array,
552 attributes=attributes,
553 use_consolidated=use_consolidated,
554 )
555 )
556 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/core/sync.py:163, in sync()
160 return_result = next(iter(finished)).result()
162 if isinstance(return_result, BaseException):
--> 163 raise return_result
164 else:
165 return return_result
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/core/sync.py:119, in _runner()
114 """
115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
116 exception, the exception will be returned.
117 """
118 try:
--> 119 return await coro
120 except Exception as ex:
121 return ex
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/api/asynchronous.py:851, in open_group()
848 if chunk_store is not None:
849 warnings.warn("chunk_store is not yet implemented", ZarrRuntimeWarning, stacklevel=2)
--> 851 store_path = await make_store_path(store, mode=mode, storage_options=storage_options, path=path)
852 if attributes is None:
853 attributes = {}
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/storage/_common.py:418, in make_store_path()
413 raise ValueError(
414 "'path' was provided but is not used for FSMap store_like objects. Specify the path when creating the FSMap instance instead."
415 )
417 else:
--> 418 store = await make_store(store_like, mode=mode, storage_options=storage_options)
419 return await StorePath.open(store, path=path_normalized, mode=mode)
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/storage/_common.py:352, in make_store()
349 return await make_store(Path(store_like), mode=mode, storage_options=storage_options)
351 elif _has_fsspec and isinstance(store_like, FSMap):
--> 352 return FsspecStore.from_mapper(store_like, read_only=_read_only)
354 else:
355 raise TypeError(f"Unsupported type for store_like: '{type(store_like).__name__}'")
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/storage/_fsspec.py:201, in from_mapper()
177 @classmethod
178 def from_mapper(
179 cls,
(...) 182 allowed_exceptions: tuple[type[Exception], ...] = ALLOWED_EXCEPTIONS,
183 ) -> FsspecStore:
184 """
185 Create a FsspecStore from a FSMap object.
186
(...) 199 FsspecStore
200 """
--> 201 fs = _make_async(fs_map.fs)
202 return cls(
203 fs=fs,
204 path=fs_map.root,
205 read_only=read_only,
206 allowed_exceptions=allowed_exceptions,
207 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/zarr/storage/_fsspec.py:57, in _make_async()
55 fs_dict = json.loads(fs.to_json())
56 fs_dict["asynchronous"] = True
---> 57 return fsspec.AbstractFileSystem.from_json(json.dumps(fs_dict))
59 if fsspec_version < parse_version("2024.12.0"):
60 raise ImportError(
61 f"The filesystem '{fs}' is synchronous, and the required "
62 "AsyncFileSystemWrapper is not available. Upgrade fsspec to version "
63 "2024.12.0 or later to enable this functionality."
64 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/fsspec/spec.py:1480, in from_json()
1459 """
1460 Recreate a filesystem instance from JSON representation.
1461
(...) 1476 at import time.
1477 """
1478 from .json import FilesystemJSONDecoder
-> 1480 return json.loads(blob, cls=FilesystemJSONDecoder)
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/json/__init__.py:359, in loads()
357 if parse_constant is not None:
358 kw['parse_constant'] = parse_constant
--> 359 return cls(**kw).decode(s)
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/json/decoder.py:345, in decode()
340 def decode(self, s, _w=WHITESPACE.match):
341 """Return the Python representation of ``s`` (a ``str`` instance
342 containing a JSON document).
343
344 """
--> 345 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
346 end = _w(s, end).end()
347 if end != len(s):
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/json/decoder.py:361, in raw_decode()
352 """Decode a JSON document from ``s`` (a ``str`` beginning with
353 a JSON document) and return a 2-tuple of the Python
354 representation and the index in ``s`` where the document ended.
(...) 358
359 """
360 try:
--> 361 obj, end = self.scan_once(s, idx)
362 except StopIteration as err:
363 raise JSONDecodeError("Expecting value", s, err.value) from None
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/fsspec/json.py:97, in custom_object_hook()
95 if "cls" in dct:
96 if (obj_cls := self.try_resolve_fs_cls(dct)) is not None:
---> 97 return AbstractFileSystem.from_dict(dct)
98 if (obj_cls := self.try_resolve_path_cls(dct)) is not None:
99 return obj_cls(dct["str"])
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/fsspec/spec.py:1556, in from_dict()
1553 dct.pop("cls", None)
1554 dct.pop("protocol", None)
-> 1556 return cls(
1557 *json_decoder.unmake_serializable(dct.pop("args", ())),
1558 **json_decoder.unmake_serializable(dct),
1559 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/fsspec/spec.py:81, in __call__()
79 return cls._cache[token]
80 else:
---> 81 obj = super().__call__(*args, **kwargs)
82 # Setting _fs_token here causes some static linters to complain.
83 obj._fs_token_ = token
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/fsspec/implementations/reference.py:770, in __init__()
768 self.fss[k] = AsyncFileSystemWrapper(f, asynchronous=self.asynchronous)
769 elif self.asynchronous ^ f.asynchronous:
--> 770 raise ValueError(
771 "Reference-FS's target filesystem must have same value "
772 "of asynchronous"
773 )
ValueError: Reference-FS's target filesystem must have same value of asynchronous
Build an Xarray-Datatree from the dictionary of datasets¶
dt = DataTree.from_dict(catalog_computed[0])
Accessing the Datatree¶
A Datatree
is a collection of related Xarray
datasets. We can access individual datasets using UNIX
syntax. In the cell below, we will access a single dataset from the datatree
.
dt["ACCESS-CM2/ssp585"]
# or
dt["ACCESS-CM2"]["ssp585"]
Convert a Datatree node to a Dataset¶
dt["ACCESS-CM2"]["ssp585"].to_dataset()
Operations across a Datatree¶
A Datatree
contains a collection of datasets with related coordinates and variables. Using some in-built methods, we can analyze it as if it were a single dataset. Instead of looping through hundreds of Xarray
datasets, we can apply operations across the Datatree
. In the example below, we will lazily create a time-series.
ts = dt.mean(dim=["lat", "lon"])
Visualize a single dataset with HvPlot¶
display( # noqa
dt["ACCESS-CM2/ssp585"].to_dataset().pr.hvplot("lon", "lat", rasterize=True)
)
Shut down the Dask cluster¶
client.shutdown()