Skip to article frontmatterSkip to article content

Multi-file virtual datasets with VirtualiZarr

Kerchunk Logo

Multi-file virtual datasets with VirtualiZarr

Overview

This notebook is intends to build off of the Basics of virtual Zarr stores.

In this tutorial we will:

  • Create a list of input paths for a collection of NetCDF files stored on the cloud.
  • Create virtual datasets for each input datasets
  • Combine the virtual datasets using combine_nested
  • Read the combined dataset using Xarray and fsspec.

Prerequisites

ConceptsImportanceNotes
Basics of virtual Zarr storesRequiredBasic features
Introduction to XarrayRecommendedIO
  • Time to learn: 60 minutes

Flags

In the section below, set the subset flag to be True (default) or False depending if you want this notebook to process the full file list. If set to True, then a subset of the file list will be processed (Recommended)

subset_flag = True

Imports

In our imports block we are using similar imports to the Basics of virtual Zarr stores tutorial:

  • fsspec for reading and writing to remote file systems
  • virtualizarr will be used to generate the virtual Zarr store
  • Xarray for examining the output dataset
import fsspec
import xarray as xr
from virtualizarr import open_virtual_dataset

Create a File Pattern from a list of input NetCDF files

Below we will create a list of input files we want to virtualize. In the Basics of virtual Zarr stores tutorial, we looked at a single file of climate downscaled data over Southern Alaska. In this example, we will build off of that work and use Kerchunk and VirtualiZarr to combine multiple NetCDF files of this dataset into a virtual dataset that can be read as if it were a Zarr store - without copying any data.

We use the fsspec s3 filesystem’s glob method to create a list of files matching a file pattern. We supply the base url of s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/, which is pointing to an AWS public bucket, for daily rcp85 ccsm downscaled data for the year 2060. After this base url, we tacked on *, which acts as a wildcard for all the files in the directory. We should expect 365 daily NetCDF files.

Finally, we are appending the string s3:// to the list of return files. This will ensure the list of files we get back are s3 urls and can be read by VirtualiZarr and Kerchunk.

# Initiate fsspec filesystems for reading and writing
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)

# Retrieve list of available days in archive for the year 2060.
files_paths = fs_read.glob("s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/*")

# Here we prepend the prefix 's3://', which points to AWS.
files_paths = sorted(["s3://" + f for f in files_paths])

As a quick check, it looks like we have a list 365 file paths, which should be a year of downscaled climte data.

print(f"{len(files_paths)} file paths were retrieved.")
365 file paths were retrieved.
# If the subset_flag == True (default), the list of input files will
# be subset to speed up the processing
if subset_flag:
    files_paths = files_paths[0:4]

Optional: If you want to examine one NetCDF files before creating the Kerchunk index, try uncommenting this code snippet below.

## Note: Optional piece of code to view one of the NetCDFs

# import s3fs

# fs = fsspec.filesystem("s3",anon=True)
# ds = xr.open_dataset(fs.open(file_pattern[0]))

Create virtual datasets for every file in the File_Pattern list

Now that we have a list of NetCDF files, we can use VirtualiZarr to create virtual datasets for each one of these.

Define kwargs for fsspec

In the cell below, we are creating a dictionary of kwargs to pass to fsspec and the s3 filesystem. Details on this can be found in the Basics of virtual Zarr stores tutorial in the Define storage_options arguments section

storage_options = dict(anon=True, default_fill_cache=False, default_cache_type="none")

In the cell below, we are reusing some of the functionality from the previous tutorial. First we are defining a function named: generate_json_reference. This function:

  • Uses an fsspec s3 filesystem to read in a NetCDF from a given url.
  • Generates a Kerchunk index using the SingleHdf5ToZarr Kerchunk method.
  • Creates a simplified filename using some string slicing.
  • Uses the local filesystem created with fsspec to write the Kerchunk index to a .json reference file.

Below the generate_json_reference function we created, we have a simple for loop that iterates through our list of NetCDF file urls and passes them to our generate_json_reference function, which appends the name of each .json reference file to a list named output_files.

virtual_datasets = [
    open_virtual_dataset(
        filepath, indexes={}, reader_options={"storage_options": storage_options}
    )
    for filepath in files_paths
]

Combine virtual datasets

After we have generated a virtual dataset for each NetCDF file, we can combine these into a single virtual dataset using Xarray’s combine_nested function.

combined_vds = xr.combine_nested(
    virtual_datasets, concat_dim=["Time"], coords="minimal", compat="override"
)
combined_vds
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[9], line 1
----> 1 combined_vds = xr.combine_nested(
      2     virtual_datasets, concat_dim=["Time"], coords="minimal", compat="override"
      3 )
      4 combined_vds

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:588, in combine_nested(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
    585     concat_dim = [concat_dim]
    587 # The IDs argument tells _nested_combine that datasets aren't yet sorted
--> 588 return _nested_combine(
    589     datasets,
    590     concat_dims=concat_dim,
    591     compat=compat,
    592     data_vars=data_vars,
    593     coords=coords,
    594     ids=False,
    595     fill_value=fill_value,
    596     join=join,
    597     combine_attrs=combine_attrs,
    598 )

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:367, in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join, combine_attrs)
    364 _check_shape_tile_ids(combined_ids)
    366 # Apply series of concatenate or merge operations along each dimension
--> 367 combined = _combine_nd(
    368     combined_ids,
    369     concat_dims,
    370     compat=compat,
    371     data_vars=data_vars,
    372     coords=coords,
    373     fill_value=fill_value,
    374     join=join,
    375     combine_attrs=combine_attrs,
    376 )
    377 return combined

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:246, in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join, combine_attrs)
    242 # Each iteration of this loop reduces the length of the tile_ids tuples
    243 # by one. It always combines along the first dimension, removing the first
    244 # element of the tuple
    245 for concat_dim in concat_dims:
--> 246     combined_ids = _combine_all_along_first_dim(
    247         combined_ids,
    248         dim=concat_dim,
    249         data_vars=data_vars,
    250         coords=coords,
    251         compat=compat,
    252         fill_value=fill_value,
    253         join=join,
    254         combine_attrs=combine_attrs,
    255     )
    256 (combined_ds,) = combined_ids.values()
    257 return combined_ds

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:278, in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join, combine_attrs)
    276     combined_ids = dict(sorted(group))
    277     datasets = combined_ids.values()
--> 278     new_combined_ids[new_id] = _combine_1d(
    279         datasets, dim, compat, data_vars, coords, fill_value, join, combine_attrs
    280     )
    281 return new_combined_ids

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:301, in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
    299 if concat_dim is not None:
    300     try:
--> 301         combined = concat(
    302             datasets,
    303             dim=concat_dim,
    304             data_vars=data_vars,
    305             coords=coords,
    306             compat=compat,
    307             fill_value=fill_value,
    308             join=join,
    309             combine_attrs=combine_attrs,
    310         )
    311     except ValueError as err:
    312         if "encountered unexpected variable" in str(err):

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/concat.py:277, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    264     return _dataarray_concat(
    265         objs,
    266         dim=dim,
   (...)    274         create_index_for_new_dim=create_index_for_new_dim,
    275     )
    276 elif isinstance(first_obj, Dataset):
--> 277     return _dataset_concat(
    278         objs,
    279         dim=dim,
    280         data_vars=data_vars,
    281         coords=coords,
    282         compat=compat,
    283         positions=positions,
    284         fill_value=fill_value,
    285         join=join,
    286         combine_attrs=combine_attrs,
    287         create_index_for_new_dim=create_index_for_new_dim,
    288     )
    289 else:
    290     raise TypeError(
    291         "can only concatenate xarray Dataset and DataArray "
    292         f"objects, got {type(first_obj)}"
    293     )

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/concat.py:676, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    674         result_vars[k] = v
    675 else:
--> 676     combined_var = concat_vars(
    677         vars, dim_name, positions, combine_attrs=combine_attrs
    678     )
    679     # reindex if variable is not present in all datasets
    680     if len(variable_index) < concat_index_size:

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/core/variable.py:3018, in concat(variables, dim, positions, shortcut, combine_attrs)
   2970 def concat(
   2971     variables,
   2972     dim="concat_dim",
   (...)   2975     combine_attrs="override",
   2976 ):
   2977     """Concatenate variables along a new or existing dimension.
   2978 
   2979     Parameters
   (...)   3016         along the given dimension.
   3017     """
-> 3018     variables = list(variables)
   3019     if all(isinstance(v, IndexVariable) for v in variables):
   3020         return IndexVariable.concat(variables, dim, positions, shortcut, combine_attrs)

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/concat.py:598, in _dataset_concat.<locals>.ensure_common_dims(vars, concat_dim_lengths)
    596 if var.dims != common_dims:
    597     common_shape = tuple(dims_sizes.get(d, dim_len) for d in common_dims)
--> 598     var = var.set_dims(common_dims, common_shape)
    599 yield var

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/util/deprecation_helpers.py:143, in deprecate_dims.<locals>.wrapper(*args, **kwargs)
    135     emit_user_level_warning(
    136         f"The `{old_name}` argument has been renamed to `dim`, and will be removed "
    137         "in the future. This renaming is taking place throughout xarray over the "
   (...)    140         PendingDeprecationWarning,
    141     )
    142     kwargs["dim"] = kwargs.pop(old_name)
--> 143 return func(*args, **kwargs)

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/core/variable.py:1395, in Variable.set_dims(self, dim, shape)
   1388 elif shape is None or all(
   1389     s == 1 for s, e in zip(shape, dim, strict=True) if e not in self_dims
   1390 ):
   1391     # "Trivial" broadcasting, i.e. simply inserting a new dimension
   1392     # This is typically easier for duck arrays to implement
   1393     # than the full "broadcast_to" semantics
   1394     indexer = (None,) * (len(expanded_dims) - self.ndim) + (...,)
-> 1395     expanded_data = self.data[indexer]
   1396 else:  # elif shape is not None:
   1397     dims_map = dict(zip(dim, shape, strict=True))

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/manifests/array.py:226, in ManifestArray.__getitem__(self, key)
    224     return self
    225 else:
--> 226     raise NotImplementedError(f"Doesn't support slicing with {indexer}")

NotImplementedError: Doesn't support slicing with (None, slice(None, None, None))

Write combined virtual dataset to a Kerchunk JSON for future use

If we want to keep the combined reference information in memory as well as write the file to .json, we can run the code snippet below.

# Write kerchunk .json record
output_fname = "combined_kerchunk.json"
combined_vds.virtualize.to_kerchunk(output_fname, format="json")

Using the output

Now that we have built a virtual dataset using VirtualiZarr and Kerchunk, we can read all of those original NetCDF files as if they were a single Zarr dataset.

**Since we saved the combined virtual dataset, this work doesn’t have to be repeated for anyone else to use this dataset. All they need is to pass the reference file storing the virtual dataset to Xarray and it is as if they had a Zarr dataset!

Open combined virtual dataset with Kerchunk

# We once again need to provide information for fsspec to access the remote file
storage_options = dict(
    remote_protocol="s3", remote_options=dict(anon=True), skip_instance_cache=True
)
# We will use the "kerchunk" engine in `xr.open_dataset` and pass the `storage_options` to the `kerchunk` engine through `backend_kwargs`
ds = xr.open_dataset(
    output_fname,
    engine="kerchunk",
    backend_kwargs={"storage_options": storage_options},
)
ds

Plot a slice of the dataset

Here we are using Xarray to select a single time slice of the dataset and plot a map of snow cover over South East Alaska.

ds.isel(Time=0).SNOW.plot()