Multi-file virtual datasets with VirtualiZarr¶
Overview¶
This notebook is intends to build off of the Basics of virtual Zarr stores.
In this tutorial we will:
Prerequisites¶
Concepts | Importance | Notes |
---|---|---|
Basics of virtual Zarr stores | Required | Basic features |
Introduction to Xarray | Recommended | IO |
- Time to learn: 60 minutes
Flags¶
In the section below, set the subset
flag to be True
(default) or False
depending if you want this notebook to process the full file list. If set to True
, then a subset of the file list will be processed (Recommended)
subset_flag = True
Imports¶
In our imports block we are using similar imports to the Basics of virtual Zarr stores tutorial:
fsspec
for reading and writing to remote file systemsvirtualizarr
will be used to generate the virtual Zarr storeXarray
for examining the output dataset
import fsspec
import xarray as xr
from virtualizarr import open_virtual_dataset
Create a File Pattern from a list of input NetCDF files¶
Below we will create a list of input files we want to virtualize. In the Basics of virtual Zarr stores tutorial, we looked at a single file of climate downscaled data over Southern Alaska. In this example, we will build off of that work and use Kerchunk
and VirtualiZarr
to combine multiple NetCDF files of this dataset into a virtual dataset that can be read as if it were a Zarr
store - without copying any data.
We use the fsspec
s3
filesystem’s glob method to create a list of files matching a file pattern. We supply the base url of s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/
, which is pointing to an AWS
public bucket, for daily rcp85 ccsm downscaled data for the year 2060. After this base url, we tacked on *
, which acts as a wildcard for all the files in the directory. We should expect 365 daily NetCDF
files.
Finally, we are appending the string s3://
to the list of return files. This will ensure the list of files we get back are s3
urls and can be read by VirtualiZarr
and Kerchunk
.
# Initiate fsspec filesystems for reading and writing
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)
# Retrieve list of available days in archive for the year 2060.
files_paths = fs_read.glob("s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/*")
# Here we prepend the prefix 's3://', which points to AWS.
files_paths = sorted(["s3://" + f for f in files_paths])
As a quick check, it looks like we have a list 365 file paths, which should be a year of downscaled climte data.¶
print(f"{len(files_paths)} file paths were retrieved.")
365 file paths were retrieved.
# If the subset_flag == True (default), the list of input files will
# be subset to speed up the processing
if subset_flag:
files_paths = files_paths[0:4]
Optional: If you want to examine one NetCDF files before creating the Kerchunk
index, try uncommenting this code snippet below.¶
## Note: Optional piece of code to view one of the NetCDFs
# import s3fs
# fs = fsspec.filesystem("s3",anon=True)
# ds = xr.open_dataset(fs.open(file_pattern[0]))
Create virtual datasets for every file in the File_Pattern
list¶
Now that we have a list of NetCDF files, we can use VirtualiZarr
to create virtual datasets for each one of these.
Define kwargs for fsspec
¶
In the cell below, we are creating a dictionary of kwargs
to pass to fsspec
and the s3
filesystem. Details on this can be found in the Basics of virtual Zarr stores tutorial in the Define storage_options arguments section
storage_options = dict(anon=True, default_fill_cache=False, default_cache_type="none")
In the cell below, we are reusing some of the functionality from the previous tutorial.
First we are defining a function named: generate_json_reference
.
This function:
- Uses an
fsspec
s3
filesystem to read in aNetCDF
from a given url. - Generates a
Kerchunk
index using theSingleHdf5ToZarr
Kerchunk
method. - Creates a simplified filename using some string slicing.
- Uses the local filesystem created with
fsspec
to write theKerchunk
index to a.json
reference file.
Below the generate_json_reference
function we created, we have a simple for
loop that iterates through our list of NetCDF
file urls and passes them to our generate_json_reference
function, which appends the name of each .json
reference file to a list named output_files.
virtual_datasets = [
open_virtual_dataset(
filepath, indexes={}, reader_options={"storage_options": storage_options}
)
for filepath in files_paths
]
Combine virtual datasets¶
After we have generated a virtual dataset for each NetCDF
file, we can combine these into a single virtual dataset using Xarray’s combine_nested
function.
combined_vds = xr.combine_nested(
virtual_datasets, concat_dim=["Time"], coords="minimal", compat="override"
)
combined_vds
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
Cell In[9], line 1
----> 1 combined_vds = xr.combine_nested(
2 virtual_datasets, concat_dim=["Time"], coords="minimal", compat="override"
3 )
4 combined_vds
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:588, in combine_nested(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
585 concat_dim = [concat_dim]
587 # The IDs argument tells _nested_combine that datasets aren't yet sorted
--> 588 return _nested_combine(
589 datasets,
590 concat_dims=concat_dim,
591 compat=compat,
592 data_vars=data_vars,
593 coords=coords,
594 ids=False,
595 fill_value=fill_value,
596 join=join,
597 combine_attrs=combine_attrs,
598 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:367, in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join, combine_attrs)
364 _check_shape_tile_ids(combined_ids)
366 # Apply series of concatenate or merge operations along each dimension
--> 367 combined = _combine_nd(
368 combined_ids,
369 concat_dims,
370 compat=compat,
371 data_vars=data_vars,
372 coords=coords,
373 fill_value=fill_value,
374 join=join,
375 combine_attrs=combine_attrs,
376 )
377 return combined
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:246, in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join, combine_attrs)
242 # Each iteration of this loop reduces the length of the tile_ids tuples
243 # by one. It always combines along the first dimension, removing the first
244 # element of the tuple
245 for concat_dim in concat_dims:
--> 246 combined_ids = _combine_all_along_first_dim(
247 combined_ids,
248 dim=concat_dim,
249 data_vars=data_vars,
250 coords=coords,
251 compat=compat,
252 fill_value=fill_value,
253 join=join,
254 combine_attrs=combine_attrs,
255 )
256 (combined_ds,) = combined_ids.values()
257 return combined_ds
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:278, in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join, combine_attrs)
276 combined_ids = dict(sorted(group))
277 datasets = combined_ids.values()
--> 278 new_combined_ids[new_id] = _combine_1d(
279 datasets, dim, compat, data_vars, coords, fill_value, join, combine_attrs
280 )
281 return new_combined_ids
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:301, in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
299 if concat_dim is not None:
300 try:
--> 301 combined = concat(
302 datasets,
303 dim=concat_dim,
304 data_vars=data_vars,
305 coords=coords,
306 compat=compat,
307 fill_value=fill_value,
308 join=join,
309 combine_attrs=combine_attrs,
310 )
311 except ValueError as err:
312 if "encountered unexpected variable" in str(err):
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/concat.py:277, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
264 return _dataarray_concat(
265 objs,
266 dim=dim,
(...) 274 create_index_for_new_dim=create_index_for_new_dim,
275 )
276 elif isinstance(first_obj, Dataset):
--> 277 return _dataset_concat(
278 objs,
279 dim=dim,
280 data_vars=data_vars,
281 coords=coords,
282 compat=compat,
283 positions=positions,
284 fill_value=fill_value,
285 join=join,
286 combine_attrs=combine_attrs,
287 create_index_for_new_dim=create_index_for_new_dim,
288 )
289 else:
290 raise TypeError(
291 "can only concatenate xarray Dataset and DataArray "
292 f"objects, got {type(first_obj)}"
293 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/concat.py:676, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
674 result_vars[k] = v
675 else:
--> 676 combined_var = concat_vars(
677 vars, dim_name, positions, combine_attrs=combine_attrs
678 )
679 # reindex if variable is not present in all datasets
680 if len(variable_index) < concat_index_size:
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/core/variable.py:3018, in concat(variables, dim, positions, shortcut, combine_attrs)
2970 def concat(
2971 variables,
2972 dim="concat_dim",
(...) 2975 combine_attrs="override",
2976 ):
2977 """Concatenate variables along a new or existing dimension.
2978
2979 Parameters
(...) 3016 along the given dimension.
3017 """
-> 3018 variables = list(variables)
3019 if all(isinstance(v, IndexVariable) for v in variables):
3020 return IndexVariable.concat(variables, dim, positions, shortcut, combine_attrs)
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/concat.py:598, in _dataset_concat.<locals>.ensure_common_dims(vars, concat_dim_lengths)
596 if var.dims != common_dims:
597 common_shape = tuple(dims_sizes.get(d, dim_len) for d in common_dims)
--> 598 var = var.set_dims(common_dims, common_shape)
599 yield var
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/util/deprecation_helpers.py:143, in deprecate_dims.<locals>.wrapper(*args, **kwargs)
135 emit_user_level_warning(
136 f"The `{old_name}` argument has been renamed to `dim`, and will be removed "
137 "in the future. This renaming is taking place throughout xarray over the "
(...) 140 PendingDeprecationWarning,
141 )
142 kwargs["dim"] = kwargs.pop(old_name)
--> 143 return func(*args, **kwargs)
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/core/variable.py:1395, in Variable.set_dims(self, dim, shape)
1388 elif shape is None or all(
1389 s == 1 for s, e in zip(shape, dim, strict=True) if e not in self_dims
1390 ):
1391 # "Trivial" broadcasting, i.e. simply inserting a new dimension
1392 # This is typically easier for duck arrays to implement
1393 # than the full "broadcast_to" semantics
1394 indexer = (None,) * (len(expanded_dims) - self.ndim) + (...,)
-> 1395 expanded_data = self.data[indexer]
1396 else: # elif shape is not None:
1397 dims_map = dict(zip(dim, shape, strict=True))
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/manifests/array.py:226, in ManifestArray.__getitem__(self, key)
224 return self
225 else:
--> 226 raise NotImplementedError(f"Doesn't support slicing with {indexer}")
NotImplementedError: Doesn't support slicing with (None, slice(None, None, None))
Write combined virtual dataset to a Kerchunk JSON for future use¶
If we want to keep the combined reference information in memory as well as write the file to .json
, we can run the code snippet below.
# Write kerchunk .json record
output_fname = "combined_kerchunk.json"
combined_vds.virtualize.to_kerchunk(output_fname, format="json")
Using the output¶
Now that we have built a virtual dataset using VirtualiZarr
and Kerchunk
, we can read all of those original NetCDF
files as if they were a single Zarr
dataset.
**Since we saved the combined virtual dataset, this work doesn’t have to be repeated for anyone else to use this dataset. All they need is to pass the reference file storing the virtual dataset to Xarray
and it is as if they had a Zarr
dataset!
Open combined virtual dataset with Kerchunk¶
# We once again need to provide information for fsspec to access the remote file
storage_options = dict(
remote_protocol="s3", remote_options=dict(anon=True), skip_instance_cache=True
)
# We will use the "kerchunk" engine in `xr.open_dataset` and pass the `storage_options` to the `kerchunk` engine through `backend_kwargs`
ds = xr.open_dataset(
output_fname,
engine="kerchunk",
backend_kwargs={"storage_options": storage_options},
)
ds
Plot a slice of the dataset¶
Here we are using Xarray
to select a single time slice of the dataset and plot a map of snow cover over South East Alaska.
ds.isel(Time=0).SNOW.plot()