Generating virtual datasets from NetCDF files
Overview¶
Within this notebook, we will cover:
- How to access remote NetCDF data using
VirtualiZarr
andKerchunk
- Combining multiple virtual datasets
This notebook shares many similarities with the multi-file virtual datasets with VirtualiZarr notebook. If you are confused on the function of a block of code, please refer there for a more detailed breakdown of what each line is doing.
Prerequisites¶
Concepts | Importance | Notes |
---|---|---|
Basics of virtual Zarr stores | Required | Core |
Multi-file virtual datasets with VirtualiZarr | Required | Core |
Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and Dask | Required | Core |
Introduction to Xarray | Required | IO/Visualization |
- Time to learn: 45 minutes
Motivation¶
NetCDF4/HDF5 is one of the most universally adopted file formats in earth sciences, with support of much of the community as well as scientific agencies, data centers and university labs. A huge amount of legacy data has been generated in this format. Fortunately, using VirtualiZarr
and Kerchunk
, we can read these datasets as if they were an Analysis-Read Cloud-Optimized (ARCO) format such as Zarr
.
About the Dataset¶
For this example, we will look at a weather dataset composed of multiple NetCDF files.The SMN-Arg is a WRF deterministic weather forecasting dataset created by the Servicio Meteorológico Nacional de Argentina
that covers Argentina as well as many neighboring countries at a 4km spatial resolution.
The model is initialized twice daily at 00 & 12 UTC with hourly forecasts for variables such as temperature, relative humidity, precipitation, wind direction and magnitude etc. for multiple atmospheric levels. The data is output at hourly intervals with a maximum prediction lead time of 72 hours in NetCDF files.
More details on this dataset can be found here.
Flags¶
In the section below, set the subset
flag to be True
(default) or False
depending if you want this notebook to process the full file list. If set to True
, then a subset of the file list will be processed (Recommended)
subset_flag = True
Imports¶
import logging
import dask
import fsspec
import s3fs
import xarray as xr
from distributed import Client
from virtualizarr import open_virtual_dataset
Examining a Single NetCDF File¶
Before we use VirtualiZarr
to create virtual datasets for multiple files, we can load a single NetCDF file to examine it.
# URL pointing to a single NetCDF file
url = "s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/00/WRFDETAR_01H_20221231_00_072.nc"
# Initialize a s3 filesystem
fs = s3fs.S3FileSystem(anon=True)
# Use Xarray to open a remote NetCDF file
ds = xr.open_dataset(fs.open(url), engine="h5netcdf")
Here we see the repr
from the Xarray
Dataset of a single NetCDF
file. From examining the output, we can tell that the Dataset dimensions are ['time','y','x']
, with time being only a single step.
Later, when we use Xarray's
combine_nested
functionality, we will need to know on which dimensions to concatenate across.
Create Input File List¶
Here we are using fsspec's
glob functionality along with the *
wildcard operator and some string slicing to grab a list of NetCDF files from a s3
fsspec
filesystem.
# Initiate fsspec filesystems for reading
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)
files_paths = fs_read.glob("s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/12/*")
# Here we prepend the prefix 's3://', which points to AWS.
files_paths = sorted(["s3://" + f for f in files_paths])
# If the subset_flag == True (default), the list of input files will be subset
# to speed up the processing
if subset_flag:
files_paths = files_paths[0:8]
Start a Dask Client¶
To parallelize the creation of our reference files, we will use Dask
. For a detailed guide on how to use Dask and Kerchunk, see the Foundations notebook: Kerchunk and Dask.
client = Client(n_workers=8, silence_logs=logging.ERROR)
client
def generate_virtual_dataset(file, storage_options):
return open_virtual_dataset(
file, indexes={}, reader_options={"storage_options": storage_options}
)
storage_options = dict(anon=True, default_fill_cache=False, default_cache_type="none")
# Generate Dask Delayed objects
tasks = [
dask.delayed(generate_virtual_dataset)(file, storage_options)
for file in files_paths
]
# Start parallel processing
import warnings
warnings.filterwarnings("ignore")
virtual_datasets = list(dask.compute(*tasks))
Combine virtual datasets and write a Kerchunk reference JSON to store the virtual Zarr store¶
In the following cell, we are combining all the `virtual datasets that were generated above into a single reference file and writing that file to disk.
combined_vds = xr.combine_nested(
virtual_datasets, concat_dim=["time"], coords="minimal", compat="override"
)
combined_vds
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[8], line 1
----> 1 combined_vds = xr.combine_nested(
2 virtual_datasets, concat_dim=["time"], coords="minimal", compat="override"
3 )
4 combined_vds
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:588, in combine_nested(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
585 concat_dim = [concat_dim]
587 # The IDs argument tells _nested_combine that datasets aren't yet sorted
--> 588 return _nested_combine(
589 datasets,
590 concat_dims=concat_dim,
591 compat=compat,
592 data_vars=data_vars,
593 coords=coords,
594 ids=False,
595 fill_value=fill_value,
596 join=join,
597 combine_attrs=combine_attrs,
598 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:367, in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join, combine_attrs)
364 _check_shape_tile_ids(combined_ids)
366 # Apply series of concatenate or merge operations along each dimension
--> 367 combined = _combine_nd(
368 combined_ids,
369 concat_dims,
370 compat=compat,
371 data_vars=data_vars,
372 coords=coords,
373 fill_value=fill_value,
374 join=join,
375 combine_attrs=combine_attrs,
376 )
377 return combined
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:246, in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join, combine_attrs)
242 # Each iteration of this loop reduces the length of the tile_ids tuples
243 # by one. It always combines along the first dimension, removing the first
244 # element of the tuple
245 for concat_dim in concat_dims:
--> 246 combined_ids = _combine_all_along_first_dim(
247 combined_ids,
248 dim=concat_dim,
249 data_vars=data_vars,
250 coords=coords,
251 compat=compat,
252 fill_value=fill_value,
253 join=join,
254 combine_attrs=combine_attrs,
255 )
256 (combined_ds,) = combined_ids.values()
257 return combined_ds
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:278, in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join, combine_attrs)
276 combined_ids = dict(sorted(group))
277 datasets = combined_ids.values()
--> 278 new_combined_ids[new_id] = _combine_1d(
279 datasets, dim, compat, data_vars, coords, fill_value, join, combine_attrs
280 )
281 return new_combined_ids
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:301, in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
299 if concat_dim is not None:
300 try:
--> 301 combined = concat(
302 datasets,
303 dim=concat_dim,
304 data_vars=data_vars,
305 coords=coords,
306 compat=compat,
307 fill_value=fill_value,
308 join=join,
309 combine_attrs=combine_attrs,
310 )
311 except ValueError as err:
312 if "encountered unexpected variable" in str(err):
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/concat.py:277, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
264 return _dataarray_concat(
265 objs,
266 dim=dim,
(...) 274 create_index_for_new_dim=create_index_for_new_dim,
275 )
276 elif isinstance(first_obj, Dataset):
--> 277 return _dataset_concat(
278 objs,
279 dim=dim,
280 data_vars=data_vars,
281 coords=coords,
282 compat=compat,
283 positions=positions,
284 fill_value=fill_value,
285 join=join,
286 combine_attrs=combine_attrs,
287 create_index_for_new_dim=create_index_for_new_dim,
288 )
289 else:
290 raise TypeError(
291 "can only concatenate xarray Dataset and DataArray "
292 f"objects, got {type(first_obj)}"
293 )
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/concat.py:676, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
674 result_vars[k] = v
675 else:
--> 676 combined_var = concat_vars(
677 vars, dim_name, positions, combine_attrs=combine_attrs
678 )
679 # reindex if variable is not present in all datasets
680 if len(variable_index) < concat_index_size:
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/core/variable.py:3018, in concat(variables, dim, positions, shortcut, combine_attrs)
2970 def concat(
2971 variables,
2972 dim="concat_dim",
(...) 2975 combine_attrs="override",
2976 ):
2977 """Concatenate variables along a new or existing dimension.
2978
2979 Parameters
(...) 3016 along the given dimension.
3017 """
-> 3018 variables = list(variables)
3019 if all(isinstance(v, IndexVariable) for v in variables):
3020 return IndexVariable.concat(variables, dim, positions, shortcut, combine_attrs)
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/concat.py:598, in _dataset_concat.<locals>.ensure_common_dims(vars, concat_dim_lengths)
596 if var.dims != common_dims:
597 common_shape = tuple(dims_sizes.get(d, dim_len) for d in common_dims)
--> 598 var = var.set_dims(common_dims, common_shape)
599 yield var
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/util/deprecation_helpers.py:143, in deprecate_dims.<locals>.wrapper(*args, **kwargs)
135 emit_user_level_warning(
136 f"The `{old_name}` argument has been renamed to `dim`, and will be removed "
137 "in the future. This renaming is taking place throughout xarray over the "
(...) 140 PendingDeprecationWarning,
141 )
142 kwargs["dim"] = kwargs.pop(old_name)
--> 143 return func(*args, **kwargs)
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/core/variable.py:1395, in Variable.set_dims(self, dim, shape)
1388 elif shape is None or all(
1389 s == 1 for s, e in zip(shape, dim, strict=True) if e not in self_dims
1390 ):
1391 # "Trivial" broadcasting, i.e. simply inserting a new dimension
1392 # This is typically easier for duck arrays to implement
1393 # than the full "broadcast_to" semantics
1394 indexer = (None,) * (len(expanded_dims) - self.ndim) + (...,)
-> 1395 expanded_data = self.data[indexer]
1396 else: # elif shape is not None:
1397 dims_map = dict(zip(dim, shape, strict=True))
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/manifests/array.py:215, in ManifestArray.__getitem__(self, key)
212 indexer = _possibly_expand_trailing_ellipsis(key, self.ndim)
214 if len(indexer) != self.ndim:
--> 215 raise ValueError(
216 f"Invalid indexer for array with ndim={self.ndim}: {indexer}"
217 )
219 if all(
220 isinstance(axis_indexer, slice) and axis_indexer == slice(None)
221 for axis_indexer in indexer
222 ):
223 # indexer is all slice(None)'s, so this is a no-op
224 return self
ValueError: Invalid indexer for array with ndim=0: (None,)
combined_vds.virtualize.to_kerchunk("ARG_combined.json", format="json")
Shut down the Dask cluster¶
client.shutdown()