Skip to article frontmatterSkip to article content

Generating virtual datasets from NetCDF files

ARG

Overview

Within this notebook, we will cover:

  1. How to access remote NetCDF data using VirtualiZarr and Kerchunk
  2. Combining multiple virtual datasets

This notebook shares many similarities with the multi-file virtual datasets with VirtualiZarr notebook. If you are confused on the function of a block of code, please refer there for a more detailed breakdown of what each line is doing.

Prerequisites

ConceptsImportanceNotes
Basics of virtual Zarr storesRequiredCore
Multi-file virtual datasets with VirtualiZarrRequiredCore
Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and DaskRequiredCore
Introduction to XarrayRequiredIO/Visualization
  • Time to learn: 45 minutes

Motivation

NetCDF4/HDF5 is one of the most universally adopted file formats in earth sciences, with support of much of the community as well as scientific agencies, data centers and university labs. A huge amount of legacy data has been generated in this format. Fortunately, using VirtualiZarr and Kerchunk, we can read these datasets as if they were an Analysis-Read Cloud-Optimized (ARCO) format such as Zarr.

About the Dataset

For this example, we will look at a weather dataset composed of multiple NetCDF files.The SMN-Arg is a WRF deterministic weather forecasting dataset created by the Servicio Meteorológico Nacional de Argentina that covers Argentina as well as many neighboring countries at a 4km spatial resolution.

The model is initialized twice daily at 00 & 12 UTC with hourly forecasts for variables such as temperature, relative humidity, precipitation, wind direction and magnitude etc. for multiple atmospheric levels. The data is output at hourly intervals with a maximum prediction lead time of 72 hours in NetCDF files.

More details on this dataset can be found here.

Flags

In the section below, set the subset flag to be True (default) or False depending if you want this notebook to process the full file list. If set to True, then a subset of the file list will be processed (Recommended)

subset_flag = True

Imports

import logging

import dask
import fsspec
import s3fs
import xarray as xr
from distributed import Client
from virtualizarr import open_virtual_dataset

Examining a Single NetCDF File

Before we use VirtualiZarr to create virtual datasets for multiple files, we can load a single NetCDF file to examine it.

# URL pointing to a single NetCDF file
url = "s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/00/WRFDETAR_01H_20221231_00_072.nc"

# Initialize a s3 filesystem
fs = s3fs.S3FileSystem(anon=True)
# Use Xarray to open a remote NetCDF file
ds = xr.open_dataset(fs.open(url), engine="h5netcdf")

Here we see the repr from the Xarray Dataset of a single NetCDF file. From examining the output, we can tell that the Dataset dimensions are ['time','y','x'], with time being only a single step. Later, when we use Xarray's combine_nested functionality, we will need to know on which dimensions to concatenate across.

Create Input File List

Here we are using fsspec's glob functionality along with the * wildcard operator and some string slicing to grab a list of NetCDF files from a s3 fsspec filesystem.

# Initiate fsspec filesystems for reading
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)

files_paths = fs_read.glob("s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/12/*")

# Here we prepend the prefix 's3://', which points to AWS.
files_paths = sorted(["s3://" + f for f in files_paths])


# If the subset_flag == True (default), the list of input files will be subset
# to speed up the processing
if subset_flag:
    files_paths = files_paths[0:8]

Start a Dask Client

To parallelize the creation of our reference files, we will use Dask. For a detailed guide on how to use Dask and Kerchunk, see the Foundations notebook: Kerchunk and Dask.

client = Client(n_workers=8, silence_logs=logging.ERROR)
client
Loading...
def generate_virtual_dataset(file, storage_options):
    return open_virtual_dataset(
        file, indexes={}, reader_options={"storage_options": storage_options}
    )


storage_options = dict(anon=True, default_fill_cache=False, default_cache_type="none")
# Generate Dask Delayed objects
tasks = [
    dask.delayed(generate_virtual_dataset)(file, storage_options)
    for file in files_paths
]
# Start parallel processing
import warnings

warnings.filterwarnings("ignore")
virtual_datasets = list(dask.compute(*tasks))

Combine virtual datasets and write a Kerchunk reference JSON to store the virtual Zarr store

In the following cell, we are combining all the `virtual datasets that were generated above into a single reference file and writing that file to disk.

combined_vds = xr.combine_nested(
    virtual_datasets, concat_dim=["time"], coords="minimal", compat="override"
)
combined_vds
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[8], line 1
----> 1 combined_vds = xr.combine_nested(
      2     virtual_datasets, concat_dim=["time"], coords="minimal", compat="override"
      3 )
      4 combined_vds

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:588, in combine_nested(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
    585     concat_dim = [concat_dim]
    587 # The IDs argument tells _nested_combine that datasets aren't yet sorted
--> 588 return _nested_combine(
    589     datasets,
    590     concat_dims=concat_dim,
    591     compat=compat,
    592     data_vars=data_vars,
    593     coords=coords,
    594     ids=False,
    595     fill_value=fill_value,
    596     join=join,
    597     combine_attrs=combine_attrs,
    598 )

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:367, in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join, combine_attrs)
    364 _check_shape_tile_ids(combined_ids)
    366 # Apply series of concatenate or merge operations along each dimension
--> 367 combined = _combine_nd(
    368     combined_ids,
    369     concat_dims,
    370     compat=compat,
    371     data_vars=data_vars,
    372     coords=coords,
    373     fill_value=fill_value,
    374     join=join,
    375     combine_attrs=combine_attrs,
    376 )
    377 return combined

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:246, in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join, combine_attrs)
    242 # Each iteration of this loop reduces the length of the tile_ids tuples
    243 # by one. It always combines along the first dimension, removing the first
    244 # element of the tuple
    245 for concat_dim in concat_dims:
--> 246     combined_ids = _combine_all_along_first_dim(
    247         combined_ids,
    248         dim=concat_dim,
    249         data_vars=data_vars,
    250         coords=coords,
    251         compat=compat,
    252         fill_value=fill_value,
    253         join=join,
    254         combine_attrs=combine_attrs,
    255     )
    256 (combined_ds,) = combined_ids.values()
    257 return combined_ds

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:278, in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join, combine_attrs)
    276     combined_ids = dict(sorted(group))
    277     datasets = combined_ids.values()
--> 278     new_combined_ids[new_id] = _combine_1d(
    279         datasets, dim, compat, data_vars, coords, fill_value, join, combine_attrs
    280     )
    281 return new_combined_ids

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/combine.py:301, in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
    299 if concat_dim is not None:
    300     try:
--> 301         combined = concat(
    302             datasets,
    303             dim=concat_dim,
    304             data_vars=data_vars,
    305             coords=coords,
    306             compat=compat,
    307             fill_value=fill_value,
    308             join=join,
    309             combine_attrs=combine_attrs,
    310         )
    311     except ValueError as err:
    312         if "encountered unexpected variable" in str(err):

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/concat.py:277, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    264     return _dataarray_concat(
    265         objs,
    266         dim=dim,
   (...)    274         create_index_for_new_dim=create_index_for_new_dim,
    275     )
    276 elif isinstance(first_obj, Dataset):
--> 277     return _dataset_concat(
    278         objs,
    279         dim=dim,
    280         data_vars=data_vars,
    281         coords=coords,
    282         compat=compat,
    283         positions=positions,
    284         fill_value=fill_value,
    285         join=join,
    286         combine_attrs=combine_attrs,
    287         create_index_for_new_dim=create_index_for_new_dim,
    288     )
    289 else:
    290     raise TypeError(
    291         "can only concatenate xarray Dataset and DataArray "
    292         f"objects, got {type(first_obj)}"
    293     )

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/concat.py:676, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
    674         result_vars[k] = v
    675 else:
--> 676     combined_var = concat_vars(
    677         vars, dim_name, positions, combine_attrs=combine_attrs
    678     )
    679     # reindex if variable is not present in all datasets
    680     if len(variable_index) < concat_index_size:

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/core/variable.py:3018, in concat(variables, dim, positions, shortcut, combine_attrs)
   2970 def concat(
   2971     variables,
   2972     dim="concat_dim",
   (...)   2975     combine_attrs="override",
   2976 ):
   2977     """Concatenate variables along a new or existing dimension.
   2978 
   2979     Parameters
   (...)   3016         along the given dimension.
   3017     """
-> 3018     variables = list(variables)
   3019     if all(isinstance(v, IndexVariable) for v in variables):
   3020         return IndexVariable.concat(variables, dim, positions, shortcut, combine_attrs)

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/structure/concat.py:598, in _dataset_concat.<locals>.ensure_common_dims(vars, concat_dim_lengths)
    596 if var.dims != common_dims:
    597     common_shape = tuple(dims_sizes.get(d, dim_len) for d in common_dims)
--> 598     var = var.set_dims(common_dims, common_shape)
    599 yield var

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/util/deprecation_helpers.py:143, in deprecate_dims.<locals>.wrapper(*args, **kwargs)
    135     emit_user_level_warning(
    136         f"The `{old_name}` argument has been renamed to `dim`, and will be removed "
    137         "in the future. This renaming is taking place throughout xarray over the "
   (...)    140         PendingDeprecationWarning,
    141     )
    142     kwargs["dim"] = kwargs.pop(old_name)
--> 143 return func(*args, **kwargs)

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/core/variable.py:1395, in Variable.set_dims(self, dim, shape)
   1388 elif shape is None or all(
   1389     s == 1 for s, e in zip(shape, dim, strict=True) if e not in self_dims
   1390 ):
   1391     # "Trivial" broadcasting, i.e. simply inserting a new dimension
   1392     # This is typically easier for duck arrays to implement
   1393     # than the full "broadcast_to" semantics
   1394     indexer = (None,) * (len(expanded_dims) - self.ndim) + (...,)
-> 1395     expanded_data = self.data[indexer]
   1396 else:  # elif shape is not None:
   1397     dims_map = dict(zip(dim, shape, strict=True))

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/manifests/array.py:215, in ManifestArray.__getitem__(self, key)
    212 indexer = _possibly_expand_trailing_ellipsis(key, self.ndim)
    214 if len(indexer) != self.ndim:
--> 215     raise ValueError(
    216         f"Invalid indexer for array with ndim={self.ndim}: {indexer}"
    217     )
    219 if all(
    220     isinstance(axis_indexer, slice) and axis_indexer == slice(None)
    221     for axis_indexer in indexer
    222 ):
    223     # indexer is all slice(None)'s, so this is a no-op
    224     return self

ValueError: Invalid indexer for array with ndim=0: (None,)
combined_vds.virtualize.to_kerchunk("ARG_combined.json", format="json")

Shut down the Dask cluster

client.shutdown()