Multi-file virtual datasets with VirtualiZarr

Overview

This notebook is intends to build off of the Basics of virtual Zarr stores.

In this tutorial we will:

Create a list of input paths for a collection of NetCDF files stored on the cloud.
Create virtual datasets for each input datasets
Combine the virtual datasets using combine_nested
Read the combined dataset using Xarray and fsspec.

Prerequisites

Concepts	Importance	Notes
Basics of virtual Zarr stores	Required	Basic features
Introduction to Xarray	Recommended	IO

Time to learn: 60 minutes

Flags

In the section below, set the subset flag to be True (default) or False depending if you want this notebook to process the full file list. If set to True, then a subset of the file list will be processed (Recommended)

subset_flag = True

Imports

In our imports block we are using similar imports to the Basics of virtual Zarr stores tutorial:

fsspec for reading and writing to remote file systems
virtualizarr will be used to generate the virtual Zarr store
Xarray for examining the output dataset

import fsspec
import xarray as xr
from virtualizarr import open_virtual_dataset

Create a File Pattern from a list of input NetCDF files

Below we will create a list of input files we want to virtualize. In the Basics of virtual Zarr stores tutorial, we looked at a single file of climate downscaled data over Southern Alaska. In this example, we will build off of that work and use Kerchunk and VirtualiZarr to combine multiple NetCDF files of this dataset into a virtual dataset that can be read as if it were a Zarr store - without copying any data.

We use the fsspec s3 filesystem’s glob method to create a list of files matching a file pattern. We supply the base url of s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/, which is pointing to an AWS public bucket, for daily rcp85 ccsm downscaled data for the year 2060. After this base url, we tacked on *, which acts as a wildcard for all the files in the directory. We should expect 365 daily NetCDF files.

Finally, we are appending the string s3:// to the list of return files. This will ensure the list of files we get back are s3 urls and can be read by VirtualiZarr and Kerchunk.

# Initiate fsspec filesystems for reading and writing
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)

# Retrieve list of available days in archive for the year 2060.
files_paths = fs_read.glob("s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/*")

# Here we prepend the prefix 's3://', which points to AWS.
files_paths = sorted(["s3://" + f for f in files_paths])

As a quick check, it looks like we have a list 365 file paths, which should be a year of downscaled climte data.

print(f"{len(files_paths)} file paths were retrieved.")

365 file paths were retrieved.

# If the subset_flag == True (default), the list of input files will
# be subset to speed up the processing
if subset_flag:
    files_paths = files_paths[0:4]

Optional: If you want to examine one NetCDF files before creating the `Kerchunk` index, try uncommenting this code snippet below.

## Note: Optional piece of code to view one of the NetCDFs

# import s3fs

# fs = fsspec.filesystem("s3",anon=True)
# ds = xr.open_dataset(fs.open(file_pattern[0]))

Create virtual datasets for every file in the `File_Pattern` list

Now that we have a list of NetCDF files, we can use VirtualiZarr to create virtual datasets for each one of these.

Define kwargs for `fsspec`

In the cell below, we are creating a dictionary of kwargs to pass to fsspec and the s3 filesystem. Details on this can be found in the Basics of virtual Zarr stores tutorial in the Define storage_options arguments section

storage_options = dict(anon=True, default_fill_cache=False, default_cache_type="none")

In the cell below, we are reusing some of the functionality from the previous tutorial. First we are defining a function named: generate_json_reference. This function:

Uses an fsspec s3 filesystem to read in a NetCDF from a given url.
Generates a Kerchunk index using the SingleHdf5ToZarr Kerchunk method.
Creates a simplified filename using some string slicing.
Uses the local filesystem created with fsspec to write the Kerchunk index to a .json reference file.

Below the generate_json_reference function we created, we have a simple for loop that iterates through our list of NetCDF file urls and passes them to our generate_json_reference function, which appends the name of each .json reference file to a list named output_files.

virtual_datasets = [
    open_virtual_dataset(
        filepath, indexes={}, reader_options={"storage_options": storage_options}
    )
    for filepath in files_paths
]

Write combined virtual dataset to a Kerchunk JSON for future use

If we want to keep the combined reference information in memory as well as write the file to .json, we can run the code snippet below.

# Write kerchunk .json record
output_fname = "combined_kerchunk.json"
combined_vds.virtualize.to_kerchunk(output_fname, format="json")

Plot a slice of the dataset

Here we are using Xarray to select a single time slice of the dataset and plot a map of snow cover over South East Alaska.

ds.isel(Time=0).SNOW.plot()

<matplotlib.collections.QuadMesh at 0x7fd0f308d240>

../../_images/94ce97bc56c5909803a4adc44fd0d3a42e902d56236b5c66e46c87d0deafc7ad.png

Multi-file virtual datasets with VirtualiZarr

Overview

Prerequisites

Flags

Imports

Create a File Pattern from a list of input NetCDF files

As a quick check, it looks like we have a list 365 file paths, which should be a year of downscaled climte data.

Optional: If you want to examine one NetCDF files before creating the `Kerchunk` index, try uncommenting this code snippet below.

Create virtual datasets for every file in the `File_Pattern` list

Define kwargs for `fsspec`

Combine virtual datasets

Write combined virtual dataset to a Kerchunk JSON for future use

Using the output

Open combined virtual dataset with Kerchunk

Plot a slice of the dataset

Multi-file virtual datasets with VirtualiZarr

Overview

Prerequisites

Flags

Imports

Create a File Pattern from a list of input NetCDF files

As a quick check, it looks like we have a list 365 file paths, which should be a year of downscaled climte data.

Optional: If you want to examine one NetCDF files before creating the Kerchunk index, try uncommenting this code snippet below.

Create virtual datasets for every file in the File_Pattern list

Define kwargs for fsspec

Combine virtual datasets

Write combined virtual dataset to a Kerchunk JSON for future use

Using the output

Open combined virtual dataset with Kerchunk

Plot a slice of the dataset

Optional: If you want to examine one NetCDF files before creating the `Kerchunk` index, try uncommenting this code snippet below.

Create virtual datasets for every file in the `File_Pattern` list

Define kwargs for `fsspec`