Store virtual datasets as Kerchunk Parquet references¶
Overview¶
In this notebook we will cover how to store virtual datasets as Kerchunk Parquet references instead of Kerchunk JSON references. For large virtual datasets, using Parquet should have performance implications as the overall reference file size should be smaller and the memory overhead of combining the reference files should be lower.
This notebook builds upon the Kerchunk Basics, Multi-File Datasets with Kerchunk and the Kerchunk and Dask notebooks.
Prerequisites¶
| Concepts | Importance | Notes |
|---|---|---|
| Basics of virtual Zarr stores | Required | Core |
| Multi-file virtual datasets with VirtualiZarr | Required | Core |
| Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and Dask | Required | Core |
| Introduction to Xarray | Required | IO/Visualization |
Time to learn: 30 minutes
Imports¶
import logging
import dask
import fsspec
import xarray as xr
from distributed import Client
from virtualizarr import open_virtual_dataset---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[1], line 7
3 import dask
4 import fsspec
5 import xarray as xr
6 from distributed import Client
----> 7 from virtualizarr import open_virtual_dataset
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/virtualizarr/__init__.py:3
1 from importlib.metadata import version as _version
----> 3 from virtualizarr.accessor import (
4 VirtualiZarrDatasetAccessor,
5 VirtualiZarrDataTreeAccessor,
6 )
7 from virtualizarr.xarray import (
8 open_virtual_dataset,
9 open_virtual_datatree,
10 open_virtual_mfdataset,
11 )
13 try:
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/virtualizarr/accessor.py:17
7 from typing import (
8 TYPE_CHECKING,
9 Callable,
(...) 12 overload,
13 )
15 import xarray as xr
---> 17 from virtualizarr.manifests import ManifestArray
18 from virtualizarr.types.kerchunk import KerchunkStoreRefs
19 from virtualizarr.writers.kerchunk import dataset_to_kerchunk_refs
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/virtualizarr/manifests/__init__.py:4
1 # Note: This directory is named "manifests" rather than "manifest".
2 # This is just to avoid conflicting with some type of file called manifest that .gitignore recommends ignoring.
----> 4 from virtualizarr.manifests.array import ManifestArray # type: ignore # noqa
5 from virtualizarr.manifests.group import ManifestGroup # type: ignore # noqa
6 from virtualizarr.manifests.manifest import ChunkEntry, ChunkManifest # type: ignore # noqa
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/virtualizarr/manifests/array.py:6
4 import numpy as np
5 import xarray as xr
----> 6 from zarr.core.metadata.v3 import ArrayV3Metadata, RegularChunkGrid
8 import virtualizarr.manifests.utils as utils
9 from virtualizarr.manifests.array_api import (
10 MANIFESTARRAY_HANDLED_ARRAY_FUNCTIONS,
11 _isnan,
12 )
ImportError: cannot import name 'RegularChunkGrid' from 'zarr.core.metadata.v3' (/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.14/site-packages/zarr/core/metadata/v3.py)Setting up the Dask Client¶
client = Client(n_workers=8, silence_logs=logging.ERROR)
clientCreate Input File List¶
Here we are using fsspec's glob functionality along with the * wildcard operator and some string slicing to grab a list of NetCDF files from a s3 fsspec filesystem.
# Initiate fsspec filesystems for reading
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)
files_paths = fs_read.glob("s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/12/*")
# Here we prepend the prefix 's3://', which points to AWS.
files_paths = sorted(["s3://" + f for f in files_paths])Subset the Data¶
To speed up our example, lets take a subset of the year of data.
# If the subset_flag == True (default), the list of input files will
# be subset to speed up the processing
subset_flag = True
if subset_flag:
files_paths = files_paths[0:4]Generate Lazy References¶
Here we create a function to generate a list of Dask delayed objects.
def generate_virtual_dataset(file, storage_options):
return open_virtual_dataset(
file, indexes={}, reader_options={"storage_options": storage_options}
)
storage_options = dict(anon=True, default_fill_cache=False, default_cache_type="first")
# Generate Dask Delayed objects
tasks = [
dask.delayed(generate_virtual_dataset)(file, storage_options)
for file in files_paths
]Start the Dask Processing¶
To view the processing you can view it in real-time on the Dask Dashboard. ex: http://
virtual_datasets = list(dask.compute(*tasks))Combine virtual datasets using VirtualiZarr¶
combined_vds = xr.combine_nested(
virtual_datasets, concat_dim=["time"], coords="minimal", compat="override"
)
combined_vdsWrite the virtual dataset to a Kerchunk Parquet reference¶
combined_vds.virtualize.to_kerchunk("combined.parq", format="parquet")Shutdown the Dask cluster¶
client.shutdown()Load kerchunked dataset¶
Next we initiate a fsspec ReferenceFileSystem.
We need to pass:
The name of the parquet store
The remote protocol (This is the protocol of the input file urls)
The target protocol (
filesince we saved our parquet store locally).
storage_options = {
"remote_protocol": "s3",
"skip_instance_cache": True,
"remote_options": {"anon": True},
"target_protocol": "file",
"lazy": True,
} # options passed to fsspec
open_dataset_options = {"chunks": {}} # opens passed to xarray
ds = xr.open_dataset(
"combined.parq",
engine="kerchunk",
storage_options=storage_options,
open_dataset_options=open_dataset_options,
)
ds