Skip to article frontmatterSkip to article content

Store virtual datasets as Kerchunk Parquet references

Parquet Logo

Store virtual datasets as Kerchunk Parquet references

Overview

In this notebook we will cover how to store virtual datasets as Kerchunk Parquet references instead of Kerchunk JSON references. For large virtual datasets, using Parquet should have performance implications as the overall reference file size should be smaller and the memory overhead of combining the reference files should be lower.

This notebook builds upon the Kerchunk Basics, Multi-File Datasets with Kerchunk and the Kerchunk and Dask notebooks.

Prerequisites

ConceptsImportanceNotes
Basics of virtual Zarr storesRequiredCore
Multi-file virtual datasets with VirtualiZarrRequiredCore
Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and DaskRequiredCore
Introduction to XarrayRequiredIO/Visualization
  • Time to learn: 30 minutes

Imports

import logging

import dask
import fsspec
import xarray as xr
from distributed import Client
from virtualizarr import open_virtual_dataset

Setting up the Dask Client

client = Client(n_workers=8, silence_logs=logging.ERROR)
client
Loading...

Create Input File List

Here we are using fsspec's glob functionality along with the * wildcard operator and some string slicing to grab a list of NetCDF files from a s3 fsspec filesystem.

# Initiate fsspec filesystems for reading
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)

files_paths = fs_read.glob("s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/12/*")

# Here we prepend the prefix 's3://', which points to AWS.
files_paths = sorted(["s3://" + f for f in files_paths])

Subset the Data

To speed up our example, lets take a subset of the year of data.

# If the subset_flag == True (default), the list of input files will
# be subset to speed up the processing
subset_flag = True
if subset_flag:
    files_paths = files_paths[0:4]

Generate Lazy References

Here we create a function to generate a list of Dask delayed objects.

def generate_virtual_dataset(file, storage_options):
    return open_virtual_dataset(
        file, indexes={}, reader_options={"storage_options": storage_options}
    )


storage_options = dict(anon=True, default_fill_cache=False, default_cache_type="first")
# Generate Dask Delayed objects
tasks = [
    dask.delayed(generate_virtual_dataset)(file, storage_options)
    for file in files_paths
]

Start the Dask Processing

To view the processing you can view it in real-time on the Dask Dashboard. ex: http://127.0.0.1:8787/status

virtual_datasets = list(dask.compute(*tasks))
2025-08-10 00:52:38,847 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-54b5475c-d326-46cc-b277-db85913396c2
State:     executing
Task:  <Task 'generate_virtual_dataset-54b5475c-d326-46cc-b277-db85913396c2' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: '  File "/tmp/ipykernel_4271/841847916.py", line 2, in generate_virtual_dataset\n'

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 virtual_datasets = list(dask.compute(*tasks))

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/dask/base.py:681, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    678     expr = expr.optimize()
    679     keys = list(flatten(expr.__dask_keys__()))
--> 681     results = schedule(expr, keys, **kwargs)
    683 return repack(results)

Cell In[5], line 2, in generate_virtual_dataset()
      1 def generate_virtual_dataset(file, storage_options):
----> 2     return open_virtual_dataset(
      3         file, indexes={}, reader_options={"storage_options": storage_options}
      4     )

TypeError: open_virtual_dataset() got an unexpected keyword argument 'indexes'

Combine virtual datasets using VirtualiZarr

combined_vds = xr.combine_nested(
    virtual_datasets, concat_dim=["time"], coords="minimal", compat="override"
)
combined_vds

Write the virtual dataset to a Kerchunk Parquet reference

combined_vds.virtualize.to_kerchunk("combined.parq", format="parquet")

Shutdown the Dask cluster

client.shutdown()

Load kerchunked dataset

Next we initiate a fsspec ReferenceFileSystem. We need to pass:

  • The name of the parquet store
  • The remote protocol (This is the protocol of the input file urls)
  • The target protocol (file since we saved our parquet store locally).
storage_options = {
    "remote_protocol": "s3",
    "skip_instance_cache": True,
    "remote_options": {"anon": True},
    "target_protocol": "file",
    "lazy": True,
}  # options passed to fsspec
open_dataset_options = {"chunks": {}}  # opens passed to xarray

ds = xr.open_dataset(
    "combined.parq",
    engine="kerchunk",
    storage_options=storage_options,
    open_dataset_options=open_dataset_options,
)
ds