Kerchunk Logo

Kerchunk Basics

Overview

This notebook is intended as an introduction to using Kerchunk. In this tutorial we will:

  • Scan a single NetCDF4/HDF5 file to create a Kerchunk virtual dataset

  • Learn how to use the output using Xarray and fsspec.

While this notebook only examines using Kerchunk on a single NetCDF file, Kerchunk can be used to create virtual Zarr datasets from collections of many input files. In the following notebook, we will demonstrate this.

Prerequisites

Concepts

Importance

Notes

Introduction to Xarray

Helpful

Basic features

  • Time to learn: 60 minutes


Imports

Here we will import a few Python libraries to help with our data processing.

  • fsspec will be used to read remote and local filesystems.

  • kerchunk.hdf will be used to read a NetCDF file and create a Kerchunk reference set.

  • ujson for writing the Kerchunk output to the .json file format.

  • Xarray for examining out output dataset

import fsspec
import kerchunk.hdf
import ujson
import xarray as xr

Define kwargs for fsspec

In the dictionary definition in the next cell, we are passing options to fsspec.open. Any additional kwargs passed in this dictionary through fsspec.open will pass as kwargs to the file system, in our case s3. The API docs for the s3fs filesystem spec can be found here.

In this example we are passing a few kwargs. In short they are:

  • anon=True: This is a s3fs kwarg that specifies you are not passing any connection credentials and are connecting to a public bucket.

  • default_fill_cache=False: s3fs kwarg that avoids caching in between chunks of files. This may lower memory usage when reading large files.

  • default_cache_type="first": fsspec kwarg that specifies the caching strategy used by fsspec. In this case, first caches the first block of a file only.

Don’t worry too much about the details here; the cache options are those that have typically proven efficient for HDF5 files.

so = dict(anon=True, default_fill_cache=False, default_cache_type="first")

Parse a single NetCDF file with kerchunk

Below we will access a NetCDF file stored on the AWS cloud. This dataset is a single time slice of a climate downscaled product for Alaska.

The steps in the cell below are as follows:

  1. Define the url that points to the NetCDF file we want to process

  2. Use fsspec.open along with the dictionary of arguments we created (so) to open the URL pointing to the NetCDF file.

  3. Use kerchunk.hdf.SingleHdf5ToZarr method to read through the NetCDF file and extract the byte ranges, compression information and metadata.

  4. Use Kerchunk's .translate method on the output from the kerchunk.hdf.SingleHdf5ToZarr to translate content of the NetCDF file into the Zarr format.

  5. Create a .json file named single_file_kerchunk.json and write the dataset information to disk.

# Input URL to dataset. Note this is a netcdf file stored on s3 (cloud dataset).
url = "s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc"

# Uses kerchunk to scan through the netcdf file to create kerchunk mapping and
# then save output as .json.
# Note: In this example, we write the kerchunk output to a .json file.
# You could also keep this information in memory and pass it to fsspec
with fsspec.open(url, **so) as inf:
    h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, url, inline_threshold=100)
    h5chunks.translate()
    with open("single_file_kerchunk.json", "wb") as f:
        f.write(ujson.dumps(h5chunks.translate()).encode())

Load Kerchunk Reference File

In the section below we will use fsspec.filesystem along with the Kerchunk .json reference file to open the NetCDF file as if it were a Zarr dataset.

# use fsspec to create filesystem from .json reference file
fs = fsspec.filesystem(
    "reference",
    fo="single_file_kerchunk.json",
    remote_protocol="s3",
    remote_options=dict(anon=True),
    skip_instance_cache=True,
)

# load kerchunked dataset with xarray
ds = xr.open_dataset(
    fs.get_mapper(""), engine="zarr", backend_kwargs={"consolidated": False}
)

Plot Dataset

ds.TMAX.plot()
<matplotlib.collections.QuadMesh at 0x7f1e8690e260>
../../_images/ad1455c1ae4cb4a1c8a0eed2a25d5e7cb174c77036d45c0f1b5a160a6db141a6.png

Note that the original .nc file size here is 16.8MB, and the created JSON is 26.5kB. These files also tend to compress very well. As you can see, it the JSON can be written anywhere, and gives us access to the underlying data, reading only the chunks we need from remote without downloading the whole file.