Kerchunk Basics
Overview
This notebook is intended as an introduction to using Kerchunk
.
In this tutorial we will:
Scan a single NetCDF4/HDF5 file to create a
Kerchunk
virtual datasetLearn how to use the output using
Xarray
andfsspec
.
While this notebook only examines using Kerchunk
on a single NetCDF file, Kerchunk
can be used to create virtual Zarr
datasets from collections of many input files. In the following notebook, we will demonstrate this.
Prerequisites
Concepts |
Importance |
Notes |
---|---|---|
Helpful |
Basic features |
Time to learn: 60 minutes
Imports
Here we will import a few Python
libraries to help with our data processing.
fsspec
will be used to read remote and local filesystems.kerchunk.hdf
will be used to read a NetCDF file and create aKerchunk
reference set.ujson
for writing theKerchunk
output to the.json
file format.Xarray
for examining out output dataset
import fsspec
import kerchunk.hdf
import ujson
import xarray as xr
Define kwargs for fsspec
In the dictionary definition in the next cell, we are passing options to fsspec.open
. Any additional kwargs passed in this dictionary through fsspec.open
will pass as kwargs to the file system, in our case s3
. The API docs for the s3fs
filesystem spec can be found here.
In this example we are passing a few kwargs. In short they are:
anon=True
: This is as3fs
kwarg that specifies you are not passing any connection credentials and are connecting to a public bucket.default_fill_cache=False
:s3fs
kwarg that avoids caching in between chunks of files. This may lower memory usage when reading large files.default_cache_type="first"
:fsspec
kwarg that specifies the caching strategy used byfsspec
. In this case,first
caches the first block of a file only.
Don’t worry too much about the details here; the cache options are those that have typically proven efficient for HDF5 files.
so = dict(anon=True, default_fill_cache=False, default_cache_type="first")
Parse a single NetCDF file with kerchunk
Below we will access a NetCDF file stored on the AWS cloud. This dataset is a single time slice of a climate downscaled product for Alaska.
The steps in the cell below are as follows:
Define the url that points to the
NetCDF
file we want to processUse
fsspec.open
along with the dictionary of arguments we created (so
) to open the URL pointing to the NetCDF file.Use
kerchunk.hdf.SingleHdf5ToZarr
method to read through theNetCDF
file and extract the byte ranges, compression information and metadata.Use
Kerchunk's
.translate
method on the output from thekerchunk.hdf.SingleHdf5ToZarr
to translate content of the NetCDF file into theZarr
format.Create a
.json
file namedsingle_file_kerchunk.json
and write the dataset information to disk.
# Input URL to dataset. Note this is a netcdf file stored on s3 (cloud dataset).
url = "s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc"
# Uses kerchunk to scan through the netcdf file to create kerchunk mapping and
# then save output as .json.
# Note: In this example, we write the kerchunk output to a .json file.
# You could also keep this information in memory and pass it to fsspec
with fsspec.open(url, **so) as inf:
h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, url, inline_threshold=100)
h5chunks.translate()
with open("single_file_kerchunk.json", "wb") as f:
f.write(ujson.dumps(h5chunks.translate()).encode())
Load Kerchunk
Reference File
In the section below we will use fsspec.filesystem
along with the Kerchunk
.json
reference file to open the NetCDF
file as if it were a Zarr
dataset.
# use fsspec to create filesystem from .json reference file
fs = fsspec.filesystem(
"reference",
fo="single_file_kerchunk.json",
remote_protocol="s3",
remote_options=dict(anon=True),
skip_instance_cache=True,
)
# load kerchunked dataset with xarray
ds = xr.open_dataset(
fs.get_mapper(""), engine="zarr", backend_kwargs={"consolidated": False}
)
Plot Dataset
ds.TMAX.plot()
<matplotlib.collections.QuadMesh at 0x7f1e8690e260>
Note that the original .nc file size here is 16.8MB, and the created JSON is 26.5kB. These files also tend to compress very well. As you can see, it the JSON can be written anywhere, and gives us access to the underlying data, reading only the chunks we need from remote without downloading the whole file.