Kerchunk, hvPlot, and Datashader: Visualizing datasets on-the-fly

Overview¶

This notebook will demonstrate how to use Kerchunk with hvPlot and Datashader to lazily visualize a reference dataset in a streaming fashion.

We will be building off the references generated through the notebook content from thePangeo_Forge notebook, so it’s encouraged you first go through that.

Prerequisites¶

Concepts	Importance	Notes
Kerchunk Basics	Required	Core
Introduction to Xarray	Required	IO
Introduction to hvPlot	Required	Data Visualization
Introduction to Datashader	Required	Big Data Visualization

Time to learn: 10 minutes

Motivation¶

Using Kerchunk, we don’t have to create a copy of the data--instead we create a collection of reference files, so that the original data files can be read as if they were Zarr.

This enables visualization on-the-fly; simply pass in the URL to the dataset and use hvplot.

Getting to Know The Data¶

gridMET is a high-resolution daily meteorological dataset covering CONUS from 1979-2023. It is produced by the Climatology Lab at UC Merced. In this example, we are going to look create a virtual Zarr dataset of a derived variable, Burn Index.

Imports¶

import hvplot.xarray  # noqa
import xarray as xr

Opening the Kerchunk Dataset¶

Now, it’s a matter of opening the Kerchunk dataset and calling hvplot with the rasterize=True keyword argument.

If you’re running this notebook locally, try zooming around the map by hovering over the plot and scrolling; it should update fairly quickly. Note, it will not update if you’re viewing this on the docs page online as there is no backend server, but don’t fret because there’s a demo GIF below!

%%timeit -r 1 -n 1


storage_options = {
    "remote_protocol": "http",
    "skip_instance_cache": True,
}  # options passed to fsspec
open_dataset_options = {"chunks": {}, "decode_coords": "all"}  # opens passed to xarray

ds_kerchunk = xr.open_dataset(
    "references/Pangeo_Forge/reference.json",
    engine="kerchunk",
    storage_options=storage_options,
    **open_dataset_options,
)

display(ds_kerchunk.hvplot("lon", "lat", rasterize=True))  # noqa

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 get_ipython().run_cell_magic('timeit', '-r 1 -n 1', '\n\nstorage_options = {\n    "remote_protocol": "http",\n    "skip_instance_cache": True,\n}  # options passed to fsspec\nopen_dataset_options = {"chunks": {}, "decode_coords": "all"}  # opens passed to xarray\n\nds_kerchunk = xr.open_dataset(\n    "references/Pangeo_Forge/reference.json",\n    engine="kerchunk",\n    storage_options=storage_options,\n    **open_dataset_options,\n)\n\ndisplay(ds_kerchunk.hvplot("lon", "lat", rasterize=True))  # noqa\n')

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/IPython/core/interactiveshell.py:2549, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
   2547 with self.builtin_trap:
   2548     args = (magic_arg_s, cell)
-> 2549     result = fn(*args, **kwargs)
   2551 # The code below prevents the output from being displayed
   2552 # when using magics with decorator @output_can_be_silenced
   2553 # when the last Python token in the expression is a ';'.
   2554 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/IPython/core/magics/execution.py:1229, in ExecutionMagics.timeit(self, line, cell, local_ns)
   1226         if time_number >= 0.2:
   1227             break
-> 1229 all_runs = timer.repeat(repeat, number)
   1230 best = min(all_runs) / number
   1231 worst = max(all_runs) / number

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/timeit.py:208, in Timer.repeat(self, repeat, number)
    206 r = []
    207 for i in range(repeat):
--> 208     t = self.timeit(number)
    209     r.append(t)
    210 return r

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/IPython/core/magics/execution.py:183, in Timer.timeit(self, number)
    181 gc.disable()
    182 try:
--> 183     timing = self.inner(it, self.timer)
    184 finally:
    185     if gcold:

File <magic-timeit>:7, in inner(_it, _timer)

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/backends/api.py:687, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    675 decoders = _resolve_decoders_kwargs(
    676     decode_cf,
    677     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)    683     decode_coords=decode_coords,
    684 )
    686 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 687 backend_ds = backend.open_dataset(
    688     filename_or_obj,
    689     drop_variables=drop_variables,
    690     **decoders,
    691     **kwargs,
    692 )
    693 ds = _dataset_from_backend_dataset(
    694     backend_ds,
    695     filename_or_obj,
   (...)    705     **kwargs,
    706 )
    707 return ds

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/kerchunk/xarray_backend.py:13, in KerchunkBackend.open_dataset(self, filename_or_obj, storage_options, open_dataset_options, **kw)
      9 def open_dataset(
     10     self, filename_or_obj, *, storage_options=None, open_dataset_options=None, **kw
     11 ):
     12     open_dataset_options = (open_dataset_options or {}) | kw
---> 13     ref_ds = open_reference_dataset(
     14         filename_or_obj,
     15         storage_options=storage_options,
     16         open_dataset_options=open_dataset_options,
     17     )
     18     return ref_ds

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/kerchunk/xarray_backend.py:45, in open_reference_dataset(filename_or_obj, storage_options, open_dataset_options)
     42 if open_dataset_options is None:
     43     open_dataset_options = {}
---> 45 store = refs_as_store(filename_or_obj, **storage_options)
     47 return xr.open_zarr(
     48     store, zarr_format=2, consolidated=False, **open_dataset_options
     49 )

TypeError: refs_as_store() got an unexpected keyword argument 'skip_instance_cache'

Kerchunk Zoom

Comparing Against THREDDS¶

Now, we will be repeating the previous cell, but with THREDDS.

Note how the initial load is longer.

If you’re running the notebook locally (or a demo GIF below), zooming in/out also takes longer to finish buffering as well.

%%timeit -r 1 -n 1


def url_gen(year):
    return (
        f"http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_{year}.nc"
    )


years = list(range(1979, 1980))
urls_list = [url_gen(year) for year in years]
netcdf_ds = xr.open_mfdataset(urls_list, engine="netcdf4")
display(netcdf_ds.hvplot("lon", "lat", rasterize=True))  # noqa

THREDDS Zoom

Using Pre-Generated References

Kerchunk and Xarray-Datatree