Skip to article frontmatterSkip to article content

Kerchunk, hvPlot, and Datashader: Visualizing datasets on-the-fly

Overview

This notebook will demonstrate how to use Kerchunk with hvPlot and Datashader to lazily visualize a reference dataset in a streaming fashion.

We will be building off the references generated through the notebook content from thePangeo_Forge notebook, so it’s encouraged you first go through that.

Prerequisites

ConceptsImportanceNotes
Kerchunk BasicsRequiredCore
Introduction to XarrayRequiredIO
Introduction to hvPlotRequiredData Visualization
Introduction to DatashaderRequiredBig Data Visualization
  • Time to learn: 10 minutes

Motivation

Using Kerchunk, we don’t have to create a copy of the data--instead we create a collection of reference files, so that the original data files can be read as if they were Zarr.

This enables visualization on-the-fly; simply pass in the URL to the dataset and use hvplot.

Getting to Know The Data

gridMET is a high-resolution daily meteorological dataset covering CONUS from 1979-2023. It is produced by the Climatology Lab at UC Merced. In this example, we are going to look create a virtual Zarr dataset of a derived variable, Burn Index.

Imports

import hvplot.xarray  # noqa
import xarray as xr
Loading...

Opening the Kerchunk Dataset

Now, it’s a matter of opening the Kerchunk dataset and calling hvplot with the rasterize=True keyword argument.

If you’re running this notebook locally, try zooming around the map by hovering over the plot and scrolling; it should update fairly quickly. Note, it will not update if you’re viewing this on the docs page online as there is no backend server, but don’t fret because there’s a demo GIF below!

%%timeit -r 1 -n 1


storage_options = {
    "remote_protocol": "http",
    "skip_instance_cache": True,
}  # options passed to fsspec
open_dataset_options = {"chunks": {}, "decode_coords": "all"}  # opens passed to xarray

ds_kerchunk = xr.open_dataset(
    "references/Pangeo_Forge/reference.json",
    engine="kerchunk",
    storage_options=storage_options,
    **open_dataset_options,
)

display(ds_kerchunk.hvplot("lon", "lat", rasterize=True))  # noqa
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 get_ipython().run_cell_magic('timeit', '-r 1 -n 1', '\n\nstorage_options = {\n    "remote_protocol": "http",\n    "skip_instance_cache": True,\n}  # options passed to fsspec\nopen_dataset_options = {"chunks": {}, "decode_coords": "all"}  # opens passed to xarray\n\nds_kerchunk = xr.open_dataset(\n    "references/Pangeo_Forge/reference.json",\n    engine="kerchunk",\n    storage_options=storage_options,\n    **open_dataset_options,\n)\n\ndisplay(ds_kerchunk.hvplot("lon", "lat", rasterize=True))  # noqa\n')

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/IPython/core/interactiveshell.py:2549, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
   2547 with self.builtin_trap:
   2548     args = (magic_arg_s, cell)
-> 2549     result = fn(*args, **kwargs)
   2551 # The code below prevents the output from being displayed
   2552 # when using magics with decorator @output_can_be_silenced
   2553 # when the last Python token in the expression is a ';'.
   2554 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/IPython/core/magics/execution.py:1229, in ExecutionMagics.timeit(self, line, cell, local_ns)
   1226         if time_number >= 0.2:
   1227             break
-> 1229 all_runs = timer.repeat(repeat, number)
   1230 best = min(all_runs) / number
   1231 worst = max(all_runs) / number

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/timeit.py:208, in Timer.repeat(self, repeat, number)
    206 r = []
    207 for i in range(repeat):
--> 208     t = self.timeit(number)
    209     r.append(t)
    210 return r

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/IPython/core/magics/execution.py:183, in Timer.timeit(self, number)
    181 gc.disable()
    182 try:
--> 183     timing = self.inner(it, self.timer)
    184 finally:
    185     if gcold:

File <magic-timeit>:7, in inner(_it, _timer)

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/backends/api.py:687, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    675 decoders = _resolve_decoders_kwargs(
    676     decode_cf,
    677     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)    683     decode_coords=decode_coords,
    684 )
    686 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 687 backend_ds = backend.open_dataset(
    688     filename_or_obj,
    689     drop_variables=drop_variables,
    690     **decoders,
    691     **kwargs,
    692 )
    693 ds = _dataset_from_backend_dataset(
    694     backend_ds,
    695     filename_or_obj,
   (...)    705     **kwargs,
    706 )
    707 return ds

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/kerchunk/xarray_backend.py:13, in KerchunkBackend.open_dataset(self, filename_or_obj, storage_options, open_dataset_options, **kw)
      9 def open_dataset(
     10     self, filename_or_obj, *, storage_options=None, open_dataset_options=None, **kw
     11 ):
     12     open_dataset_options = (open_dataset_options or {}) | kw
---> 13     ref_ds = open_reference_dataset(
     14         filename_or_obj,
     15         storage_options=storage_options,
     16         open_dataset_options=open_dataset_options,
     17     )
     18     return ref_ds

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/kerchunk/xarray_backend.py:45, in open_reference_dataset(filename_or_obj, storage_options, open_dataset_options)
     42 if open_dataset_options is None:
     43     open_dataset_options = {}
---> 45 store = refs_as_store(filename_or_obj, **storage_options)
     47 return xr.open_zarr(
     48     store, zarr_format=2, consolidated=False, **open_dataset_options
     49 )

TypeError: refs_as_store() got an unexpected keyword argument 'skip_instance_cache'

Kerchunk Zoom

Comparing Against THREDDS

Now, we will be repeating the previous cell, but with THREDDS.

Note how the initial load is longer.

If you’re running the notebook locally (or a demo GIF below), zooming in/out also takes longer to finish buffering as well.

%%timeit -r 1 -n 1


def url_gen(year):
    return (
        f"http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_{year}.nc"
    )


years = list(range(1979, 1980))
urls_list = [url_gen(year) for year in years]
netcdf_ds = xr.open_mfdataset(urls_list, engine="netcdf4")
display(netcdf_ds.hvplot("lon", "lat", rasterize=True))  # noqa

THREDDS Zoom