Overview¶
This notebook will demonstrate how to use Kerchunk with hvPlot and Datashader to lazily visualize a reference dataset in a streaming fashion.
We will be building off the references generated through the notebook content from thePangeo_Forge notebook, so it’s encouraged you first go through that.
Prerequisites¶
Concepts | Importance | Notes |
---|---|---|
Kerchunk Basics | Required | Core |
Introduction to Xarray | Required | IO |
Introduction to hvPlot | Required | Data Visualization |
Introduction to Datashader | Required | Big Data Visualization |
- Time to learn: 10 minutes
Motivation¶
Using Kerchunk, we don’t have to create a copy of the data--instead we create a collection of reference files, so that the original data files can be read as if they were Zarr.
This enables visualization on-the-fly; simply pass in the URL to the dataset and use hvplot.
Getting to Know The Data¶
gridMET
is a high-resolution daily meteorological dataset covering CONUS from 1979-2023. It is produced by the Climatology Lab at UC Merced. In this example, we are going to look create a virtual Zarr dataset of a derived variable, Burn Index.
Imports¶
import hvplot.xarray # noqa
import xarray as xr
Opening the Kerchunk Dataset¶
Now, it’s a matter of opening the Kerchunk dataset and calling hvplot
with the rasterize=True
keyword argument.
If you’re running this notebook locally, try zooming around the map by hovering over the plot and scrolling; it should update fairly quickly. Note, it will not update if you’re viewing this on the docs page online as there is no backend server, but don’t fret because there’s a demo GIF below!
%%timeit -r 1 -n 1
storage_options = {
"remote_protocol": "http",
"skip_instance_cache": True,
} # options passed to fsspec
open_dataset_options = {"chunks": {}, "decode_coords": "all"} # opens passed to xarray
ds_kerchunk = xr.open_dataset(
"references/Pangeo_Forge/reference.json",
engine="kerchunk",
storage_options=storage_options,
**open_dataset_options,
)
display(ds_kerchunk.hvplot("lon", "lat", rasterize=True)) # noqa
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[2], line 1
----> 1 get_ipython().run_cell_magic('timeit', '-r 1 -n 1', '\n\nstorage_options = {\n "remote_protocol": "http",\n "skip_instance_cache": True,\n} # options passed to fsspec\nopen_dataset_options = {"chunks": {}, "decode_coords": "all"} # opens passed to xarray\n\nds_kerchunk = xr.open_dataset(\n "references/Pangeo_Forge/reference.json",\n engine="kerchunk",\n storage_options=storage_options,\n **open_dataset_options,\n)\n\ndisplay(ds_kerchunk.hvplot("lon", "lat", rasterize=True)) # noqa\n')
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/IPython/core/interactiveshell.py:2549, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
2547 with self.builtin_trap:
2548 args = (magic_arg_s, cell)
-> 2549 result = fn(*args, **kwargs)
2551 # The code below prevents the output from being displayed
2552 # when using magics with decorator @output_can_be_silenced
2553 # when the last Python token in the expression is a ';'.
2554 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/IPython/core/magics/execution.py:1229, in ExecutionMagics.timeit(self, line, cell, local_ns)
1226 if time_number >= 0.2:
1227 break
-> 1229 all_runs = timer.repeat(repeat, number)
1230 best = min(all_runs) / number
1231 worst = max(all_runs) / number
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/timeit.py:208, in Timer.repeat(self, repeat, number)
206 r = []
207 for i in range(repeat):
--> 208 t = self.timeit(number)
209 r.append(t)
210 return r
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/IPython/core/magics/execution.py:183, in Timer.timeit(self, number)
181 gc.disable()
182 try:
--> 183 timing = self.inner(it, self.timer)
184 finally:
185 if gcold:
File <magic-timeit>:7, in inner(_it, _timer)
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/backends/api.py:687, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
675 decoders = _resolve_decoders_kwargs(
676 decode_cf,
677 open_backend_dataset_parameters=backend.open_dataset_parameters,
(...) 683 decode_coords=decode_coords,
684 )
686 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 687 backend_ds = backend.open_dataset(
688 filename_or_obj,
689 drop_variables=drop_variables,
690 **decoders,
691 **kwargs,
692 )
693 ds = _dataset_from_backend_dataset(
694 backend_ds,
695 filename_or_obj,
(...) 705 **kwargs,
706 )
707 return ds
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/kerchunk/xarray_backend.py:13, in KerchunkBackend.open_dataset(self, filename_or_obj, storage_options, open_dataset_options, **kw)
9 def open_dataset(
10 self, filename_or_obj, *, storage_options=None, open_dataset_options=None, **kw
11 ):
12 open_dataset_options = (open_dataset_options or {}) | kw
---> 13 ref_ds = open_reference_dataset(
14 filename_or_obj,
15 storage_options=storage_options,
16 open_dataset_options=open_dataset_options,
17 )
18 return ref_ds
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/kerchunk/xarray_backend.py:45, in open_reference_dataset(filename_or_obj, storage_options, open_dataset_options)
42 if open_dataset_options is None:
43 open_dataset_options = {}
---> 45 store = refs_as_store(filename_or_obj, **storage_options)
47 return xr.open_zarr(
48 store, zarr_format=2, consolidated=False, **open_dataset_options
49 )
TypeError: refs_as_store() got an unexpected keyword argument 'skip_instance_cache'
Comparing Against THREDDS¶
Now, we will be repeating the previous cell, but with THREDDS.
Note how the initial load is longer.
If you’re running the notebook locally (or a demo GIF below), zooming in/out also takes longer to finish buffering as well.
%%timeit -r 1 -n 1
def url_gen(year):
return (
f"http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_{year}.nc"
)
years = list(range(1979, 1980))
urls_list = [url_gen(year) for year in years]
netcdf_ds = xr.open_mfdataset(urls_list, engine="netcdf4")
display(netcdf_ds.hvplot("lon", "lat", rasterize=True)) # noqa