Kerchunk, hvPlot, and Datashader: Visualizing datasets on-the-fly
Overview
This notebook will demonstrate how to use Kerchunk with hvPlot and Datashader to lazily visualize a reference dataset in a streaming fashion.
We will be building off the references generated through the notebook content from thePangeo_Forge notebook, so it’s encouraged you first go through that.
Prerequisites
Concepts |
Importance |
Notes |
---|---|---|
Required |
Core |
|
Required |
IO |
|
Required |
Data Visualization |
|
Required |
Big Data Visualization |
Time to learn: 10 minutes
Motivation
Using Kerchunk, we don’t have to create a copy of the data–instead we create a collection of reference files, so that the original data files can be read as if they were Zarr.
This enables visualization on-the-fly; simply pass in the URL to the dataset and use hvplot.
Getting to Know The Data
gridMET
is a high-resolution daily meteorological dataset covering CONUS from 1979-2023. It is produced by the Climatology Lab at UC Merced. In this example, we are going to look create a virtual Zarr dataset of a derived variable, Burn Index.
Imports
import hvplot.xarray # noqa
import xarray as xr
Opening the Kerchunk Dataset
Now, it’s a matter of opening the Kerchunk dataset and calling hvplot
with the rasterize=True
keyword argument.
If you’re running this notebook locally, try zooming around the map by hovering over the plot and scrolling; it should update fairly quickly. Note, it will not update if you’re viewing this on the docs page online as there is no backend server, but don’t fret because there’s a demo GIF below!
%%timeit -r 1 -n 1
storage_options = {
"remote_protocol": "http",
"skip_instance_cache": True,
} # options passed to fsspec
open_dataset_options = {"chunks": {}, "decode_coords": "all"} # opens passed to xarray
ds_kerchunk = xr.open_dataset(
"references/Pangeo_Forge/reference.json",
engine="kerchunk",
storage_options=storage_options,
**open_dataset_options,
)
display(ds_kerchunk.hvplot("lon", "lat", rasterize=True)) # noqa
27.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Comparing Against THREDDS
Now, we will be repeating the previous cell, but with THREDDS.
Note how the initial load is longer.
If you’re running the notebook locally (or a demo GIF below), zooming in/out also takes longer to finish buffering as well.
%%timeit -r 1 -n 1
def url_gen(year):
return (
f"http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_{year}.nc"
)
years = list(range(1979, 1980))
urls_list = [url_gen(year) for year in years]
netcdf_ds = xr.open_mfdataset(urls_list, engine="netcdf4")
display(netcdf_ds.hvplot("lon", "lat", rasterize=True)) # noqa
6.74 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)