Kerchunk, hvPlot, and Datashader: Visualizing datasets on-the-fly

Overview

This notebook will demonstrate how to use Kerchunk with hvPlot and Datashader to lazily visualize a reference dataset in a streaming fashion.

We will be building off content from Kerchunk and Pangeo-Forge, so it’s encouraged you first go through that.

Prerequisites

Concepts

Importance

Notes

Kerchunk Basics

Required

Core

Multiple Files and Kerchunk

Required

Core

Introduction to Xarray

Required

IO

Introduction to hvPlot

Required

Data Visualization

Introduction to Datashader

Required

Big Data Visualization

  • Time to learn: 10 minutes


Motivation

Using Kerchunk, we don’t have to create a copy of the data–instead we create a collection of reference files, so that the original data files can be read as if they were Zarr.

This enables visualization on-the-fly; simply pass in the URL to the dataset and use hvplot.

Getting to Know The Data

gridMET is a high-resolution daily meteorological dataset covering CONUS from 1979-2023. It is produced by the Climatology Lab at UC Merced. In this example, we are going to look create a virtual Zarr dataset of a derived variable, Burn Index.

Imports

import os
import time

import apache_beam as beam
import fsspec
import hvplot.xarray
import xarray as xr
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.transforms import (
    OpenWithKerchunk,
    WriteCombinedReference,
)

Preprocess Dataset

Here we will be preparing the Kerchunk reference files by using the recipe described in Kerchunk and Pangeo-Forge.

# Constants
target_root = "references"
store_name = "Pangeo_Forge"
full_path = os.path.join(target_root, store_name, "reference.json")
years = list(range(1979, 1980))
time_dim = ConcatDim("time", keys=years)


# Functions
def format_function(time):
    return f"http://www.northwestknowledge.net/metdata/data/bi_{time}.nc"


# Patterns
pattern = FilePattern(format_function, time_dim, file_type="netcdf4")
pattern = pattern.prune()

# Apache Beam transforms
transforms = (
    beam.Create(pattern.items())
    | OpenWithKerchunk(file_type=pattern.file_type)
    | WriteCombinedReference(
        target_root=target_root,
        store_name=store_name,
        concat_dims=["day"],
        identical_dims=["lat", "lon", "crs"],
    )
)

Opening the Kerchunk Dataset

Now, it’s a matter of opening the Kerchunk dataset and calling hvplot with the rasterize=True keyword argument.

If you’re running this notebook locally, try zooming around the map by hovering over the plot and scrolling; it should update fairly quickly. Note, it will not update if you’re viewing this on the docs page online as there is no backend server, but don’t fret because there’s a demo GIF below!

%%timeit -r 1 -n 1

mapper = fsspec.get_mapper(
    "reference://",
    fo=full_path,
    remote_protocol="http",
)
ds_kerchunk = xr.open_dataset(
    mapper, engine="zarr", decode_coords="all", backend_kwargs={"consolidated": False}
)
display(ds_kerchunk.hvplot("lon", "lat", rasterize=True))
4.39 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Kerchunk Zoom

Comparing Against THREDDS

Now, we will be repeating the previous cell, but with THREDDS.

Note how the initial load is longer.

If you’re running the notebook locally (or a demo GIF below), zooming in/out also takes longer to finish buffering as well.

%%timeit -r 1 -n 1


def url_gen(year):
    return (
        f"http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_{year}.nc"
    )


urls_list = [url_gen(year) for year in years]
netcdf_ds = xr.open_mfdataset(urls_list, engine="netcdf4")
display(netcdf_ds.hvplot("lon", "lat", rasterize=True))
4.62 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

THREDDS Zoom