Kerchunk and NetCDF/HDF5: A Case Study using the Argentinian High Resolution Weather Forecast Dataset

ARG

Overview

Within this notebook, we will cover:

  1. How to access remote NetCDF data using Kerchunk

  2. Combining multiple Kerchunk reference files using MultiZarrToZarr

  3. Reading the output with Xarray and Intake

This notebook shares many similarities with the Multi-File Datasets with Kerchunk. If you are confused on the function of a block of code, please refer there for a more detailed breakdown of what each line is doing.

Prerequisites

Concepts

Importance

Notes

Kerchunk Basics

Required

Core

Multiple Files and Kerchunk

Required

Core

Kerchunk and Dask

Required

Core

Introduction to Xarray

Required

IO/Visualization

Intake Introduction

Recommended

IO

  • Time to learn: 45 minutes


Motivation

NetCDF4/HDF5 is one of the most universally adopted file formats in earth sciences, with support of much of the community as well as scientific agencies, data centers and university labs. A huge amount of legacy data has been generated in this format. Fortunately, using Kerchunk, we can read these datasets as if they were an Analysis-Read Cloud-Optimized (ARCO) format such as Zarr.

About the Dataset

The SMN-Arg is a WRF deterministic weather forecasting dataset created by the Servicio Meteorológico Nacional de Argentina that covers Argentina as well as many neighboring countries at a 4km spatial resolution.
The model is initialized twice daily at 00 & 12 UTC with hourly forecasts for variables such as temperature, relative humidity, precipitation, wind direction and magnitude etc. for multiple atmospheric levels. The data is output at hourly intervals with a maximum prediction lead time of 72 hours in NetCDF files.

More details on this dataset can be found here.

Flags

In the section below, set the subset flag to be True (default) or False depending if you want this notebook to process the full file list. If set to True, then a subset of the file list will be processed (Recommended)

subset_flag = True

Imports

import glob
import logging
from tempfile import TemporaryDirectory

import dask
import fsspec
import s3fs
import ujson
import xarray as xr
from distributed import Client
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr
from tqdm import tqdm

Examining a Single NetCDF File

Before we use Kerchunk to create indices for multiple files, we can load a single NetCDF file to examine it.

# URL pointing to a single NetCDF file
url = "s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/00/WRFDETAR_01H_20221231_00_072.nc"

# Initialize a s3 filesystem
fs = s3fs.S3FileSystem(anon=True)
# Use Xarray to open a remote NetCDF file
ds = xr.open_dataset(fs.open(url), engine="h5netcdf")
ds
<xarray.Dataset>
Dimensions:            (time: 1, y: 1249, x: 999)
Coordinates:
  * time               (time) datetime64[ns] 2023-01-03
  * x                  (x) float32 -1.996e+06 -1.992e+06 ... 1.992e+06 1.996e+06
  * y                  (y) float32 -2.496e+06 -2.492e+06 ... 2.492e+06 2.496e+06
    lat                (y, x) float32 ...
    lon                (y, x) float32 ...
Data variables:
    PP                 (time, y, x) float32 ...
    T2                 (time, y, x) float32 ...
    HR2                (time, y, x) float32 ...
    dirViento10        (time, y, x) float32 ...
    magViento10        (time, y, x) float32 ...
    Lambert_Conformal  float32 ...
Attributes: (12/18)
    title:          Python PostProcessing for SMN WRF-ARW Deterministic SFC
    institution:    Servicio Meteorologico Nacional
    source:          OUTPUT FROM WRF V4.0 MODEL
    start_lat:      -54.386837
    start_lon:      -94.33081
    end_lat:        -11.645958
    ...             ...
    TRUELAT1:       -35.0
    TRUELAT2:       -35.0
    DX:             4000.0
    DY:             4000.0
    START_DATE:     2022-12-31_00:00:00
    Conventions:    CF-1.8

Here we see the repr from the Xarray Dataset of a single NetCDF file. From examining the output, we can tell that the Dataset dimensions are ['time','y','x'], with time being only a single step. Later, when we use Kerchunk's MultiZarrToZarr functionality, we will need to know on which dimensions to concatenate across.

Create Input File List

Here we are using fsspec's glob functionality along with the * wildcard operator and some string slicing to grab a list of NetCDF files from a s3 fsspec filesystem.

# Initiate fsspec filesystems for reading
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)

files_paths = fs_read.glob("s3://smn-ar-wrf/DATA/WRF/DET/2023/01/26/12/*")

# Here we prepend the prefix 's3://', which points to AWS.
file_pattern = sorted(["s3://" + f for f in files_paths])


# If the subset_flag == True (default), the list of input files will be subset to speed up the processing
if subset_flag:
    file_pattern = file_pattern[0:8]
# This dictionary will be passed as kwargs to `fsspec`. For more details, check out the `foundations/kerchunk_basics` notebook.
so = dict(mode="rb", anon=True, default_fill_cache=False, default_cache_type="first")

# We are creating a temporary directory to store the .json reference files
# Alternately, you could write these to cloud storage.
td = TemporaryDirectory()
temp_dir = td.name
temp_dir
'/tmp/tmpyhej4nny'

Start a Dask Client

To parallelize the creation of our reference files, we will use Dask. For a detailed guide on how to use Dask and Kerchunk, see the Foundations notebook: Kerchunk and Dask.

client = Client(n_workers=8, silence_logs=logging.ERROR)
client

Client

Client-f5250688-21bc-11ee-8d5a-0022481ea464

Connection method: Cluster object Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

# Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk` index from a NetCDF file.
def generate_json_reference(fil, output_dir: str):
    with fs_read.open(fil, **so) as infile:
        h5chunks = SingleHdf5ToZarr(infile, fil, inline_threshold=300)
        fname = fil.split("/")[-1].strip(".nc")
        outf = f"{output_dir}/{fname}.json"
        with open(outf, "wb") as f:
            f.write(ujson.dumps(h5chunks.translate()).encode())
        return outf


# Generate Dask Delayed objects
tasks = [dask.delayed(generate_json_reference)(fil, temp_dir) for fil in file_pattern]
# Start parallel processing
import warnings

warnings.filterwarnings("ignore")
dask.compute(tasks)
/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py:436: UserWarning: The following excepion was caught and quashed while traversing HDF5
'str' object has no attribute 'extend'
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py", line 379, in _translator
    za = self._zroot.create_dataset(
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1043, in create_dataset
    return self._write_op(self._create_dataset_nosync, name, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 895, in _write_op
    return f(*args, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1060, in _create_dataset_nosync
    a = array(data, store=self._store, path=path, chunk_store=self._chunk_store,
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/creation.py", line 397, in array
    z[...] = data
AttributeError: 'str' object has no attribute 'extend'

  warnings.warn(msg)
/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py:436: UserWarning: The following excepion was caught and quashed while traversing HDF5
'str' object has no attribute 'extend'
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py", line 379, in _translator
    za = self._zroot.create_dataset(
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1043, in create_dataset
    return self._write_op(self._create_dataset_nosync, name, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 895, in _write_op
    return f(*args, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1060, in _create_dataset_nosync
    a = array(data, store=self._store, path=path, chunk_store=self._chunk_store,
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/creation.py", line 397, in array
    z[...] = data
AttributeError: 'str' object has no attribute 'extend'

  warnings.warn(msg)
/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py:436: UserWarning: The following excepion was caught and quashed while traversing HDF5
'str' object has no attribute 'extend'
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py", line 379, in _translator
    za = self._zroot.create_dataset(
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1043, in create_dataset
    return self._write_op(self._create_dataset_nosync, name, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 895, in _write_op
    return f(*args, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1060, in _create_dataset_nosync
    a = array(data, store=self._store, path=path, chunk_store=self._chunk_store,
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/creation.py", line 397, in array
    z[...] = data
AttributeError: 'str' object has no attribute 'extend'

  warnings.warn(msg)
/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py:436: UserWarning: The following excepion was caught and quashed while traversing HDF5
'str' object has no attribute 'extend'
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py", line 379, in _translator
    za = self._zroot.create_dataset(
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1043, in create_dataset
    return self._write_op(self._create_dataset_nosync, name, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 895, in _write_op
    return f(*args, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1060, in _create_dataset_nosync
    a = array(data, store=self._store, path=path, chunk_store=self._chunk_store,
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/creation.py", line 397, in array
    z[...] = data
AttributeError: 'str' object has no attribute 'extend'

  warnings.warn(msg)
/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py:436: UserWarning: The following excepion was caught and quashed while traversing HDF5
'str' object has no attribute 'extend'
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py", line 379, in _translator
    za = self._zroot.create_dataset(
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1043, in create_dataset
    return self._write_op(self._create_dataset_nosync, name, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 895, in _write_op
    return f(*args, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1060, in _create_dataset_nosync
    a = array(data, store=self._store, path=path, chunk_store=self._chunk_store,
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/creation.py", line 397, in array
    z[...] = data
AttributeError: 'str' object has no attribute 'extend'

  warnings.warn(msg)
/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py:436: UserWarning: The following excepion was caught and quashed while traversing HDF5
'str' object has no attribute 'extend'
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py", line 379, in _translator
    za = self._zroot.create_dataset(
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1043, in create_dataset
    return self._write_op(self._create_dataset_nosync, name, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 895, in _write_op
    return f(*args, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1060, in _create_dataset_nosync
    a = array(data, store=self._store, path=path, chunk_store=self._chunk_store,
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/creation.py", line 397, in array
    z[...] = data
AttributeError: 'str' object has no attribute 'extend'

  warnings.warn(msg)
/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py:436: UserWarning: The following excepion was caught and quashed while traversing HDF5
'str' object has no attribute 'extend'
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py", line 379, in _translator
    za = self._zroot.create_dataset(
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1043, in create_dataset
    return self._write_op(self._create_dataset_nosync, name, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 895, in _write_op
    return f(*args, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1060, in _create_dataset_nosync
    a = array(data, store=self._store, path=path, chunk_store=self._chunk_store,
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/creation.py", line 397, in array
    z[...] = data
AttributeError: 'str' object has no attribute 'extend'

  warnings.warn(msg)
/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py:436: UserWarning: The following excepion was caught and quashed while traversing HDF5
'str' object has no attribute 'extend'
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/kerchunk/hdf.py", line 379, in _translator
    za = self._zroot.create_dataset(
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1043, in create_dataset
    return self._write_op(self._create_dataset_nosync, name, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 895, in _write_op
    return f(*args, **kwargs)
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/hierarchy.py", line 1060, in _create_dataset_nosync
    a = array(data, store=self._store, path=path, chunk_store=self._chunk_store,
  File "/usr/share/miniconda3/envs/kerchunk-cookbook-dev/lib/python3.10/site-packages/zarr/creation.py", line 397, in array
    z[...] = data
AttributeError: 'str' object has no attribute 'extend'

  warnings.warn(msg)
(['/tmp/tmpyhej4nny/WRFDETAR_01H_20230126_12_000.json',
  '/tmp/tmpyhej4nny/WRFDETAR_01H_20230126_12_001.json',
  '/tmp/tmpyhej4nny/WRFDETAR_01H_20230126_12_002.json',
  '/tmp/tmpyhej4nny/WRFDETAR_01H_20230126_12_003.json',
  '/tmp/tmpyhej4nny/WRFDETAR_01H_20230126_12_004.json',
  '/tmp/tmpyhej4nny/WRFDETAR_01H_20230126_12_005.json',
  '/tmp/tmpyhej4nny/WRFDETAR_01H_20230126_12_006.json',
  '/tmp/tmpyhej4nny/WRFDETAR_01H_20230126_12_007.json'],)

Combine .json Kerchunk reference files and write a combined Kerchunk index

In the following cell, we are combining all the .json reference files that were generated above into a single reference file and writing that file to disk.

# Create a list of reference json files
output_files = glob.glob(f"{temp_dir}/*.json")

# combine individual references into single consolidated reference
mzz = MultiZarrToZarr(
    output_files,
    concat_dims=["time"],
    identical_dims=["y", "x"],
)
# save translate reference in memory for later visualization
multi_kerchunk = mzz.translate()

# Write kerchunk .json record
output_fname = "ARG_combined.json"
with open(f"{output_fname}", "wb") as f:
    f.write(ujson.dumps(multi_kerchunk).encode())

Load kerchunked dataset

Now the dataset is a logical view over all of the files we scanned.

# create an fsspec reference filesystem from the Kerchunk output
import fsspec

fs = fsspec.filesystem(
    "reference",
    fo="ARG_combined.json",
    remote_protocol="s3",
    remote_options={"anon": True},
    skip_instance_cache=True,
)
m = fs.get_mapper("")
ds = xr.open_dataset(m, engine="zarr")

Create a Map

Here we are using Xarray to select a single time slice and create a map of 2-m temperature across the region.

ds.isel(time=0).TSLB.plot()
<matplotlib.collections.QuadMesh at 0x7fd6b8b0bb80>
../../_images/8697eca3e3568b95ff927410f1729f7057869eef54f9be9648f4ce48f4d7841b.png

Create a Time-Series

Next we are plotting temperature as a function of time for a specific point.

ds["T2"][:, 500, 500].plot()
[<matplotlib.lines.Line2D at 0x7fd6b80c7cd0>]
../../_images/e11cc2a2e70b8ec014463257fed7267d666850c5a0f573756a130b21b4fee160.png