Store Kerchunk Reference Files as Parquet

Overview

In this notebook we will cover how to store Kerchunk references as Parquet files instead of json. For large reference datasets, using Parquet should have performance implications as the overall reference file size should be smaller and the memory overhead of combining the reference files should be lower.

This notebook builds upon the Kerchunk Basics, Multi-File Datasets with Kerchunk and the Kerchunk and Dask notebooks.

Prerequisites

Concepts	Importance	Notes
Kerchunk Basics	Required	Core
Multiple Files and Kerchunk	Required	Core
Introduction to Xarray	Recommended	IO/Visualization
Intro to Dask	Required	Parallel Processing

Time to learn: 30 minutes

Imports

In addition to the previous imports we used throughout the tutorial, we are adding a few imports:

LazyReferenceMapper and ReferenceFileSystem from fsspec.implementations.reference for lazy Parquet.

import logging
import os

import dask
import fsspec
import xarray as xr
from distributed import Client
from fsspec.implementations.reference import LazyReferenceMapper, ReferenceFileSystem
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr

Setting up the `Dask` Client

client = Client(n_workers=8, silence_logs=logging.ERROR)
client

Client

Client-f5b5ad1a-f06e-11ee-8ab8-000d3ae34a06

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

66239996

Dashboard: http://127.0.0.1:8787/status	Workers: 8
Total threads: 8	Total memory: 15.61 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-d45e7bae-72b1-4c2f-916e-0172c6fd8b2f

Comm: tcp://127.0.0.1:40813	Workers: 8
Dashboard: http://127.0.0.1:8787/status	Total threads: 8
Started: Just now	Total memory: 15.61 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:34307	Total threads: 1
Dashboard: http://127.0.0.1:44431/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:35607
Local directory: /tmp/dask-scratch-space/worker-7yp0zhri

Worker: 1

Comm: tcp://127.0.0.1:40053	Total threads: 1
Dashboard: http://127.0.0.1:33275/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:40981
Local directory: /tmp/dask-scratch-space/worker-1z4_0e0d

Worker: 2

Comm: tcp://127.0.0.1:45193	Total threads: 1
Dashboard: http://127.0.0.1:33669/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:40557
Local directory: /tmp/dask-scratch-space/worker-vkewl4bg

Worker: 3

Comm: tcp://127.0.0.1:45363	Total threads: 1
Dashboard: http://127.0.0.1:44451/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:45197
Local directory: /tmp/dask-scratch-space/worker-gjuov0fy

Worker: 4

Comm: tcp://127.0.0.1:37775	Total threads: 1
Dashboard: http://127.0.0.1:42269/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:38443
Local directory: /tmp/dask-scratch-space/worker-wd7_kmy6

Worker: 5

Comm: tcp://127.0.0.1:44853	Total threads: 1
Dashboard: http://127.0.0.1:46531/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:41371
Local directory: /tmp/dask-scratch-space/worker-s6jn0a6q

Worker: 6

Comm: tcp://127.0.0.1:46731	Total threads: 1
Dashboard: http://127.0.0.1:40097/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:44617
Local directory: /tmp/dask-scratch-space/worker-15pbnwim

Worker: 7

Comm: tcp://127.0.0.1:41979	Total threads: 1
Dashboard: http://127.0.0.1:39963/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:44677
Local directory: /tmp/dask-scratch-space/worker-42s3vx2e

Create Input File List

Here we are using fsspec's glob functionality along with the * wildcard operator and some string slicing to grab a list of NetCDF files from a s3 fsspec filesystem.

# Initiate fsspec filesystems for reading
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)

files_paths = fs_read.glob("s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/12/*")

# Here we prepend the prefix 's3://', which points to AWS.
file_pattern = sorted(["s3://" + f for f in files_paths])

# Grab the first seven files to speed up example.
file_pattern = file_pattern[0:2]

Generate Lazy References

Below we will create a fsspec filesystem to read the references from s3 and create a function to generate dask delayed tasks.

# Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk`
# index from a NetCDF file.
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)
so = dict(mode="rb", anon=True, default_fill_cache=False, default_cache_type="first")


def generate_json_reference(fil):
    with fs_read.open(fil, **so) as infile:
        h5chunks = SingleHdf5ToZarr(infile, fil, inline_threshold=300)
        return h5chunks.translate()  # outf


# Generate Dask Delayed objects
tasks = [dask.delayed(generate_json_reference)(fil) for fil in file_pattern]

Start the Dask Processing

To view the processing you can view it in real-time on the Dask Dashboard. ex: http://127.0.0.1:8787/status

single_refs = dask.compute(tasks)[0]

len(single_refs)

Combine In-Memory References with MultiZarrToZarr

This section will look notably different than the previous examples that have written to .json.

In the following code block we are:

Creating an fsspec filesystem.
Create a empty parquet file to write to.
Creating an fsspec LazyReferenceMapper to pass into MultiZarrToZarr
Building a MultiZarrToZarr object of the combined references.
Calling .flush() on our LazyReferenceMapper, to write the combined reference to our parquet file.

fs = fsspec.filesystem("file")

if os.path.exists("combined.parq"):
    import shutil

    shutil.rmtree("combined.parq")
os.makedirs("combined.parq")

out = LazyReferenceMapper.create(root="combined.parq", fs=fs, record_size=1000)

mzz = MultiZarrToZarr(
    single_refs,
    remote_protocol="s3",
    concat_dims=["time"],
    identical_dims=["y", "x"],
    remote_options={"anon": True},
    out=out,
).translate()

out.flush()

Store Kerchunk Reference Files as Parquet

Overview

Prerequisites

Imports

Setting up the `Dask` Client

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Worker: 3

Worker: 4

Worker: 5

Worker: 6

Worker: 7

Create Input File List

Generate Lazy References

Start the Dask Processing

Combine In-Memory References with MultiZarrToZarr

Shutdown the Dask cluster

Load kerchunked dataset

Store Kerchunk Reference Files as Parquet

Overview

Prerequisites

Imports

Setting up the Dask Client

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Worker: 3

Worker: 4

Worker: 5

Worker: 6

Worker: 7

Create Input File List

Generate Lazy References

Start the Dask Processing

Combine In-Memory References with MultiZarrToZarr

Shutdown the Dask cluster

Load kerchunked dataset

Setting up the `Dask` Client