Kerchunk, GeoTIFF and Generating Coordinates with `xrefcoord`

Overview

In this tutorial we will cover:

How to generate Kerchunk references of GeoTIFFs.
Combining Kerchunk references into a virtual dataset.
Generating Coordinates with the xrefcoord accessor.

Prerequisites

Concepts	Importance	Notes
Kerchunk Basics	Required	Core
Multiple Files and Kerchunk	Required	Core
Kerchunk and Dask	Required	Core
Introduction to Xarray	Required	IO/Visualization

Time to learn: 30 minutes

Create Input File List

Here we are using fsspec's glob functionality along with the * wildcard operator and some string slicing to grab a list of GeoTIFF files from a s3 fsspec filesystem.

# Initiate fsspec filesystems for reading
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)

files_paths = fs_read.glob(
    "s3://fmi-opendata-radar-geotiff/2023/01/01/FIN-ACRR-3067-1KM/*24H-3067-1KM.tif"
)
# Here we prepend the prefix 's3://', which points to AWS.
file_pattern = sorted(["s3://" + f for f in files_paths])

# This dictionary will be passed as kwargs to `fsspec`. For more details, check out the `foundations/kerchunk_basics` notebook.
so = dict(mode="rb", anon=True, default_fill_cache=False, default_cache_type="first")

# We are creating a temporary directory to store the .json reference files
# Alternately, you could write these to cloud storage.
td = TemporaryDirectory()
temp_dir = td.name
temp_dir

'/tmp/tmp3v5l212j'

Start a Dask Client

To parallelize the creation of our reference files, we will use Dask. For a detailed guide on how to use Dask and Kerchunk, see the Foundations notebook: Kerchunk and Dask.

client = Client(n_workers=8, silence_logs=logging.ERROR)
client

Client

Client-26ea85fc-21bd-11ee-8e7d-0022481ea464

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

36ffec7e

Dashboard: http://127.0.0.1:8787/status	Workers: 8
Total threads: 8	Total memory: 6.77 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-213775d5-035e-4e84-9da9-0142c968105a

Comm: tcp://127.0.0.1:34877	Workers: 8
Dashboard: http://127.0.0.1:8787/status	Total threads: 8
Started: Just now	Total memory: 6.77 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:38985	Total threads: 1
Dashboard: http://127.0.0.1:33203/status	Memory: 866.49 MiB
Nanny: tcp://127.0.0.1:39639
Local directory: /tmp/dask-scratch-space/worker-tg28vkzb

Worker: 1

Comm: tcp://127.0.0.1:37423	Total threads: 1
Dashboard: http://127.0.0.1:39079/status	Memory: 866.49 MiB
Nanny: tcp://127.0.0.1:42551
Local directory: /tmp/dask-scratch-space/worker-8as53sld

Worker: 2

Comm: tcp://127.0.0.1:43091	Total threads: 1
Dashboard: http://127.0.0.1:35709/status	Memory: 866.49 MiB
Nanny: tcp://127.0.0.1:44339
Local directory: /tmp/dask-scratch-space/worker-nyudetmg

Worker: 3

Comm: tcp://127.0.0.1:43383	Total threads: 1
Dashboard: http://127.0.0.1:44549/status	Memory: 866.49 MiB
Nanny: tcp://127.0.0.1:37833
Local directory: /tmp/dask-scratch-space/worker-ceuz79j2

Worker: 4

Comm: tcp://127.0.0.1:45665	Total threads: 1
Dashboard: http://127.0.0.1:43553/status	Memory: 866.49 MiB
Nanny: tcp://127.0.0.1:43767
Local directory: /tmp/dask-scratch-space/worker-kegnjjtd

Worker: 5

Comm: tcp://127.0.0.1:35233	Total threads: 1
Dashboard: http://127.0.0.1:36071/status	Memory: 866.49 MiB
Nanny: tcp://127.0.0.1:39955
Local directory: /tmp/dask-scratch-space/worker-zeh8djxm

Worker: 6

Comm: tcp://127.0.0.1:39065	Total threads: 1
Dashboard: http://127.0.0.1:41511/status	Memory: 866.49 MiB
Nanny: tcp://127.0.0.1:39119
Local directory: /tmp/dask-scratch-space/worker-oo1bt3_e

Worker: 7

Comm: tcp://127.0.0.1:44915	Total threads: 1
Dashboard: http://127.0.0.1:37353/status	Memory: 866.49 MiB
Nanny: tcp://127.0.0.1:42621
Local directory: /tmp/dask-scratch-space/worker-bh8s6s97

# Use Kerchunk's `tiff_to_zarr` method to create create reference files


def generate_json_reference(fil, output_dir: str):
    tiff_chunks = tiff_to_zarr(fil, remote_options={"protocol": "s3", "anon": True})
    fname = fil.split("/")[-1].split("_")[0]
    outf = f"{output_dir}/{fname}.json"
    with open(outf, "wb") as f:
        f.write(ujson.dumps(tiff_chunks).encode())
    return outf


# Generate Dask Delayed objects
tasks = [dask.delayed(generate_json_reference)(fil, temp_dir) for fil in file_pattern]

# Start parallel processing
import warnings

warnings.filterwarnings("ignore")
dask.compute(tasks)

(['/tmp/tmp3v5l212j/202301010100.json',
  '/tmp/tmp3v5l212j/202301010200.json',
  '/tmp/tmp3v5l212j/202301010300.json',
  '/tmp/tmp3v5l212j/202301010400.json',
  '/tmp/tmp3v5l212j/202301010500.json',
  '/tmp/tmp3v5l212j/202301010600.json',
  '/tmp/tmp3v5l212j/202301010700.json',
  '/tmp/tmp3v5l212j/202301010800.json',
  '/tmp/tmp3v5l212j/202301010900.json',
  '/tmp/tmp3v5l212j/202301011000.json',
  '/tmp/tmp3v5l212j/202301011100.json',
  '/tmp/tmp3v5l212j/202301011200.json',
  '/tmp/tmp3v5l212j/202301011300.json',
  '/tmp/tmp3v5l212j/202301011400.json',
  '/tmp/tmp3v5l212j/202301011500.json',
  '/tmp/tmp3v5l212j/202301011600.json',
  '/tmp/tmp3v5l212j/202301011700.json',
  '/tmp/tmp3v5l212j/202301011800.json',
  '/tmp/tmp3v5l212j/202301011900.json',
  '/tmp/tmp3v5l212j/202301012000.json',
  '/tmp/tmp3v5l212j/202301012100.json',
  '/tmp/tmp3v5l212j/202301012200.json',
  '/tmp/tmp3v5l212j/202301012300.json'],)

Combine Reference Files into Multi-File Reference Dataset

Now we will combine all the reference files generated into a single reference dataset. Since each TIFF file is a single timeslice and the only temporal information is stored in the filepath, we will have to specify the coo_map kwarg in MultiZarrToZarr to build a dimension from the filepath attributes.

ref_files = sorted(glob.iglob(f"{temp_dir}/*.json"))


# Custom Kerchunk function from `coo_map` to create dimensions
def fn_to_time(index, fs, var, fn):
    import datetime
    import re

    subst = fn.split("/")[-1].split(".json")[0]
    return datetime.datetime.strptime(subst, "%Y%m%d%H%M")


mzz = MultiZarrToZarr(
    path=ref_files,
    indicts=ref_files,
    remote_protocol="s3",
    remote_options={"anon": True},
    coo_map={"time": fn_to_time},
    coo_dtypes={"time": np.dtype("M8[s]")},
    concat_dims=["time"],
    identical_dims=["X", "Y"],
)

# # save translate reference in memory for later visualization
multi_kerchunk = mzz.translate()

# Write kerchunk .json record
output_fname = "RADAR.json"
with open(f"{output_fname}", "wb") as f:
    f.write(ujson.dumps(multi_kerchunk).encode())

Use `xrefcoord` to Generate Coordinates

When using Kerchunk to generate reference datasets for GeoTIFF’s, only the dimensions are preserved. xrefcoord is a small utility that allows us to generate coordinates for these reference datasets using the geospatial metadata. Similar to other accessor add-on libraries for Xarray such as rioxarray and xwrf, xrefcord provides an accessor for an Xarray dataset. Importing xrefcoord allows us to use the .xref accessor to access additional methods.

In the following cell, we will use the generate_coords method to build coordinates for the Xarray dataset. xrefcoord is very experimental and makes assumptions about the underlying data, such as each variable shares the same dimensions etc. Use with caution!

# Generate coordinates from reference dataset
ref_ds = ds.xref.generate_coords(time_dim_name="time", x_dim_name="X", y_dim_name="Y")
# Rename to rain accumulation in 24 hour period
ref_ds = ref_ds.rename({"0": "rr24h"})

Create a Map

Here we are using Xarray to select a single time slice and create a map of 24 hour accumulated rainfall.

ref_ds["rr24h"].where(ref_ds.rr24h < 60000).isel(time=0).plot(robust=True)

<matplotlib.collections.QuadMesh at 0x7f7ef4123460>

../../_images/11eec65c1143129e1e9cff266283509d0541b84a8a1109467675bd86eb109725.png

Create a Time-Series

Next we are plotting accumulated rain as a function of time for a specific point.

ref_ds["rr24h"][:, 700, 700].plot()

[<matplotlib.lines.Line2D at 0x7f7eeccba140>]

../../_images/5149bf0012ca71a97f13ffaff29626f8ec705416703fbc8ed5a1b3a70210713e.png

Kerchunk, GeoTIFF and Generating Coordinates with `xrefcoord`

Overview

Prerequisites

About the Dataset

Examining a Single GeoTIFF File

Create Input File List

Start a Dask Client

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Worker: 3

Worker: 4

Worker: 5

Worker: 6

Worker: 7

Combine Reference Files into Multi-File Reference Dataset

Open Combined Reference Dataset

Use `xrefcoord` to Generate Coordinates

Create a Map

Create a Time-Series

Kerchunk, GeoTIFF and Generating Coordinates with xrefcoord

Overview

Prerequisites

About the Dataset

Examining a Single GeoTIFF File

Create Input File List

Start a Dask Client

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Worker: 3

Worker: 4

Worker: 5

Worker: 6

Worker: 7

Combine Reference Files into Multi-File Reference Dataset

Open Combined Reference Dataset

Use xrefcoord to Generate Coordinates

Create a Map

Create a Time-Series

Kerchunk, GeoTIFF and Generating Coordinates with `xrefcoord`

Use `xrefcoord` to Generate Coordinates