GeoTIFF

Generating Kerchunk References from GeoTIFF files

ARG

Overview

In this tutorial we will cover:

  1. How to generate Kerchunk references of GeoTIFFs.

  2. Combining Kerchunk references into a virtual dataset.

Prerequisites

Concepts

Importance

Notes

Kerchunk Basics

Required

Core

Multiple Files and Kerchunk

Required

Core

Kerchunk and Dask

Required

Core

Introduction to Xarray

Required

IO/Visualization

  • Time to learn: 30 minutes


About the Dataset

The Finish Meterological Institute (FMI) Weather Radar Dataset is a collection of GeoTIFF files containing multiple radar specific variables, such as rainfall intensity, precipitation accumulation (in 1, 12 and 24 hour increments), radar reflectivity, radial velocity, rain classification and the cloud top height. It is available through the AWS public data portal and is updated frequently.

More details on this dataset can be found here.

import glob
import logging
from tempfile import TemporaryDirectory

import dask
import fsspec
import numpy as np
import rioxarray
import s3fs
import ujson
from distributed import Client
from kerchunk.combine import MultiZarrToZarr
from kerchunk.tiff import tiff_to_zarr

Examining a Single GeoTIFF File

Before we use Kerchunk to create indices for multiple files, we can load a single GeoTiff file to examine it.

# URL pointing to a single GeoTIFF file
url = "s3://fmi-opendata-radar-geotiff/2023/07/01/FIN-ACRR-3067-1KM/202307010100_FIN-ACRR1H-3067-1KM.tif"

# Initialize a s3 filesystem
fs = s3fs.S3FileSystem(anon=True)

xds = rioxarray.open_rasterio(fs.open(url))
xds
<xarray.DataArray (band: 1, y: 1345, x: 850)> Size: 2MB
[1143250 values with dtype=uint16]
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 7kB -1.177e+05 -1.166e+05 ... 8.738e+05 8.75e+05
  * y            (y) float64 11kB 7.907e+06 7.906e+06 ... 6.337e+06 6.336e+06
    spatial_ref  int64 8B 0
Attributes:
    GDAL_METADATA:  <GDALMetadata>\n<Item name="Observation time" format="YYY...
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0
xds.isel(band=0).where(xds < 2000).plot()
<matplotlib.collections.QuadMesh at 0x7f10b5687730>
../../_images/364c2738dd5891f9e43d9d00dd2c6a2069afdba513da25c1a93de8776da0d613.png

Create Input File List

Here we are using fsspec's glob functionality along with the * wildcard operator and some string slicing to grab a list of GeoTIFF files from a s3 fsspec filesystem.

# Initiate fsspec filesystems for reading
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)

files_paths = fs_read.glob(
    "s3://fmi-opendata-radar-geotiff/2023/01/01/FIN-ACRR-3067-1KM/*24H-3067-1KM.tif"
)
# Here we prepend the prefix 's3://', which points to AWS.
file_pattern = sorted(["s3://" + f for f in files_paths])
# This dictionary will be passed as kwargs to `fsspec`. For more details,
# check out the `foundations/kerchunk_basics` notebook.
so = dict(mode="rb", anon=True, default_fill_cache=False, default_cache_type="first")

# We are creating a temporary directory to store the .json reference files
# Alternately, you could write these to cloud storage.
td = TemporaryDirectory()
temp_dir = td.name
temp_dir
'/tmp/tmp20gcu1xi'

Start a Dask Client

To parallelize the creation of our reference files, we will use Dask. For a detailed guide on how to use Dask and Kerchunk, see the Foundations notebook: Kerchunk and Dask.

client = Client(n_workers=8, silence_logs=logging.ERROR)
client

Client

Client-612de058-a04f-11ef-8c6d-000d3ad367d5

Connection method: Cluster object Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

# Use Kerchunk's `tiff_to_zarr` method to create create reference files


def generate_json_reference(fil, output_dir: str):
    tiff_chunks = tiff_to_zarr(fil, remote_options={"protocol": "s3", "anon": True})
    fname = fil.split("/")[-1].split("_")[0]
    outf = f"{output_dir}/{fname}.json"
    with open(outf, "wb") as f:
        f.write(ujson.dumps(tiff_chunks).encode())
    return outf


# Generate Dask Delayed objects
tasks = [dask.delayed(generate_json_reference)(fil, temp_dir) for fil in file_pattern]
# Start parallel processing
import warnings

warnings.filterwarnings("ignore")
dask.compute(tasks)
(['/tmp/tmp20gcu1xi/202301010100.json',
  '/tmp/tmp20gcu1xi/202301010200.json',
  '/tmp/tmp20gcu1xi/202301010300.json',
  '/tmp/tmp20gcu1xi/202301010400.json',
  '/tmp/tmp20gcu1xi/202301010500.json',
  '/tmp/tmp20gcu1xi/202301010600.json',
  '/tmp/tmp20gcu1xi/202301010700.json',
  '/tmp/tmp20gcu1xi/202301010800.json',
  '/tmp/tmp20gcu1xi/202301010900.json',
  '/tmp/tmp20gcu1xi/202301011000.json',
  '/tmp/tmp20gcu1xi/202301011100.json',
  '/tmp/tmp20gcu1xi/202301011200.json',
  '/tmp/tmp20gcu1xi/202301011300.json',
  '/tmp/tmp20gcu1xi/202301011400.json',
  '/tmp/tmp20gcu1xi/202301011500.json',
  '/tmp/tmp20gcu1xi/202301011600.json',
  '/tmp/tmp20gcu1xi/202301011700.json',
  '/tmp/tmp20gcu1xi/202301011800.json',
  '/tmp/tmp20gcu1xi/202301011900.json',
  '/tmp/tmp20gcu1xi/202301012000.json',
  '/tmp/tmp20gcu1xi/202301012100.json',
  '/tmp/tmp20gcu1xi/202301012200.json',
  '/tmp/tmp20gcu1xi/202301012300.json'],)

Combine Reference Files into Multi-File Reference Dataset

Now we will combine all the reference files generated into a single reference dataset. Since each TIFF file is a single timeslice and the only temporal information is stored in the filepath, we will have to specify the coo_map kwarg in MultiZarrToZarr to build a dimension from the filepath attributes.

ref_files = sorted(glob.iglob(f"{temp_dir}/*.json"))


# Custom Kerchunk function from `coo_map` to create dimensions
def fn_to_time(index, fs, var, fn):
    import datetime

    subst = fn.split("/")[-1].split(".json")[0]
    return datetime.datetime.strptime(subst, "%Y%m%d%H%M")


mzz = MultiZarrToZarr(
    path=ref_files,
    indicts=ref_files,
    remote_protocol="s3",
    remote_options={"anon": True},
    coo_map={"time": fn_to_time},
    coo_dtypes={"time": np.dtype("M8[s]")},
    concat_dims=["time"],
    identical_dims=["X", "Y"],
)

# # save translate reference in memory for later visualization
multi_kerchunk = mzz.translate()

# Write kerchunk .json record
output_fname = "RADAR.json"
with open(f"{output_fname}", "wb") as f:
    f.write(ujson.dumps(multi_kerchunk).encode())

Shut down the Dask cluster

client.shutdown()