GeoTIFF - Kerchunk Cookbook

Generating virutal datasets from GeoTiff files

ARG

Overview¶

In this tutorial we will cover:

How to generate virtual datasets from GeoTIFFs.
Combining virtual datasets.

Prerequisites¶

Concepts	Importance	Notes
Basics of virtual Zarr stores	Required	Core
Multi-file virtual datasets with VirtualiZarr	Required	Core
Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and Dask	Required	Core
Introduction to Xarray	Required	IO/Visualization

Time to learn: 30 minutes

About the Dataset¶

The Finish Meterological Institute (FMI) Weather Radar Dataset is a collection of GeoTIFF files containing multiple radar specific variables, such as rainfall intensity, precipitation accumulation (in 1, 12 and 24 hour increments), radar reflectivity, radial velocity, rain classification and the cloud top height. It is available through the AWS public data portal and is updated frequently.

More details on this dataset can be found here.

import logging
from datetime import datetime

import dask
import fsspec
import rioxarray
import s3fs
import xarray as xr
from distributed import Client
from virtualizarr import open_virtual_dataset

Examining a Single GeoTIFF File¶

Before we use Kerchunk to create indices for multiple files, we can load a single GeoTiff file to examine it.

# URL pointing to a single GeoTIFF file
url = "s3://fmi-opendata-radar-geotiff/2023/07/01/FIN-ACRR-3067-1KM/202307010100_FIN-ACRR1H-3067-1KM.tif"

# Initialize a s3 filesystem
fs = s3fs.S3FileSystem(anon=True)

xds = rioxarray.open_rasterio(fs.open(url))

xds

xds.isel(band=0).where(xds < 2000).plot()

Create Input File List¶

Here we are using fsspec's glob functionality along with the * wildcard operator and some string slicing to grab a list of GeoTIFF files from a s3 fsspec filesystem.

# Initiate fsspec filesystems for reading
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)

files_paths = fs_read.glob(
    "s3://fmi-opendata-radar-geotiff/2023/01/01/FIN-ACRR-3067-1KM/*24H-3067-1KM.tif"
)
# Here we prepend the prefix 's3://', which points to AWS.
files_paths = sorted(["s3://" + f for f in files_paths])

Start a Dask Client¶

To parallelize the creation of our reference files, we will use Dask. For a detailed guide on how to use Dask and Kerchunk, see the Foundations notebook: Kerchunk and Dask.

client = Client(n_workers=8, silence_logs=logging.ERROR)
client

def generate_virtual_dataset(file):
    storage_options = dict(
        anon=True, default_fill_cache=False, default_cache_type="none"
    )
    vds = open_virtual_dataset(
        file,
        indexes={},
        filetype="tiff",
        reader_options={
            "remote_options": {"anon": True},
            "storage_options": storage_options,
        },
    )
    # Pre-process virtual datasets to extract time step information from the filename
    subst = file.split("/")[-1].split(".json")[0].split("_")[0]
    time_val = datetime.strptime(subst, "%Y%m%d%H%M")
    vds = vds.expand_dims(dim={"time": [time_val]})
    # Only include the raw data, not the overviews
    vds = vds[["0"]]
    return vds

# Generate Dask Delayed objects
tasks = [dask.delayed(generate_virtual_dataset)(file) for file in files_paths]

# Start parallel processing
import warnings

warnings.filterwarnings("ignore")
virtual_datasets = dask.compute(*tasks)

/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/readers/tiff.py:47: UserWarning: storage_options have been dropped from reader_options as they are not supported by kerchunk.tiff.tiff_to_zarr
  warnings.warn(
/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/readers/tiff.py:47: UserWarning: storage_options have been dropped from reader_options as they are not supported by kerchunk.tiff.tiff_to_zarr
  warnings.warn(
/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/readers/tiff.py:47: UserWarning: storage_options have been dropped from reader_options as they are not supported by kerchunk.tiff.tiff_to_zarr
  warnings.warn(

/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/readers/tiff.py:47: UserWarning: storage_options have been dropped from reader_options as they are not supported by kerchunk.tiff.tiff_to_zarr
  warnings.warn(
/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/readers/tiff.py:47: UserWarning: storage_options have been dropped from reader_options as they are not supported by kerchunk.tiff.tiff_to_zarr
  warnings.warn(
/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/readers/tiff.py:47: UserWarning: storage_options have been dropped from reader_options as they are not supported by kerchunk.tiff.tiff_to_zarr
  warnings.warn(
/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/readers/tiff.py:47: UserWarning: storage_options have been dropped from reader_options as they are not supported by kerchunk.tiff.tiff_to_zarr
  warnings.warn(

/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/readers/tiff.py:47: UserWarning: storage_options have been dropped from reader_options as they are not supported by kerchunk.tiff.tiff_to_zarr
  warnings.warn(

2025-06-29 00:51:37,149 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-ae014e7e-d8c8-4377-8a5f-f078cc805366
State:     executing
Task:  <Task 'generate_virtual_dataset-ae014e7e-d8c8-4377-8a5f-f078cc805366' generate_virtual_dataset(...)>
Exception: 'NotImplementedError("Doesn\'t support slicing with (None, slice(None, None, None))")'
Traceback: '  File "/tmp/ipykernel_4455/4131322973.py", line 17, in generate_virtual_dataset\n  File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/core/dataset.py", line 4568, in expand_dims\n    variables[k] = v.set_dims(dict(all_dims))\n                   ~~~~~~~~~~^^^^^^^^^^^^^^^^\n  File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/util/deprecation_helpers.py", line 143, in wrapper\n    return func(*args, **kwargs)\n  File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/core/variable.py", line 1395, in set_dims\n    expanded_data = self.data[indexer]\n                    ~~~~~~~~~^^^^^^^^^\n  File "/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/manifests/array.py", line 226, in __getitem__\n    raise NotImplementedError(f"Doesn\'t support slicing with {indexer}")\n'

/home/runner/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/readers/tiff.py:47: UserWarning: storage_options have been dropped from reader_options as they are not supported by kerchunk.tiff.tiff_to_zarr
  warnings.warn(

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[9], line 5
      2 import warnings
      4 warnings.filterwarnings("ignore")
----> 5 virtual_datasets = dask.compute(*tasks)

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/dask/base.py:681, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    678     expr = expr.optimize()
    679     keys = list(flatten(expr.__dask_keys__()))
--> 681     results = schedule(expr, keys, **kwargs)
    683 return repack(results)

Cell In[7], line 17, in generate_virtual_dataset()
     15 subst = file.split("/")[-1].split(".json")[0].split("_")[0]
     16 time_val = datetime.strptime(subst, "%Y%m%d%H%M")
---> 17 vds = vds.expand_dims(dim={"time": [time_val]})
     18 # Only include the raw data, not the overviews
     19 vds = vds[["0"]]

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/core/dataset.py:4568, in expand_dims()
   4566         for d, c in zip_axis_dim:
   4567             all_dims.insert(d, c)
-> 4568         variables[k] = v.set_dims(dict(all_dims))
   4569 elif k not in variables:
   4570     if k in coord_names and create_index_for_new_dim:
   4571         # If dims includes a label of a non-dimension coordinate,
   4572         # it will be promoted to a 1D coordinate with a single value.

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/util/deprecation_helpers.py:143, in wrapper()
    135     emit_user_level_warning(
    136         f"The `{old_name}` argument has been renamed to `dim`, and will be removed "
    137         "in the future. This renaming is taking place throughout xarray over the "
   (...)    140         PendingDeprecationWarning,
    141     )
    142     kwargs["dim"] = kwargs.pop(old_name)
--> 143 return func(*args, **kwargs)

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/xarray/core/variable.py:1395, in set_dims()
   1388 elif shape is None or all(
   1389     s == 1 for s, e in zip(shape, dim, strict=True) if e not in self_dims
   1390 ):
   1391     # "Trivial" broadcasting, i.e. simply inserting a new dimension
   1392     # This is typically easier for duck arrays to implement
   1393     # than the full "broadcast_to" semantics
   1394     indexer = (None,) * (len(expanded_dims) - self.ndim) + (...,)
-> 1395     expanded_data = self.data[indexer]
   1396 else:  # elif shape is not None:
   1397     dims_map = dict(zip(dim, shape, strict=True))

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/virtualizarr/manifests/array.py:226, in __getitem__()
    224     return self
    225 else:
--> 226     raise NotImplementedError(f"Doesn't support slicing with {indexer}")

NotImplementedError: Doesn't support slicing with (None, slice(None, None, None))

Combine virtual datasets¶

combined_vds = xr.concat(virtual_datasets, dim="time")
combined_vds

Shut down the Dask cluster¶

client.shutdown()

Generating Reference Files

GRIB2

Using Pre-Generated References

Load Kerchunked dataset with Xarray