Skip to article frontmatterSkip to article content

Generating virutal datasets from GeoTiff files

ARG

Overview

In this tutorial we will cover:

  1. How to generate virtual datasets from GeoTIFFs.

  2. Combining virtual datasets.

Prerequisites

ConceptsImportanceNotes
Basics of virtual Zarr storesRequiredCore
Multi-file virtual datasets with VirtualiZarrRequiredCore
Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and DaskRequiredCore
Introduction to XarrayRequiredIO/Visualization
  • Time to learn: 30 minutes


About the Dataset

The Finish Meterological Institute (FMI) Weather Radar Dataset is a collection of GeoTIFF files containing multiple radar specific variables, such as rainfall intensity, precipitation accumulation (in 1, 12 and 24 hour increments), radar reflectivity, radial velocity, rain classification and the cloud top height. It is available through the AWS public data portal and is updated frequently.

More details on this dataset can be found here.

import logging
from datetime import datetime

import dask
import fsspec
import rioxarray
import s3fs
import xarray as xr
from distributed import Client
from virtualizarr import open_virtual_dataset

Examining a Single GeoTIFF File

Before we use Kerchunk to create indices for multiple files, we can load a single GeoTiff file to examine it.

# URL pointing to a single GeoTIFF file
url = "s3://fmi-opendata-radar-geotiff/2023/07/01/FIN-ACRR-3067-1KM/202307010100_FIN-ACRR1H-3067-1KM.tif"

# Initialize a s3 filesystem
fs = s3fs.S3FileSystem(anon=True)

xds = rioxarray.open_rasterio(fs.open(url))
xds
Loading...
xds.isel(band=0).where(xds < 2000).plot()
<Figure size 640x480 with 2 Axes>

Create Input File List

Here we are using fsspec's glob functionality along with the * wildcard operator and some string slicing to grab a list of GeoTIFF files from a s3 fsspec filesystem.

# Initiate fsspec filesystems for reading
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)

files_paths = fs_read.glob(
    "s3://fmi-opendata-radar-geotiff/2023/01/01/FIN-ACRR-3067-1KM/*24H-3067-1KM.tif"
)
# Here we prepend the prefix 's3://', which points to AWS.
files_paths = sorted(["s3://" + f for f in files_paths])

Start a Dask Client

To parallelize the creation of our reference files, we will use Dask. For a detailed guide on how to use Dask and Kerchunk, see the Foundations notebook: Kerchunk and Dask.

client = Client(n_workers=8, silence_logs=logging.ERROR)
client
Loading...
def generate_virtual_dataset(file):
    storage_options = dict(
        anon=True, default_fill_cache=False, default_cache_type="none"
    )
    vds = open_virtual_dataset(
        file,
        indexes={},
        filetype="tiff",
        reader_options={
            "remote_options": {"anon": True},
            "storage_options": storage_options,
        },
    )
    # Pre-process virtual datasets to extract time step information from the filename
    subst = file.split("/")[-1].split(".json")[0].split("_")[0]
    time_val = datetime.strptime(subst, "%Y%m%d%H%M")
    vds = vds.expand_dims(dim={"time": [time_val]})
    # Only include the raw data, not the overviews
    vds = vds[["0"]]
    return vds
# Generate Dask Delayed objects
tasks = [dask.delayed(generate_virtual_dataset)(file) for file in files_paths]
# Start parallel processing
import warnings

warnings.filterwarnings("ignore")
virtual_datasets = dask.compute(*tasks)
2025-10-09 00:40:52,356 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-a8ceab2d-de93-49f9-833c-635277d8153a
State:     executing
Task:  <Task 'generate_virtual_dataset-a8ceab2d-de93-49f9-833c-635277d8153a' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: '  File "/tmp/ipykernel_4375/4131322973.py", line 5, in generate_virtual_dataset\n'

2025-10-09 00:40:52,366 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-d5bc88e2-c816-4e01-b8ce-140cfd8765b3
State:     executing
Task:  <Task 'generate_virtual_dataset-d5bc88e2-c816-4e01-b8ce-140cfd8765b3' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: '  File "/tmp/ipykernel_4375/4131322973.py", line 5, in generate_virtual_dataset\n'

2025-10-09 00:40:52,365 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-e4aa6fd4-a238-4769-acdd-0fe4f51dc6c2
State:     executing
Task:  <Task 'generate_virtual_dataset-e4aa6fd4-a238-4769-acdd-0fe4f51dc6c2' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: '  File "/tmp/ipykernel_4375/4131322973.py", line 5, in generate_virtual_dataset\n'

2025-10-09 00:40:52,377 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-4221a7b5-7310-4584-9b18-d4c201ae2a2e
State:     executing
Task:  <Task 'generate_virtual_dataset-4221a7b5-7310-4584-9b18-d4c201ae2a2e' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: '  File "/tmp/ipykernel_4375/4131322973.py", line 5, in generate_virtual_dataset\n'

2025-10-09 00:40:52,385 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-9a0d7527-3479-455d-a008-ec15f5634bab
State:     executing
Task:  <Task 'generate_virtual_dataset-9a0d7527-3479-455d-a008-ec15f5634bab' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: '  File "/tmp/ipykernel_4375/4131322973.py", line 5, in generate_virtual_dataset\n'

2025-10-09 00:40:52,390 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-95fe738e-e7e4-4253-8353-fd7f41cba46e
State:     executing
Task:  <Task 'generate_virtual_dataset-95fe738e-e7e4-4253-8353-fd7f41cba46e' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: '  File "/tmp/ipykernel_4375/4131322973.py", line 5, in generate_virtual_dataset\n'

2025-10-09 00:40:52,393 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-39990684-76f8-431d-8b81-c9889a49fdcf
State:     executing
Task:  <Task 'generate_virtual_dataset-39990684-76f8-431d-8b81-c9889a49fdcf' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: '  File "/tmp/ipykernel_4375/4131322973.py", line 5, in generate_virtual_dataset\n'

2025-10-09 00:40:52,424 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-2aeb3c72-4911-46ff-9170-fb9c41106ae7
State:     executing
Task:  <Task 'generate_virtual_dataset-2aeb3c72-4911-46ff-9170-fb9c41106ae7' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: '  File "/tmp/ipykernel_4375/4131322973.py", line 5, in generate_virtual_dataset\n'

2025-10-09 00:40:52,426 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-1f55f1cd-36d2-4015-b8b2-db1515a52432
State:     executing
Task:  <Task 'generate_virtual_dataset-1f55f1cd-36d2-4015-b8b2-db1515a52432' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: '  File "/tmp/ipykernel_4375/4131322973.py", line 5, in generate_virtual_dataset\n'

2025-10-09 00:40:52,432 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-2d5c97d5-541e-4c0a-b154-11352f5a428e
State:     executing
Task:  <Task 'generate_virtual_dataset-2d5c97d5-541e-4c0a-b154-11352f5a428e' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: '  File "/tmp/ipykernel_4375/4131322973.py", line 5, in generate_virtual_dataset\n'

2025-10-09 00:40:52,462 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-06669fe1-8bd1-4180-a4d8-3b0f6e6b8d0b
State:     executing
Task:  <Task 'generate_virtual_dataset-06669fe1-8bd1-4180-a4d8-3b0f6e6b8d0b' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: '  File "/tmp/ipykernel_4375/4131322973.py", line 5, in generate_virtual_dataset\n'

2025-10-09 00:40:52,468 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-0df37ad9-19a0-411a-a064-c94aebff043b
State:     executing
Task:  <Task 'generate_virtual_dataset-0df37ad9-19a0-411a-a064-c94aebff043b' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: '  File "/tmp/ipykernel_4375/4131322973.py", line 5, in generate_virtual_dataset\n'

2025-10-09 00:40:52,473 - distributed.worker - ERROR - Compute Failed
Key:       generate_virtual_dataset-18f5cb71-f955-42ed-b157-586466a776b4
State:     executing
Task:  <Task 'generate_virtual_dataset-18f5cb71-f955-42ed-b157-586466a776b4' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: '  File "/tmp/ipykernel_4375/4131322973.py", line 5, in generate_virtual_dataset\n'

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 5
      2 import warnings
      4 warnings.filterwarnings("ignore")
----> 5 virtual_datasets = dask.compute(*tasks)

File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/dask/base.py:681, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    678     expr = expr.optimize()
    679     keys = list(flatten(expr.__dask_keys__()))
--> 681     results = schedule(expr, keys, **kwargs)
    683 return repack(results)

Cell In[7], line 5, in generate_virtual_dataset()
      1 def generate_virtual_dataset(file):
      2     storage_options = dict(
      3         anon=True, default_fill_cache=False, default_cache_type="none"
      4     )
----> 5     vds = open_virtual_dataset(
      6         file,
      7         indexes={},
      8         filetype="tiff",
      9         reader_options={
     10             "remote_options": {"anon": True},
     11             "storage_options": storage_options,
     12         },
     13     )
     14     # Pre-process virtual datasets to extract time step information from the filename
     15     subst = file.split("/")[-1].split(".json")[0].split("_")[0]

TypeError: open_virtual_dataset() got an unexpected keyword argument 'indexes'

Combine virtual datasets

combined_vds = xr.concat(virtual_datasets, dim="time")
combined_vds

Shut down the Dask cluster

client.shutdown()