Generating virutal datasets from GeoTiff files

Overview¶
In this tutorial we will cover:
How to generate virtual datasets from GeoTIFFs.
Combining virtual datasets.
Prerequisites¶
| Concepts | Importance | Notes |
|---|---|---|
| Basics of virtual Zarr stores | Required | Core |
| Multi-file virtual datasets with VirtualiZarr | Required | Core |
| Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and Dask | Required | Core |
| Introduction to Xarray | Required | IO/Visualization |
Time to learn: 30 minutes
About the Dataset¶
The Finish Meterological Institute (FMI) Weather Radar Dataset is a collection of GeoTIFF files containing multiple radar specific variables, such as rainfall intensity, precipitation accumulation (in 1, 12 and 24 hour increments), radar reflectivity, radial velocity, rain classification and the cloud top height. It is available through the AWS public data portal and is updated frequently.
More details on this dataset can be found here.
import logging
from datetime import datetime
import dask
import fsspec
import rioxarray
import s3fs
import xarray as xr
from distributed import Client
from virtualizarr import open_virtual_datasetExamining a Single GeoTIFF File¶
Before we use Kerchunk to create indices for multiple files, we can load a single GeoTiff file to examine it.
# URL pointing to a single GeoTIFF file
url = "s3://fmi-opendata-radar-geotiff/2023/07/01/FIN-ACRR-3067-1KM/202307010100_FIN-ACRR1H-3067-1KM.tif"
# Initialize a s3 filesystem
fs = s3fs.S3FileSystem(anon=True)
xds = rioxarray.open_rasterio(fs.open(url))xdsxds.isel(band=0).where(xds < 2000).plot()
Create Input File List¶
Here we are using fsspec's glob functionality along with the * wildcard operator and some string slicing to grab a list of GeoTIFF files from a s3 fsspec filesystem.
# Initiate fsspec filesystems for reading
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)
files_paths = fs_read.glob(
"s3://fmi-opendata-radar-geotiff/2023/01/01/FIN-ACRR-3067-1KM/*24H-3067-1KM.tif"
)
# Here we prepend the prefix 's3://', which points to AWS.
files_paths = sorted(["s3://" + f for f in files_paths])Start a Dask Client¶
To parallelize the creation of our reference files, we will use Dask. For a detailed guide on how to use Dask and Kerchunk, see the Foundations notebook: Kerchunk and Dask.
client = Client(n_workers=8, silence_logs=logging.ERROR)
clientdef generate_virtual_dataset(file):
storage_options = dict(
anon=True, default_fill_cache=False, default_cache_type="none"
)
vds = open_virtual_dataset(
file,
indexes={},
filetype="tiff",
reader_options={
"remote_options": {"anon": True},
"storage_options": storage_options,
},
)
# Pre-process virtual datasets to extract time step information from the filename
subst = file.split("/")[-1].split(".json")[0].split("_")[0]
time_val = datetime.strptime(subst, "%Y%m%d%H%M")
vds = vds.expand_dims(dim={"time": [time_val]})
# Only include the raw data, not the overviews
vds = vds[["0"]]
return vds# Generate Dask Delayed objects
tasks = [dask.delayed(generate_virtual_dataset)(file) for file in files_paths]# Start parallel processing
import warnings
warnings.filterwarnings("ignore")
virtual_datasets = dask.compute(*tasks)2025-11-03 00:45:48,887 - distributed.worker - ERROR - Compute Failed
Key: generate_virtual_dataset-e5d6a8e2-c09a-4c7d-946e-3d6098d95bec
State: executing
Task: <Task 'generate_virtual_dataset-e5d6a8e2-c09a-4c7d-946e-3d6098d95bec' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: ' File "/tmp/ipykernel_4249/4131322973.py", line 5, in generate_virtual_dataset\n'
2025-11-03 00:45:48,910 - distributed.worker - ERROR - Compute Failed
Key: generate_virtual_dataset-9800c250-c652-4ded-86d1-778f3b22c2b4
State: executing
Task: <Task 'generate_virtual_dataset-9800c250-c652-4ded-86d1-778f3b22c2b4' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: ' File "/tmp/ipykernel_4249/4131322973.py", line 5, in generate_virtual_dataset\n'
2025-11-03 00:45:48,915 - distributed.worker - ERROR - Compute Failed
Key: generate_virtual_dataset-f58732b7-1a20-4ced-9442-814f9e119a37
State: executing
Task: <Task 'generate_virtual_dataset-f58732b7-1a20-4ced-9442-814f9e119a37' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: ' File "/tmp/ipykernel_4249/4131322973.py", line 5, in generate_virtual_dataset\n'
2025-11-03 00:45:48,917 - distributed.worker - ERROR - Compute Failed
Key: generate_virtual_dataset-5f0fbcf6-fcda-436b-9f69-3791cdff0eb4
State: executing
Task: <Task 'generate_virtual_dataset-5f0fbcf6-fcda-436b-9f69-3791cdff0eb4' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: ' File "/tmp/ipykernel_4249/4131322973.py", line 5, in generate_virtual_dataset\n'
2025-11-03 00:45:48,923 - distributed.worker - ERROR - Compute Failed
Key: generate_virtual_dataset-b7aaef3e-bfbf-4523-bc6d-31c886fe66a6
State: executing
Task: <Task 'generate_virtual_dataset-b7aaef3e-bfbf-4523-bc6d-31c886fe66a6' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: ' File "/tmp/ipykernel_4249/4131322973.py", line 5, in generate_virtual_dataset\n'
2025-11-03 00:45:48,953 - distributed.worker - ERROR - Compute Failed
Key: generate_virtual_dataset-567089bc-21a9-421b-9526-f434f5a6bea6
State: executing
Task: <Task 'generate_virtual_dataset-567089bc-21a9-421b-9526-f434f5a6bea6' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: ' File "/tmp/ipykernel_4249/4131322973.py", line 5, in generate_virtual_dataset\n'
2025-11-03 00:45:48,969 - distributed.worker - ERROR - Compute Failed
Key: generate_virtual_dataset-50d76f74-282f-4b62-a415-0af047e5f342
State: executing
Task: <Task 'generate_virtual_dataset-50d76f74-282f-4b62-a415-0af047e5f342' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: ' File "/tmp/ipykernel_4249/4131322973.py", line 5, in generate_virtual_dataset\n'
2025-11-03 00:45:48,998 - distributed.worker - ERROR - Compute Failed
Key: generate_virtual_dataset-eca15a1e-bdcb-45bd-95cf-9839a4c58130
State: executing
Task: <Task 'generate_virtual_dataset-eca15a1e-bdcb-45bd-95cf-9839a4c58130' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: ' File "/tmp/ipykernel_4249/4131322973.py", line 5, in generate_virtual_dataset\n'
2025-11-03 00:45:49,004 - distributed.worker - ERROR - Compute Failed
Key: generate_virtual_dataset-9f10e7c2-833f-4f9e-820d-50fbda6802ee
State: executing
Task: <Task 'generate_virtual_dataset-9f10e7c2-833f-4f9e-820d-50fbda6802ee' generate_virtual_dataset(...)>
Exception: 'TypeError("open_virtual_dataset() got an unexpected keyword argument \'indexes\'")'
Traceback: ' File "/tmp/ipykernel_4249/4131322973.py", line 5, in generate_virtual_dataset\n'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[9], line 5
2 import warnings
4 warnings.filterwarnings("ignore")
----> 5 virtual_datasets = dask.compute(*tasks)
File ~/micromamba/envs/kerchunk-cookbook/lib/python3.13/site-packages/dask/base.py:681, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
678 expr = expr.optimize()
679 keys = list(flatten(expr.__dask_keys__()))
--> 681 results = schedule(expr, keys, **kwargs)
683 return repack(results)
Cell In[7], line 5, in generate_virtual_dataset()
1 def generate_virtual_dataset(file):
2 storage_options = dict(
3 anon=True, default_fill_cache=False, default_cache_type="none"
4 )
----> 5 vds = open_virtual_dataset(
6 file,
7 indexes={},
8 filetype="tiff",
9 reader_options={
10 "remote_options": {"anon": True},
11 "storage_options": storage_options,
12 },
13 )
14 # Pre-process virtual datasets to extract time step information from the filename
15 subst = file.split("/")[-1].split(".json")[0].split("_")[0]
TypeError: open_virtual_dataset() got an unexpected keyword argument 'indexes'Combine virtual datasets¶
combined_vds = xr.concat(virtual_datasets, dim="time")
combined_vdsShut down the Dask cluster¶
client.shutdown()