NetCDF

Generating virtual datasets from NetCDF files

Overview

Within this notebook, we will cover:

How to access remote NetCDF data using VirtualiZarr and Kerchunk
Combining multiple virtual datasets

This notebook shares many similarities with the multi-file virtual datasets with VirtualiZarr notebook. If you are confused on the function of a block of code, please refer there for a more detailed breakdown of what each line is doing.

Prerequisites

Concepts	Importance	Notes
Basics of virtual Zarr stores	Required	Core
Multi-file virtual datasets with VirtualiZarr	Required	Core
Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and Dask	Required	Core
Introduction to Xarray	Required	IO/Visualization

Time to learn: 45 minutes

Motivation

NetCDF4/HDF5 is one of the most universally adopted file formats in earth sciences, with support of much of the community as well as scientific agencies, data centers and university labs. A huge amount of legacy data has been generated in this format. Fortunately, using VirtualiZarr and Kerchunk, we can read these datasets as if they were an Analysis-Read Cloud-Optimized (ARCO) format such as Zarr.

About the Dataset

For this example, we will look at a weather dataset composed of multiple NetCDF files.The SMN-Arg is a WRF deterministic weather forecasting dataset created by the Servicio Meteorológico Nacional de Argentina that covers Argentina as well as many neighboring countries at a 4km spatial resolution.

The model is initialized twice daily at 00 & 12 UTC with hourly forecasts for variables such as temperature, relative humidity, precipitation, wind direction and magnitude etc. for multiple atmospheric levels. The data is output at hourly intervals with a maximum prediction lead time of 72 hours in NetCDF files.

More details on this dataset can be found here.

Flags

In the section below, set the subset flag to be True (default) or False depending if you want this notebook to process the full file list. If set to True, then a subset of the file list will be processed (Recommended)

subset_flag = True

Imports

import logging

import dask
import fsspec
import s3fs
import xarray as xr
from distributed import Client
from virtualizarr import open_virtual_dataset

Examining a Single NetCDF File

Before we use VirtualiZarr to create virtual datasets for multiple files, we can load a single NetCDF file to examine it.

# URL pointing to a single NetCDF file
url = "s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/00/WRFDETAR_01H_20221231_00_072.nc"

# Initialize a s3 filesystem
fs = s3fs.S3FileSystem(anon=True)
# Use Xarray to open a remote NetCDF file
ds = xr.open_dataset(fs.open(url), engine="h5netcdf")

Here we see the repr from the Xarray Dataset of a single NetCDF file. From examining the output, we can tell that the Dataset dimensions are ['time','y','x'], with time being only a single step. Later, when we use Xarray's combine_nested functionality, we will need to know on which dimensions to concatenate across.

Create Input File List

Here we are using fsspec's glob functionality along with the * wildcard operator and some string slicing to grab a list of NetCDF files from a s3 fsspec filesystem.

# Initiate fsspec filesystems for reading
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)

files_paths = fs_read.glob("s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/12/*")

# Here we prepend the prefix 's3://', which points to AWS.
files_paths = sorted(["s3://" + f for f in files_paths])


# If the subset_flag == True (default), the list of input files will be subset
# to speed up the processing
if subset_flag:
    files_paths = files_paths[0:8]

Start a Dask Client

To parallelize the creation of our reference files, we will use Dask. For a detailed guide on how to use Dask and Kerchunk, see the Foundations notebook: Kerchunk and Dask.

client = Client(n_workers=8, silence_logs=logging.ERROR)
client

Client

Client-de78eedc-b0ee-11ef-8d78-7c1e5222ecf8

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

1f02a086

Dashboard: http://127.0.0.1:8787/status	Workers: 8
Total threads: 8	Total memory: 15.61 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-62adedb9-0f5f-4587-8afa-5920834013cd

Comm: tcp://127.0.0.1:45749	Workers: 8
Dashboard: http://127.0.0.1:8787/status	Total threads: 8
Started: Just now	Total memory: 15.61 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:37221	Total threads: 1
Dashboard: http://127.0.0.1:45491/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:42817
Local directory: /tmp/dask-scratch-space/worker-_g_yvc1b

Worker: 1

Comm: tcp://127.0.0.1:38229	Total threads: 1
Dashboard: http://127.0.0.1:40035/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:39549
Local directory: /tmp/dask-scratch-space/worker-qvjggl80

Worker: 2

Comm: tcp://127.0.0.1:45609	Total threads: 1
Dashboard: http://127.0.0.1:44975/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:44321
Local directory: /tmp/dask-scratch-space/worker-t68ihmyv

Worker: 3

Comm: tcp://127.0.0.1:36381	Total threads: 1
Dashboard: http://127.0.0.1:42229/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:43355
Local directory: /tmp/dask-scratch-space/worker-2unbfx03

Worker: 4

Comm: tcp://127.0.0.1:39671	Total threads: 1
Dashboard: http://127.0.0.1:42029/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:38857
Local directory: /tmp/dask-scratch-space/worker-xjbgnll_

Worker: 5

Comm: tcp://127.0.0.1:41401	Total threads: 1
Dashboard: http://127.0.0.1:34911/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:46727
Local directory: /tmp/dask-scratch-space/worker-ztns536b

Worker: 6

Comm: tcp://127.0.0.1:39107	Total threads: 1
Dashboard: http://127.0.0.1:39017/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:46689
Local directory: /tmp/dask-scratch-space/worker-41p5i590

Worker: 7

Comm: tcp://127.0.0.1:39013	Total threads: 1
Dashboard: http://127.0.0.1:38369/status	Memory: 1.95 GiB
Nanny: tcp://127.0.0.1:33899
Local directory: /tmp/dask-scratch-space/worker-of6xk0o0

def generate_virtual_dataset(file, storage_options):
    return open_virtual_dataset(
        file, indexes={}, reader_options={"storage_options": storage_options}
    )


storage_options = dict(anon=True, default_fill_cache=False, default_cache_type="none")
# Generate Dask Delayed objects
tasks = [
    dask.delayed(generate_virtual_dataset)(file, storage_options)
    for file in files_paths
]

# Start parallel processing
import warnings

warnings.filterwarnings("ignore")
virtual_datasets = list(dask.compute(*tasks))

Shut down the Dask cluster

client.shutdown()