Creating a More Cloud-Native METAR Format

We will convert the dynamical Meteorological Aerodrome Report from a parquet format to instead us Xarray’s DataTree

Overview¶

Within this notebook, we will create an interactive visualization of the latest METAR data across all stations. We will use the following libraries for our visualizations:

Xarray

Prerequisites¶

Concepts	Importance	Notes
Parquet	Required	Cloud Computing
Zarr	Required	Cloud Computing

Time to learn: 10 minutes

import duckdb
import numpy as np
import pandas as pd
import geopandas as gpd
from datetime import datetime, timedelta, timezone
import hvplot
import hvplot.pandas
import holoviews as hv
from palettable.colorbrewer.diverging import BrBG_10
import xarray as xr
from pathlib import Path

import warnings
warnings.filterwarnings('ignore')

We will just be working with the past years worth of data, so the parquet file can be found here.

url = 'https://data.source.coop/dynamical/asos-parquet/year=2026/data.parquet'

Get the past three days of data¶

This query will request just the latest hours worth of data across all stations.

time_1 = datetime.now()
print(f"end time: {time_1}")
time_0 = datetime.now() - timedelta(hours=72)
print(f"start time: {time_0}")

end time: 2026-06-30 04:15:24.345678
start time: 2026-06-27 04:15:24.345806

Query the database for data between timestamps requires SQL-like queries¶

df = duckdb.execute("""
    SELECT *
    FROM read_parquet($1, hive_partitioning=true)
    WHERE 
---      country = 'US' AND
    valid BETWEEN $2 AND $3
    ORDER BY country
""", [url, time_0, time_1]).fetchdf()

Print the first couple columns of data to get an understanding of the structure

df.head(2)

Understanding the data columns¶

The documentation can be found here for the individual columns:

https://github.com/dynamical-org/asos-parquet#key-fields

Print the columns as follows:

df.columns

Index(['station', 'valid', 'longitude', 'latitude', 'tmpf', 'tmpc', 'dwpf',
       'dwpc', 'relh', 'drct', 'sknt', 'gust', 'alti', 'mslp', 'vsby', 'p01i',
       'p01m', 'state', 'geometry', 'name', 'elevation', 'country', 'county',
       'wfo', 'tzname', 'bbox', 'year'],
      dtype='str')

Now create bind descriptions to each of the columns

variables = {
    'tmpf': "Temperature in Fahrenheit",
    'tmpc': "Temperature in Celsius",
    'dwpf': "Dewpoint in Fahrenheit",
    'dwpc': "Dewpoint in Celsius", 
    'relh': "Relative humidity in percent", 
    'drct': "Wind direction in degrees", 
    'sknt': "Wind speed in knots", 
    'gust': "Wind gusts in knots", 
    'alti': "Altimeter setting in inches of mercury", 
    'mslp': "Mean sea level pressure", 
    'vsby': "Visibility in statute miles", 
    'p01i': "P01I", # 1-hour precipitation inches
    'p01m': "P01M" # 1-hour precipitation mm
}

Plot Timeseries data from a station local to Boulder, Colorado¶

Select all the data from the Boulder station, “BDU” and plot it.

df_bdu = df.loc[df['station'] == "BDU"]
df_bdu.head(2)

df_select = pd.DataFrame({
    'date': df_bdu['valid'],
    'tmpc': df_bdu['tmpc'],
    'dwpc': df_bdu['dwpc'],
    'relh': df_bdu['relh'],
    'drct': df_bdu['drct'],
    'sknt': df_bdu['sknt'],
    'alti': df_bdu['alti'],
}).set_index('date')

# print the head of the dataframe
df_select.head(2)

Iterate through some variables and compose an interactive plot of the past three days

hv.Layout(
    [df_select[i].hvplot.line(title=f"{variables[i]}", width=400) for i in [
        'tmpc',
        'relh',
    ]]
).cols(1)

METAR Parquet Antipatterns¶

There are a couple antipatterns that we are currently using in this cloud computing scheme:

partitioning --> the data is currently not accessed through a single catalog
- we need to derive the start & end time to query specific files
compression --> similar data is not stored adjacent across rows of the parquet file
expressions --> data is not being accessed in a pythonic manner, we are instead relying on SQL-like queries
geospatial information is not changing, station attributes such as “lat/lon” are stored in a redundant manner

We would like to access the data using a more pythonic approach. The underlying data is timeseries, so the goal is to store it as timeseries DataArrays.

All of these problems can be fixed by Xarray’s DataTree implementation.

df.head(2)

# assemble all of the data for one station
df_bdu = df.loc[df['station'] == "BDU"]
df_bdu_select = pd.DataFrame({
    'station': df_bdu['station'],
    'date': df_bdu['valid'],
    'latitude': df_bdu['latitude'],
    'longitude': df_bdu['longitude'],
    'tmpf': df_bdu['tmpf'],
    'dwpf': df_bdu['dwpf'],
    'relh': df_bdu['relh'],
    'drct': df_bdu['drct'],
    'sknt': df_bdu['sknt'],
    'gust': df_bdu['gust'],
    'alti': df_bdu['alti'],
    # 'mslp': df_bdu['mslp'],
    'vsby': df_bdu['vsby'],
    'p01i': df_bdu['p01i'],
}).set_index('date')

df_bdu_select.head(2)

latitude = df_bdu['latitude'].iloc[0]
longitude = df_bdu['longitude'].iloc[0]
elevation = df_bdu['elevation'].iloc[0]
print(f"Latitude: {latitude}, Longitude: {longitude}, Elevation: {elevation}")

Latitude: 40.0394, Longitude: -105.2258, Elevation: 1612.0

station = df_bdu['station'].iloc[0]
country = df_bdu['country'].iloc[0]
tzname = df_bdu['tzname'].iloc[0]
print(f"Station: {station}, Country: {country} and time zone name: {tzname}")

Station: BDU, Country: US and time zone name: America/Denver

# create an xarray dataarra for each variable
time = df_bdu_select.index.sort_values()

latitude = df_bdu_select['latitude'].values[0]
longitude = df_bdu_select['longitude'].values[0]

temperature = df_bdu_select['tmpf'].values
dew_point = df_bdu['dwpf'].values
relative_humidity = df_bdu['relh'].values
wind_direction = df_bdu['drct'].values
wind_speed = df_bdu['sknt'].values
wind_gust = df_bdu['gust'].values
pressure = df_bdu['alti'].values
visibility = df_bdu['vsby'].values
one_hour_precipitation = df_bdu['p01i'].values

# help(time.sort_values) #()

Create all of the DataArrays with associated labels and units¶

time = df_bdu_select.index.sort_values().tz_localize(None)
time_da = xr.DataArray(
    data=time,
    dims=("time"),
    name="time",
    attrs=dict(
        long_name="Timestamp of each ping",
        standard_name="time",
    )
)
temperature_da = xr.DataArray(
    data=temperature,
    dims=["time"],
    coords=dict(
        time=time_da,
    ),
    attrs=dict(
        long_name="Temperature in Fahrenheit",
        standard_name="Temperature",
        units="degF",
    ),
)
dew_point_da = xr.DataArray(
    data=dew_point,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="Dewpoint in Fahrenheit",
        standard_name="Dewpoint",
        units="degF",
    ),
)
relative_humidity_da = xr.DataArray(
    data=relative_humidity,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="Relative humidity in Percent",
        standard_name="Relative Humidity",
        units="percent",
    ),
)
wind_direction_da = xr.DataArray(
    data=wind_direction,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="Wind Direction in Degrees",
        standard_name="Wind Direction",
        units="degrees",
    ),
)
wind_speed_da = xr.DataArray(
    data=wind_speed,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="Wind Speed in Knots",
        standard_name="Wind Speed",
        units="knots",
    ),
)
wind_gust_da = xr.DataArray(
    data=wind_gust,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="Wind Gust in Knots",
        standard_name="Wind Gust",
        units="knots",
    ),
)
pressure_da = xr.DataArray(
    data=pressure,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="Pressure in inHg",
        standard_name="Pressure",
        units="inHg",
    ),
)
visibility_da = xr.DataArray(
    data=visibility,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="Visibility in Miles",
        standard_name="Visibility",
        units="miles",
    ),
)
one_hour_precipitation_da = xr.DataArray(
    data=one_hour_precipitation,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="One Hour Precipitation in Inches",
        standard_name="Precipitation",
        units="inches",
    ),
)
# one_hour_precipitation_da

Create and Xarray Dataset¶

Combine all of the DataArray variables together and include attributes that don’t change for the DataSet such as the GPS location and the elevation.

ds = xr.Dataset(
    data_vars=dict(
        temperature=temperature_da,
        dew_point=dew_point_da,
        relative_humidity=relative_humidity_da,
        wind_direction=wind_direction_da,
        wind_speed=wind_speed_da,
        wind_gust=wind_gust_da,
        pressure=pressure_da,
        visibility=visibility_da,
        precipitation=one_hour_precipitation_da,
    ),
    coords=dict(
        time=time_da,
    ),
    attrs=dict(
        station=station,
        country=country,
        timezone=tzname,
        latitude=latitude,
        longitude=longitude,
        elevation=elevation
    ),
)
ds

Create an Xarray DataTree¶

The solution to having disperate datasets will be to join them together using Xarray’s DataTree implementation. It is a graph like structure, so it uses nodes and edges to connect dataset in a hierarchical like manner.

Here is an example of an Xarray DataTree with a family:

bart = xr.DataTree(name="Bart")
lisa = xr.DataTree(name="Lisa")
homer = xr.DataTree(name="Homer", children={"Bart": bart, "Lisa": lisa})
print(homer)

<xarray.DataTree 'Homer'>
Group: /
├── Group: /Bart
└── Group: /Lisa

To access a child I just denote them by name:

homer['Lisa']

First example: Define the METAR station for Boulder, ‘BDU’ and add it as a child to the parent

bdu = xr.DataTree(dataset=ds, name=ds.station)

metar = xr.DataTree(name="metar", children={"bdu": bdu})

So “metar” now has the child dataset, “bdu”¶

And note that there are attributes specific to that station tracked with that

metar.children

Frozen({'bdu': <xarray.DataTree 'bdu'>
Group: /bdu
    Dimensions:            (time: 210)
    Coordinates:
      * time               (time) datetime64[us] 2kB 2026-06-27T04:35:00 ... 2026...
    Data variables:
        temperature        (time) float64 2kB 71.6 73.4 75.2 73.4 ... 78.8 80.6 78.8
        dew_point          (time) float64 2kB 44.6 41.0 41.0 41.0 ... 30.2 24.8 26.6
        relative_humidity  (time) float64 2kB 37.92 31.08 29.26 ... 16.92 12.77 14.6
        wind_direction     (time) float64 2kB 240.0 260.0 260.0 ... 190.0 0.0 230.0
        wind_speed         (time) float64 2kB 6.0 20.0 10.0 9.0 ... 4.0 3.0 0.0 4.0
        wind_gust          (time) float64 2kB nan 33.0 23.0 nan ... 15.0 nan nan nan
        pressure           (time) float64 2kB 29.86 29.89 29.86 ... 29.89 29.9 29.91
        visibility         (time) float64 2kB 10.0 10.0 10.0 10.0 ... 10.0 10.0 10.0
        precipitation      (time) float64 2kB 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
    Attributes:
        station:    BDU
        country:    US
        timezone:   America/Denver
        latitude:   40.0394
        longitude:  -105.2258
        elevation:  1612.0})

We can store the data in a single cloud-native dataset now, no need to partition because the data is automatically chunked¶

The dataset can be saved as follows:

metar.to_zarr(store="metar.zarr", mode='a')

<xarray.backends.zarr.ZarrStore at 0x7f97788e13a0>

Create a datatree of some Colorado stations¶

Pick a couple example stations to save:

select_stations = ['BDU', 'DEN', 'DRO']

Iterate through all selected stations and add them to the parent¶

all_children = {}
for stn in select_stations:
    # print(f"station: {stn}")
    dff = df.loc[df['station'] == stn]
    dff_select = pd.DataFrame({
        'station': dff['station'],
        'date': dff['valid'],
        'latitude': dff['latitude'],
        'longitude': dff['longitude'],
        'tmpf': dff['tmpf'],
        'dwpf': dff['dwpf'],
        'relh': dff['relh'],
        'drct': dff['drct'],
        'sknt': dff['sknt'],
        'gust': dff['gust'],
        'alti': dff['alti'],
        'vsby': dff['vsby'],
        'p01i': dff['p01i'],
    }).set_index('date')
    latitude = dff['latitude'].iloc[0]
    longitude = dff['longitude'].iloc[0]
    elevation = dff['elevation'].iloc[0]
    station = dff['station'].iloc[0]
    country = dff['country'].iloc[0]
    tzname = dff['tzname'].iloc[0]
    time = dff_select.index.sort_values()
    # print(np.diff(time))
    latitude = dff_select['latitude'].values[0]
    longitude = dff_select['longitude'].values[0]
    temperature = dff_select['tmpf'].values
    dew_point = dff['dwpf'].values
    relative_humidity = dff['relh'].values
    wind_direction = dff['drct'].values
    wind_speed = dff['sknt'].values
    wind_gust = dff['gust'].values
    pressure = dff['alti'].values
    visibility = dff['vsby'].values
    one_hour_precipitation = dff['p01i'].values
    # time.tz_localize(None)
    time = dff_select.index.sort_values().tz_localize(None)
    time_da = xr.DataArray(
        data=time,
        dims=("time"),
        name="time",
        attrs=dict(
            # calendar="proleptic_gregorian",
            # units="seconds since 1970-01-01 00:00:00",
            long_name="Timestamp of each ping",
            standard_name="time",
        )
    )
    temperature_da = xr.DataArray(
        data=temperature,
        dims=["time"],
        coords=dict(
            time=time_da,
        ),
        attrs=dict(
            long_name="Temperature in Fahrenheit",
            standard_name="Temperature",
            units="degF",
        ),
    )
    # temperature_da
    dew_point_da = xr.DataArray(
        data=dew_point,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="Dewpoint in Fahrenheit",
            standard_name="Dewpoint",
            units="degF",
        ),
    )
    relative_humidity_da = xr.DataArray(
        data=relative_humidity,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="Relative humidity in Percent",
            standard_name="Relative Humidity",
            units="percent",
        ),
    )
    wind_direction_da = xr.DataArray(
        data=wind_direction,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="Wind Direction in Degrees",
            standard_name="Wind Direction",
            units="degrees",
        ),
    )
    wind_speed_da = xr.DataArray(
        data=wind_speed,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="Wind Speed in Knots",
            standard_name="Wind Speed",
            units="knots",
        ),
    )
    wind_gust_da = xr.DataArray(
        data=wind_gust,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="Wind Gust in Knots",
            standard_name="Wind Gust",
            units="knots",
        ),
    )
    pressure_da = xr.DataArray(
        data=pressure,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="Pressure in inHg",
            standard_name="Pressure",
            units="inHg",
        ),
    )
    visibility_da = xr.DataArray(
        data=visibility,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="Visibility in Miles",
            standard_name="Visibility",
            units="miles",
        ),
    )
    one_hour_precipitation_da = xr.DataArray(
        data=one_hour_precipitation,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="One Hour Precipitation in Inches",
            standard_name="Precipitation",
            units="inches",
        ),
    )
    ds = xr.Dataset(
        data_vars=dict(
            temperature=temperature_da,
            dew_point=dew_point_da,
            relative_humidity=relative_humidity_da,
            wind_direction=wind_direction_da,
            wind_speed=wind_speed_da,
            wind_gust=wind_gust_da,
            pressure=pressure_da,
            visibility=visibility_da,
            precipitation=one_hour_precipitation_da,
        ),
        coords=dict(
            time=time_da,
        ),
        attrs=dict(
            station=station,
            country=country,
            timezone=tzname,
            latitude=latitude,
            longitude=longitude,
            elevation=elevation
        ),
    )
    # print(ds)
    temp_station = xr.DataTree(dataset=ds, name=ds.station)
    all_children[ds.station] = temp_station

len(all_children)

3

Create Consolidated DataTree of the Station Data¶

metar = xr.DataTree(name="metar", children=all_children)
metar.children

Frozen({'BDU': <xarray.DataTree 'BDU'>
Group: /BDU
    Dimensions:            (time: 210)
    Coordinates:
      * time               (time) datetime64[us] 2kB 2026-06-27T04:35:00 ... 2026...
    Data variables:
        temperature        (time) float64 2kB 71.6 73.4 75.2 73.4 ... 78.8 80.6 78.8
        dew_point          (time) float64 2kB 44.6 41.0 41.0 41.0 ... 30.2 24.8 26.6
        relative_humidity  (time) float64 2kB 37.92 31.08 29.26 ... 16.92 12.77 14.6
        wind_direction     (time) float64 2kB 240.0 260.0 260.0 ... 190.0 0.0 230.0
        wind_speed         (time) float64 2kB 6.0 20.0 10.0 9.0 ... 4.0 3.0 0.0 4.0
        wind_gust          (time) float64 2kB nan 33.0 23.0 nan ... 15.0 nan nan nan
        pressure           (time) float64 2kB 29.86 29.89 29.86 ... 29.89 29.9 29.91
        visibility         (time) float64 2kB 10.0 10.0 10.0 10.0 ... 10.0 10.0 10.0
        precipitation      (time) float64 2kB 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
    Attributes:
        station:    BDU
        country:    US
        timezone:   America/Denver
        latitude:   40.0394
        longitude:  -105.2258
        elevation:  1612.0, 'DEN': <xarray.DataTree 'DEN'>
Group: /DEN
    Dimensions:            (time: 75)
    Coordinates:
      * time               (time) datetime64[us] 600B 2026-06-27T04:53:00 ... 202...
    Data variables:
        temperature        (time) float64 600B 67.0 66.0 65.0 ... 84.0 79.0 70.0
        dew_point          (time) float64 600B 59.0 58.0 56.0 ... 28.0 30.0 32.0
        relative_humidity  (time) float64 600B 75.5 75.41 72.64 ... 16.67 24.43
        wind_direction     (time) float64 600B 160.0 170.0 170.0 ... 40.0 80.0 110.0
        wind_speed         (time) float64 600B 22.0 15.0 12.0 11.0 ... 12.0 8.0 8.0
        wind_gust          (time) float64 600B 34.0 nan nan nan ... 19.0 nan nan nan
        pressure           (time) float64 600B 29.89 29.89 29.89 ... 29.86 29.89
        visibility         (time) float64 600B 10.0 10.0 10.0 ... 10.0 10.0 10.0
        precipitation      (time) float64 600B 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
    Attributes:
        station:    DEN
        country:    US
        timezone:   America/Denver
        latitude:   39.8328
        longitude:  -104.6575
        elevation:  1656.0, 'DRO': <xarray.DataTree 'DRO'>
Group: /DRO
    Dimensions:            (time: 72)
    Coordinates:
      * time               (time) datetime64[us] 576B 2026-06-27T04:53:00 ... 202...
    Data variables:
        temperature        (time) float64 576B 57.0 55.0 54.0 ... 79.0 76.0 70.0
        dew_point          (time) float64 576B 51.0 51.0 51.0 ... 15.0 18.0 18.0
        relative_humidity  (time) float64 576B 80.31 86.35 89.55 ... 11.14 13.63
        wind_direction     (time) float64 576B 0.0 0.0 0.0 ... 250.0 250.0 360.0
        wind_speed         (time) float64 576B 0.0 0.0 0.0 4.0 ... 6.0 11.0 6.0 4.0
        wind_gust          (time) float64 576B nan nan nan nan ... nan 16.0 nan nan
        pressure           (time) float64 576B 30.18 30.16 30.17 ... 30.06 30.06
        visibility         (time) float64 576B 10.0 10.0 10.0 ... 10.0 10.0 10.0
        precipitation      (time) float64 576B 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
    Attributes:
        station:    DRO
        country:    US
        timezone:   America/Denver
        latitude:   37.1515
        longitude:  -107.7538
        elevation:  2038.0})

Write all of the data as a single dataset¶

metar.to_zarr(store="metar_all.zarr", mode='a')

<xarray.backends.zarr.ZarrStore at 0x7f978282b7e0>

Here is how you would open the data¶

dt = xr.open_datatree("metar_all.zarr")
dt

To get a slice of 1 hours data you could do something like:

dt_bdu_sel = dt["BDU"].sel(time=slice("2026-06-18T00:00:00", "2026-06-18T01:00:00"))
dt_bdu_sel

And then to get the temperature data:

dt_bdu_sel.temperature

And the relative humidity:

dt_bdu_sel.relative_humidity

References¶

What’s next?¶

Expanding the DataTree implementation to the full dataset.