Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Creating a More Cloud-Native METAR Format

Authors
Affiliations
Colorado State University
UCAR | UCP | NSF Unidata
CIRES University of Colorado Boulder
Boise State University
Montclair State University
NSF NCAR EOL
University at Albany (SUNY)

We will convert the dynamical Meteorological Aerodrome Report from a parquet format to instead us Xarray’s DataTree

Overview

Within this notebook, we will create an interactive visualization of the latest METAR data across all stations. We will use the following libraries for our visualizations:

  1. Xarray

Prerequisites

ConceptsImportanceNotes
ParquetRequiredCloud Computing
ZarrRequiredCloud Computing
  • Time to learn: 10 minutes


import duckdb
import numpy as np
import pandas as pd
import geopandas as gpd
from datetime import datetime, timedelta, timezone
import hvplot
import hvplot.pandas
import holoviews as hv
from palettable.colorbrewer.diverging import BrBG_10
import xarray as xr
from pathlib import Path

import warnings
warnings.filterwarnings('ignore')
Loading...
Loading...
Loading...
Loading...

We will just be working with the past years worth of data, so the parquet file can be found here.

url = 'https://data.source.coop/dynamical/asos-parquet/year=2026/data.parquet'

Get the past three days of data

This query will request just the latest hours worth of data across all stations.

time_1 = datetime.now()
print(f"end time: {time_1}")
time_0 = datetime.now() - timedelta(hours=72)
print(f"start time: {time_0}")
end time: 2026-06-30 04:15:24.345678
start time: 2026-06-27 04:15:24.345806

Query the database for data between timestamps requires SQL-like queries

df = duckdb.execute("""
    SELECT *
    FROM read_parquet($1, hive_partitioning=true)
    WHERE 
---      country = 'US' AND
    valid BETWEEN $2 AND $3
    ORDER BY country
""", [url, time_0, time_1]).fetchdf()
Loading...

Print the first couple columns of data to get an understanding of the structure

df.head(2)
Loading...

Understanding the data columns

The documentation can be found here for the individual columns:

https://github.com/dynamical-org/asos-parquet#key-fields

Print the columns as follows:

df.columns
Index(['station', 'valid', 'longitude', 'latitude', 'tmpf', 'tmpc', 'dwpf', 'dwpc', 'relh', 'drct', 'sknt', 'gust', 'alti', 'mslp', 'vsby', 'p01i', 'p01m', 'state', 'geometry', 'name', 'elevation', 'country', 'county', 'wfo', 'tzname', 'bbox', 'year'], dtype='str')

Now create bind descriptions to each of the columns

variables = {
    'tmpf': "Temperature in Fahrenheit",
    'tmpc': "Temperature in Celsius",
    'dwpf': "Dewpoint in Fahrenheit",
    'dwpc': "Dewpoint in Celsius", 
    'relh': "Relative humidity in percent", 
    'drct': "Wind direction in degrees", 
    'sknt': "Wind speed in knots", 
    'gust': "Wind gusts in knots", 
    'alti': "Altimeter setting in inches of mercury", 
    'mslp': "Mean sea level pressure", 
    'vsby': "Visibility in statute miles", 
    'p01i': "P01I", # 1-hour precipitation inches
    'p01m': "P01M" # 1-hour precipitation mm
}

Plot Timeseries data from a station local to Boulder, Colorado

Select all the data from the Boulder station, “BDU” and plot it.

df_bdu = df.loc[df['station'] == "BDU"]
df_bdu.head(2)
Loading...
df_select = pd.DataFrame({
    'date': df_bdu['valid'],
    'tmpc': df_bdu['tmpc'],
    'dwpc': df_bdu['dwpc'],
    'relh': df_bdu['relh'],
    'drct': df_bdu['drct'],
    'sknt': df_bdu['sknt'],
    'alti': df_bdu['alti'],
}).set_index('date')

# print the head of the dataframe
df_select.head(2)
Loading...

Iterate through some variables and compose an interactive plot of the past three days

hv.Layout(
    [df_select[i].hvplot.line(title=f"{variables[i]}", width=400) for i in [
        'tmpc',
        'relh',
    ]]
).cols(1)
Loading...

METAR Parquet Antipatterns

There are a couple antipatterns that we are currently using in this cloud computing scheme:

  • partitioning --> the data is currently not accessed through a single catalog

    • we need to derive the start & end time to query specific files

  • compression --> similar data is not stored adjacent across rows of the parquet file

  • expressions --> data is not being accessed in a pythonic manner, we are instead relying on SQL-like queries

  • geospatial information is not changing, station attributes such as “lat/lon” are stored in a redundant manner

We would like to access the data using a more pythonic approach. The underlying data is timeseries, so the goal is to store it as timeseries DataArrays.

All of these problems can be fixed by Xarray’s DataTree implementation.

df.head(2)
Loading...
# assemble all of the data for one station
df_bdu = df.loc[df['station'] == "BDU"]
df_bdu_select = pd.DataFrame({
    'station': df_bdu['station'],
    'date': df_bdu['valid'],
    'latitude': df_bdu['latitude'],
    'longitude': df_bdu['longitude'],
    'tmpf': df_bdu['tmpf'],
    'dwpf': df_bdu['dwpf'],
    'relh': df_bdu['relh'],
    'drct': df_bdu['drct'],
    'sknt': df_bdu['sknt'],
    'gust': df_bdu['gust'],
    'alti': df_bdu['alti'],
    # 'mslp': df_bdu['mslp'],
    'vsby': df_bdu['vsby'],
    'p01i': df_bdu['p01i'],
}).set_index('date')

df_bdu_select.head(2)
Loading...
latitude = df_bdu['latitude'].iloc[0]
longitude = df_bdu['longitude'].iloc[0]
elevation = df_bdu['elevation'].iloc[0]
print(f"Latitude: {latitude}, Longitude: {longitude}, Elevation: {elevation}")
Latitude: 40.0394, Longitude: -105.2258, Elevation: 1612.0
station = df_bdu['station'].iloc[0]
country = df_bdu['country'].iloc[0]
tzname = df_bdu['tzname'].iloc[0]
print(f"Station: {station}, Country: {country} and time zone name: {tzname}")
Station: BDU, Country: US and time zone name: America/Denver
# create an xarray dataarra for each variable
time = df_bdu_select.index.sort_values()

latitude = df_bdu_select['latitude'].values[0]
longitude = df_bdu_select['longitude'].values[0]

temperature = df_bdu_select['tmpf'].values
dew_point = df_bdu['dwpf'].values
relative_humidity = df_bdu['relh'].values
wind_direction = df_bdu['drct'].values
wind_speed = df_bdu['sknt'].values
wind_gust = df_bdu['gust'].values
pressure = df_bdu['alti'].values
visibility = df_bdu['vsby'].values
one_hour_precipitation = df_bdu['p01i'].values
# help(time.sort_values) #()

Create all of the DataArrays with associated labels and units

time = df_bdu_select.index.sort_values().tz_localize(None)
time_da = xr.DataArray(
    data=time,
    dims=("time"),
    name="time",
    attrs=dict(
        long_name="Timestamp of each ping",
        standard_name="time",
    )
)
temperature_da = xr.DataArray(
    data=temperature,
    dims=["time"],
    coords=dict(
        time=time_da,
    ),
    attrs=dict(
        long_name="Temperature in Fahrenheit",
        standard_name="Temperature",
        units="degF",
    ),
)
dew_point_da = xr.DataArray(
    data=dew_point,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="Dewpoint in Fahrenheit",
        standard_name="Dewpoint",
        units="degF",
    ),
)
relative_humidity_da = xr.DataArray(
    data=relative_humidity,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="Relative humidity in Percent",
        standard_name="Relative Humidity",
        units="percent",
    ),
)
wind_direction_da = xr.DataArray(
    data=wind_direction,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="Wind Direction in Degrees",
        standard_name="Wind Direction",
        units="degrees",
    ),
)
wind_speed_da = xr.DataArray(
    data=wind_speed,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="Wind Speed in Knots",
        standard_name="Wind Speed",
        units="knots",
    ),
)
wind_gust_da = xr.DataArray(
    data=wind_gust,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="Wind Gust in Knots",
        standard_name="Wind Gust",
        units="knots",
    ),
)
pressure_da = xr.DataArray(
    data=pressure,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="Pressure in inHg",
        standard_name="Pressure",
        units="inHg",
    ),
)
visibility_da = xr.DataArray(
    data=visibility,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="Visibility in Miles",
        standard_name="Visibility",
        units="miles",
    ),
)
one_hour_precipitation_da = xr.DataArray(
    data=one_hour_precipitation,
    dims=["time"],
    coords=dict(time=time_da),
    attrs=dict(
        long_name="One Hour Precipitation in Inches",
        standard_name="Precipitation",
        units="inches",
    ),
)
# one_hour_precipitation_da

Create and Xarray Dataset

Combine all of the DataArray variables together and include attributes that don’t change for the DataSet such as the GPS location and the elevation.

ds = xr.Dataset(
    data_vars=dict(
        temperature=temperature_da,
        dew_point=dew_point_da,
        relative_humidity=relative_humidity_da,
        wind_direction=wind_direction_da,
        wind_speed=wind_speed_da,
        wind_gust=wind_gust_da,
        pressure=pressure_da,
        visibility=visibility_da,
        precipitation=one_hour_precipitation_da,
    ),
    coords=dict(
        time=time_da,
    ),
    attrs=dict(
        station=station,
        country=country,
        timezone=tzname,
        latitude=latitude,
        longitude=longitude,
        elevation=elevation
    ),
)
ds
Loading...

Create an Xarray DataTree

The solution to having disperate datasets will be to join them together using Xarray’s DataTree implementation. It is a graph like structure, so it uses nodes and edges to connect dataset in a hierarchical like manner.

Here is an example of an Xarray DataTree with a family:

bart = xr.DataTree(name="Bart")
lisa = xr.DataTree(name="Lisa")
homer = xr.DataTree(name="Homer", children={"Bart": bart, "Lisa": lisa})
print(homer)
<xarray.DataTree 'Homer'>
Group: /
├── Group: /Bart
└── Group: /Lisa

To access a child I just denote them by name:

homer['Lisa']
Loading...

First example: Define the METAR station for Boulder, ‘BDU’ and add it as a child to the parent

bdu = xr.DataTree(dataset=ds, name=ds.station)

metar = xr.DataTree(name="metar", children={"bdu": bdu})

So “metar” now has the child dataset, “bdu”

And note that there are attributes specific to that station tracked with that

metar.children
Frozen({'bdu': <xarray.DataTree 'bdu'> Group: /bdu Dimensions: (time: 210) Coordinates: * time (time) datetime64[us] 2kB 2026-06-27T04:35:00 ... 2026... Data variables: temperature (time) float64 2kB 71.6 73.4 75.2 73.4 ... 78.8 80.6 78.8 dew_point (time) float64 2kB 44.6 41.0 41.0 41.0 ... 30.2 24.8 26.6 relative_humidity (time) float64 2kB 37.92 31.08 29.26 ... 16.92 12.77 14.6 wind_direction (time) float64 2kB 240.0 260.0 260.0 ... 190.0 0.0 230.0 wind_speed (time) float64 2kB 6.0 20.0 10.0 9.0 ... 4.0 3.0 0.0 4.0 wind_gust (time) float64 2kB nan 33.0 23.0 nan ... 15.0 nan nan nan pressure (time) float64 2kB 29.86 29.89 29.86 ... 29.89 29.9 29.91 visibility (time) float64 2kB 10.0 10.0 10.0 10.0 ... 10.0 10.0 10.0 precipitation (time) float64 2kB 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 Attributes: station: BDU country: US timezone: America/Denver latitude: 40.0394 longitude: -105.2258 elevation: 1612.0})

We can store the data in a single cloud-native dataset now, no need to partition because the data is automatically chunked

The dataset can be saved as follows:

metar.to_zarr(store="metar.zarr", mode='a')
<xarray.backends.zarr.ZarrStore at 0x7f97788e13a0>

Create a datatree of some Colorado stations

Pick a couple example stations to save:

select_stations = ['BDU', 'DEN', 'DRO']

Iterate through all selected stations and add them to the parent

all_children = {}
for stn in select_stations:
    # print(f"station: {stn}")
    dff = df.loc[df['station'] == stn]
    dff_select = pd.DataFrame({
        'station': dff['station'],
        'date': dff['valid'],
        'latitude': dff['latitude'],
        'longitude': dff['longitude'],
        'tmpf': dff['tmpf'],
        'dwpf': dff['dwpf'],
        'relh': dff['relh'],
        'drct': dff['drct'],
        'sknt': dff['sknt'],
        'gust': dff['gust'],
        'alti': dff['alti'],
        'vsby': dff['vsby'],
        'p01i': dff['p01i'],
    }).set_index('date')
    latitude = dff['latitude'].iloc[0]
    longitude = dff['longitude'].iloc[0]
    elevation = dff['elevation'].iloc[0]
    station = dff['station'].iloc[0]
    country = dff['country'].iloc[0]
    tzname = dff['tzname'].iloc[0]
    time = dff_select.index.sort_values()
    # print(np.diff(time))
    latitude = dff_select['latitude'].values[0]
    longitude = dff_select['longitude'].values[0]
    temperature = dff_select['tmpf'].values
    dew_point = dff['dwpf'].values
    relative_humidity = dff['relh'].values
    wind_direction = dff['drct'].values
    wind_speed = dff['sknt'].values
    wind_gust = dff['gust'].values
    pressure = dff['alti'].values
    visibility = dff['vsby'].values
    one_hour_precipitation = dff['p01i'].values
    # time.tz_localize(None)
    time = dff_select.index.sort_values().tz_localize(None)
    time_da = xr.DataArray(
        data=time,
        dims=("time"),
        name="time",
        attrs=dict(
            # calendar="proleptic_gregorian",
            # units="seconds since 1970-01-01 00:00:00",
            long_name="Timestamp of each ping",
            standard_name="time",
        )
    )
    temperature_da = xr.DataArray(
        data=temperature,
        dims=["time"],
        coords=dict(
            time=time_da,
        ),
        attrs=dict(
            long_name="Temperature in Fahrenheit",
            standard_name="Temperature",
            units="degF",
        ),
    )
    # temperature_da
    dew_point_da = xr.DataArray(
        data=dew_point,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="Dewpoint in Fahrenheit",
            standard_name="Dewpoint",
            units="degF",
        ),
    )
    relative_humidity_da = xr.DataArray(
        data=relative_humidity,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="Relative humidity in Percent",
            standard_name="Relative Humidity",
            units="percent",
        ),
    )
    wind_direction_da = xr.DataArray(
        data=wind_direction,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="Wind Direction in Degrees",
            standard_name="Wind Direction",
            units="degrees",
        ),
    )
    wind_speed_da = xr.DataArray(
        data=wind_speed,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="Wind Speed in Knots",
            standard_name="Wind Speed",
            units="knots",
        ),
    )
    wind_gust_da = xr.DataArray(
        data=wind_gust,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="Wind Gust in Knots",
            standard_name="Wind Gust",
            units="knots",
        ),
    )
    pressure_da = xr.DataArray(
        data=pressure,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="Pressure in inHg",
            standard_name="Pressure",
            units="inHg",
        ),
    )
    visibility_da = xr.DataArray(
        data=visibility,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="Visibility in Miles",
            standard_name="Visibility",
            units="miles",
        ),
    )
    one_hour_precipitation_da = xr.DataArray(
        data=one_hour_precipitation,
        dims=["time"],
        coords=dict(time=time_da),
        attrs=dict(
            long_name="One Hour Precipitation in Inches",
            standard_name="Precipitation",
            units="inches",
        ),
    )
    ds = xr.Dataset(
        data_vars=dict(
            temperature=temperature_da,
            dew_point=dew_point_da,
            relative_humidity=relative_humidity_da,
            wind_direction=wind_direction_da,
            wind_speed=wind_speed_da,
            wind_gust=wind_gust_da,
            pressure=pressure_da,
            visibility=visibility_da,
            precipitation=one_hour_precipitation_da,
        ),
        coords=dict(
            time=time_da,
        ),
        attrs=dict(
            station=station,
            country=country,
            timezone=tzname,
            latitude=latitude,
            longitude=longitude,
            elevation=elevation
        ),
    )
    # print(ds)
    temp_station = xr.DataTree(dataset=ds, name=ds.station)
    all_children[ds.station] = temp_station
len(all_children)
3

Create Consolidated DataTree of the Station Data

metar = xr.DataTree(name="metar", children=all_children)
metar.children
Frozen({'BDU': <xarray.DataTree 'BDU'> Group: /BDU Dimensions: (time: 210) Coordinates: * time (time) datetime64[us] 2kB 2026-06-27T04:35:00 ... 2026... Data variables: temperature (time) float64 2kB 71.6 73.4 75.2 73.4 ... 78.8 80.6 78.8 dew_point (time) float64 2kB 44.6 41.0 41.0 41.0 ... 30.2 24.8 26.6 relative_humidity (time) float64 2kB 37.92 31.08 29.26 ... 16.92 12.77 14.6 wind_direction (time) float64 2kB 240.0 260.0 260.0 ... 190.0 0.0 230.0 wind_speed (time) float64 2kB 6.0 20.0 10.0 9.0 ... 4.0 3.0 0.0 4.0 wind_gust (time) float64 2kB nan 33.0 23.0 nan ... 15.0 nan nan nan pressure (time) float64 2kB 29.86 29.89 29.86 ... 29.89 29.9 29.91 visibility (time) float64 2kB 10.0 10.0 10.0 10.0 ... 10.0 10.0 10.0 precipitation (time) float64 2kB 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 Attributes: station: BDU country: US timezone: America/Denver latitude: 40.0394 longitude: -105.2258 elevation: 1612.0, 'DEN': <xarray.DataTree 'DEN'> Group: /DEN Dimensions: (time: 75) Coordinates: * time (time) datetime64[us] 600B 2026-06-27T04:53:00 ... 202... Data variables: temperature (time) float64 600B 67.0 66.0 65.0 ... 84.0 79.0 70.0 dew_point (time) float64 600B 59.0 58.0 56.0 ... 28.0 30.0 32.0 relative_humidity (time) float64 600B 75.5 75.41 72.64 ... 16.67 24.43 wind_direction (time) float64 600B 160.0 170.0 170.0 ... 40.0 80.0 110.0 wind_speed (time) float64 600B 22.0 15.0 12.0 11.0 ... 12.0 8.0 8.0 wind_gust (time) float64 600B 34.0 nan nan nan ... 19.0 nan nan nan pressure (time) float64 600B 29.89 29.89 29.89 ... 29.86 29.89 visibility (time) float64 600B 10.0 10.0 10.0 ... 10.0 10.0 10.0 precipitation (time) float64 600B 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 Attributes: station: DEN country: US timezone: America/Denver latitude: 39.8328 longitude: -104.6575 elevation: 1656.0, 'DRO': <xarray.DataTree 'DRO'> Group: /DRO Dimensions: (time: 72) Coordinates: * time (time) datetime64[us] 576B 2026-06-27T04:53:00 ... 202... Data variables: temperature (time) float64 576B 57.0 55.0 54.0 ... 79.0 76.0 70.0 dew_point (time) float64 576B 51.0 51.0 51.0 ... 15.0 18.0 18.0 relative_humidity (time) float64 576B 80.31 86.35 89.55 ... 11.14 13.63 wind_direction (time) float64 576B 0.0 0.0 0.0 ... 250.0 250.0 360.0 wind_speed (time) float64 576B 0.0 0.0 0.0 4.0 ... 6.0 11.0 6.0 4.0 wind_gust (time) float64 576B nan nan nan nan ... nan 16.0 nan nan pressure (time) float64 576B 30.18 30.16 30.17 ... 30.06 30.06 visibility (time) float64 576B 10.0 10.0 10.0 ... 10.0 10.0 10.0 precipitation (time) float64 576B 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 Attributes: station: DRO country: US timezone: America/Denver latitude: 37.1515 longitude: -107.7538 elevation: 2038.0})

Write all of the data as a single dataset

metar.to_zarr(store="metar_all.zarr", mode='a')
<xarray.backends.zarr.ZarrStore at 0x7f978282b7e0>

Here is how you would open the data

dt = xr.open_datatree("metar_all.zarr")
dt
Loading...

To get a slice of 1 hours data you could do something like:

dt_bdu_sel = dt["BDU"].sel(time=slice("2026-06-18T00:00:00", "2026-06-18T01:00:00"))
dt_bdu_sel
Loading...

And then to get the temperature data:

dt_bdu_sel.temperature
Loading...

And the relative humidity:

dt_bdu_sel.relative_humidity
Loading...

References

What’s next?

Expanding the DataTree implementation to the full dataset.