We will convert the dynamical Meteorological Aerodrome Report from a parquet format to instead us Xarray’s DataTree
Overview¶
Within this notebook, we will create an interactive visualization of the latest METAR data across all stations. We will use the following libraries for our visualizations:
Prerequisites¶
Time to learn: 10 minutes
import duckdb
import numpy as np
import pandas as pd
import geopandas as gpd
from datetime import datetime, timedelta, timezone
import hvplot
import hvplot.pandas
import holoviews as hv
from palettable.colorbrewer.diverging import BrBG_10
import xarray as xr
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')We will just be working with the past years worth of data, so the parquet file can be found here.
url = 'https://data.source.coop/dynamical/asos-parquet/year=2026/data.parquet'Get the past three days of data¶
This query will request just the latest hours worth of data across all stations.
time_1 = datetime.now()
print(f"end time: {time_1}")
time_0 = datetime.now() - timedelta(hours=72)
print(f"start time: {time_0}")end time: 2026-06-30 04:15:24.345678
start time: 2026-06-27 04:15:24.345806
Query the database for data between timestamps requires SQL-like queries¶
df = duckdb.execute("""
SELECT *
FROM read_parquet($1, hive_partitioning=true)
WHERE
--- country = 'US' AND
valid BETWEEN $2 AND $3
ORDER BY country
""", [url, time_0, time_1]).fetchdf()Print the first couple columns of data to get an understanding of the structure
df.head(2)Understanding the data columns¶
The documentation can be found here for the individual columns:
https://
Print the columns as follows:
df.columnsIndex(['station', 'valid', 'longitude', 'latitude', 'tmpf', 'tmpc', 'dwpf',
'dwpc', 'relh', 'drct', 'sknt', 'gust', 'alti', 'mslp', 'vsby', 'p01i',
'p01m', 'state', 'geometry', 'name', 'elevation', 'country', 'county',
'wfo', 'tzname', 'bbox', 'year'],
dtype='str')Now create bind descriptions to each of the columns
variables = {
'tmpf': "Temperature in Fahrenheit",
'tmpc': "Temperature in Celsius",
'dwpf': "Dewpoint in Fahrenheit",
'dwpc': "Dewpoint in Celsius",
'relh': "Relative humidity in percent",
'drct': "Wind direction in degrees",
'sknt': "Wind speed in knots",
'gust': "Wind gusts in knots",
'alti': "Altimeter setting in inches of mercury",
'mslp': "Mean sea level pressure",
'vsby': "Visibility in statute miles",
'p01i': "P01I", # 1-hour precipitation inches
'p01m': "P01M" # 1-hour precipitation mm
}Plot Timeseries data from a station local to Boulder, Colorado¶
Select all the data from the Boulder station, “BDU” and plot it.
df_bdu = df.loc[df['station'] == "BDU"]
df_bdu.head(2)df_select = pd.DataFrame({
'date': df_bdu['valid'],
'tmpc': df_bdu['tmpc'],
'dwpc': df_bdu['dwpc'],
'relh': df_bdu['relh'],
'drct': df_bdu['drct'],
'sknt': df_bdu['sknt'],
'alti': df_bdu['alti'],
}).set_index('date')
# print the head of the dataframe
df_select.head(2)Iterate through some variables and compose an interactive plot of the past three days
hv.Layout(
[df_select[i].hvplot.line(title=f"{variables[i]}", width=400) for i in [
'tmpc',
'relh',
]]
).cols(1)METAR Parquet Antipatterns¶
There are a couple antipatterns that we are currently using in this cloud computing scheme:
partitioning --> the data is currently not accessed through a single catalog
we need to derive the start & end time to query specific files
compression --> similar data is not stored adjacent across rows of the parquet file
expressions --> data is not being accessed in a pythonic manner, we are instead relying on SQL-like queries
geospatial information is not changing, station attributes such as “lat/lon” are stored in a redundant manner
We would like to access the data using a more pythonic approach. The underlying data is timeseries, so the goal is to store it as timeseries DataArrays.
All of these problems can be fixed by Xarray’s DataTree implementation.
df.head(2)# assemble all of the data for one station
df_bdu = df.loc[df['station'] == "BDU"]
df_bdu_select = pd.DataFrame({
'station': df_bdu['station'],
'date': df_bdu['valid'],
'latitude': df_bdu['latitude'],
'longitude': df_bdu['longitude'],
'tmpf': df_bdu['tmpf'],
'dwpf': df_bdu['dwpf'],
'relh': df_bdu['relh'],
'drct': df_bdu['drct'],
'sknt': df_bdu['sknt'],
'gust': df_bdu['gust'],
'alti': df_bdu['alti'],
# 'mslp': df_bdu['mslp'],
'vsby': df_bdu['vsby'],
'p01i': df_bdu['p01i'],
}).set_index('date')
df_bdu_select.head(2)latitude = df_bdu['latitude'].iloc[0]
longitude = df_bdu['longitude'].iloc[0]
elevation = df_bdu['elevation'].iloc[0]
print(f"Latitude: {latitude}, Longitude: {longitude}, Elevation: {elevation}")Latitude: 40.0394, Longitude: -105.2258, Elevation: 1612.0
station = df_bdu['station'].iloc[0]
country = df_bdu['country'].iloc[0]
tzname = df_bdu['tzname'].iloc[0]
print(f"Station: {station}, Country: {country} and time zone name: {tzname}")Station: BDU, Country: US and time zone name: America/Denver
# create an xarray dataarra for each variable
time = df_bdu_select.index.sort_values()
latitude = df_bdu_select['latitude'].values[0]
longitude = df_bdu_select['longitude'].values[0]
temperature = df_bdu_select['tmpf'].values
dew_point = df_bdu['dwpf'].values
relative_humidity = df_bdu['relh'].values
wind_direction = df_bdu['drct'].values
wind_speed = df_bdu['sknt'].values
wind_gust = df_bdu['gust'].values
pressure = df_bdu['alti'].values
visibility = df_bdu['vsby'].values
one_hour_precipitation = df_bdu['p01i'].values# help(time.sort_values) #()Create all of the DataArrays with associated labels and units¶
time = df_bdu_select.index.sort_values().tz_localize(None)
time_da = xr.DataArray(
data=time,
dims=("time"),
name="time",
attrs=dict(
long_name="Timestamp of each ping",
standard_name="time",
)
)
temperature_da = xr.DataArray(
data=temperature,
dims=["time"],
coords=dict(
time=time_da,
),
attrs=dict(
long_name="Temperature in Fahrenheit",
standard_name="Temperature",
units="degF",
),
)
dew_point_da = xr.DataArray(
data=dew_point,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="Dewpoint in Fahrenheit",
standard_name="Dewpoint",
units="degF",
),
)
relative_humidity_da = xr.DataArray(
data=relative_humidity,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="Relative humidity in Percent",
standard_name="Relative Humidity",
units="percent",
),
)
wind_direction_da = xr.DataArray(
data=wind_direction,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="Wind Direction in Degrees",
standard_name="Wind Direction",
units="degrees",
),
)
wind_speed_da = xr.DataArray(
data=wind_speed,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="Wind Speed in Knots",
standard_name="Wind Speed",
units="knots",
),
)
wind_gust_da = xr.DataArray(
data=wind_gust,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="Wind Gust in Knots",
standard_name="Wind Gust",
units="knots",
),
)
pressure_da = xr.DataArray(
data=pressure,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="Pressure in inHg",
standard_name="Pressure",
units="inHg",
),
)
visibility_da = xr.DataArray(
data=visibility,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="Visibility in Miles",
standard_name="Visibility",
units="miles",
),
)
one_hour_precipitation_da = xr.DataArray(
data=one_hour_precipitation,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="One Hour Precipitation in Inches",
standard_name="Precipitation",
units="inches",
),
)
# one_hour_precipitation_daCreate and Xarray Dataset¶
Combine all of the DataArray variables together and include attributes that don’t change for the DataSet such as the GPS location and the elevation.
ds = xr.Dataset(
data_vars=dict(
temperature=temperature_da,
dew_point=dew_point_da,
relative_humidity=relative_humidity_da,
wind_direction=wind_direction_da,
wind_speed=wind_speed_da,
wind_gust=wind_gust_da,
pressure=pressure_da,
visibility=visibility_da,
precipitation=one_hour_precipitation_da,
),
coords=dict(
time=time_da,
),
attrs=dict(
station=station,
country=country,
timezone=tzname,
latitude=latitude,
longitude=longitude,
elevation=elevation
),
)
dsCreate an Xarray DataTree¶
The solution to having disperate datasets will be to join them together using Xarray’s DataTree implementation. It is a graph like structure, so it uses nodes and edges to connect dataset in a hierarchical like manner.
Here is an example of an Xarray DataTree with a family:
bart = xr.DataTree(name="Bart")
lisa = xr.DataTree(name="Lisa")
homer = xr.DataTree(name="Homer", children={"Bart": bart, "Lisa": lisa})
print(homer)<xarray.DataTree 'Homer'>
Group: /
├── Group: /Bart
└── Group: /Lisa
To access a child I just denote them by name:
homer['Lisa']First example: Define the METAR station for Boulder, ‘BDU’ and add it as a child to the parent
bdu = xr.DataTree(dataset=ds, name=ds.station)
metar = xr.DataTree(name="metar", children={"bdu": bdu})So “metar” now has the child dataset, “bdu”¶
And note that there are attributes specific to that station tracked with that
metar.childrenFrozen({'bdu': <xarray.DataTree 'bdu'>
Group: /bdu
Dimensions: (time: 210)
Coordinates:
* time (time) datetime64[us] 2kB 2026-06-27T04:35:00 ... 2026...
Data variables:
temperature (time) float64 2kB 71.6 73.4 75.2 73.4 ... 78.8 80.6 78.8
dew_point (time) float64 2kB 44.6 41.0 41.0 41.0 ... 30.2 24.8 26.6
relative_humidity (time) float64 2kB 37.92 31.08 29.26 ... 16.92 12.77 14.6
wind_direction (time) float64 2kB 240.0 260.0 260.0 ... 190.0 0.0 230.0
wind_speed (time) float64 2kB 6.0 20.0 10.0 9.0 ... 4.0 3.0 0.0 4.0
wind_gust (time) float64 2kB nan 33.0 23.0 nan ... 15.0 nan nan nan
pressure (time) float64 2kB 29.86 29.89 29.86 ... 29.89 29.9 29.91
visibility (time) float64 2kB 10.0 10.0 10.0 10.0 ... 10.0 10.0 10.0
precipitation (time) float64 2kB 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Attributes:
station: BDU
country: US
timezone: America/Denver
latitude: 40.0394
longitude: -105.2258
elevation: 1612.0})We can store the data in a single cloud-native dataset now, no need to partition because the data is automatically chunked¶
The dataset can be saved as follows:
metar.to_zarr(store="metar.zarr", mode='a')<xarray.backends.zarr.ZarrStore at 0x7f97788e13a0>Create a datatree of some Colorado stations¶
Pick a couple example stations to save:
select_stations = ['BDU', 'DEN', 'DRO']Iterate through all selected stations and add them to the parent¶
all_children = {}
for stn in select_stations:
# print(f"station: {stn}")
dff = df.loc[df['station'] == stn]
dff_select = pd.DataFrame({
'station': dff['station'],
'date': dff['valid'],
'latitude': dff['latitude'],
'longitude': dff['longitude'],
'tmpf': dff['tmpf'],
'dwpf': dff['dwpf'],
'relh': dff['relh'],
'drct': dff['drct'],
'sknt': dff['sknt'],
'gust': dff['gust'],
'alti': dff['alti'],
'vsby': dff['vsby'],
'p01i': dff['p01i'],
}).set_index('date')
latitude = dff['latitude'].iloc[0]
longitude = dff['longitude'].iloc[0]
elevation = dff['elevation'].iloc[0]
station = dff['station'].iloc[0]
country = dff['country'].iloc[0]
tzname = dff['tzname'].iloc[0]
time = dff_select.index.sort_values()
# print(np.diff(time))
latitude = dff_select['latitude'].values[0]
longitude = dff_select['longitude'].values[0]
temperature = dff_select['tmpf'].values
dew_point = dff['dwpf'].values
relative_humidity = dff['relh'].values
wind_direction = dff['drct'].values
wind_speed = dff['sknt'].values
wind_gust = dff['gust'].values
pressure = dff['alti'].values
visibility = dff['vsby'].values
one_hour_precipitation = dff['p01i'].values
# time.tz_localize(None)
time = dff_select.index.sort_values().tz_localize(None)
time_da = xr.DataArray(
data=time,
dims=("time"),
name="time",
attrs=dict(
# calendar="proleptic_gregorian",
# units="seconds since 1970-01-01 00:00:00",
long_name="Timestamp of each ping",
standard_name="time",
)
)
temperature_da = xr.DataArray(
data=temperature,
dims=["time"],
coords=dict(
time=time_da,
),
attrs=dict(
long_name="Temperature in Fahrenheit",
standard_name="Temperature",
units="degF",
),
)
# temperature_da
dew_point_da = xr.DataArray(
data=dew_point,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="Dewpoint in Fahrenheit",
standard_name="Dewpoint",
units="degF",
),
)
relative_humidity_da = xr.DataArray(
data=relative_humidity,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="Relative humidity in Percent",
standard_name="Relative Humidity",
units="percent",
),
)
wind_direction_da = xr.DataArray(
data=wind_direction,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="Wind Direction in Degrees",
standard_name="Wind Direction",
units="degrees",
),
)
wind_speed_da = xr.DataArray(
data=wind_speed,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="Wind Speed in Knots",
standard_name="Wind Speed",
units="knots",
),
)
wind_gust_da = xr.DataArray(
data=wind_gust,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="Wind Gust in Knots",
standard_name="Wind Gust",
units="knots",
),
)
pressure_da = xr.DataArray(
data=pressure,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="Pressure in inHg",
standard_name="Pressure",
units="inHg",
),
)
visibility_da = xr.DataArray(
data=visibility,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="Visibility in Miles",
standard_name="Visibility",
units="miles",
),
)
one_hour_precipitation_da = xr.DataArray(
data=one_hour_precipitation,
dims=["time"],
coords=dict(time=time_da),
attrs=dict(
long_name="One Hour Precipitation in Inches",
standard_name="Precipitation",
units="inches",
),
)
ds = xr.Dataset(
data_vars=dict(
temperature=temperature_da,
dew_point=dew_point_da,
relative_humidity=relative_humidity_da,
wind_direction=wind_direction_da,
wind_speed=wind_speed_da,
wind_gust=wind_gust_da,
pressure=pressure_da,
visibility=visibility_da,
precipitation=one_hour_precipitation_da,
),
coords=dict(
time=time_da,
),
attrs=dict(
station=station,
country=country,
timezone=tzname,
latitude=latitude,
longitude=longitude,
elevation=elevation
),
)
# print(ds)
temp_station = xr.DataTree(dataset=ds, name=ds.station)
all_children[ds.station] = temp_stationlen(all_children)3Create Consolidated DataTree of the Station Data¶
metar = xr.DataTree(name="metar", children=all_children)
metar.childrenFrozen({'BDU': <xarray.DataTree 'BDU'>
Group: /BDU
Dimensions: (time: 210)
Coordinates:
* time (time) datetime64[us] 2kB 2026-06-27T04:35:00 ... 2026...
Data variables:
temperature (time) float64 2kB 71.6 73.4 75.2 73.4 ... 78.8 80.6 78.8
dew_point (time) float64 2kB 44.6 41.0 41.0 41.0 ... 30.2 24.8 26.6
relative_humidity (time) float64 2kB 37.92 31.08 29.26 ... 16.92 12.77 14.6
wind_direction (time) float64 2kB 240.0 260.0 260.0 ... 190.0 0.0 230.0
wind_speed (time) float64 2kB 6.0 20.0 10.0 9.0 ... 4.0 3.0 0.0 4.0
wind_gust (time) float64 2kB nan 33.0 23.0 nan ... 15.0 nan nan nan
pressure (time) float64 2kB 29.86 29.89 29.86 ... 29.89 29.9 29.91
visibility (time) float64 2kB 10.0 10.0 10.0 10.0 ... 10.0 10.0 10.0
precipitation (time) float64 2kB 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Attributes:
station: BDU
country: US
timezone: America/Denver
latitude: 40.0394
longitude: -105.2258
elevation: 1612.0, 'DEN': <xarray.DataTree 'DEN'>
Group: /DEN
Dimensions: (time: 75)
Coordinates:
* time (time) datetime64[us] 600B 2026-06-27T04:53:00 ... 202...
Data variables:
temperature (time) float64 600B 67.0 66.0 65.0 ... 84.0 79.0 70.0
dew_point (time) float64 600B 59.0 58.0 56.0 ... 28.0 30.0 32.0
relative_humidity (time) float64 600B 75.5 75.41 72.64 ... 16.67 24.43
wind_direction (time) float64 600B 160.0 170.0 170.0 ... 40.0 80.0 110.0
wind_speed (time) float64 600B 22.0 15.0 12.0 11.0 ... 12.0 8.0 8.0
wind_gust (time) float64 600B 34.0 nan nan nan ... 19.0 nan nan nan
pressure (time) float64 600B 29.89 29.89 29.89 ... 29.86 29.89
visibility (time) float64 600B 10.0 10.0 10.0 ... 10.0 10.0 10.0
precipitation (time) float64 600B 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Attributes:
station: DEN
country: US
timezone: America/Denver
latitude: 39.8328
longitude: -104.6575
elevation: 1656.0, 'DRO': <xarray.DataTree 'DRO'>
Group: /DRO
Dimensions: (time: 72)
Coordinates:
* time (time) datetime64[us] 576B 2026-06-27T04:53:00 ... 202...
Data variables:
temperature (time) float64 576B 57.0 55.0 54.0 ... 79.0 76.0 70.0
dew_point (time) float64 576B 51.0 51.0 51.0 ... 15.0 18.0 18.0
relative_humidity (time) float64 576B 80.31 86.35 89.55 ... 11.14 13.63
wind_direction (time) float64 576B 0.0 0.0 0.0 ... 250.0 250.0 360.0
wind_speed (time) float64 576B 0.0 0.0 0.0 4.0 ... 6.0 11.0 6.0 4.0
wind_gust (time) float64 576B nan nan nan nan ... nan 16.0 nan nan
pressure (time) float64 576B 30.18 30.16 30.17 ... 30.06 30.06
visibility (time) float64 576B 10.0 10.0 10.0 ... 10.0 10.0 10.0
precipitation (time) float64 576B 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Attributes:
station: DRO
country: US
timezone: America/Denver
latitude: 37.1515
longitude: -107.7538
elevation: 2038.0})Write all of the data as a single dataset¶
metar.to_zarr(store="metar_all.zarr", mode='a')<xarray.backends.zarr.ZarrStore at 0x7f978282b7e0>Here is how you would open the data¶
dt = xr.open_datatree("metar_all.zarr")
dtTo get a slice of 1 hours data you could do something like:
dt_bdu_sel = dt["BDU"].sel(time=slice("2026-06-18T00:00:00", "2026-06-18T01:00:00"))
dt_bdu_selAnd then to get the temperature data:
dt_bdu_sel.temperatureAnd the relative humidity:
dt_bdu_sel.relative_humidityReferences¶
What’s next?¶
Expanding the DataTree implementation to the full dataset.