Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Parquet and DuckDB Primer

Authors
Affiliations
Colorado State University
UCAR | UCP | NSF Unidata
CIRES University of Colorado Boulder
Boise State University
Montclair State University
NSF NCAR EOL
University at Albany (SUNY)

Overview

This cookbook makes extensive use of duckdb and a file format called Parquet (specifically GeoParquet). This notebook serves to get you familiar with the technology, why it’s being used, and some advanced usage.

Prerequisites

ConceptsImportanceNotes
DuckDBNecessaryPackage necessary; familiarity helpful
PandasHelpfulBasic experience with viewing tabular data
  • Time to learn: 20 minutes


Imports

import requests
import duckdb
from datetime import datetime

The What and Why of (Geo)Parquet

As noted in the Parquet documentation: “Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.”

In plain language, this means that Parquet is a modern, performant alternative to the plain text CSV (Comma Separated Variables) format. The following code cell pulls in some example CSV-formatted data from the Iowa Environmental Mesonet (IEM) and contains the following fields: station (Denver International Airport), day (Jan 1-10, 2026), and max recorded temperature in Fahrenheit.

iem_url = "https://mesonet.agron.iastate.edu/cgi-bin/request/daily.py"
params = {
    "network": "CO_ASOS", "stations": "DEN",
    "year1": "2026", "month1": "1", "day1": "1",
    "year2": "2026", "month2": "1", "day2": "10",
    "var": "max_temp_f",
    "na": "blank",
    "format": "csv"
}
den_data = requests.get(
    iem_url,
    params=params
)
print(den_data.content.decode())
station,day,max_temp_f
DEN,2026-01-01,64.0
DEN,2026-01-02,60.0
DEN,2026-01-03,64.0
DEN,2026-01-04,67.0
DEN,2026-01-05,61.0
DEN,2026-01-06,57.0
DEN,2026-01-07,58.0
DEN,2026-01-08,41.0
DEN,2026-01-09,32.0
DEN,2026-01-10,44.0

For small data sets such as the above, this is an acceptable file format. However, consider a dataset with these properties:

  • many stations located globaly

  • a large number of data fields

  • collected hourly

  • decades-worth of data

Such a file in a CSV format would quickly become unwieldy. Furthermore, much of the data is “repeated.” That is, all the Denver station data has “DEN” in the station column. Over hundreds, or thousands, of data entries this redundant data can add up to unnecessarily large file size. In addition, this particular example does not have geospatial data.

With Parquet, this data can be saved in a more storage-efficient format. GeoParquet extends Parquet to add geospatial/geometry data and metadata to Parquet data sets. An interested reader can read the introductory blog post with more detail here.

The IEM provides a historical archive of global ASOS data in the native METAR format. To conclude our GeoParquet story, Dynamical.org has created a processing pipeline that takes raw METAR from the IEM, transforms it into the GeoParquet format, and makes it readily available via the internet.

🦆 The What, Why, and How of DuckDB 🦆

DuckDB is a relational (table-oriented) DBMS [(Database Management System)] that supports the Structured Query Language (SQL).”

Databases are routinely used in web development to read and write, for example, user data and other website content. However, for our purposes we will not be using DuckDB in such a manner. Instead, we’ll be making use of the following DuckDB features:

  • Parquet support

  • SQL support

  • Interoperability with the Pandas library

By using DuckDB’s Python library, we can read a Parquet file--or collection of files--as a DuckDB “table”, execute a “query” to efficiently subset the dataset(s) for the variables of interest, and finally load the data into a pandas dataframe for further exploration, processing, and analysis. This technique will be used throughout this cookbook and is the method used by Dynamical.org in their examples.

base = "https://data.source.coop/dynamical/asos-parquet"

# URLS for every data set/parquet file beginning in 1940
urls = [f"{base}/year={y}/data.parquet" for y in range(1940, datetime.now().year + 1)]

df = duckdb.execute("""
    SELECT valid, station, name, country, tmpf, dwpf, sknt, p01i, year
    FROM read_parquet($1, hive_partitioning=true)
    WHERE station = 'DEN'
    ORDER BY valid""", [urls]).fetchdf()
df
Loading...
Loading...

Here, we’ve imported the duckdb module to gain access to the duckdb.execute function. The first argument is a multiline string, surrounded by triple double quotes """, defining the SQL command to execute. The second argument is a list that can be used to define parameters that can be referred in the SQL command. Finally, we call the fetchdf() method to convert the queried data into a Pandas dataframe.

Anatomy of a Query

In this sub-section, we’ll deconstruct the query from the example above. Additionally, we will explore a few additional features that may be useful.

SELECT

E.g.: SELECT valid, station, name, country, tmpf, dwpf, sknt, p01i, year

The SELECT statement determines the names of the variables to fetch, separated by commas. Each variable can be renamed with an AS instruction. For example:

df = duckdb.execute("""
SELECT valid AS datetime, station, tmpf AS temperature_f, dwpf AS dewpoint_f
FROM read_parquet($1)
WHERE station = 'DEN'
""", [f"{base}/year=2026/data.parquet"]).fetchdf()
df
Loading...

FROM

E.g.: FROM read_parquet($1, hive_partitioning=true)

The FROM instruction tells DuckDB the “table” to select from. In this case, our table is parsed from a parquet file(s) which is defined in the $1 parameter. Recall that this parameter is the first element of the list we sent as the second argument to the duckdb.execute function.

WHERE

E.g.: WHERE station = 'DEN'

The WHERE instruction filters the selected table data based on a conditional expression. In our original example we used WHERE station = 'DEN' to select data from the station at the Denver International Airport.

We can create more complex queries by combining conditional expressions with AND and OR. When doing so, the expression can be grouped with parentheses () to preserve order of operations. The following example will select all entries where either the Denver International Airport or Boulder Municipal Airport stations recorded temperatures below freezing but above 20F.

df = duckdb.execute("""
SELECT valid AS datetime, station, tmpf AS temperature_f, dwpf AS dewpoint_f
FROM read_parquet($1)
WHERE (station = 'DEN' OR station = 'BDU')
  AND (temperature_f <= 32 AND temperature_f >= 20)
""", [f"{base}/year=2026/data.parquet"]).fetchdf()
df
Loading...

In the above query, we used the expression temperature_f <= 32 AND temperature_f >= 20 to filter by entries with temperatures between 20F and 32F, inclusive. This is a common enough kind of query that SQL has a dedicated query to provide this functionality. The following query is equivalent:

WHERE (station = 'DEN' OR station = 'BDU')
  AND (temperature_f BETWEEN 20 AND 32)

WHERE and BETWEEN can be used to filter entries by date. For example, to get data collected in the month of March we would run the following query. Note that we construct the query with datetime values as a full timestamp passed as a parameter via the execute function.

t_start = '2026-01-01 00:00:00+00:00'
t_end = '2026-01-31 23:59:59+00:00'
df = duckdb.execute("""
SELECT valid AS datetime, station, tmpf AS temperature_f, dwpf AS dewpoint_f
FROM read_parquet($1)
WHERE (station = 'DEN' OR station = 'BDU')
  AND (datetime BETWEEN $2 AND $3)
""", [f"{base}/year=2026/data.parquet", t_start, t_end]).fetchdf()
df
Loading...

ORDER BY

E.g.: ORDER BY valid

The ORDER BY query sorts the queried data in ascending or descending order (using DESC), and can sort by more than one field. For example to sort queried data by datetime, regardless of station, with most recent dates first:

df = duckdb.execute("""
SELECT valid AS datetime, station, tmpf AS temperature_f, dwpf AS dewpoint_f
FROM read_parquet($1)
WHERE (station = 'DEN' OR station = 'BDU' OR station = 'JFK')
ORDER BY datetime DESC
""", [f"{base}/year=2026/data.parquet"]).fetchdf()
df
Loading...

If we wanted to additionally sort by station first, we can instead run:

df = duckdb.execute("""
SELECT valid AS datetime, station, tmpf AS temperature_f, dwpf AS dewpoint_f
FROM read_parquet($1)
WHERE (station = 'DEN' OR station = 'BDU' OR station = 'JFK')
ORDER BY station, datetime DESC
""", [f"{base}/year=2026/data.parquet"]).fetchdf()
df
Loading...

Summary

GeoParquet is a modern data format for efficiently storing and accessing tabular data, such as Dynamical.org’s ASOS catalog, transformed from the native METAR format found on the Iowa Environmental Mesonet. DuckDB’s Python module is used to subset Dynamical’s GeoParquet-formatted ASOS data using SQL queries before being stored in memory as a Panda’s dataframe for further processing and analysis.

SQL commands are composed of various queries and statements. Some commonly useful ones include:

  • SELECT

  • FROM

  • WHERE

  • ORDER BY

What’s next?

Now that you’re prepared to construct your own queries, we can move on to exploring the types of analyses and visualizations that can be made with Dynamical’s ASOS data.