Intake Logo

Creating Intake Catalogs


Overview

In the last lesson we learned to use Intake catalogs to simplify the process of accessing research data. In this lesson we will walk through the steps of creating a catalog for your research data by recreating the catalog in the previous lesson.

  1. Creating an Intake catalog

  2. Documenting your data source

  3. Share your catalog on Github.

Prerequisites

Concepts

Importance

Notes

Intro to Intake

Necessary

Understanding of yaml

Necessary

Getting Started with Github

Necessary

Intro to Pandas

Helpful

  • Time to learn: 45 minutes


Imports

import intake
import intake_xarray
import intake_markdown
import requests
import aiohttp
import s3fs
import yaml
import json
import datetime
import os

Setting up the Environment

By the end of this tutorial we will have created a git repository that we can host on Github to share our catalog.

Start by creating a Github repository called “intake-demo”, and then clone the repository to your local machine. Be sure to replace path/to/Github/repository with the name of the repository you just made in the following command.

git clone path/to/Github/repository

Intake catalogs can be a simple yaml file. We can create the yaml file programmatically by converting nested python dictionaries to yaml. A Intake catalog has two main parts metadata and sources. The metadata can be arbitrary with a few exceptions. The sources section is a mapping between a data source name and its properties. For more information about Intake catalogs, see Intake’s documentation

description = "Catalog containing Mesowest's HRRR data. See readme source for more information."

catalog = {'metadata': {'version': 1,
                       'description': description},
           'sources': {}}

os.makedirs("intake-demo", exist_ok=True) #only needed for building this notebook
with open('intake-demo/catalog.yml', 'w') as f:
    yaml.dump(catalog, f)

You will now notice a new file in your “intake-demo” directory called “catalog.yml” with the following contents.

with open('intake-demo/catalog.yml', 'r') as f:
    print(f.read())
metadata:
  description: Catalog containing Mesowest's HRRR data. See readme source for more
    information.
  version: 1
sources: {}

Adding Your First Data Source

Intake only knows how to handle a few different data formats. To handle other formats it uses plugable drivers. To use the Mesowest’s HRRR Zarr data we will use the intake-xarray package which provides a driver for reading Zarr data into Xarray datasets. Drivers are installed as python packages and integrate into the Intake library. When intalled Intake creates a open_{driver} method for each driver in the package. Installing the intake-xarray package allows us to access zarr data using the open_zarr method.

Mesowest’s HRRR Zarr data is stored in AWS. The file structure of the hrrrzarr S3 bucket looks like

s3://hrrrzarr/sfc/yyyymmdd/yyyymmdd_hhz_anl.zarr/level/param/level

where

  • yyyy = four digit year

  • mm = two digit month

  • dd = two digit day of month

  • hh = two digit hour of the day

  • level = level of atmoshpere the data describes

  • param = the parameter your interested in

To load a complete dataset we need the Zarr arrays from two urls

s3://hrrrzarr/sfc/yyyymmdd/yyyymmdd_hhz_anl.zarr/level/param/level

s3://hrrrzarr/sfc/yyyymmdd/yyyymmdd_hhz_anl.zarr/level/param

Lets load surface temperature data from August 24, 2016

urls = ['s3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP/surface',
           's3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP']

source = intake.open_zarr(urls, storage_options={"anon": True})

source.name = 'hrrrzarr'
source.description = "Mesowest's HRRR data. See readme source for more information."
ds = source.read()
ds
/home/runner/miniconda3/envs/intake-cookbook-dev/lib/python3.12/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  'dims': dict(self._ds.dims),
<xarray.Dataset> Size: 4MB
Dimensions:                  (projection_y_coordinate: 1059,
                              projection_x_coordinate: 1799)
Coordinates:
  * projection_x_coordinate  (projection_x_coordinate) float64 14kB -2.698e+0...
  * projection_y_coordinate  (projection_y_coordinate) float64 8kB -1.587e+06...
Data variables:
    TMP                      (projection_y_coordinate, projection_x_coordinate) float16 4MB ...
    forecast_period          timedelta64[ns] 8B 00:00:00
    forecast_reference_time  datetime64[ns] 8B 2016-08-24
    height                   float64 8B 1e+03
    pressure                 float64 8B 2.5e+04
    time                     datetime64[ns] 8B 2016-08-24

Above we used the storage_options argument to tell Intake how to access data on AWS. In this case we accessed the data as an anonymous user. The consolidated=True argument is given to tell Xarray how to load the metadata for this source. Zarr data may contain consolidated metadata. If it does, using it can increase performance significantly.

When you use Intake’s open_{driver} methods, it creates a catalog entry for the source. You can view the yaml using the source’s yaml method.

print(source.yaml())
sources:
  hrrrzarr:
    args:
      storage_options:
        anon: true
      urlpath:
      - s3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP/surface
      - s3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP
    description: Mesowest's HRRR data. See readme source for more information.
    driver: intake_xarray.xzarr.ZarrSource
    metadata:
      coords: !!python/tuple
      - projection_x_coordinate
      - projection_y_coordinate
      data_vars:
        TMP:
        - projection_x_coordinate
        - projection_y_coordinate
        forecast_period: []
        forecast_reference_time: []
        height: []
        pressure: []
        time: []
      dims:
        projection_x_coordinate: 1799
        projection_y_coordinate: 1059

Modifying the Source

If we wanted we could add this yaml to our catalog and we could then load this data using Intake. However, there are many datasets we can load from the Zarr store with almost the same catalog entry. Making a separate entry for each would make the catalog cluttered and harder to use. Instead we will generalize this catalog entry so it applies to many datasets. Then we will create user parameters to give the catalog user the abillity to select the data they want.

To generalize the source we need Intake to dynamically generate urls pointing to the Zarr arrays based off user set parameters. We will take the source created by the open_zarr method convert it to a python dictionary and then modify it to include user parameters. We can then use those parameters to generate the urls. Intake provides Jinja templating in catalogs to make this simple. Let’s start by defining user parameters.

source_dict = yaml.load(source.yaml(), Loader=yaml.CLoader)

parameters = {}
parameters['level'] = {'description': "Parameter specifying level in the atmosphere. Corresponds to 'Vertical Level' column in data_dictionary",
                       'type': 'str',
                       'default': 'surface'}

parameters['param'] = {'description': "Specifies what parameter your dataset will contain. Corresponds to 'Parameter Short Name' in data_dictionary",
                       'type': 'str',
                       'default': 'TMP'}

parameters['date'] = {'description': "Date and hour of data.",
                      'type': 'datetime',
                      'default': "2016-08-24T00:00:00",
                      'min': "2016-08-24T00:00:00"}



sources = source_dict['sources']
hrrr_zarr = sources['hrrrzarr']
hrrr_zarr['parameters'] = parameters

With the parameters defined we can now use them to create the urls using Jinja syntax.

urls = ["s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}",
        "s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}/{{level}}"]

hrrr_zarr['args']['urlpath'] = urls

Now that we have a more generalized source, some of the metadata is too specific. To fix this we will just remove the data_varsection from the source’s metadata.

hrrr_zarr['metadata'].pop('data_vars', None)
print(yaml.dump(source_dict))
sources:
  hrrrzarr:
    args:
      storage_options:
        anon: true
      urlpath:
      - s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}
      - s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}/{{level}}
    description: Mesowest's HRRR data. See readme source for more information.
    driver: intake_xarray.xzarr.ZarrSource
    metadata:
      coords: !!python/tuple
      - projection_x_coordinate
      - projection_y_coordinate
      dims:
        projection_x_coordinate: 1799
        projection_y_coordinate: 1059
    parameters:
      date:
        default: '2016-08-24T00:00:00'
        description: Date and hour of data.
        min: '2016-08-24T00:00:00'
        type: datetime
      level:
        default: surface
        description: Parameter specifying level in the atmosphere. Corresponds to
          'Vertical Level' column in data_dictionary
        type: str
      param:
        default: TMP
        description: Specifies what parameter your dataset will contain. Corresponds
          to 'Parameter Short Name' in data_dictionary
        type: str

Documenting the Data

We have created a data source, but it may be a little tricky to use. We need a way to let users know what options they have for the level and param user parameters we defined earlier. The inventory.csv file, created from Mesowest’s HRRR Zarr Variable List, in this directory contains a table which shows what parameters are in the Zarr store and at what level in the atmosphere those parameters are available. Let’s open it as a source and add it to our source dictionary.

source = intake.open_csv('inventory.csv')
source.name = 'data_dictionary'
source.description = 'Describes the data in the hrrrzarr source'

The Vertical Level column corresponds to the level paremeter in our data source and the Parameter Short Name corresponds to the param parameter.

source
data_dictionary:
  args:
    urlpath: inventory.csv
  description: Describes the data in the hrrrzarr source
  driver: intake.source.csv.CSVSource
  metadata: {}

This source is almost how we want it, but the urlpath will not work after we push our catalog to Github. Intake sets the CATALOG_DIR parameter to point to whatever directory the catalog file is in. Using this parameter we can generate a url that will work even after we push the repository to Github.

sources['data_dictionary'] = yaml.load(source.yaml(), 
                                       Loader=yaml.CLoader)['sources']['data_dictionary']
data_dictionary_args = sources['data_dictionary']['args']
data_dictionary_args['urlpath'] = "{{ CATALOG_DIR }}/inventory.csv"
print(yaml.dump(sources))
data_dictionary:
  args:
    urlpath: '{{ CATALOG_DIR }}/inventory.csv'
  description: Describes the data in the hrrrzarr source
  driver: intake.source.csv.CSVSource
  metadata: {}
hrrrzarr:
  args:
    storage_options:
      anon: true
    urlpath:
    - s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}
    - s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}/{{level}}
  description: Mesowest's HRRR data. See readme source for more information.
  driver: intake_xarray.xzarr.ZarrSource
  metadata:
    coords: !!python/tuple
    - projection_x_coordinate
    - projection_y_coordinate
    dims:
      projection_x_coordinate: 1799
      projection_y_coordinate: 1059
  parameters:
    date:
      default: '2016-08-24T00:00:00'
      description: Date and hour of data.
      min: '2016-08-24T00:00:00'
      type: datetime
    level:
      default: surface
      description: Parameter specifying level in the atmosphere. Corresponds to 'Vertical
        Level' column in data_dictionary
      type: str
    param:
      default: TMP
      description: Specifies what parameter your dataset will contain. Corresponds
        to 'Parameter Short Name' in data_dictionary
      type: str

Your source now points to a inventory.csv file in the same directory as your catalog. Be sure to copy the file into your “intake-demo” directory.

Now that we have a source and a data dictionary to describe it lets add a readme to our catalog to explain how to use it and give some usage examples. This readme will also be displayed as the readme for the repository on Github. In this directory there is an example readme markdown file to use. Go ahead and copy it into the “intake-demo” directory.

md_kwargs = {"pre": "<details markdown='1'>\n<summary>README</summary>\n",
             "post": "\n<br>\nEnd of README\n</details>"}
source = intake.open_markdown('README.md', md_kwargs=md_kwargs)
source.name = 'readme'
source.description = 'Learn more about how to use this catalog'
source.read()
README

Intake Demo Readme

This repository was made as a educational resource to learn the Python library Intake. It contains an intake catalog pointing to Mesowest's High-Resolution Rapid Refresh (HRRR) Zarr data.

Dependencies

  • requests
  • aiohttp
  • intake
  • s3fs
  • Intake-xarray
  • Intake-markdown
  • cartopy

About the Data

High-Resolution Rapid Refresh (HRRR) is a atmospheric model maintained by NOAA. As stated on NOAA's website

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

In this cookbook we use a subset of HRRR data maintained by Mesowest on AWS S3 object storage.

HRRR Projection

From Mesowest's HRRR Zarr Data Loading Guide

The projection description for the spatial grid, which is stored in the HRRR grib2 files, is not available directly in the zarr arrays. Additionally, the zarr data has the x and y coordinates in the native Lambert Conformal Conic projection, but not the latitude and longitude data. The various use cases will have more specific examples of how to handle this information, but here we'll note that: The proj params you need to define the grid correctly are: {'a': 6371229, 'b': 6371229, 'proj': 'lcc', 'lon_0': 262.5, 'lat_0': 38.5, 'lat_1': 38.5, 'lat_2': 38.5} We also offer a "chunk index" that has the latitude and longitude coordinates for the grid, more on that under Use Case 3. The following code sets up the correct CRS in cartopy:
projection = ccrs.LambertConformal(central_longitude=262.5, 
                                   central_latitude=38.5, 
                                   standard_parallels=(38.5, 38.5),
                                    globe=ccrs.Globe(semimajor_axis=6371229,
                                                     semiminor_axis=6371229))

Data Dictionary

This catalog contains a csv source called data_dictionary that describes the data in the the hrrrzarr source. The Vertical Level and Parameter Short Name column are arguments that can be passed to the hrrrzarr source's level and param user parameters respectively.

Usage Example

import intake
import datetime as dt
user_parameters = {'level': 'top_of_atmosphere',
                   'param': 'USWRF',
                   'date': dt.datetime(2021, 4, 25, 6)}

cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')

ds = cat.hrrrzarr(**user_parameters).read()

Combining a Days Worth of Data

import intake
import cartopy.crs as ccrs
import datetime as dt
import xarray as xr

cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')

projection = ccrs.LambertConformal(central_longitude=262.5, 
                                   central_latitude=38.5, 
                                   standard_parallels=(38.5, 38.5),
                                    globe=ccrs.Globe(semimajor_axis=6371229,
                                                     semiminor_axis=6371229))
date = dt.datetime(2019, 8, 9)
hours = [date + dt.timedelta(hours=i) for i in range(4)]
datasets = [cat.hrrrzarr(date=hour).read().set_coords("time") for hour in hours]
ds = xr.concat(datasets, dim='time', combine_attrs="drop_conflicts")


End of README

The values of pre and post in the md_kwargs dictionary are used by intake-markdown to add extra markdown before and after the markdown source. In this example we use details and summary tags to enclose the readme in a dropdown.

We will change the urlpath of this source in the same way as the data dictionary to ensure the readme loads correctly.

sources['readme'] = yaml.load(source.yaml(), Loader=yaml.CLoader)['sources']['readme']
readme_args = sources['readme']['args']
readme_args['urlpath'] = "{{ CATALOG_DIR }}/README.md"
print(yaml.dump(source_dict))
sources:
  data_dictionary:
    args:
      urlpath: '{{ CATALOG_DIR }}/inventory.csv'
    description: Describes the data in the hrrrzarr source
    driver: intake.source.csv.CSVSource
    metadata: {}
  hrrrzarr:
    args:
      storage_options:
        anon: true
      urlpath:
      - s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}
      - s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}/{{level}}
    description: Mesowest's HRRR data. See readme source for more information.
    driver: intake_xarray.xzarr.ZarrSource
    metadata:
      coords: !!python/tuple
      - projection_x_coordinate
      - projection_y_coordinate
      dims:
        projection_x_coordinate: 1799
        projection_y_coordinate: 1059
    parameters:
      date:
        default: '2016-08-24T00:00:00'
        description: Date and hour of data.
        min: '2016-08-24T00:00:00'
        type: datetime
      level:
        default: surface
        description: Parameter specifying level in the atmosphere. Corresponds to
          'Vertical Level' column in data_dictionary
        type: str
      param:
        default: TMP
        description: Specifies what parameter your dataset will contain. Corresponds
          to 'Parameter Short Name' in data_dictionary
        type: str
  readme:
    args:
      md_kwargs:
        post: '

          <br>

          End of README

          </details>'
        pre: '<details markdown=''1''>

          <summary>README</summary>

          '
      urlpath: '{{ CATALOG_DIR }}/README.md'
    description: Learn more about how to use this catalog
    driver: intake_markdown.intake_markdown.MarkdownSource
    metadata: {}

With all our sources made, we will add them to the catalog, and save the catalog.

catalog['sources'] = sources
with open('intake-demo/catalog.yml', 'w') as f:
    yaml.dump(catalog, f)

At this point you should have three files in your “intake-demo” directory: “catalog.yml”, “inventory.csv”, and “README.md”. All we need to do now is commit our changes and push them to Github.

git add .
git commit -m "initial commit"
git push

Testing the Catalog

Now that your catalog is on Github let’s try using it. In the cell below replace the url with the url pointing to the raw catalog file on your Github account

cat = intake.open_catalog('https://raw.githubusercontent.com/ProjectPythia/intake-cookbook/main/notebooks/catalog.yml')
list(cat)
['data_dictionary', 'hrrrzarr', 'readme']
cat.readme.read()
README

Intake Demo Readme

This repository was made as a educational resource to learn the Python library Intake. It contains an intake catalog pointing to Mesowest's High-Resolution Rapid Refresh (HRRR) Zarr data.

Dependencies

  • requests
  • aiohttp
  • intake
  • s3fs
  • Intake-xarray
  • Intake-markdown
  • cartopy

About the Data

High-Resolution Rapid Refresh (HRRR) is a atmospheric model maintained by NOAA. As stated on NOAA's website

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

In this cookbook we use a subset of HRRR data maintained by Mesowest on AWS S3 object storage.

HRRR Projection

From Mesowest's HRRR Zarr Data Loading Guide

The projection description for the spatial grid, which is stored in the HRRR grib2 files, is not available directly in the zarr arrays. Additionally, the zarr data has the x and y coordinates in the native Lambert Conformal Conic projection, but not the latitude and longitude data. The various use cases will have more specific examples of how to handle this information, but here we'll note that: The proj params you need to define the grid correctly are: {'a': 6371229, 'b': 6371229, 'proj': 'lcc', 'lon_0': 262.5, 'lat_0': 38.5, 'lat_1': 38.5, 'lat_2': 38.5} We also offer a "chunk index" that has the latitude and longitude coordinates for the grid, more on that under Use Case 3. The following code sets up the correct CRS in cartopy:
projection = ccrs.LambertConformal(central_longitude=262.5, 
                                   central_latitude=38.5, 
                                   standard_parallels=(38.5, 38.5),
                                    globe=ccrs.Globe(semimajor_axis=6371229,
                                                     semiminor_axis=6371229))

Data Dictionary

This catalog contains a csv source called data_dictionary that describes the data in the the hrrrzarr source. The Vertical Level and Parameter Short Name column are arguments that can be passed to the hrrrzarr source's level and param user parameters respectively.

Usage Example

import intake
import datetime as dt
user_parameters = {'level': 'top_of_atmosphere',
                   'param': 'USWRF',
                   'date': dt.datetime(2021, 4, 25, 6)}

cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')

ds = cat.hrrrzarr(**user_parameters).read()

Combining a Days Worth of Data

import intake
import cartopy.crs as ccrs
import datetime as dt
import xarray as xr

cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')

projection = ccrs.LambertConformal(central_longitude=262.5, 
                                   central_latitude=38.5, 
                                   standard_parallels=(38.5, 38.5),
                                    globe=ccrs.Globe(semimajor_axis=6371229,
                                                     semiminor_axis=6371229))
date = dt.datetime(2019, 8, 9)
hours = [date + dt.timedelta(hours=i) for i in range(4)]
datasets = [cat.hrrrzarr(date=hour).read().set_coords("time") for hour in hours]
ds = xr.concat(datasets, dim='time', combine_attrs="drop_conflicts")


End of README

cat.data_dictionary
data_dictionary:
  args:
    keep_default_na: false
    urlpath: https://raw.githubusercontent.com/ProjectPythia/intake-cookbook/main/notebooks/inventory.csv
  description: Describes the data in the hrrrzarr source
  driver: intake.source.csv.CSVSource
  metadata:
    catalog_dir: https://raw.githubusercontent.com/ProjectPythia/intake-cookbook/main/notebooks
cat.hrrrzarr.read()
/home/runner/miniconda3/envs/intake-cookbook-dev/lib/python3.12/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  'dims': dict(self._ds.dims),
<xarray.Dataset> Size: 4MB
Dimensions:                  (projection_y_coordinate: 1059,
                              projection_x_coordinate: 1799)
Coordinates:
  * projection_x_coordinate  (projection_x_coordinate) float64 14kB -2.698e+0...
  * projection_y_coordinate  (projection_y_coordinate) float64 8kB -1.587e+06...
Data variables:
    TMP                      (projection_y_coordinate, projection_x_coordinate) float16 4MB ...
    forecast_period          timedelta64[ns] 8B 00:00:00
    forecast_reference_time  datetime64[ns] 8B 2016-08-24
    height                   float64 8B 1e+03
    pressure                 float64 8B 2.5e+04
    time                     datetime64[ns] 8B 2016-08-24

Summary

In this tutorial we learned to create Intake catalogs and host them on Github. We learned to create sources with Intake and then modify them to make them more general. We explored a possible method for documenting data by adding a readme and data dictionary to our catalog. These guidelines will help you make your data more accessible to collaborators.

Resources and references