Creating Intake Catalogs

Overview

In the last lesson we learned to use Intake catalogs to simplify the process of accessing research data. In this lesson we will walk through the steps of creating a catalog for your research data by recreating the catalog in the previous lesson.

Creating an Intake catalog
Documenting your data source
Share your catalog on Github.

Prerequisites

Concepts	Importance	Notes
Intro to Intake	Necessary
Understanding of yaml	Necessary
Getting Started with Github	Necessary
Intro to Pandas	Helpful

Time to learn: 45 minutes

Imports

import intake
import intake_xarray
import intake_markdown
import requests
import aiohttp
import s3fs
import yaml
import json
import datetime
import os

Setting up the Environment

By the end of this tutorial we will have created a git repository that we can host on Github to share our catalog.

Start by creating a Github repository called “intake-demo”, and then clone the repository to your local machine. Be sure to replace path/to/Github/repository with the name of the repository you just made in the following command.

git clone path/to/Github/repository

Intake catalogs can be a simple yaml file. We can create the yaml file programmatically by converting nested python dictionaries to yaml. A Intake catalog has two main parts metadata and sources. The metadata can be arbitrary with a few exceptions. The sources section is a mapping between a data source name and its properties. For more information about Intake catalogs, see Intake’s documentation

description = "Catalog containing Mesowest's HRRR data. See readme source for more information."

catalog = {'metadata': {'version': 1,
                       'description': description},
           'sources': {}}

os.makedirs("intake-demo", exist_ok=True) #only needed for building this notebook
with open('intake-demo/catalog.yml', 'w') as f:
    yaml.dump(catalog, f)

You will now notice a new file in your “intake-demo” directory called “catalog.yml” with the following contents.

with open('intake-demo/catalog.yml', 'r') as f:
    print(f.read())

metadata:
  description: Catalog containing Mesowest's HRRR data. See readme source for more
    information.
  version: 1
sources: {}

Documenting the Data

We have created a data source, but it may be a little tricky to use. We need a way to let users know what options they have for the level and param user parameters we defined earlier. The inventory.csv file, created from Mesowest’s HRRR Zarr Variable List, in this directory contains a table which shows what parameters are in the Zarr store and at what level in the atmosphere those parameters are available. Let’s open it as a source and add it to our source dictionary.

source = intake.open_csv('inventory.csv')
source.name = 'data_dictionary'
source.description = 'Describes the data in the hrrrzarr source'

The Vertical Level column corresponds to the level paremeter in our data source and the Parameter Short Name corresponds to the param parameter.

source

data_dictionary:
  args:
    urlpath: inventory.csv
  description: Describes the data in the hrrrzarr source
  driver: intake.source.csv.CSVSource
  metadata: {}

This source is almost how we want it, but the urlpath will not work after we push our catalog to Github. Intake sets the CATALOG_DIR parameter to point to whatever directory the catalog file is in. Using this parameter we can generate a url that will work even after we push the repository to Github.

sources['data_dictionary'] = yaml.load(source.yaml(), 
                                       Loader=yaml.CLoader)['sources']['data_dictionary']
data_dictionary_args = sources['data_dictionary']['args']
data_dictionary_args['urlpath'] = "{{ CATALOG_DIR }}/inventory.csv"
print(yaml.dump(sources))

data_dictionary:
  args:
    urlpath: '{{ CATALOG_DIR }}/inventory.csv'
  description: Describes the data in the hrrrzarr source
  driver: intake.source.csv.CSVSource
  metadata: {}
hrrrzarr:
  args:
    storage_options:
      anon: true
    urlpath:
    - s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}
    - s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}/{{level}}
  description: Mesowest's HRRR data. See readme source for more information.
  driver: intake_xarray.xzarr.ZarrSource
  metadata:
    coords: !!python/tuple
    - projection_x_coordinate
    - projection_y_coordinate
    dims:
      projection_x_coordinate: 1799
      projection_y_coordinate: 1059
  parameters:
    date:
      default: '2016-08-24T00:00:00'
      description: Date and hour of data.
      min: '2016-08-24T00:00:00'
      type: datetime
    level:
      default: surface
      description: Parameter specifying level in the atmosphere. Corresponds to 'Vertical
        Level' column in data_dictionary
      type: str
    param:
      default: TMP
      description: Specifies what parameter your dataset will contain. Corresponds
        to 'Parameter Short Name' in data_dictionary
      type: str

Your source now points to a inventory.csv file in the same directory as your catalog. Be sure to copy the file into your “intake-demo” directory.

Now that we have a source and a data dictionary to describe it lets add a readme to our catalog to explain how to use it and give some usage examples. This readme will also be displayed as the readme for the repository on Github. In this directory there is an example readme markdown file to use. Go ahead and copy it into the “intake-demo” directory.

md_kwargs = {"pre": "<details markdown='1'>\n<summary>README</summary>\n",
             "post": "\n<br>\nEnd of README\n</details>"}
source = intake.open_markdown('README.md', md_kwargs=md_kwargs)
source.name = 'readme'
source.description = 'Learn more about how to use this catalog'
source.read()

README

Intake Demo Readme

This repository was made as a educational resource to learn the Python library Intake. It contains an intake catalog pointing to Mesowest's High-Resolution Rapid Refresh (HRRR) Zarr data.

Dependencies

requests
aiohttp
intake
s3fs
Intake-xarray
Intake-markdown
cartopy

About the Data

High-Resolution Rapid Refresh (HRRR) is a atmospheric model maintained by NOAA. As stated on NOAA's website

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

In this cookbook we use a subset of HRRR data maintained by Mesowest on AWS S3 object storage.

HRRR Projection

From Mesowest's HRRR Zarr Data Loading Guide

The projection description for the spatial grid, which is stored in the HRRR grib2 files, is not available directly in the zarr arrays. Additionally, the zarr data has the x and y coordinates in the native Lambert Conformal Conic projection, but not the latitude and longitude data. The various use cases will have more specific examples of how to handle this information, but here we'll note that: The proj params you need to define the grid correctly are: {'a': 6371229, 'b': 6371229, 'proj': 'lcc', 'lon_0': 262.5, 'lat_0': 38.5, 'lat_1': 38.5, 'lat_2': 38.5} We also offer a "chunk index" that has the latitude and longitude coordinates for the grid, more on that under Use Case 3. The following code sets up the correct CRS in cartopy:

projection = ccrs.LambertConformal(central_longitude=262.5, 
                                   central_latitude=38.5, 
                                   standard_parallels=(38.5, 38.5),
                                    globe=ccrs.Globe(semimajor_axis=6371229,
                                                     semiminor_axis=6371229))

Data Dictionary

This catalog contains a csv source called data_dictionary that describes the data in the the hrrrzarr source. The Vertical Level and Parameter Short Name column are arguments that can be passed to the hrrrzarr source's level and param user parameters respectively.

Usage Example

import intake
import datetime as dt
user_parameters = {'level': 'top_of_atmosphere',
                   'param': 'USWRF',
                   'date': dt.datetime(2021, 4, 25, 6)}

cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')

ds = cat.hrrrzarr(**user_parameters).read()

Combining a Days Worth of Data

import intake
import cartopy.crs as ccrs
import datetime as dt
import xarray as xr

cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')

projection = ccrs.LambertConformal(central_longitude=262.5, 
                                   central_latitude=38.5, 
                                   standard_parallels=(38.5, 38.5),
                                    globe=ccrs.Globe(semimajor_axis=6371229,
                                                     semiminor_axis=6371229))
date = dt.datetime(2019, 8, 9)
hours = [date + dt.timedelta(hours=i) for i in range(4)]
datasets = [cat.hrrrzarr(date=hour).read().set_coords("time") for hour in hours]
ds = xr.concat(datasets, dim='time', combine_attrs="drop_conflicts")

End of README

The values of pre and post in the md_kwargs dictionary are used by intake-markdown to add extra markdown before and after the markdown source. In this example we use details and summary tags to enclose the readme in a dropdown.

We will change the urlpath of this source in the same way as the data dictionary to ensure the readme loads correctly.

sources['readme'] = yaml.load(source.yaml(), Loader=yaml.CLoader)['sources']['readme']
readme_args = sources['readme']['args']
readme_args['urlpath'] = "{{ CATALOG_DIR }}/README.md"
print(yaml.dump(source_dict))

sources:
  data_dictionary:
    args:
      urlpath: '{{ CATALOG_DIR }}/inventory.csv'
    description: Describes the data in the hrrrzarr source
    driver: intake.source.csv.CSVSource
    metadata: {}
  hrrrzarr:
    args:
      storage_options:
        anon: true
      urlpath:
      - s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}
      - s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}/{{level}}
    description: Mesowest's HRRR data. See readme source for more information.
    driver: intake_xarray.xzarr.ZarrSource
    metadata:
      coords: !!python/tuple
      - projection_x_coordinate
      - projection_y_coordinate
      dims:
        projection_x_coordinate: 1799
        projection_y_coordinate: 1059
    parameters:
      date:
        default: '2016-08-24T00:00:00'
        description: Date and hour of data.
        min: '2016-08-24T00:00:00'
        type: datetime
      level:
        default: surface
        description: Parameter specifying level in the atmosphere. Corresponds to
          'Vertical Level' column in data_dictionary
        type: str
      param:
        default: TMP
        description: Specifies what parameter your dataset will contain. Corresponds
          to 'Parameter Short Name' in data_dictionary
        type: str
  readme:
    args:
      md_kwargs:
        post: '

          <br>

          End of README

          </details>'
        pre: '<details markdown=''1''>

          <summary>README</summary>

          '
      urlpath: '{{ CATALOG_DIR }}/README.md'
    description: Learn more about how to use this catalog
    driver: intake_markdown.intake_markdown.MarkdownSource
    metadata: {}

With all our sources made, we will add them to the catalog, and save the catalog.

catalog['sources'] = sources
with open('intake-demo/catalog.yml', 'w') as f:
    yaml.dump(catalog, f)

At this point you should have three files in your “intake-demo” directory: “catalog.yml”, “inventory.csv”, and “README.md”. All we need to do now is commit our changes and push them to Github.

git add .
git commit -m "initial commit"
git push

Testing the Catalog

Now that your catalog is on Github let’s try using it. In the cell below replace the url with the url pointing to the raw catalog file on your Github account

cat = intake.open_catalog('https://raw.githubusercontent.com/ProjectPythia/intake-cookbook/main/notebooks/catalog.yml')
list(cat)

['data_dictionary', 'hrrrzarr', 'readme']

cat.readme.read()

README

Intake Demo Readme

This repository was made as a educational resource to learn the Python library Intake. It contains an intake catalog pointing to Mesowest's High-Resolution Rapid Refresh (HRRR) Zarr data.

Dependencies

requests
aiohttp
intake
s3fs
Intake-xarray
Intake-markdown
cartopy

About the Data

High-Resolution Rapid Refresh (HRRR) is a atmospheric model maintained by NOAA. As stated on NOAA's website

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

In this cookbook we use a subset of HRRR data maintained by Mesowest on AWS S3 object storage.

HRRR Projection

From Mesowest's HRRR Zarr Data Loading Guide

The projection description for the spatial grid, which is stored in the HRRR grib2 files, is not available directly in the zarr arrays. Additionally, the zarr data has the x and y coordinates in the native Lambert Conformal Conic projection, but not the latitude and longitude data. The various use cases will have more specific examples of how to handle this information, but here we'll note that: The proj params you need to define the grid correctly are: {'a': 6371229, 'b': 6371229, 'proj': 'lcc', 'lon_0': 262.5, 'lat_0': 38.5, 'lat_1': 38.5, 'lat_2': 38.5} We also offer a "chunk index" that has the latitude and longitude coordinates for the grid, more on that under Use Case 3. The following code sets up the correct CRS in cartopy:

projection = ccrs.LambertConformal(central_longitude=262.5, 
                                   central_latitude=38.5, 
                                   standard_parallels=(38.5, 38.5),
                                    globe=ccrs.Globe(semimajor_axis=6371229,
                                                     semiminor_axis=6371229))

Data Dictionary

This catalog contains a csv source called data_dictionary that describes the data in the the hrrrzarr source. The Vertical Level and Parameter Short Name column are arguments that can be passed to the hrrrzarr source's level and param user parameters respectively.

Usage Example

import intake
import datetime as dt
user_parameters = {'level': 'top_of_atmosphere',
                   'param': 'USWRF',
                   'date': dt.datetime(2021, 4, 25, 6)}

cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')

ds = cat.hrrrzarr(**user_parameters).read()

Combining a Days Worth of Data

import intake
import cartopy.crs as ccrs
import datetime as dt
import xarray as xr

cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')

projection = ccrs.LambertConformal(central_longitude=262.5, 
                                   central_latitude=38.5, 
                                   standard_parallels=(38.5, 38.5),
                                    globe=ccrs.Globe(semimajor_axis=6371229,
                                                     semiminor_axis=6371229))
date = dt.datetime(2019, 8, 9)
hours = [date + dt.timedelta(hours=i) for i in range(4)]
datasets = [cat.hrrrzarr(date=hour).read().set_coords("time") for hour in hours]
ds = xr.concat(datasets, dim='time', combine_attrs="drop_conflicts")

End of README

cat.data_dictionary

data_dictionary:
  args:
    keep_default_na: false
    urlpath: https://raw.githubusercontent.com/ProjectPythia/intake-cookbook/main/notebooks/inventory.csv
  description: Describes the data in the hrrrzarr source
  driver: intake.source.csv.CSVSource
  metadata:
    catalog_dir: https://raw.githubusercontent.com/ProjectPythia/intake-cookbook/main/notebooks

Summary

In this tutorial we learned to create Intake catalogs and host them on Github. We learned to create sources with Intake and then modify them to make them more general. We explored a possible method for documenting data by adding a readme and data dictionary to our catalog. These guidelines will help you make your data more accessible to collaborators.

Creating Intake Catalogs

Overview

Prerequisites

Imports

Setting up the Environment

Adding Your First Data Source

Modifying the Source

Documenting the Data

Intake Demo Readme

Dependencies

About the Data

HRRR Projection

Data Dictionary

Usage Example

Combining a Days Worth of Data

Testing the Catalog

Intake Demo Readme

Dependencies

About the Data

HRRR Projection

Data Dictionary

Usage Example

Combining a Days Worth of Data

Summary

Resources and references