Creating Intake Catalogs¶

Overview¶

In the last lesson we learned to use Intake catalogs to simplify the process of accessing research data. In this lesson we will walk through the steps of creating a catalog for your research data by recreating the catalog in the previous lesson.

Creating an Intake catalog
Documenting your data source
Share your catalog on Github.

Prerequisites¶

Concepts	Importance	Notes
Intro to Intake	Necessary
Understanding of yaml	Necessary
Getting Started with Github	Necessary
Intro to Pandas	Helpful

Time to learn: 45 minutes

Imports¶

import intake
import intake_xarray
import intake_markdown
import requests
import aiohttp
import s3fs
import yaml
import json
import datetime
import os

Setting up the Environment¶

By the end of this tutorial we will have created a git repository that we can host on Github to share our catalog.

Start by creating a Github repository called “intake-demo”, and then clone the repository to your local machine. Be sure to replace path/to/Github/repository with the name of the repository you just made in the following command.

git clone path/to/Github/repository

Intake catalogs can be a simple yaml file. We can create the yaml file programmatically by converting nested python dictionaries to yaml. A Intake catalog has two main parts metadata and sources. The metadata can be arbitrary with a few exceptions. The sources section is a mapping between a data source name and its properties. For more information about Intake catalogs, see Intake’s documentation

description = "Catalog containing Mesowest's HRRR data. See readme source for more information."

catalog = {'metadata': {'version': 1,
                       'description': description},
           'sources': {}}

os.makedirs("intake-demo", exist_ok=True) #only needed for building this notebook
with open('intake-demo/catalog.yml', 'w') as f:
    yaml.dump(catalog, f)

You will now notice a new file in your “intake-demo” directory called “catalog.yml” with the following contents.

with open('intake-demo/catalog.yml', 'r') as f:
    print(f.read())

metadata:
  description: Catalog containing Mesowest's HRRR data. See readme source for more
    information.
  version: 1
sources: {}

Adding Your First Data Source¶

Intake only knows how to handle a few different data formats. To handle other formats it uses plugable drivers. To use the Mesowest’s HRRR Zarr data we will use the intake-xarray package which provides a driver for reading Zarr data into Xarray datasets. Drivers are installed as python packages and integrate into the Intake library. When intalled Intake creates a open_{driver} method for each driver in the package. Installing the intake-xarray package allows us to access zarr data using the open_zarr method.

Mesowest’s HRRR Zarr data is stored in AWS. The file structure of the hrrrzarr S3 bucket looks like

s3://hrrrzarr/sfc/yyyymmdd/yyyymmdd_hhz_anl.zarr/level/param/level

where

yyyy = four digit year
mm = two digit month
dd = two digit day of month
hh = two digit hour of the day
level = level of atmoshpere the data describes
param = the parameter your interested in

To load a complete dataset we need the Zarr arrays from two urls

s3://hrrrzarr/sfc/yyyymmdd/yyyymmdd_hhz_anl.zarr/level/param/level

s3://hrrrzarr/sfc/yyyymmdd/yyyymmdd_hhz_anl.zarr/level/param

Lets load surface temperature data from August 24, 2016

urls = ['s3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP/surface',
           's3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP']

source = intake.open_zarr(urls, storage_options={"anon": True})

source.name = 'hrrrzarr'
source.description = "Mesowest's HRRR data. See readme source for more information."
ds = source.read()
ds

Above we used the storage_options argument to tell Intake how to access data on AWS. In this case we accessed the data as an anonymous user. The consolidated=True argument is given to tell Xarray how to load the metadata for this source. Zarr data may contain consolidated metadata. If it does, using it can increase performance significantly.

When you use Intake’s open_{driver} methods, it creates a catalog entry for the source. You can view the yaml using the source’s yaml method.

print(source.yaml())

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[5], line 1
----> 1 print(source.yaml())

AttributeError: 'ZarrSource' object has no attribute 'yaml'

Modifying the Source¶

If we wanted we could add this yaml to our catalog and we could then load this data using Intake. However, there are many datasets we can load from the Zarr store with almost the same catalog entry. Making a separate entry for each would make the catalog cluttered and harder to use. Instead we will generalize this catalog entry so it applies to many datasets. Then we will create user parameters to give the catalog user the abillity to select the data they want.

To generalize the source we need Intake to dynamically generate urls pointing to the Zarr arrays based off user set parameters. We will take the source created by the open_zarr method convert it to a python dictionary and then modify it to include user parameters. We can then use those parameters to generate the urls. Intake provides Jinja templating in catalogs to make this simple. Let’s start by defining user parameters.

source_dict = yaml.load(source.yaml(), Loader=yaml.CLoader)

parameters = {}
parameters['level'] = {'description': "Parameter specifying level in the atmosphere. Corresponds to 'Vertical Level' column in data_dictionary",
                       'type': 'str',
                       'default': 'surface'}

parameters['param'] = {'description': "Specifies what parameter your dataset will contain. Corresponds to 'Parameter Short Name' in data_dictionary",
                       'type': 'str',
                       'default': 'TMP'}

parameters['date'] = {'description': "Date and hour of data.",
                      'type': 'datetime',
                      'default': "2016-08-24T00:00:00",
                      'min': "2016-08-24T00:00:00"}



sources = source_dict['sources']
hrrr_zarr = sources['hrrrzarr']
hrrr_zarr['parameters'] = parameters

With the parameters defined we can now use them to create the urls using Jinja syntax.

urls = ["s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}",
        "s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}/{{level}}"]

hrrr_zarr['args']['urlpath'] = urls

Now that we have a more generalized source, some of the metadata is too specific. To fix this we will just remove the data_varsection from the source’s metadata.

hrrr_zarr['metadata'].pop('data_vars', None)
print(yaml.dump(source_dict))

Documenting the Data¶

We have created a data source, but it may be a little tricky to use. We need a way to let users know what options they have for the level and param user parameters we defined earlier. The inventory.csv file, created from Mesowest’s HRRR Zarr Variable List, in this directory contains a table which shows what parameters are in the Zarr store and at what level in the atmosphere those parameters are available. Let’s open it as a source and add it to our source dictionary.

source = intake.open_csv('inventory.csv')
source.name = 'data_dictionary'
source.description = 'Describes the data in the hrrrzarr source'

The Vertical Level column corresponds to the level paremeter in our data source and the Parameter Short Name corresponds to the param parameter.

source

This source is almost how we want it, but the urlpath will not work after we push our catalog to Github. Intake sets the CATALOG_DIR parameter to point to whatever directory the catalog file is in. Using this parameter we can generate a url that will work even after we push the repository to Github.

sources['data_dictionary'] = yaml.load(source.yaml(), 
                                       Loader=yaml.CLoader)['sources']['data_dictionary']
data_dictionary_args = sources['data_dictionary']['args']
data_dictionary_args['urlpath'] = "{{ CATALOG_DIR }}/inventory.csv"
print(yaml.dump(sources))

Your source now points to a inventory.csv file in the same directory as your catalog. Be sure to copy the file into your “intake-demo” directory.

Now that we have a source and a data dictionary to describe it lets add a readme to our catalog to explain how to use it and give some usage examples. This readme will also be displayed as the readme for the repository on Github. In this directory there is an example readme markdown file to use. Go ahead and copy it into the “intake-demo” directory.

md_kwargs = {"pre": "<details markdown='1'>\n<summary>README</summary>\n",
             "post": "\n<br>\nEnd of README\n</details>"}
source = intake.open_markdown('README.md', md_kwargs=md_kwargs)
source.name = 'readme'
source.description = 'Learn more about how to use this catalog'
source.read()

The values of pre and post in the md_kwargs dictionary are used by intake-markdown to add extra markdown before and after the markdown source. In this example we use details and summary tags to enclose the readme in a dropdown.

We will change the urlpath of this source in the same way as the data dictionary to ensure the readme loads correctly.

sources['readme'] = yaml.load(source.yaml(), Loader=yaml.CLoader)['sources']['readme']
readme_args = sources['readme']['args']
readme_args['urlpath'] = "{{ CATALOG_DIR }}/README.md"
print(yaml.dump(source_dict))

With all our sources made, we will add them to the catalog, and save the catalog.

catalog['sources'] = sources
with open('intake-demo/catalog.yml', 'w') as f:
    yaml.dump(catalog, f)

At this point you should have three files in your “intake-demo” directory: “catalog.yml”, “inventory.csv”, and “README.md”. All we need to do now is commit our changes and push them to Github.

git add .
git commit -m "initial commit"
git push

Testing the Catalog¶

Now that your catalog is on Github let’s try using it. In the cell below replace the url with the url pointing to the raw catalog file on your Github account

cat = intake.open_catalog('https://raw.githubusercontent.com/ProjectPythia/intake-cookbook/main/notebooks/catalog.yml')
list(cat)

cat.readme.read()

cat.data_dictionary

cat.hrrrzarr.read()

Summary¶

In this tutorial we learned to create Intake catalogs and host them on Github. We learned to create sources with Intake and then modify them to make them more general. We explored a possible method for documenting data by adding a readme and data dictionary to our catalog. These guidelines will help you make your data more accessible to collaborators.

Creating Intake Catalogs