Creating Intake Catalogs¶
Overview¶
In the last lesson we learned to use Intake catalogs to simplify the process of accessing research data. In this lesson we will walk through the steps of creating a catalog for your research data by recreating the catalog in the previous lesson.
- Creating an Intake catalog
- Documenting your data source
- Share your catalog on Github.
Prerequisites¶
Concepts | Importance | Notes |
---|---|---|
Intro to Intake | Necessary | |
Understanding of yaml | Necessary | |
Getting Started with Github | Necessary | |
Intro to Pandas | Helpful |
- Time to learn: 45 minutes
Imports¶
import intake
import intake_xarray
import intake_markdown
import requests
import aiohttp
import s3fs
import yaml
import json
import datetime
import os
Setting up the Environment¶
By the end of this tutorial we will have created a git repository that we can host on Github to share our catalog.
Start by creating a Github repository called “intake-demo”, and then clone the repository to your local machine. Be sure to replace path/to/Github/repository with the name of the repository you just made in the following command.
git clone path/to/Github/repository
Intake catalogs can be a simple yaml file. We can create the yaml file programmatically by converting nested python dictionaries to yaml. A Intake catalog has two main parts metadata and sources. The metadata can be arbitrary with a few exceptions. The sources section is a mapping between a data source name and its properties. For more information about Intake catalogs, see Intake’s documentation
description = "Catalog containing Mesowest's HRRR data. See readme source for more information."
catalog = {'metadata': {'version': 1,
'description': description},
'sources': {}}
os.makedirs("intake-demo", exist_ok=True) #only needed for building this notebook
with open('intake-demo/catalog.yml', 'w') as f:
yaml.dump(catalog, f)
You will now notice a new file in your “intake-demo” directory called “catalog.yml” with the following contents.
with open('intake-demo/catalog.yml', 'r') as f:
print(f.read())
metadata:
description: Catalog containing Mesowest's HRRR data. See readme source for more
information.
version: 1
sources: {}
Adding Your First Data Source¶
Intake only knows how to handle a few different data formats. To handle other formats it uses plugable drivers. To use the Mesowest’s HRRR Zarr data we will use the intake-xarray package which provides a driver for reading Zarr data into Xarray datasets. Drivers are installed as python packages and integrate into the Intake library. When intalled Intake creates a open_{driver} method for each driver in the package. Installing the intake-xarray package allows us to access zarr data using the open_zarr
method.
Mesowest’s HRRR Zarr data is stored in AWS. The file structure of the hrrrzarr S3 bucket looks like
s3://hrrrzarr/sfc/yyyymmdd/yyyymmdd_hhz_anl.zarr/level/param/level
where
- yyyy = four digit year
- mm = two digit month
- dd = two digit day of month
- hh = two digit hour of the day
- level = level of atmoshpere the data describes
- param = the parameter your interested in
To load a complete dataset we need the Zarr arrays from two urls
s3://hrrrzarr/sfc/yyyymmdd/yyyymmdd_hhz_anl.zarr/level/param/level
s3://hrrrzarr/sfc/yyyymmdd/yyyymmdd_hhz_anl.zarr/level/param
Lets load surface temperature data from August 24, 2016
urls = ['s3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP/surface',
's3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP']
source = intake.open_zarr(urls, storage_options={"anon": True})
source.name = 'hrrrzarr'
source.description = "Mesowest's HRRR data. See readme source for more information."
ds = source.read()
ds
Above we used the storage_options
argument to tell Intake how to access data on AWS. In this case we accessed the data as an anonymous user. The consolidated=True
argument is given to tell Xarray how to load the metadata for this source. Zarr data may contain consolidated metadata. If it does, using it can increase performance significantly.
When you use Intake’s open_{driver}
methods, it creates a catalog entry for the source. You can view the yaml using the source’s yaml
method.
print(source.yaml())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[5], line 1
----> 1 print(source.yaml())
AttributeError: 'ZarrSource' object has no attribute 'yaml'
Modifying the Source¶
If we wanted we could add this yaml to our catalog and we could then load this data using Intake. However, there are many datasets we can load from the Zarr store with almost the same catalog entry. Making a separate entry for each would make the catalog cluttered and harder to use. Instead we will generalize this catalog entry so it applies to many datasets. Then we will create user parameters to give the catalog user the abillity to select the data they want.
To generalize the source we need Intake to dynamically generate urls pointing to the Zarr arrays based off user set parameters. We will take the source created by the open_zarr
method convert it to a python dictionary and then modify it to include user parameters. We can then use those parameters to generate the urls. Intake provides Jinja templating in catalogs to make this simple. Let’s start by defining user parameters.
source_dict = yaml.load(source.yaml(), Loader=yaml.CLoader)
parameters = {}
parameters['level'] = {'description': "Parameter specifying level in the atmosphere. Corresponds to 'Vertical Level' column in data_dictionary",
'type': 'str',
'default': 'surface'}
parameters['param'] = {'description': "Specifies what parameter your dataset will contain. Corresponds to 'Parameter Short Name' in data_dictionary",
'type': 'str',
'default': 'TMP'}
parameters['date'] = {'description': "Date and hour of data.",
'type': 'datetime',
'default': "2016-08-24T00:00:00",
'min': "2016-08-24T00:00:00"}
sources = source_dict['sources']
hrrr_zarr = sources['hrrrzarr']
hrrr_zarr['parameters'] = parameters
With the parameters defined we can now use them to create the urls using Jinja syntax.
urls = ["s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}",
"s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}/{{level}}"]
hrrr_zarr['args']['urlpath'] = urls
Now that we have a more generalized source, some of the metadata is too specific. To fix this we will just remove the data_var
section from the source’s metadata.
hrrr_zarr['metadata'].pop('data_vars', None)
print(yaml.dump(source_dict))
Documenting the Data¶
We have created a data source, but it may be a little tricky to use. We need a way to let users know what options they have for the level
and param
user parameters we defined earlier. The inventory.csv file, created from Mesowest’s HRRR Zarr Variable List, in this directory contains a table which shows what parameters are in the Zarr store and at what level in the atmosphere those parameters are available. Let’s open it as a source and add it to our source dictionary.
source = intake.open_csv('inventory.csv')
source.name = 'data_dictionary'
source.description = 'Describes the data in the hrrrzarr source'
The Vertical Level
column corresponds to the level paremeter in our data source and the Parameter Short Name
corresponds to the param
parameter.
source
This source is almost how we want it, but the urlpath
will not work after we push our catalog to Github. Intake sets the CATALOG_DIR
parameter to point to whatever directory the catalog file is in. Using this parameter we can generate a url that will work even after we push the repository to Github.
sources['data_dictionary'] = yaml.load(source.yaml(),
Loader=yaml.CLoader)['sources']['data_dictionary']
data_dictionary_args = sources['data_dictionary']['args']
data_dictionary_args['urlpath'] = "{{ CATALOG_DIR }}/inventory.csv"
print(yaml.dump(sources))
Your source now points to a inventory.csv file in the same directory as your catalog. Be sure to copy the file into your “intake-demo” directory.
Now that we have a source and a data dictionary to describe it lets add a readme to our catalog to explain how to use it and give some usage examples. This readme will also be displayed as the readme for the repository on Github. In this directory there is an example readme markdown file to use. Go ahead and copy it into the “intake-demo” directory.
md_kwargs = {"pre": "<details markdown='1'>\n<summary>README</summary>\n",
"post": "\n<br>\nEnd of README\n</details>"}
source = intake.open_markdown('README.md', md_kwargs=md_kwargs)
source.name = 'readme'
source.description = 'Learn more about how to use this catalog'
source.read()
The values of pre
and post
in the md_kwargs
dictionary are used by intake-markdown to add extra markdown before and after the markdown source. In this example we use details
and summary
tags to enclose the readme in a dropdown.
We will change the urlpath of this source in the same way as the data dictionary to ensure the readme loads correctly.
sources['readme'] = yaml.load(source.yaml(), Loader=yaml.CLoader)['sources']['readme']
readme_args = sources['readme']['args']
readme_args['urlpath'] = "{{ CATALOG_DIR }}/README.md"
print(yaml.dump(source_dict))
With all our sources made, we will add them to the catalog, and save the catalog.
catalog['sources'] = sources
with open('intake-demo/catalog.yml', 'w') as f:
yaml.dump(catalog, f)
At this point you should have three files in your “intake-demo” directory: “catalog.yml”, “inventory.csv”, and “README.md”. All we need to do now is commit our changes and push them to Github.
git add .
git commit -m "initial commit"
git push
Testing the Catalog¶
Now that your catalog is on Github let’s try using it. In the cell below replace the url with the url pointing to the raw catalog file on your Github account
cat = intake.open_catalog('https://raw.githubusercontent.com/ProjectPythia/intake-cookbook/main/notebooks/catalog.yml')
list(cat)
cat.readme.read()
cat.data_dictionary
cat.hrrrzarr.read()
Summary¶
In this tutorial we learned to create Intake catalogs and host them on Github. We learned to create sources with Intake and then modify them to make them more general. We explored a possible method for documenting data by adding a readme and data dictionary to our catalog. These guidelines will help you make your data more accessible to collaborators.