Creating Intake Catalogs
Overview
In the last lesson we learned to use Intake catalogs to simplify the process of accessing research data. In this lesson we will walk through the steps of creating a catalog for your research data by recreating the catalog in the previous lesson.
Creating an Intake catalog
Documenting your data source
Share your catalog on Github.
Prerequisites
Concepts |
Importance |
Notes |
---|---|---|
Necessary |
||
Necessary |
||
Necessary |
||
Helpful |
Time to learn: 45 minutes
Imports
import intake
import intake_xarray
import intake_markdown
import requests
import aiohttp
import s3fs
import yaml
import json
import datetime
import os
Setting up the Environment
By the end of this tutorial we will have created a git repository that we can host on Github to share our catalog.
Start by creating a Github repository called “intake-demo”, and then clone the repository to your local machine. Be sure to replace path/to/Github/repository with the name of the repository you just made in the following command.
git clone path/to/Github/repository
Intake catalogs can be a simple yaml file. We can create the yaml file programmatically by converting nested python dictionaries to yaml. A Intake catalog has two main parts metadata and sources. The metadata can be arbitrary with a few exceptions. The sources section is a mapping between a data source name and its properties. For more information about Intake catalogs, see Intake’s documentation
description = "Catalog containing Mesowest's HRRR data. See readme source for more information."
catalog = {'metadata': {'version': 1,
'description': description},
'sources': {}}
os.makedirs("intake-demo", exist_ok=True) #only needed for building this notebook
with open('intake-demo/catalog.yml', 'w') as f:
yaml.dump(catalog, f)
You will now notice a new file in your “intake-demo” directory called “catalog.yml” with the following contents.
with open('intake-demo/catalog.yml', 'r') as f:
print(f.read())
metadata:
description: Catalog containing Mesowest's HRRR data. See readme source for more
information.
version: 1
sources: {}
Adding Your First Data Source
Intake only knows how to handle a few different data formats. To handle other formats it uses plugable drivers. To use the Mesowest’s HRRR Zarr data we will use the intake-xarray package which provides a driver for reading Zarr data into Xarray datasets. Drivers are installed as python packages and integrate into the Intake library. When intalled Intake creates a open_{driver} method for each driver in the package. Installing the intake-xarray package allows us to access zarr data using the open_zarr
method.
Mesowest’s HRRR Zarr data is stored in AWS. The file structure of the hrrrzarr S3 bucket looks like
s3://hrrrzarr/sfc/yyyymmdd/yyyymmdd_hhz_anl.zarr/level/param/level
where
yyyy = four digit year
mm = two digit month
dd = two digit day of month
hh = two digit hour of the day
level = level of atmoshpere the data describes
param = the parameter your interested in
To load a complete dataset we need the Zarr arrays from two urls
s3://hrrrzarr/sfc/yyyymmdd/yyyymmdd_hhz_anl.zarr/level/param/level
s3://hrrrzarr/sfc/yyyymmdd/yyyymmdd_hhz_anl.zarr/level/param
Lets load surface temperature data from August 24, 2016
urls = ['s3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP/surface',
's3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP']
source = intake.open_zarr(urls, storage_options={"anon": True})
source.name = 'hrrrzarr'
source.description = "Mesowest's HRRR data. See readme source for more information."
ds = source.read()
ds
/home/runner/miniconda3/envs/intake-cookbook-dev/lib/python3.12/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
'dims': dict(self._ds.dims),
<xarray.Dataset> Size: 4MB Dimensions: (projection_y_coordinate: 1059, projection_x_coordinate: 1799) Coordinates: * projection_x_coordinate (projection_x_coordinate) float64 14kB -2.698e+0... * projection_y_coordinate (projection_y_coordinate) float64 8kB -1.587e+06... Data variables: TMP (projection_y_coordinate, projection_x_coordinate) float16 4MB ... forecast_period timedelta64[ns] 8B 00:00:00 forecast_reference_time datetime64[ns] 8B 2016-08-24 height float64 8B 1e+03 pressure float64 8B 2.5e+04 time datetime64[ns] 8B 2016-08-24
Above we used the storage_options
argument to tell Intake how to access data on AWS. In this case we accessed the data as an anonymous user. The consolidated=True
argument is given to tell Xarray how to load the metadata for this source. Zarr data may contain consolidated metadata. If it does, using it can increase performance significantly.
When you use Intake’s open_{driver}
methods, it creates a catalog entry for the source. You can view the yaml using the source’s yaml
method.
print(source.yaml())
sources:
hrrrzarr:
args:
storage_options:
anon: true
urlpath:
- s3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP/surface
- s3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP
description: Mesowest's HRRR data. See readme source for more information.
driver: intake_xarray.xzarr.ZarrSource
metadata:
coords: !!python/tuple
- projection_x_coordinate
- projection_y_coordinate
data_vars:
TMP:
- projection_x_coordinate
- projection_y_coordinate
forecast_period: []
forecast_reference_time: []
height: []
pressure: []
time: []
dims:
projection_x_coordinate: 1799
projection_y_coordinate: 1059
Modifying the Source
If we wanted we could add this yaml to our catalog and we could then load this data using Intake. However, there are many datasets we can load from the Zarr store with almost the same catalog entry. Making a separate entry for each would make the catalog cluttered and harder to use. Instead we will generalize this catalog entry so it applies to many datasets. Then we will create user parameters to give the catalog user the abillity to select the data they want.
To generalize the source we need Intake to dynamically generate urls pointing to the Zarr arrays based off user set parameters. We will take the source created by the open_zarr
method convert it to a python dictionary and then modify it to include user parameters. We can then use those parameters to generate the urls. Intake provides Jinja templating in catalogs to make this simple. Let’s start by defining user parameters.
source_dict = yaml.load(source.yaml(), Loader=yaml.CLoader)
parameters = {}
parameters['level'] = {'description': "Parameter specifying level in the atmosphere. Corresponds to 'Vertical Level' column in data_dictionary",
'type': 'str',
'default': 'surface'}
parameters['param'] = {'description': "Specifies what parameter your dataset will contain. Corresponds to 'Parameter Short Name' in data_dictionary",
'type': 'str',
'default': 'TMP'}
parameters['date'] = {'description': "Date and hour of data.",
'type': 'datetime',
'default': "2016-08-24T00:00:00",
'min': "2016-08-24T00:00:00"}
sources = source_dict['sources']
hrrr_zarr = sources['hrrrzarr']
hrrr_zarr['parameters'] = parameters
With the parameters defined we can now use them to create the urls using Jinja syntax.
urls = ["s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}",
"s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}/{{level}}"]
hrrr_zarr['args']['urlpath'] = urls
Now that we have a more generalized source, some of the metadata is too specific. To fix this we will just remove the data_var
section from the source’s metadata.
hrrr_zarr['metadata'].pop('data_vars', None)
print(yaml.dump(source_dict))
sources:
hrrrzarr:
args:
storage_options:
anon: true
urlpath:
- s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}
- s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}/{{level}}
description: Mesowest's HRRR data. See readme source for more information.
driver: intake_xarray.xzarr.ZarrSource
metadata:
coords: !!python/tuple
- projection_x_coordinate
- projection_y_coordinate
dims:
projection_x_coordinate: 1799
projection_y_coordinate: 1059
parameters:
date:
default: '2016-08-24T00:00:00'
description: Date and hour of data.
min: '2016-08-24T00:00:00'
type: datetime
level:
default: surface
description: Parameter specifying level in the atmosphere. Corresponds to
'Vertical Level' column in data_dictionary
type: str
param:
default: TMP
description: Specifies what parameter your dataset will contain. Corresponds
to 'Parameter Short Name' in data_dictionary
type: str
Documenting the Data
We have created a data source, but it may be a little tricky to use. We need a way to let users know what options they have for the level
and param
user parameters we defined earlier. The inventory.csv file, created from Mesowest’s HRRR Zarr Variable List, in this directory contains a table which shows what parameters are in the Zarr store and at what level in the atmosphere those parameters are available. Let’s open it as a source and add it to our source dictionary.
source = intake.open_csv('inventory.csv')
source.name = 'data_dictionary'
source.description = 'Describes the data in the hrrrzarr source'
The Vertical Level
column corresponds to the level paremeter in our data source and the Parameter Short Name
corresponds to the param
parameter.
source
data_dictionary:
args:
urlpath: inventory.csv
description: Describes the data in the hrrrzarr source
driver: intake.source.csv.CSVSource
metadata: {}
This source is almost how we want it, but the urlpath
will not work after we push our catalog to Github. Intake sets the CATALOG_DIR
parameter to point to whatever directory the catalog file is in. Using this parameter we can generate a url that will work even after we push the repository to Github.
sources['data_dictionary'] = yaml.load(source.yaml(),
Loader=yaml.CLoader)['sources']['data_dictionary']
data_dictionary_args = sources['data_dictionary']['args']
data_dictionary_args['urlpath'] = "{{ CATALOG_DIR }}/inventory.csv"
print(yaml.dump(sources))
data_dictionary:
args:
urlpath: '{{ CATALOG_DIR }}/inventory.csv'
description: Describes the data in the hrrrzarr source
driver: intake.source.csv.CSVSource
metadata: {}
hrrrzarr:
args:
storage_options:
anon: true
urlpath:
- s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}
- s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}/{{level}}
description: Mesowest's HRRR data. See readme source for more information.
driver: intake_xarray.xzarr.ZarrSource
metadata:
coords: !!python/tuple
- projection_x_coordinate
- projection_y_coordinate
dims:
projection_x_coordinate: 1799
projection_y_coordinate: 1059
parameters:
date:
default: '2016-08-24T00:00:00'
description: Date and hour of data.
min: '2016-08-24T00:00:00'
type: datetime
level:
default: surface
description: Parameter specifying level in the atmosphere. Corresponds to 'Vertical
Level' column in data_dictionary
type: str
param:
default: TMP
description: Specifies what parameter your dataset will contain. Corresponds
to 'Parameter Short Name' in data_dictionary
type: str
Your source now points to a inventory.csv file in the same directory as your catalog. Be sure to copy the file into your “intake-demo” directory.
Now that we have a source and a data dictionary to describe it lets add a readme to our catalog to explain how to use it and give some usage examples. This readme will also be displayed as the readme for the repository on Github. In this directory there is an example readme markdown file to use. Go ahead and copy it into the “intake-demo” directory.
md_kwargs = {"pre": "<details markdown='1'>\n<summary>README</summary>\n",
"post": "\n<br>\nEnd of README\n</details>"}
source = intake.open_markdown('README.md', md_kwargs=md_kwargs)
source.name = 'readme'
source.description = 'Learn more about how to use this catalog'
source.read()
README
Intake Demo Readme
This repository was made as a educational resource to learn the Python library Intake. It contains an intake catalog pointing to Mesowest's High-Resolution Rapid Refresh (HRRR) Zarr data.
Dependencies
- requests
- aiohttp
- intake
- s3fs
- Intake-xarray
- Intake-markdown
- cartopy
About the Data
High-Resolution Rapid Refresh (HRRR) is a atmospheric model maintained by NOAA. As stated on NOAA's website
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
In this cookbook we use a subset of HRRR data maintained by Mesowest on AWS S3 object storage.
HRRR Projection
From Mesowest's HRRR Zarr Data Loading Guide
The projection description for the spatial grid, which is stored in the HRRR grib2 files, is not available directly in the zarr arrays. Additionally, the zarr data has the x and y coordinates in the native Lambert Conformal Conic projection, but not the latitude and longitude data. The various use cases will have more specific examples of how to handle this information, but here we'll note that: The proj params you need to define the grid correctly are: {'a': 6371229, 'b': 6371229, 'proj': 'lcc', 'lon_0': 262.5, 'lat_0': 38.5, 'lat_1': 38.5, 'lat_2': 38.5} We also offer a "chunk index" that has the latitude and longitude coordinates for the grid, more on that under Use Case 3. The following code sets up the correct CRS in cartopy:
projection = ccrs.LambertConformal(central_longitude=262.5,
central_latitude=38.5,
standard_parallels=(38.5, 38.5),
globe=ccrs.Globe(semimajor_axis=6371229,
semiminor_axis=6371229))
Data Dictionary
This catalog contains a csv source called data_dictionary
that describes the data in the the hrrrzarr source. The Vertical Level
and Parameter Short Name
column are arguments that can be passed to the hrrrzarr source's level
and param
user parameters respectively.
Usage Example
import intake
import datetime as dt
user_parameters = {'level': 'top_of_atmosphere',
'param': 'USWRF',
'date': dt.datetime(2021, 4, 25, 6)}
cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')
ds = cat.hrrrzarr(**user_parameters).read()
Combining a Days Worth of Data
import intake
import cartopy.crs as ccrs
import datetime as dt
import xarray as xr
cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')
projection = ccrs.LambertConformal(central_longitude=262.5,
central_latitude=38.5,
standard_parallels=(38.5, 38.5),
globe=ccrs.Globe(semimajor_axis=6371229,
semiminor_axis=6371229))
date = dt.datetime(2019, 8, 9)
hours = [date + dt.timedelta(hours=i) for i in range(4)]
datasets = [cat.hrrrzarr(date=hour).read().set_coords("time") for hour in hours]
ds = xr.concat(datasets, dim='time', combine_attrs="drop_conflicts")
End of README
The values of pre
and post
in the md_kwargs
dictionary are used by intake-markdown to add extra markdown before and after the markdown source. In this example we use details
and summary
tags to enclose the readme in a dropdown.
We will change the urlpath of this source in the same way as the data dictionary to ensure the readme loads correctly.
sources['readme'] = yaml.load(source.yaml(), Loader=yaml.CLoader)['sources']['readme']
readme_args = sources['readme']['args']
readme_args['urlpath'] = "{{ CATALOG_DIR }}/README.md"
print(yaml.dump(source_dict))
sources:
data_dictionary:
args:
urlpath: '{{ CATALOG_DIR }}/inventory.csv'
description: Describes the data in the hrrrzarr source
driver: intake.source.csv.CSVSource
metadata: {}
hrrrzarr:
args:
storage_options:
anon: true
urlpath:
- s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}
- s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}/{{level}}
description: Mesowest's HRRR data. See readme source for more information.
driver: intake_xarray.xzarr.ZarrSource
metadata:
coords: !!python/tuple
- projection_x_coordinate
- projection_y_coordinate
dims:
projection_x_coordinate: 1799
projection_y_coordinate: 1059
parameters:
date:
default: '2016-08-24T00:00:00'
description: Date and hour of data.
min: '2016-08-24T00:00:00'
type: datetime
level:
default: surface
description: Parameter specifying level in the atmosphere. Corresponds to
'Vertical Level' column in data_dictionary
type: str
param:
default: TMP
description: Specifies what parameter your dataset will contain. Corresponds
to 'Parameter Short Name' in data_dictionary
type: str
readme:
args:
md_kwargs:
post: '
<br>
End of README
</details>'
pre: '<details markdown=''1''>
<summary>README</summary>
'
urlpath: '{{ CATALOG_DIR }}/README.md'
description: Learn more about how to use this catalog
driver: intake_markdown.intake_markdown.MarkdownSource
metadata: {}
With all our sources made, we will add them to the catalog, and save the catalog.
catalog['sources'] = sources
with open('intake-demo/catalog.yml', 'w') as f:
yaml.dump(catalog, f)
At this point you should have three files in your “intake-demo” directory: “catalog.yml”, “inventory.csv”, and “README.md”. All we need to do now is commit our changes and push them to Github.
git add .
git commit -m "initial commit"
git push
Testing the Catalog
Now that your catalog is on Github let’s try using it. In the cell below replace the url with the url pointing to the raw catalog file on your Github account
cat = intake.open_catalog('https://raw.githubusercontent.com/ProjectPythia/intake-cookbook/main/notebooks/catalog.yml')
list(cat)
['data_dictionary', 'hrrrzarr', 'readme']
cat.readme.read()
README
Intake Demo Readme
This repository was made as a educational resource to learn the Python library Intake. It contains an intake catalog pointing to Mesowest's High-Resolution Rapid Refresh (HRRR) Zarr data.
Dependencies
- requests
- aiohttp
- intake
- s3fs
- Intake-xarray
- Intake-markdown
- cartopy
About the Data
High-Resolution Rapid Refresh (HRRR) is a atmospheric model maintained by NOAA. As stated on NOAA's website
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
In this cookbook we use a subset of HRRR data maintained by Mesowest on AWS S3 object storage.
HRRR Projection
From Mesowest's HRRR Zarr Data Loading Guide
The projection description for the spatial grid, which is stored in the HRRR grib2 files, is not available directly in the zarr arrays. Additionally, the zarr data has the x and y coordinates in the native Lambert Conformal Conic projection, but not the latitude and longitude data. The various use cases will have more specific examples of how to handle this information, but here we'll note that: The proj params you need to define the grid correctly are: {'a': 6371229, 'b': 6371229, 'proj': 'lcc', 'lon_0': 262.5, 'lat_0': 38.5, 'lat_1': 38.5, 'lat_2': 38.5} We also offer a "chunk index" that has the latitude and longitude coordinates for the grid, more on that under Use Case 3. The following code sets up the correct CRS in cartopy:
projection = ccrs.LambertConformal(central_longitude=262.5,
central_latitude=38.5,
standard_parallels=(38.5, 38.5),
globe=ccrs.Globe(semimajor_axis=6371229,
semiminor_axis=6371229))
Data Dictionary
This catalog contains a csv source called data_dictionary
that describes the data in the the hrrrzarr source. The Vertical Level
and Parameter Short Name
column are arguments that can be passed to the hrrrzarr source's level
and param
user parameters respectively.
Usage Example
import intake
import datetime as dt
user_parameters = {'level': 'top_of_atmosphere',
'param': 'USWRF',
'date': dt.datetime(2021, 4, 25, 6)}
cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')
ds = cat.hrrrzarr(**user_parameters).read()
Combining a Days Worth of Data
import intake
import cartopy.crs as ccrs
import datetime as dt
import xarray as xr
cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')
projection = ccrs.LambertConformal(central_longitude=262.5,
central_latitude=38.5,
standard_parallels=(38.5, 38.5),
globe=ccrs.Globe(semimajor_axis=6371229,
semiminor_axis=6371229))
date = dt.datetime(2019, 8, 9)
hours = [date + dt.timedelta(hours=i) for i in range(4)]
datasets = [cat.hrrrzarr(date=hour).read().set_coords("time") for hour in hours]
ds = xr.concat(datasets, dim='time', combine_attrs="drop_conflicts")
End of README
cat.data_dictionary
data_dictionary:
args:
keep_default_na: false
urlpath: https://raw.githubusercontent.com/ProjectPythia/intake-cookbook/main/notebooks/inventory.csv
description: Describes the data in the hrrrzarr source
driver: intake.source.csv.CSVSource
metadata:
catalog_dir: https://raw.githubusercontent.com/ProjectPythia/intake-cookbook/main/notebooks
cat.hrrrzarr.read()
/home/runner/miniconda3/envs/intake-cookbook-dev/lib/python3.12/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
'dims': dict(self._ds.dims),
<xarray.Dataset> Size: 4MB Dimensions: (projection_y_coordinate: 1059, projection_x_coordinate: 1799) Coordinates: * projection_x_coordinate (projection_x_coordinate) float64 14kB -2.698e+0... * projection_y_coordinate (projection_y_coordinate) float64 8kB -1.587e+06... Data variables: TMP (projection_y_coordinate, projection_x_coordinate) float16 4MB ... forecast_period timedelta64[ns] 8B 00:00:00 forecast_reference_time datetime64[ns] 8B 2016-08-24 height float64 8B 1e+03 pressure float64 8B 2.5e+04 time datetime64[ns] 8B 2016-08-24
Summary
In this tutorial we learned to create Intake catalogs and host them on Github. We learned to create sources with Intake and then modify them to make them more general. We explored a possible method for documenting data by adding a readme and data dictionary to our catalog. These guidelines will help you make your data more accessible to collaborators.