Intake Logo

Introduction to Intake


Overview

Intake is a python library that provides a consistent interface for accessing data regardless of where or how it is stored. In this notebook you will learn to:

  1. Interact with Intake catalogs

  2. Use Intake to access data stored in the cloud

  3. Use Intake to load data into Dask

Prerequisites

Concepts

Importance

Notes

Times and Dates in Python

Necessary

Intro to Xarray

Necessary

Intro to Cartopy

Helpful

Understanding of Zarr

Helpful

Understanding of Dask

Helpful

  • Time to learn: 45 minutes


Imports

import intake
import xarray as xr
import datetime as dt
import metpy
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import requests
import aiohttp
import intake_xarray

Interacting with Intake Catalogs

Intake uses an object called a catalog to inform users what datasets are available to them. These catalogs can be in the form of a yaml file, a server, or a python package you install. In this example we will use a catalog to access Mesowest’s HRRR data stored on AWS S3. To open a catalog use Intake’s open_catalog method with the location of the catalog as an argument. The catalog object created by calling open_catalog is iterable, so you can see what data sources are available to you by passing your catalog object as an argument to python’s list function. You can view the catalog file used for this cookbook on Github.

cat = intake.open_catalog('catalog.yml')
list(cat)
['data_dictionary', 'hrrrzarr', 'readme']

Each of the catalog’s sources are accessible as properties of your catalog object.

cat.hrrrzarr
hrrrzarr:
  args:
    chunks: null
    consolidated: true
    storage_options:
      anon: true
    urlpath:
    - s3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP
    - s3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP/surface
  description: Mesowest's HRRR data. See readme source for more information.
  driver: intake_xarray.xzarr.ZarrSource
  metadata:
    catalog_dir: /home/runner/work/intake-cookbook/intake-cookbook/notebooks/
    coords: !!python/tuple
    - projection_x_coordinate
    - projection_y_coordinate
    dims:
      projection_x_coordinate: 1799
      projection_y_coordinate: 1059

Learning About Catalog Entries

The first place you can look for information about a catalog source is it’s description. That is stored in the data sources’s description property.

cat.hrrrzarr.description
"Mesowest's HRRR data. See readme source for more information."

To get a better look at a source in a Intake catalog, call it’s describe method

desc = cat.hrrrzarr.describe()
desc
{'name': 'hrrrzarr',
 'container': 'xarray',
 'plugin': ['zarr'],
 'driver': ['zarr'],
 'description': "Mesowest's HRRR data. See readme source for more information.",
 'direct_access': 'forbid',
 'user_parameters': [{'name': 'date',
   'description': 'Date and hour of data.',
   'type': 'datetime',
   'min': Timestamp('2016-08-24 00:00:00'),
   'default': Timestamp('2016-08-24 00:00:00')},
  {'name': 'level',
   'description': "Parameter specifying level in the atmosphere. Corresponds to 'Vertical Level' column in data_dictionary",
   'type': 'str',
   'default': 'surface'},
  {'name': 'param',
   'description': "Specifies what parameter your dataset will contain. Corresponds to 'Parameter Short Name' in data_dictionary",
   'type': 'str',
   'default': 'TMP'}],
 'metadata': {'coords': ('projection_x_coordinate', 'projection_y_coordinate'),
  'dims': {'projection_x_coordinate': 1799, 'projection_y_coordinate': 1059}},
 'args': {'chunks': None,
  'consolidated': True,
  'storage_options': {'anon': True},
  'urlpath': ["s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}",
   "s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}/{{level}}"]}}

From this Python dictionary, there are a few things that we can learn. The container key tells us what form the data will be in when we read it. In this case it will be a Xarray Dataset The user_parameters key has a list containing parameters a user can set to control what data they get. The metadata key contains an arbitrary dictionary of information specified by the catalog author. A common things to find in the metadata field are plots you can use to get a quick peak at the data.

Reading Data with Intake

Now that we know how to explore Intake catalogs, let’s use one to get some data. Luckily Intake makes this a really easy one-liner.

cat.hrrrzarr.read()
<xarray.Dataset>
Dimensions:                  (projection_y_coordinate: 1059,
                              projection_x_coordinate: 1799)
Coordinates:
  * projection_x_coordinate  (projection_x_coordinate) float64 -2.698e+06 ......
  * projection_y_coordinate  (projection_y_coordinate) float64 -1.587e+06 ......
Data variables:
    TMP                      (projection_y_coordinate, projection_x_coordinate) float16 ...
    forecast_period          timedelta64[ns] 00:00:00
    forecast_reference_time  datetime64[ns] 2016-08-24
    height                   float64 1e+03
    pressure                 float64 2.5e+04
    time                     datetime64[ns] 2016-08-24

Info

Intake catalogs access data lazily. You can explore that catalog all you want, but you won’t have any data until use call the read or simillar methods. The read method may take longer to run depending on your internet connection, the size of the data, and your proximity to the data center where the data is stored.

When we look at the description of the hrrrzarr source it referenced the readme source. We can look at it using the same method. Pay attention to the projection information. It will be useful later.

cat.readme.read()
README

Intake Demo Readme

This repository was made as a educational resource to learn the Python library Intake. It contains an intake catalog pointing to Mesowest's High-Resolution Rapid Refresh (HRRR) Zarr data.

Dependencies

  • requests
  • aiohttp
  • intake
  • s3fs
  • Intake-xarray
  • Intake-markdown
  • cartopy

About the Data

High-Resolution Rapid Refresh (HRRR) is a atmospheric model maintained by NOAA. As stated on NOAA's website

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

In this cookbook we use a subset of HRRR data maintained by Mesowest on AWS S3 object storage.

HRRR Projection

From Mesowest's HRRR Zarr Data Loading Guide

The projection description for the spatial grid, which is stored in the HRRR grib2 files, is not available directly in the zarr arrays. Additionally, the zarr data has the x and y coordinates in the native Lambert Conformal Conic projection, but not the latitude and longitude data. The various use cases will have more specific examples of how to handle this information, but here we'll note that: The proj params you need to define the grid correctly are: {'a': 6371229, 'b': 6371229, 'proj': 'lcc', 'lon_0': 262.5, 'lat_0': 38.5, 'lat_1': 38.5, 'lat_2': 38.5} We also offer a "chunk index" that has the latitude and longitude coordinates for the grid, more on that under Use Case 3. The following code sets up the correct CRS in cartopy:
projection = ccrs.LambertConformal(central_longitude=262.5, 
                                   central_latitude=38.5, 
                                   standard_parallels=(38.5, 38.5),
                                    globe=ccrs.Globe(semimajor_axis=6371229,
                                                     semiminor_axis=6371229))

Data Dictionary

This catalog contains a csv source called data_dictionary that describes the data in the the hrrrzarr source. The Vertical Level and Parameter Short Name column are arguments that can be passed to the hrrrzarr source's level and param user parameters respectively.

Usage Example

import intake
import datetime as dt
user_parameters = {'level': 'top_of_atmosphere',
                   'param': 'USWRF',
                   'date': dt.datetime(2021, 4, 25, 6)}

cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')

ds = cat.hrrrzarr(**user_parameters).read()

Combining a Days Worth of Data

import intake
import cartopy.crs as ccrs
import datetime as dt
import xarray as xr

cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')

projection = ccrs.LambertConformal(central_longitude=262.5, 
                                   central_latitude=38.5, 
                                   standard_parallels=(38.5, 38.5),
                                    globe=ccrs.Globe(semimajor_axis=6371229,
                                                     semiminor_axis=6371229))
date = dt.datetime(2019, 8, 9)
hours = [date + dt.timedelta(hours=i) for i in range(4)]
datasets = [cat.hrrrzarr(date=hour).read().set_coords("time") for hour in hours]
ds = xr.concat(datasets, dim='time', combine_attrs="drop_conflicts")


End of README

Specifying User Parameters

The hrrrzarr sources in this catalog has three parameters that can be used to control what data you will read in. To list those use the user_parameters key on the description dictionary created above.

desc['user_parameters']
[{'name': 'date',
  'description': 'Date and hour of data.',
  'type': 'datetime',
  'min': Timestamp('2016-08-24 00:00:00'),
  'default': Timestamp('2016-08-24 00:00:00')},
 {'name': 'level',
  'description': "Parameter specifying level in the atmosphere. Corresponds to 'Vertical Level' column in data_dictionary",
  'type': 'str',
  'default': 'surface'},
 {'name': 'param',
  'description': "Specifies what parameter your dataset will contain. Corresponds to 'Parameter Short Name' in data_dictionary",
  'type': 'str',
  'default': 'TMP'}]

Each user parameter can have a name, description, type, defualt, allowed, min, and max key. You can learn more about parameter definitions in Intake’s documentation. This data source contains three user parameters: date, level, and param. Each parameter’s descriptions explain what they are for. The level and param parameter allow you to select data based on level in the atmosphere and variable being measured. There allowed values correspond to values in the “Vertical Level” and “Parameter Short Name” column in the data_dictionary source repectively. The date parameter allows you to select data by date.

data_dictionary = cat.data_dictionary.read()
data_dictionary.query("`Vertical Level` == 'surface'")[:10]
Parameter Long Name Vertical Level Parameter Short Name Units 1st Version Available Analysis or Forcast Notes
133 Visibility surface VIS m V3 both
134 Wind Gust surface GUST m/s V3 both
135 Hail surface HAIL_max_fcst m V4 anl
136 Hail surface HAIL_1hr_max_fcst m V4 fcst
137 Pressure surface PRES Pa V3 both
138 Geopotential Height surface HGT gpm V3 both
139 Temperature surface TMP K V3 both
140 Total Snowfall surface ASNOW_acc_fcst m V3 both
141 Total Snowfall surface ASNOW_1hr_acc_fcst m V3 fcst
142 Plant Canopy Surface Water surface CNWAT kg/m2 V3 both

Lets use parameters to select surface temperature data from June 20, 2021. We can provide these parameter by passing keyword arguments to the data source.

summer_solstice_2021 = dt.datetime(2021, 6, 20)
source = cat.hrrrzarr(date=summer_solstice_2021, level='surface', param='TMP')
source.read()
<xarray.Dataset>
Dimensions:                  (projection_y_coordinate: 1059,
                              projection_x_coordinate: 1799)
Coordinates:
  * projection_x_coordinate  (projection_x_coordinate) float64 -2.698e+06 ......
  * projection_y_coordinate  (projection_y_coordinate) float64 -1.587e+06 ......
Data variables:
    TMP                      (projection_y_coordinate, projection_x_coordinate) float16 ...
    forecast_period          timedelta64[ns] 00:00:00
    forecast_reference_time  datetime64[ns] 2021-06-20
    height                   float64 1e+03
    pressure                 float64 2.5e+04
    time                     datetime64[ns] 2021-06-20

Your data source now points to surface temperature data taken June 20, 2021

A More Complicated Example

Mesowest provides a tutorial for reading a days worth of surface temperature HRRR data from AWS. Lets see what the same task looks like using Intake.

We will start by setting up our Cartopy projection according to the information given in the readme source.

projection = ccrs.LambertConformal(central_longitude=262.5, 
                                   central_latitude=38.5, 
                                   standard_parallels=(38.5, 38.5),
                                    globe=ccrs.Globe(semimajor_axis=6371229,
                                                     semiminor_axis=6371229))

Now lets read in the data with Intake. To do this will create a datetime object with the date August 9, 2019. Then we will use list comprehension and timedelta objects to create a datetime object for each hour that day. Again, using list comprehension, we will create a list of datasets using Intake. In order to concatenate our list of datasets using Xarray, we need a dimension to concatenate accross. Each dataset in our list contains a time variable with an array of just one timestamp. We can promote that variable to a coordinate using the set_coords method. This may take a few minutes to run.

%%time
date = dt.datetime(2019, 8, 10)
hours = [date + dt.timedelta(hours=i) for i in range(24)]
datasets = [cat.hrrrzarr(date=hour).read().set_coords("time") for hour in hours]
ds = xr.concat(datasets, dim='time', combine_attrs="drop_conflicts")
ds
CPU times: user 8.06 s, sys: 551 ms, total: 8.61 s
Wall time: 2min 46s
<xarray.Dataset>
Dimensions:                  (time: 24, projection_y_coordinate: 1059,
                              projection_x_coordinate: 1799)
Coordinates:
  * projection_x_coordinate  (projection_x_coordinate) float64 -2.698e+06 ......
  * projection_y_coordinate  (projection_y_coordinate) float64 -1.587e+06 ......
  * time                     (time) datetime64[ns] 2019-08-10 ... 2019-08-10T...
Data variables:
    TMP                      (time, projection_y_coordinate, projection_x_coordinate) float16 ...
    forecast_period          (time) timedelta64[ns] 00:00:00 ... 00:00:00
    forecast_reference_time  (time) datetime64[ns] 2019-08-10 ... 2019-08-10T...
    height                   (time) float64 1e+03 1e+03 1e+03 ... 1e+03 1e+03
    pressure                 (time) float64 2.5e+04 2.5e+04 ... 2.5e+04 2.5e+04

Now our data is ready to be analyzed in the normal way using Xarray.

avg_TMP = ds.TMP.mean(dim='time')
fig = plt.figure(figsize=(10, 8.5))
ax = fig.add_subplot(1, 1, 1, projection=projection)
temp_plot = ax.contourf(avg_TMP.projection_x_coordinate, avg_TMP.projection_y_coordinate, avg_TMP)
ax.coastlines()
fig.colorbar(temp_plot, orientation="horizontal")

plt.show()
/usr/share/miniconda3/envs/intake-cookbook-dev/lib/python3.11/site-packages/cartopy/io/__init__.py:241: DownloadWarning: Downloading: https://naturalearth.s3.amazonaws.com/50m_physical/ne_50m_coastline.zip
  warnings.warn(f'Downloading: {url}', DownloadWarning)
../_images/af2ea64dea8d36f185a31abc68c6023e5357aac7ca4ba4f6b9b1cb06ef91ef9b.png

Using Intake with Dask

Often times the data we want to analyze is too big to be loaded into memory all at once on your computer. Dask solves this problem by breaking up your data into smaller chunks, operating on each chunck of data, and then aggregating the results. This is usually done in parallel on a cluster system. You can use Intake to create a Dask dataset by using the to_dask method instead of the read method.

ds1 = cat.hrrrzarr(date=dt.datetime(2021, 1, 1)).read()
print(type(ds1.TMP.data))

ds2 = cat.hrrrzarr(date=dt.datetime(2022, 1, 1)).to_dask()
print(type(ds2.TMP.data))
<class 'numpy.ndarray'>
<class 'dask.array.core.Array'>

As you can see Xarray uses Dask arrays instead of NumPy arrays to hold the data when the to_dask method is used.


Summary

  • Intake makes it easy to consistently access data regardless of where and how it is stored

  • Intake catalogs contain useful information about the data they make available

  • Intake can load data into Dask for use in parallel computing.

What’s next?

In the next notebook we will look at writing a Intake catalog and making it available on Github.