Introduction to Intake

Overview

Intake is a python library that provides a consistent interface for accessing data regardless of where or how it is stored. In this notebook you will learn to:

Interact with Intake catalogs
Use Intake to access data stored in the cloud
Use Intake to load data into Dask

Prerequisites

Concepts	Importance	Notes
Times and Dates in Python	Necessary
Intro to Xarray	Necessary
Intro to Cartopy	Helpful
Understanding of Zarr	Helpful
Understanding of Dask	Helpful

Time to learn: 45 minutes

Imports

import intake
import xarray as xr
import datetime as dt
import metpy
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import requests
import aiohttp
import intake_xarray

Interacting with Intake Catalogs

Intake uses an object called a catalog to inform users what datasets are available to them. These catalogs can be in the form of a yaml file, a server, or a python package you install. In this example we will use a catalog to access Mesowest’s HRRR data stored on AWS S3. To open a catalog use Intake’s open_catalog method with the location of the catalog as an argument. The catalog object created by calling open_catalog is iterable, so you can see what data sources are available to you by passing your catalog object as an argument to python’s list function. You can view the catalog file used for this cookbook on Github.

cat = intake.open_catalog('catalog.yml')
list(cat)

['data_dictionary', 'hrrrzarr', 'readme']

Each of the catalog’s sources are accessible as properties of your catalog object.

cat.hrrrzarr

hrrrzarr:
  args:
    chunks: null
    consolidated: true
    storage_options:
      anon: true
    urlpath:
    - s3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP
    - s3://hrrrzarr/sfc/20160824/20160824_00z_anl.zarr/surface/TMP/surface
  description: Mesowest's HRRR data. See readme source for more information.
  driver: intake_xarray.xzarr.ZarrSource
  metadata:
    catalog_dir: /home/runner/work/intake-cookbook/intake-cookbook/notebooks/
    coords: !!python/tuple
    - projection_x_coordinate
    - projection_y_coordinate
    dims:
      projection_x_coordinate: 1799
      projection_y_coordinate: 1059

Learning About Catalog Entries

The first place you can look for information about a catalog source is it’s description. That is stored in the data sources’s description property.

cat.hrrrzarr.description

"Mesowest's HRRR data. See readme source for more information."

To get a better look at a source in a Intake catalog, call it’s describe method

desc = cat.hrrrzarr.describe()
desc

{'name': 'hrrrzarr',
 'container': 'xarray',
 'plugin': ['zarr'],
 'driver': ['zarr'],
 'description': "Mesowest's HRRR data. See readme source for more information.",
 'direct_access': 'forbid',
 'user_parameters': [{'name': 'date',
   'description': 'Date and hour of data.',
   'type': 'datetime',
   'min': Timestamp('2016-08-24 00:00:00'),
   'default': Timestamp('2016-08-24 00:00:00')},
  {'name': 'level',
   'description': "Parameter specifying level in the atmosphere. Corresponds to 'Vertical Level' column in data_dictionary",
   'type': 'str',
   'default': 'surface'},
  {'name': 'param',
   'description': "Specifies what parameter your dataset will contain. Corresponds to 'Parameter Short Name' in data_dictionary",
   'type': 'str',
   'default': 'TMP'}],
 'metadata': {'coords': ('projection_x_coordinate', 'projection_y_coordinate'),
  'dims': {'projection_x_coordinate': 1799, 'projection_y_coordinate': 1059}},
 'args': {'chunks': None,
  'consolidated': True,
  'storage_options': {'anon': True},
  'urlpath': ["s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}",
   "s3://hrrrzarr/sfc/{{date.strftime('%Y%m%d/%Y%m%d_%Hz_anl.zarr')}}/{{level}}/{{param}}/{{level}}"]}}

From this Python dictionary, there are a few things that we can learn. The container key tells us what form the data will be in when we read it. In this case it will be a Xarray Dataset The user_parameters key has a list containing parameters a user can set to control what data they get. The metadata key contains an arbitrary dictionary of information specified by the catalog author. A common things to find in the metadata field are plots you can use to get a quick peak at the data.

Reading Data with Intake

Now that we know how to explore Intake catalogs, let’s use one to get some data. Luckily Intake makes this a really easy one-liner.

Info

Intake catalogs access data lazily. You can explore that catalog all you want, but you won’t have any data until use call the read or simillar methods. The read method may take longer to run depending on your internet connection, the size of the data, and your proximity to the data center where the data is stored.

When we look at the description of the hrrrzarr source it referenced the readme source. We can look at it using the same method. Pay attention to the projection information. It will be useful later.

cat.readme.read()

README

Intake Demo Readme

This repository was made as a educational resource to learn the Python library Intake. It contains an intake catalog pointing to Mesowest's High-Resolution Rapid Refresh (HRRR) Zarr data.

Dependencies

requests
aiohttp
intake
s3fs
Intake-xarray
Intake-markdown
cartopy

About the Data

High-Resolution Rapid Refresh (HRRR) is a atmospheric model maintained by NOAA. As stated on NOAA's website

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

In this cookbook we use a subset of HRRR data maintained by Mesowest on AWS S3 object storage.

HRRR Projection

From Mesowest's HRRR Zarr Data Loading Guide

The projection description for the spatial grid, which is stored in the HRRR grib2 files, is not available directly in the zarr arrays. Additionally, the zarr data has the x and y coordinates in the native Lambert Conformal Conic projection, but not the latitude and longitude data. The various use cases will have more specific examples of how to handle this information, but here we'll note that: The proj params you need to define the grid correctly are: {'a': 6371229, 'b': 6371229, 'proj': 'lcc', 'lon_0': 262.5, 'lat_0': 38.5, 'lat_1': 38.5, 'lat_2': 38.5} We also offer a "chunk index" that has the latitude and longitude coordinates for the grid, more on that under Use Case 3. The following code sets up the correct CRS in cartopy:

projection = ccrs.LambertConformal(central_longitude=262.5, 
                                   central_latitude=38.5, 
                                   standard_parallels=(38.5, 38.5),
                                    globe=ccrs.Globe(semimajor_axis=6371229,
                                                     semiminor_axis=6371229))

Data Dictionary

This catalog contains a csv source called data_dictionary that describes the data in the the hrrrzarr source. The Vertical Level and Parameter Short Name column are arguments that can be passed to the hrrrzarr source's level and param user parameters respectively.

Usage Example

import intake
import datetime as dt
user_parameters = {'level': 'top_of_atmosphere',
                   'param': 'USWRF',
                   'date': dt.datetime(2021, 4, 25, 6)}

cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')

ds = cat.hrrrzarr(**user_parameters).read()

Combining a Days Worth of Data

import intake
import cartopy.crs as ccrs
import datetime as dt
import xarray as xr

cat = intake.open_catalog('https://raw.githubusercontent.com/jnmorley/intake_demo/main/catalog.yml')

projection = ccrs.LambertConformal(central_longitude=262.5, 
                                   central_latitude=38.5, 
                                   standard_parallels=(38.5, 38.5),
                                    globe=ccrs.Globe(semimajor_axis=6371229,
                                                     semiminor_axis=6371229))
date = dt.datetime(2019, 8, 9)
hours = [date + dt.timedelta(hours=i) for i in range(4)]
datasets = [cat.hrrrzarr(date=hour).read().set_coords("time") for hour in hours]
ds = xr.concat(datasets, dim='time', combine_attrs="drop_conflicts")

End of README

Specifying User Parameters

The hrrrzarr sources in this catalog has three parameters that can be used to control what data you will read in. To list those use the user_parameters key on the description dictionary created above.

desc['user_parameters']

[{'name': 'date',
  'description': 'Date and hour of data.',
  'type': 'datetime',
  'min': Timestamp('2016-08-24 00:00:00'),
  'default': Timestamp('2016-08-24 00:00:00')},
 {'name': 'level',
  'description': "Parameter specifying level in the atmosphere. Corresponds to 'Vertical Level' column in data_dictionary",
  'type': 'str',
  'default': 'surface'},
 {'name': 'param',
  'description': "Specifies what parameter your dataset will contain. Corresponds to 'Parameter Short Name' in data_dictionary",
  'type': 'str',
  'default': 'TMP'}]

Each user parameter can have a name, description, type, defualt, allowed, min, and max key. You can learn more about parameter definitions in Intake’s documentation. This data source contains three user parameters: date, level, and param. Each parameter’s descriptions explain what they are for. The level and param parameter allow you to select data based on level in the atmosphere and variable being measured. There allowed values correspond to values in the “Vertical Level” and “Parameter Short Name” column in the data_dictionary source repectively. The date parameter allows you to select data by date.

data_dictionary = cat.data_dictionary.read()
data_dictionary.query("`Vertical Level` == 'surface'")[:10]

	Parameter Long Name	Vertical Level	Parameter Short Name	Units	1st Version Available	Analysis or Forcast
133	Visibility	surface	VIS	m	V3	both
134	Wind Gust	surface	GUST	m/s	V3	both
135	Hail	surface	HAIL_max_fcst	m	V4	anl
136	Hail	surface	HAIL_1hr_max_fcst	m	V4	fcst
137	Pressure	surface	PRES	Pa	V3	both
138	Geopotential Height	surface	HGT	gpm	V3	both
139	Temperature	surface	TMP	K	V3	both
140	Total Snowfall	surface	ASNOW_acc_fcst	m	V3	both
141	Total Snowfall	surface	ASNOW_1hr_acc_fcst	m	V3	fcst
142	Plant Canopy Surface Water	surface	CNWAT	kg/m2	V3	both

Lets use parameters to select surface temperature data from June 20, 2021. We can provide these parameter by passing keyword arguments to the data source.

Your data source now points to surface temperature data taken June 20, 2021

Using Intake with Dask

Often times the data we want to analyze is too big to be loaded into memory all at once on your computer. Dask solves this problem by breaking up your data into smaller chunks, operating on each chunck of data, and then aggregating the results. This is usually done in parallel on a cluster system. You can use Intake to create a Dask dataset by using the to_dask method instead of the read method.

ds1 = cat.hrrrzarr(date=dt.datetime(2021, 1, 1)).read()
print(type(ds1.TMP.data))

ds2 = cat.hrrrzarr(date=dt.datetime(2022, 1, 1)).to_dask()
print(type(ds2.TMP.data))

/home/runner/miniconda3/envs/intake-cookbook-dev/lib/python3.12/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  'dims': dict(self._ds.dims),

<class 'numpy.ndarray'>

<class 'dask.array.core.Array'>

/home/runner/miniconda3/envs/intake-cookbook-dev/lib/python3.12/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  'dims': dict(self._ds.dims),

As you can see Xarray uses Dask arrays instead of NumPy arrays to hold the data when the to_dask method is used.

Summary

Intake makes it easy to consistently access data regardless of where and how it is stored
Intake catalogs contain useful information about the data they make available
Intake can load data into Dask for use in parallel computing.

What’s next?

In the next notebook we will look at writing a Intake catalog and making it available on Github.