Introduction to Intake¶
Overview¶
Intake is a python library that provides a consistent interface for accessing data regardless of where or how it is stored. In this notebook you will learn to:
- Interact with Intake catalogs
- Use Intake to access data stored in the cloud
- Use Intake to load data into Dask
Prerequisites¶
Concepts | Importance | Notes |
---|---|---|
Times and Dates in Python | Necessary | |
Intro to Xarray | Necessary | |
Intro to Cartopy | Helpful | |
Understanding of Zarr | Helpful | |
Understanding of Dask | Helpful |
- Time to learn: 45 minutes
Imports¶
import intake
import xarray as xr
import datetime as dt
import metpy
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import requests
import aiohttp
import intake_xarray
Interacting with Intake Catalogs¶
Intake uses an object called a catalog to inform users what datasets are available to them. These catalogs can be in the form of a yaml file, a server, or a python package you install. In this example we will use a catalog to access Mesowest’s HRRR data stored on AWS S3. To open a catalog use Intake’s open_catalog
method with the location of the catalog as an argument. The catalog object created by calling open_catalog
is iterable, so you can see what data sources are available to you by passing your catalog object as an argument to python’s list
function. You can view the catalog file used for this cookbook on Github.
cat = intake.open_catalog('catalog.yml')
list(cat)
['data_dictionary', 'hrrrzarr', 'readme']
Each of the catalog’s sources are accessible as properties of your catalog object.
cat.hrrrzarr
<intake_xarray.xzarr.ZarrSource at 0x7fcd9c4fd250>
Learning About Catalog Entries¶
The first place you can look for information about a catalog source is it’s description. That is stored in the data sources’s description
property.
cat.hrrrzarr.description
"Mesowest's HRRR data. See readme source for more information."
To get a better look at a source in a Intake catalog, call it’s describe
method
desc = cat.hrrrzarr.describe()
desc
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[5], line 1
----> 1 desc = cat.hrrrzarr.describe()
2 desc
AttributeError: 'ZarrSource' object has no attribute 'describe'
From this Python dictionary, there are a few things that we can learn. The container
key tells us what form the data will be in when we read it. In this case it will be a Xarray Dataset
The user_parameters
key has a list containing parameters a user can set to control what data they get. The metadata
key contains an arbitrary dictionary of information specified by the catalog author. A common things to find in the metadata
field are plots you can use to get a quick peak at the data.
Reading Data with Intake¶
Now that we know how to explore Intake catalogs, let’s use one to get some data. Luckily Intake makes this a really easy one-liner.
cat.hrrrzarr.read()
Info
Intake catalogs access data lazily. You can explore that catalog all you want, but you won't have any data until use call theread
or simillar methods. The read
method may take longer to run depending on your internet connection, the size of the data, and your proximity to the data center where the data is stored.When we look at the description of the hrrrzarr source it referenced the readme source. We can look at it using the same method. Pay attention to the projection information. It will be useful later.
cat.readme.read()
Specifying User Parameters¶
The hrrrzarr sources in this catalog has three parameters that can be used to control what data you will read in. To list those use the user_parameters
key on the description dictionary created above.
desc['user_parameters']
Each user parameter can have a name, description, type, defualt, allowed, min, and max key. You can learn more about parameter definitions in Intake’s documentation. This data source contains three user parameters: date
, level
, and param
. Each parameter’s descriptions explain what they are for. The level
and param
parameter allow you to select data based on level in the atmosphere and variable being measured. There allowed values correspond to values in the “Vertical Level” and “Parameter Short Name” column in the data_dictionary source repectively. The date
parameter allows you to select data by date.
data_dictionary = cat.data_dictionary.read()
data_dictionary.query("`Vertical Level` == 'surface'")[:10]
Lets use parameters to select surface temperature data from June 20, 2021. We can provide these parameter by passing keyword arguments to the data source.
summer_solstice_2021 = dt.datetime(2021, 6, 20)
source = cat.hrrrzarr(date=summer_solstice_2021, level='surface', param='TMP')
source.read()
Your data source now points to surface temperature data taken June 20, 2021
A More Complicated Example¶
Mesowest provides a tutorial for reading a days worth of surface temperature HRRR data from AWS. Lets see what the same task looks like using Intake.
We will start by setting up our Cartopy projection according to the information given in the readme source.
projection = ccrs.LambertConformal(central_longitude=262.5,
central_latitude=38.5,
standard_parallels=(38.5, 38.5),
globe=ccrs.Globe(semimajor_axis=6371229,
semiminor_axis=6371229))
Now lets read in the data with Intake. To do this will create a datetime object with the date August 9, 2019. Then we will use list comprehension and timedelta
objects to create a datetime object for each hour that day. Again, using list comprehension, we will create a list of datasets using Intake. In order to concatenate our list of datasets using Xarray, we need a dimension to concatenate accross. Each dataset in our list contains a time variable with an array of just one timestamp. We can promote that variable to a coordinate using the set_coords
method. This may take a few minutes to run.
%%time
date = dt.datetime(2019, 8, 10)
hours = [date + dt.timedelta(hours=i) for i in range(24)]
datasets = [cat.hrrrzarr(date=hour).read().set_coords("time") for hour in hours]
ds = xr.concat(datasets, dim='time', combine_attrs="drop_conflicts")
ds
Now our data is ready to be analyzed in the normal way using Xarray.
avg_TMP = ds.TMP.mean(dim='time')
fig = plt.figure(figsize=(10, 8.5))
ax = fig.add_subplot(1, 1, 1, projection=projection)
temp_plot = ax.contourf(avg_TMP.projection_x_coordinate, avg_TMP.projection_y_coordinate, avg_TMP)
ax.coastlines()
fig.colorbar(temp_plot, orientation="horizontal")
plt.show()
Using Intake with Dask¶
Often times the data we want to analyze is too big to be loaded into memory all at once on your computer. Dask solves this problem by breaking up your data into smaller chunks, operating on each chunck of data, and then aggregating the results. This is usually done in parallel on a cluster system. You can use Intake to create a Dask dataset by using the to_dask
method instead of the read
method.
ds1 = cat.hrrrzarr(date=dt.datetime(2021, 1, 1)).read()
print(type(ds1.TMP.data))
ds2 = cat.hrrrzarr(date=dt.datetime(2022, 1, 1)).to_dask()
print(type(ds2.TMP.data))
As you can see Xarray uses Dask arrays instead of NumPy arrays to hold the data when the to_dask
method is used.
Summary¶
- Intake makes it easy to consistently access data regardless of where and how it is stored
- Intake catalogs contain useful information about the data they make available
- Intake can load data into Dask for use in parallel computing.
What’s next?¶
In the next notebook we will look at writing a Intake catalog and making it available on Github.