Data Ingestion - General Purpose Tooling
Overview
If the specialized geospatial tools discussed in the previous notebook suit your needs, feel free to proceed to explore a workflow example, such as Spectral Clustering. However, if you’re seeking a tool that is adaptable across a wider range of data types and sources, welcome to this introduction to Intake V2, a general-purpose data ingestion and management library.
Intake is a high-level library designed for data ingestion and management. While the geospatial-specific tooling approach is optimized for satellite data, Intake offers a broader and potentially more flexible approach for multimodal data workflows, characterized by:
Unified Interface: Abstracts the details of data sources, enabling users to interact with a consistent API regardless of the data’s underlying format.
Dynamic and Shareable Catalogs: Facilitates the creation and sharing of data catalogs that can be version-controlled, updated, and maintained.
Extensible: Facilitates the addition of new data sources and formats through its plugin system.
In the following sections, we will guide you through an introduction to various Intake functionalities that simplify data access and enhance both modularity and reproducibility in geospatial workflows.
Prerequisites
Concepts |
Importance |
Notes |
---|---|---|
Necessary |
Background |
|
Helpful |
||
Helpful |
||
Necessary |
||
Helpful |
||
Necessary |
Time to learn: 20 minutes
Imports
import intake
import planetary_computer
from pprint import pprint
# Viz
import hvplot.xarray
import panel as pn
pn.extension()
Connecting to Data Sources
To get started, we need to provide a STAC URL (or any other data source URL) to intake, and we can ask intake to recommend some suitable datatypes.
url = "https://planetarycomputer.microsoft.com/api/stac/v1"
data_types = intake.readers.datatypes.recommend(url)
pprint(data_types)
[<class 'intake.readers.datatypes.JSONFile'>,
<class 'intake.readers.datatypes.STACJSON'>,
<class 'intake.readers.datatypes.Handle'>,
<class 'intake.readers.datatypes.CatalogAPI'>,
<class 'intake.readers.datatypes.TiledService'>]
Selecting the Appropriate Data Type
After identifying the possible data types, we choose the one that best suits our needs. For handling STAC formatted JSON data from our URL, we will proceed with STACJSON
.
data_type = intake.datatypes.STACJSON(url)
data_type
STACJSON, {'url': 'https://planetarycomputer.microsoft.com/api/stac/v1', 'storage_options': None, 'metadata': {}}
This object now represents the specific data type we will work with, allowing us to streamline subsequent data operations.
Initializing Data Readers
With the STACJSON
data type specified, we explore available methods to read the data.
readers = data_type.possible_readers
pprint(readers)
{'importable': [<class 'intake.readers.catalogs.StackBands'>,
<class 'intake.readers.catalogs.StacSearch'>,
<class 'intake.readers.catalogs.StacCatalogReader'>,
<class 'intake.readers.readers.DaskJSON'>,
<class 'intake.readers.readers.OpenFilesReader'>,
<class 'intake.readers.readers.FileByteReader'>,
<class 'intake.readers.readers.FileTextReader'>,
<class 'intake.readers.readers.FileExistsReader'>],
'not_importable': [<class 'intake.readers.readers.PolarsJSON'>,
<class 'intake.readers.readers.RayJSON'>,
<class 'intake.readers.readers.AwkwardJSON'>,
<class 'intake.readers.readers.DuckJSON'>,
<class 'intake.readers.readers.RayBinary'>]}
This output presents us with options that can interpret the STACJSON
data format effectively. The StacCatalogReader
is probably the most suitable for our use case. We can use it to read the STAC catalog and explore the available contents.
Reading the Catalog
Next, we can access the data catalog through our reader.
reader = intake.catalogs.StacCatalogReader(
data_type, signer=planetary_computer.sign_inplace
)
reader
StacCatalogReader reader producing intake.readers.entry:Catalog
This reader is now configured to handle interactions with the data catalog.
List Catalog Contents
Once the catalog is accessible, we read()
it and then collect each dataset’s description
to identify datasets of interest. For our purposes, we will just print the entries that include the word 'landsat'
.
stac_cat = reader.read()
description = {}
for data_description in stac_cat.data.values():
data = data_description.kwargs["data"]
description[data["id"]] = data["description"]
# Print only keys that include the word 'landsat'
pprint([key for key in description.keys() if 'landsat' in key.lower()])
['landsat-c2-l2', 'landsat-c2-l1']
Detailed Dataset Examination
By examining specific datasets more closely, we understand their content and relevance to our project goals. We can now print the description of the desired landsat IDs.
print("1:", description["landsat-c2-l1"])
print('-------------------------------\n')
print("2:", description["landsat-c2-l2"])
1: Landsat Collection 2 Level-1 data, consisting of quantized and calibrated scaled Digital Numbers (DN) representing the multispectral image data. These [Level-1](https://www.usgs.gov/landsat-missions/landsat-collection-2-level-1-data) data can be [rescaled](https://www.usgs.gov/landsat-missions/using-usgs-landsat-level-1-data-product) to top of atmosphere (TOA) reflectance and/or radiance. Thermal band data can be rescaled to TOA brightness temperature.
This dataset represents the global archive of Level-1 data from [Landsat Collection 2](https://www.usgs.gov/core-science-systems/nli/landsat/landsat-collection-2) acquired by the [Multispectral Scanner System](https://landsat.gsfc.nasa.gov/multispectral-scanner-system/) onboard Landsat 1 through Landsat 5 from July 7, 1972 to January 7, 2013. Images are stored in [cloud-optimized GeoTIFF](https://www.cogeo.org/) format.
-------------------------------
2: Landsat Collection 2 Level-2 [Science Products](https://www.usgs.gov/landsat-missions/landsat-collection-2-level-2-science-products), consisting of atmospherically corrected [surface reflectance](https://www.usgs.gov/landsat-missions/landsat-collection-2-surface-reflectance) and [surface temperature](https://www.usgs.gov/landsat-missions/landsat-collection-2-surface-temperature) image data. Collection 2 Level-2 Science Products are available from August 22, 1982 to present.
This dataset represents the global archive of Level-2 data from [Landsat Collection 2](https://www.usgs.gov/core-science-systems/nli/landsat/landsat-collection-2) acquired by the [Thematic Mapper](https://landsat.gsfc.nasa.gov/thematic-mapper/) onboard Landsat 4 and 5, the [Enhanced Thematic Mapper](https://landsat.gsfc.nasa.gov/the-enhanced-thematic-mapper-plus-etm/) onboard Landsat 7, and the [Operatational Land Imager](https://landsat.gsfc.nasa.gov/satellites/landsat-8/spacecraft-instruments/operational-land-imager/) and [Thermal Infrared Sensor](https://landsat.gsfc.nasa.gov/satellites/landsat-8/spacecraft-instruments/thermal-infrared-sensor/) onboard Landsat 8 and 9. Images are stored in [cloud-optimized GeoTIFF](https://www.cogeo.org/) format.
Selecting and Accessing Data
We want "landsat-c2-l2"
, so with a chosen dataset, we can now access it directly and view the metadata
specific to this dataset - key details that are important for analysis and interpretation. Since the output is long, we’ll utilize the HoloViz Panel library to wrap the output in a scrollable element.
landsat_reader = stac_cat["landsat-c2-l2"]
landsat_metadata = landsat_reader.read().metadata
# View extensive metadata in scrollable block
json_pane = pn.pane.JSON(landsat_metadata, name='Metadata', max_height=400, sizing_mode='stretch_width', depth=-1, theme='light')
scrollable_output = pn.Column(json_pane, height=400, sizing_mode='stretch_width', scroll=True, styles={'background': 'lightgrey'})
scrollable_output
Visual Preview
To get a visual preview of the dataset, particularly to check its quality and relevance, we use the following commands:
landsat_reader["thumbnail"].read()
Accessing Geospatial Data Items
Once we have selected the appropriate dataset, the next step is to access the specific data items. These items typically represent individual data files or collections that are part of the dataset.
The following code retrieves a handle to the ‘geoparquet-items’ from the Landsat dataset, which are optimized for efficient geospatial operations and queries.
landsat_items = landsat_reader["geoparquet-items"]
landsat_items
DaskGeoParquet reader producing dask_geopandas.core:GeoDataFrame
Converting Data for Analysis
To facilitate analysis, the following code selects the last few entries (tail
) of the dataset, converts them into a GeoDataFrame, and reads it back into a STAC catalog format. This format is particularly suited for geospatial data and necessary for compatibility with geospatial analysis tools and libraries like Geopandas.
cat = landsat_items.tail(output_instance="geopandas:GeoDataFrame").GeoDataFrameToSTACCatalog.read()
/home/runner/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/dask/dataframe/_pyarrow_compat.py:17: FutureWarning: Minimal version of pyarrow will soon be increased to 14.0.1. You are using 12.0.1. Please consider upgrading.
warnings.warn(
/home/runner/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/geopandas/_compat.py:154: UserWarning: The Shapely GEOS version (3.11.1-CAPI-1.17.1) is incompatible with the GEOS version PyGEOS was compiled with (3.10.4-CAPI-1.16.2). Conversions between both will be slow.
set_use_pygeos()
Exploring Data Collections
After conversion, we explore the structure of the data collection. Each “item” in this collection corresponds to a set of assets, providing a structured way to access multiple related data files. We’ll simply print the structure of the catalog to understand the available items and their organization.
cat
Catalog
named datasets: ['LC09_L2SP_093064_20240601_02_T1', 'LC09_L2SP_093065_20240601_02_T2', 'LC09_L2SP_093066_20240601_02_T1', 'LC09_L2SP_093067_20240601_02_T1', 'LC09_L2SP_093068_20240601_02_T1']
Accessing Sub-Collections
To dive deeper into the data, we access a specific sub-collection based on its key. This allows us to focus on a particular geographic area or time period. We’ll select the first item in the catalog for now.
item_key = list(cat.entries.keys())[0]
subcat = cat[item_key].read()
subcat
Catalog
named datasets: ['ang', 'atran', 'blue', 'cdist', 'coastal', 'drad', 'emis', 'emsd', 'green', 'lwir11', 'mtl.json', 'mtl.txt', 'mtl.xml', 'nir08', 'qa', 'qa_aerosol', 'qa_pixel', 'qa_radsat', 'red', 'rendered_preview', 'swir16', 'swir22', 'tilejson', 'trad', 'urad']
Reading Specific Data Bands
For detailed analysis, especially in remote sensing, accessing specific spectral bands is crucial. Here, we read the red spectral band, which is often used in vegetation analysis and other remote sensing applications.
subcat.red.read()
[<xarray.DataArray (band: 1, y: 7701, x: 7581)>
[58381281 values with dtype=uint16]
Coordinates:
* band (band) int64 1
* x (x) float64 1.449e+05 1.449e+05 ... 3.723e+05 3.723e+05
* y (y) float64 -1.165e+06 -1.165e+06 ... -1.396e+06 -1.396e+06
spatial_ref int64 0
Attributes:
AREA_OR_POINT: Point
_FillValue: 0
scale_factor: 1.0
add_offset: 0.0]
Preparing for Multiband Analysis
To analyze true color imagery, we need to stack multiple spectral bands. Here, we prepare for this by setting up a band-stacking operation. Note, re-signing might be necessary at this point.
catbands = cat[item_key].to_reader(reader="StackBands", bands=["red", "green", "blue"], signer=planetary_computer.sign_inplace)
Loading and Visualizing True Color Imagery
After setting up the band-stacking, we read the multiband data and prepare it for visualization.
data = catbands.read(dim="band")
data
<xarray.DataArray (band: 3, y: 7701, x: 7581)> array([[[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], [[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], [[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]]], dtype=uint16) Coordinates: * band (band) int64 1 1 1 * x (x) float64 1.449e+05 1.449e+05 ... 3.723e+05 3.723e+05 * y (y) float64 -1.165e+06 -1.165e+06 ... -1.396e+06 -1.396e+06 spatial_ref int64 0 Attributes: AREA_OR_POINT: Point _FillValue: 0 scale_factor: 1.0 add_offset: 0.0
Visualizing Data
Finally, we visualize the true color imagery. This visualization helps in assessing the quality of the data and the appropriateness of the bands used.
data.plot.imshow(robust=True, figsize=(10, 10))
<matplotlib.image.AxesImage at 0x7efe1c82aec0>
Summary
As earth science data becomes integrated with other types of data, a powerful approach is to utilize a general purpose set of tools, including Intake and Xarray. Once you have accessed data, visualize it with hvPlot to ensure that it matches your expectations.
What’s next?
Now that we know how to access the data, it’s time to proceed to analysis, where we will explore a some simple machine learning approaches.
Resources and references
Authored by Demetris Roumis and Andrew Huang circa April, 2024, with guidance from Martin Durant.
The banner image is a mashup of a Landsat 8 image from NASA and the Intake logo.