Planetary Computer

Data Ingestion - Microsoft Planetary Computer


Overview

In this notebook, you will ingest Landsat data for use in machine learning. Machine learning tasks often involve a lot of data, and in Python, data is typically stored in memory as simple NumPy arrays. However, higher-level containers built on top of NumPy arrays provide more functionality for multidimensional gridded data (xarray) or out-of-core and distributed data (Dask). Our goal for data ingestion will be to load specific Landsat data of interest into one of these higher-level containers.

Microsoft Plantery Computer is one of several providers of Landsat Data. We are using it together with pystac-client and odc-stac because together they provide a nice Python API for searching and loading with specific criteria such as spatial area, datetime, Landsat mission, and cloud coverage.

Earth science datasets are often stored on remote servers that may be too large to download locally. Therefore, in this cookbook, we will focus primarily on ingestion approaches that load small portions of data from a remote source, as needed. However, the approach for your own work will depend not only on data size and location but also the intended analysis, so in a follow up notebook, you will see alternative approaches to loading data.

Prerequisites

Concepts

Importance

Notes

Intro to Landsat

Necessary

Background

About the Microsoft Planetary Computer

Helpful

Background

pystac-client Usage

Helpful

Consult as needed

odc.stac.load Reference

Helpful

Consult as needed

xarray

Necessary

Intro to Dask Array

Helpful

Panel Getting Started Guide

Helpful

  • Time to learn: 10 minutes

Imports

# Data
import odc.stac
import pandas as pd
import planetary_computer
import pystac_client
import xarray as xr
from pystac.extensions.eo import EOExtension as eo

# Viz
import hvplot.xarray
import panel as pn

pn.extension()

Open and read the root of the STAC catalog

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace,
)
catalog.title
'Microsoft Planetary Computer STAC API'

Microsoft Planetary Computer has a public STAC metadata but the actual data assets are in private Azure Blob Storage containers and require authentication. pystac-client provides a modifier keyword that we can use to manually sign the item. Otherwise, we’d get an error when trying to access the asset.

Search for Landsat Data

Let’s say that an analysis we want to run requires landsat data over a specific region and from a specific time period. We can use our catalog to search for assets that fit our search criteria.

First, let’s find the name of the landsat dataset. This page is a nice resource for browsing the available collections, but we can also just search the catalog for ‘landsat’:

all_collections = [i.id for i in catalog.get_collections()]
landsat_collections = [
    collection for collection in all_collections if "landsat" in collection
]
landsat_collections
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 all_collections = [i.id for i in catalog.get_collections()]
      2 landsat_collections = [
      3     collection for collection in all_collections if "landsat" in collection
      4 ]
      5 landsat_collections

Cell In[3], line 1, in <listcomp>(.0)
----> 1 all_collections = [i.id for i in catalog.get_collections()]
      2 landsat_collections = [
      3     collection for collection in all_collections if "landsat" in collection
      4 ]
      5 landsat_collections

File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/pystac_client/client.py:433, in Client.get_collections(self)
    429         for col in page["collections"]:
    430             collection = CollectionClient.from_dict(
    431                 col, root=self, modifier=self.modifier
    432             )
--> 433             call_modifier(self.modifier, collection)
    434             yield collection
    435 else:

File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/pystac_client/_utils.py:18, in call_modifier(modifier, obj)
     15 if modifier is None:
     16     return None
---> 18 result = modifier(obj)
     19 if result is not None and result is not obj:
     20     warnings.warn(
     21         f"modifier '{modifier}' returned a result that's being ignored. "
     22         "You should ensure that 'modifier' is operating in-place and use "
     23         "a function that returns 'None' or silence this warning.",
     24         IgnoredResultWarning,
     25     )

File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/planetary_computer/sas.py:109, in sign_inplace(obj)
    103 def sign_inplace(obj: Any) -> Any:
    104     """
    105     Sign the object in place.
    106 
    107     See :func:`planetary_computer.sign` for more.
    108     """
--> 109     return sign(obj, copy=False)

File ~/miniconda3/envs/cookbook-dev/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
    885 if not args:
    886     raise TypeError(f'{funcname} requires at least '
    887                     '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)

File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/planetary_computer/sas.py:367, in sign_collection(collection, copy)
    364         collection.assets = deepcopy(assets)
    366 for key in collection.assets:
--> 367     _sign_asset_in_place(collection.assets[key])
    368 return collection

File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/planetary_computer/sas.py:261, in _sign_asset_in_place(asset)
    251 """Sign a PySTAC asset
    252 
    253 Args:
   (...)
    258     with a signed version.
    259 """
    260 asset.href = sign(asset.href)
--> 261 _sign_fsspec_asset_in_place(asset)
    262 return asset

File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/planetary_computer/sas.py:298, in _sign_fsspec_asset_in_place(asset)
    296 container = parse_adlfs_url(href)
    297 if account and container:
--> 298     token = get_token(account, container)
    299     storage_options["credential"] = token.token

File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/planetary_computer/sas.py:470, in get_token(account_name, container_name, retry_total, retry_backoff_factor)
    461 session.mount("https://", adapter)
    462 response = session.get(
    463     token_request_url,
    464     headers=(
   (...)
    468     ),
    469 )
--> 470 response.raise_for_status()
    472 token = SASToken(**response.json())
    473 if not token:

File ~/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/requests/models.py:1021, in Response.raise_for_status(self)
   1016     http_error_msg = (
   1017         f"{self.status_code} Server Error: {reason} for url: {self.url}"
   1018     )
   1020 if http_error_msg:
-> 1021     raise HTTPError(http_error_msg, response=self)

HTTPError: 404 Client Error: Not Found for url: https://planetarycomputer.microsoft.com/api/sas/v1/token/pcstacitems/items

We’ll use the landsat-c2-l2 dataset, which stands for Collection 2 Level-2. It contains data from several landsat missions and has better data quality than Level 1 (landsat-c2-l1). Microsoft Planetary Computer has descriptions of Level 1 and Level 2, but a direct and succinct comparison can be found in this community post, and the information can be verified with USGS.

Now, let’s set our search parameters. You may already know the bounding box (region/area of interest) coordinates, but if you don’t, there are many useful tools like bboxfinder.com that can help.

bbox = [-118.89, 38.54, -118.57, 38.84]  # Region over a lake in Nevada, USA
datetime = "2017-06-01/2017-09-30"  # Summer months of 2017
collection = "landsat-c2-l2"

We can also specify other parameters in the query, such as a specific landsat mission and the max percent of cloud cover:

platform = "landsat-8"
cloudy_less_than = 1  # percent

Now we run the search and list the results:

search = catalog.search(
    collections=["landsat-c2-l2"],
    bbox=bbox,
    datetime=datetime,
    query={"eo:cloud_cover": {"lt": cloudy_less_than}, "platform": {"in": [platform]}},
)
items = search.get_all_items()
print(f"Returned {len(items)} Items:")
item_id = {(i, item.id): i for i, item in enumerate(items)}
item_id
/home/runner/miniconda3/envs/cookbook-dev/lib/python3.10/site-packages/pystac_client/item_search.py:849: FutureWarning: get_all_items() is deprecated, use item_collection() instead.
  warnings.warn(
Returned 3 Items:
{(0, 'LC08_L2SP_042033_20170718_02_T1'): 0,
 (1, 'LC08_L2SP_042033_20170702_02_T1'): 1,
 (2, 'LC08_L2SP_042033_20170616_02_T1'): 2}

It looks like there were three image stacks taken by Landsat 8 over this spatial region during the summer months of 2017 that has less than 1 percent cloud cover.

Preview Results and Select a Dataset

Before loading one of the available image stacks, it would be useful to get a visual check of the results. Many datasets have a rendered preview or thumbnail image that can be accessed without having to load the full resolution data.

We can create a simple interactive application using the Panel library to access and display rendered PNG previews of the our search results. Note that these pre-rendered images are of large tiles that span beyond our bounding box of interest. In the next steps, we will only be loading in a small area around the lake.

item_sel = pn.widgets.Select(value=1, options=item_id, name="item")

def get_preview(i):
    return pn.panel(items[i].assets["rendered_preview"].href, height=300)


pn.Row(item_sel, pn.bind(get_preview, item_sel))