%pip install -q intake-stac
WARNING: Ignoring invalid distribution ~array (/home/runner/micromamba/envs/landsat-product-cookbook-dev/lib/python3.13/site-packages)
WARNING: Ignoring invalid distribution ~array (/home/runner/micromamba/envs/landsat-product-cookbook-dev/lib/python3.13/site-packages)
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
intake-esm 2024.2.6 requires intake<2,>=0.6.6, but you have intake 2.0.8 which is incompatible.
WARNING: Ignoring invalid distribution ~array (/home/runner/micromamba/envs/landsat-product-cookbook-dev/lib/python3.13/site-packages)
WARNING: Ignoring invalid distribution ~array (/home/runner/micromamba/envs/landsat-product-cookbook-dev/lib/python3.13/site-packages)
Note: you may need to restart the kernel to use updated packages.
Preprocessing¶
Once you have acquired your data, the next step is preprocessing—preparing the data for computing the retrieval and generating the final data product. Preprocessing will include one or more of the following tasks:
Each of these steps helps standardize the data so it can be combined, compared, or analyzed reliably.
🔧 Preprocessing Steps¶
🔹 Cleaning
Data cleaning is required when the dataset contains missing values, outliers, or artifacts that could bias the analysis. This might include filtering NaNs, removing physically impossible values, or masking bad pixels.
To clean the data, you typically:
- Identify invalid or missing values.
- Mask or remove unreliable data.
- Optionally, interpolate or fill gaps as needed.
📚 Tutorial: Data Cleaning with Pandas and NumPy (RealPython)
🔹 Unit Conversion
Unit conversion is needed when datasets use different physical units (e.g., Kelvin vs. Celsius, W/m² vs. mW/cm²) or when preparing inputs for physical equations that require standardized units.
This may also require spatial integration (e.g., converting a flux to energy) to match units over time and space.
🔹 Reprojection
Reprojection is required when datasets are provided in different coordinate reference systems (CRS). Working with mismatched projections can lead to spatial misalignment—features may not overlap or align correctly.
To reproject data:
- Determine the CRS of each dataset.
- Use geospatial tools to transform to a common projection.
🔹 Regridding
Regridding is used when datasets have different spatial resolutions or grid layouts and need to be brought onto a common grid. For example, satellite data may be on a swath-based grid while model output is on a regular latitude-longitude grid.
This step ensures datasets are co-located in space and is critical for any pixel-wise comparison or combination.
🔹 Normalization
Normalization rescales data so that it is on a consistent numerical scale, especially important when combining variables with different units or orders of magnitude as inputs to a model (e.g., temperature vs. elevation vs. reflectance).
For example, normalizing input features before passing them into a machine learning model helps ensure each variable contributes proportionally.
In the Sea Surface Temperature (SST) workflow demonstrated in this cookbook, we will be using all of the preprocessing steps.
Read in Landsat thermal data¶
Let’s begin by reading in the data we acquired previously in the Data Access notebook.
We read in all paths and parameters
cd /home/jovyan/landsatproduct-cookbook/
[Errno 2] No such file or directory: '/home/jovyan/landsatproduct-cookbook/'
/home/runner/work/landsatproduct-cookbook/landsatproduct-cookbook/notebooks
%matplotlib widget
%load_ext autoreload
%autoreload 2
from pathlib import Path
import os
import boto3
from rasterio.session import AWSSession
import earthaccess
import intake
import xarray as xr
from shapely.geometry.polygon import Polygon
import geopandas as gpd
import SSTutils as stu
import warnings
warnings.filterwarnings('ignore')
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[3], line 16
13 from shapely.geometry.polygon import Polygon
14 import geopandas as gpd
---> 16 import SSTutils as stu
18 import warnings
19 warnings.filterwarnings('ignore')
ModuleNotFoundError: No module named 'SSTutils'
# Define the landsat STAC catalog location
url = 'https://landsatlook.usgs.gov/stac-server'
# For atm correction
basepath = Path('/home/jovyan/Landsat_SST_algorithm')
lsatpath = basepath / 'Data'
atmpath = lsatpath / 'AtmCorrection'
modout_path = lsatpath / 'MOD07_L2'
WV = 'Water_Vapor'
# For search and tile plot for Landsat
satellite = 'Landsat8'
collection = 'landsat-c2l1' # Landsat Collection 2, Level 1 - includes L8 and L9
colnm = ['landsat:wrs_path','landsat:wrs_row']
gjson_outfile = lsatpath / f'{satellite}.geojson'
# # For scene search and plot
interp = 1
region = 'Cosgrove'
if region=='Cosgrove':
SSTpath = lsatpath / 'SST/MODcalib/Cosgrove/'
bbox = (-103.0, -73.5, -102.0, -73.42) # LatboundsC from LandsatCalibration20201031
# Cosgrove full run
timeRange = '2021-09-01/2023-04-07'
# [0:1] Cosgrove bbox
# timeRange = '2022-11-01/2023-03-27'
elif region=='DotsonPolynya':
SSTpath = lsatpath / 'SST/MODcalib/DotsonPolynya/'
bbox = (-113, -73.9, -111.5, -73.59) # Dotson polynya
# Dotson full run
timeRange = '2021-09-01/2023-05-31'
# [0:1] Dotson bbox
# timeRange = '2022-11-01/2023-03-27'
elif region=='PineIslandPlume':
SSTpath = lsatpath / 'SST/UncalibratedSST/PineIslandPlume/'
bbox = (-101.98,-75.09,-101.65,-75.05) # PIG plume for analysis - 2014
# bbox = (-101.88,-75.23,-100.35,-74.76) # PIG ice front for analysis
# bbox = (-101.8,-75.23,-100.50,-74.80) # PIG 2019?
# PIG full run - NOT narrowed down yet
timeRange = '2021-09-01/2023-04-07'
elif region=='DotsonIntercomp':
SSTpath = lsatpath / 'SST/Validation/DotsonIntercomp/'
bbox = (-113.5,-74.20,-113.17,-74.11) # Dotson plume for analysis
# Dotson intercomp run
timeRange = '2021-09-01/2023-03-31'
elif region=='Burke':
SSTpath = lsatpath / 'SST/MODcalib/Burke/'
bbox = (-104.2,-73.81, -103.8, -73.42) # Outside Cosgrove south of Burke
# Burke full run
timeRange = '2021-09-01/2023-04-06'
We set up authentication for accessing all data
# Authenticate for boto S3 access, etc.
os.environ["AWS_REQUEST_PAYER"] = "requester"
aws_session = AWSSession(boto3.Session(), requester_pays=True)
# Setup and authenticate dask
from dask.distributed import Client
import logging
client = Client(processes=True, n_workers=4,
threads_per_worker=1,
silence_logs=logging.ERROR)
client.run(lambda: os.environ["AWS_REQUEST_PAYER"] == "requester" )
client
# Authenticate for accessing NASA data (MODIS)
auth = earthaccess.login(strategy="interactive")
# Search for desired Landsat scenes
items = stu.search_stac(url, collection, gjson_outfile=gjson_outfile, bbox=bbox, timeRange=timeRange)
# Open stac catalog for some needed info
catalog = intake.open_stac_item_collection(items)
# Load the geojson file
gf = gpd.read_file(gjson_outfile)
# Plot tiles of all scenes found
stu.plot_search(gf,satellite,colnm)
Note the bands you would like to include are assigned by passing the bandNames parameter to landsat_to_xarray using the following codes:
‘coastal’, ‘blue’, ‘green’, ‘red’, ‘nir08’, ‘swir16’, ‘swir22’, ‘pan’, ‘cirrus’, ‘lwir11’, ‘lwir12’, ‘qa_pixel’
Process Landsat scenes to acquire sea surface temperature¶
# Convert bounding box to polar for checking if landsat has any data in bounding box
source_crs = 'epsg:4326'
target_crs = 'epsg:3031' # Coordinate system of the file
sbox,checkbox = stu.lsat_reproj(source_crs,target_crs,(bbox[0],bbox[1],bbox[2],bbox[3]))
# Create polygon for later cropping
polygon = Polygon([(sbox[0][0],sbox[0][1]),(sbox[3][0],sbox[3][1]),(sbox[2][0],sbox[2][1]),(sbox[1][0],sbox[1][1])])
# Create min/max boundaries for trimming image before crop_xarray to cut down on processing times
minx, miny, maxx, maxy = polygon.bounds
polarx = [minx, maxx]
polary = [miny, maxy]
# Include only Landsat 8 scenes
catalog_list = [x for x in items if x.id[3]=='8']
sceneid = catalog_list[0]
print(sceneid.id)
scene = catalog[sceneid.id]
timestr = scene.metadata['datetime'].strftime('%H%M%S')
outFile = f'{SSTpath}/{sceneid.id}_{timestr}_Cel.tif'
# Open all desired bands for one scene
ls_scene = stu.landsat_to_xarray(sceneid,catalog)
ls_scene = ls_scene.rio.write_crs("epsg:3031", inplace=True)
Masking Unwanted Pixels in Landsat Thermal Imagery¶
In our Landsat SST algorithm, the first preprocessing step is to ensure that we only process ocean pixels.
Why?
Thermal infrared measurements are highly sensitive to atmospheric effects, particularly water vapor, and cannot provide accurate surface temperature if clouds are present.
Additionally, we don’t want SST from land or ice pixels.
This means our first preprocessing task is masking—identifying and excluding pixels that shouldn’t be processed.
Sources for Pixel Classification¶
Landsat imagery includes a qa_pixel
band with bit flags that encode surface classification for each pixel.
We can:
- Use the
qa_pixel
band to mask unwanted pixels (our approach here). - Replace or augment with a machine learning classifier (e.g., neural network) for more accurate cloud detection.
Note: The standard Landsat cloud classification is not well-suited for detecting certain types of cloud. ML-based classifiers often outperform it for certain conditions.
Step 1 – Inspect the QA Band¶
Let’s first inspect what’s inside the QA band:
qa = ls_scene.sel(band='qa_pixel').astype('uint16')
unique_values, counts = np.unique(qa, return_counts=True)
print("Unique QA codes:", unique_values)
cd /home/jovyan/landsatproduct-cookbook/
def create_masks(ls_scene, cloud_mask=True, ice_mask=False, ocean_mask=False):
"""
Creates cloud, ice, and ocean masks from a Landsat scene QA band. By default,
clouds are labeled as 1, ice as 2, ocean as 3, and all other pixels are NaN.
Parameters
----------
ls_scene : xarray.DataArray
A Landsat scene loaded with a 'qa_pixel' band (as created by `landsat_to_xarray`).
cloud_mask : bool, optional
Whether to generate the cloud mask. Default is True.
ice_mask : bool, optional
Whether to generate the ice mask. Default is False.
ocean_mask : bool, optional
Whether to generate the ocean mask. Default is False.
Returns
-------
xarray.DataArray
The same input xarray object, but with an added `"mask"` coordinate.
In that mask, cloud pixels are assigned 1, ice pixels 2, ocean pixels 3,
and everything else is set to NaN.
"""
cloud = []
ocean = []
ice = []
qa = ls_scene.sel(band='qa_pixel').astype('uint16')
n,c = np.unique(qa, return_counts=True)
for j in range(len(n)):
longform = f'{n[j]:016b}'
if (longform[-7]=='0')|(longform[-3]=='1'): #bit 2 and 6 are for cirrus and clear sky
cloud.append(n[j])
if longform[-8:]=='11000000': #bit 6 and 7 give clear sky and water, lower bits need to be 0
ocean.append(n[j])
if longform[-7:]=='1100000': #bit 5 and 6 give ice and clear sky
ice.append(n[j])
if 0 in cloud:
cloud.remove(0)
if 1 in cloud:
cloud.remove(1)
# mask cloud, ice, and ocean
if cloud_mask==True:
# cloud is 2
mask_c = xr.where(qa.isin(cloud), 1, np.nan)
if ice_mask==True:
mask_c = xr.where(qa.isin(ice), 2, mask_c)
if ocean_mask==True:
mask_c = xr.where(qa.isin(ocean), 3, mask_c)
ls_scene.coords['mask'] = (('y', 'x'), mask_c.data)
return ls_scene
##########################
def normalize(array):
'''
normalize a dask array so all value are between 0 and 1
'''
array_min = array.min(skipna=True)
array_max = array.max(skipna=True)
return (array - array_min) / (array_max - array_min)
##########################
def search_stac(url, collection, gjson_outfile=None, bbox=None, timeRange=None, filename=None):
"""
Search a STAC API for Landsat images based on either:
- Bounding box and time range, or
- Specific filename (STAC 'id').
Parameters:
-----------
url : str
URL to the STAC API.
collection : str
Collection name (e.g., "landsat-c2-l2").
gjson_outfile : str or None
Output file to save the search result as GeoJSON (optional).
bbox : list or None
Bounding box [west, south, east, north] (optional).
timeRange : str or None
Time range in ISO format, e.g., '2021-09-01/2023-03-31' (optional).
filename : str or None
Exact filename (product ID) to search for (optional).
Returns:
--------
item_collection : pystac.ItemCollection
Collection of matching STAC items.
"""
api = pystac_client.Client.open(url)
if filename:
# Search by filename (ID)
search = api.search(
collections=[collection],
ids=[filename],
)
# print(f"Searching for filename: {filename}")
elif bbox and timeRange:
# Search by bbox and timeRange
search = api.search(
bbox=bbox,
datetime=timeRange,
collections=[collection],
)
# print(f"Searching for items in bbox {bbox} and timeRange {timeRange}")
else:
raise ValueError("Must provide either a filename, or both bbox and timeRange.")
items = search.item_collection()
# print(f"Found {len(items)} item(s)")
if gjson_outfile:
items.save_object(gjson_outfile)
return items