Skip to article frontmatterSkip to article content

Preprocessing data

University of Maryland
%pip install -q intake-stac
    WARNING: Ignoring invalid distribution ~array (/home/runner/micromamba/envs/landsat-product-cookbook-dev/lib/python3.13/site-packages)
WARNING: Ignoring invalid distribution ~array (/home/runner/micromamba/envs/landsat-product-cookbook-dev/lib/python3.13/site-packages)
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
intake-esm 2024.2.6 requires intake<2,>=0.6.6, but you have intake 2.0.8 which is incompatible.
WARNING: Ignoring invalid distribution ~array (/home/runner/micromamba/envs/landsat-product-cookbook-dev/lib/python3.13/site-packages)
WARNING: Ignoring invalid distribution ~array (/home/runner/micromamba/envs/landsat-product-cookbook-dev/lib/python3.13/site-packages)
Note: you may need to restart the kernel to use updated packages.

Preprocessing

Once you have acquired your data, the next step is preprocessing—preparing the data for computing the retrieval and generating the final data product. Preprocessing will include one or more of the following tasks:

Each of these steps helps standardize the data so it can be combined, compared, or analyzed reliably.


🔧 Preprocessing Steps

🔹 Cleaning

Data cleaning is required when the dataset contains missing values, outliers, or artifacts that could bias the analysis. This might include filtering NaNs, removing physically impossible values, or masking bad pixels.

To clean the data, you typically:

  • Identify invalid or missing values.
  • Mask or remove unreliable data.
  • Optionally, interpolate or fill gaps as needed.

📚 Tutorial: Data Cleaning with Pandas and NumPy (RealPython)

🔹 Unit Conversion

Unit conversion is needed when datasets use different physical units (e.g., Kelvin vs. Celsius, W/m² vs. mW/cm²) or when preparing inputs for physical equations that require standardized units.

This may also require spatial integration (e.g., converting a flux to energy) to match units over time and space.

📚 Pint Documentation – Units in Python

🔹 Reprojection

Reprojection is required when datasets are provided in different coordinate reference systems (CRS). Working with mismatched projections can lead to spatial misalignment—features may not overlap or align correctly.

To reproject data:

  • Determine the CRS of each dataset.
  • Use geospatial tools to transform to a common projection.

📚 Pyproj Documentation

🔹 Regridding

Regridding is used when datasets have different spatial resolutions or grid layouts and need to be brought onto a common grid. For example, satellite data may be on a swath-based grid while model output is on a regular latitude-longitude grid.

This step ensures datasets are co-located in space and is critical for any pixel-wise comparison or combination.

📚 xESMF for Regridding

🔹 Normalization

Normalization rescales data so that it is on a consistent numerical scale, especially important when combining variables with different units or orders of magnitude as inputs to a model (e.g., temperature vs. elevation vs. reflectance).

For example, normalizing input features before passing them into a machine learning model helps ensure each variable contributes proportionally.

📚 scikit-learn Preprocessing: Normalization and Scaling


In the Sea Surface Temperature (SST) workflow demonstrated in this cookbook, we will be using all of the preprocessing steps.

Read in Landsat thermal data

Let’s begin by reading in the data we acquired previously in the Data Access notebook.

We read in all paths and parameters

cd /home/jovyan/landsatproduct-cookbook/
[Errno 2] No such file or directory: '/home/jovyan/landsatproduct-cookbook/'
/home/runner/work/landsatproduct-cookbook/landsatproduct-cookbook/notebooks
%matplotlib widget

%load_ext autoreload
%autoreload 2

from pathlib import Path
import os
import boto3
from rasterio.session import AWSSession
import earthaccess
import intake
import xarray as xr
from shapely.geometry.polygon import Polygon
import geopandas as gpd

import SSTutils as stu

import warnings
warnings.filterwarnings('ignore')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[3], line 16
     13 from shapely.geometry.polygon import Polygon
     14 import geopandas as gpd
---> 16 import SSTutils as stu
     18 import warnings
     19 warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'SSTutils'
# Define the landsat STAC catalog location
url = 'https://landsatlook.usgs.gov/stac-server'

# For atm correction
basepath = Path('/home/jovyan/Landsat_SST_algorithm')
lsatpath = basepath / 'Data'
atmpath = lsatpath / 'AtmCorrection'
modout_path = lsatpath / 'MOD07_L2'

WV = 'Water_Vapor'

# For search and tile plot for Landsat
satellite = 'Landsat8'
collection = 'landsat-c2l1' # Landsat Collection 2, Level 1 - includes L8 and L9
colnm = ['landsat:wrs_path','landsat:wrs_row']
gjson_outfile = lsatpath / f'{satellite}.geojson'

# # For scene search and plot

interp = 1

region = 'Cosgrove'

if region=='Cosgrove':
    SSTpath = lsatpath / 'SST/MODcalib/Cosgrove/'
    bbox = (-103.0, -73.5, -102.0, -73.42) # LatboundsC from LandsatCalibration20201031
    # Cosgrove full run 
    timeRange = '2021-09-01/2023-04-07'
    # [0:1] Cosgrove bbox
    # timeRange = '2022-11-01/2023-03-27'
elif region=='DotsonPolynya':
    SSTpath = lsatpath / 'SST/MODcalib/DotsonPolynya/'
    bbox = (-113, -73.9, -111.5, -73.59) # Dotson polynya
    # Dotson full run
    timeRange = '2021-09-01/2023-05-31'
    # [0:1] Dotson bbox
    # timeRange = '2022-11-01/2023-03-27'
elif region=='PineIslandPlume':
    SSTpath = lsatpath / 'SST/UncalibratedSST/PineIslandPlume/'
    bbox = (-101.98,-75.09,-101.65,-75.05) # PIG plume for analysis - 2014
    # bbox = (-101.88,-75.23,-100.35,-74.76) # PIG ice front for analysis
    # bbox = (-101.8,-75.23,-100.50,-74.80) # PIG 2019?
    # PIG full run - NOT narrowed down yet
    timeRange = '2021-09-01/2023-04-07'
elif region=='DotsonIntercomp':
    SSTpath = lsatpath / 'SST/Validation/DotsonIntercomp/'
    bbox = (-113.5,-74.20,-113.17,-74.11) # Dotson plume for analysis
    # Dotson intercomp run
    timeRange = '2021-09-01/2023-03-31'
elif region=='Burke':
    SSTpath = lsatpath / 'SST/MODcalib/Burke/'
    bbox = (-104.2,-73.81, -103.8, -73.42) # Outside Cosgrove south of Burke  
    # Burke full run
    timeRange = '2021-09-01/2023-04-06'

We set up authentication for accessing all data

# Authenticate for boto S3 access, etc.
os.environ["AWS_REQUEST_PAYER"] = "requester"
aws_session = AWSSession(boto3.Session(), requester_pays=True)

# Setup and authenticate dask
from dask.distributed import Client
import logging
client = Client(processes=True, n_workers=4, 
                threads_per_worker=1,
                silence_logs=logging.ERROR)
client.run(lambda: os.environ["AWS_REQUEST_PAYER"] == "requester" )
client

# Authenticate for accessing NASA data (MODIS)
auth = earthaccess.login(strategy="interactive")
# Search for desired Landsat scenes
items = stu.search_stac(url, collection, gjson_outfile=gjson_outfile, bbox=bbox, timeRange=timeRange)

# Open stac catalog for some needed info
catalog = intake.open_stac_item_collection(items)

# Load the geojson file
gf = gpd.read_file(gjson_outfile)
# Plot tiles of all scenes found
stu.plot_search(gf,satellite,colnm)

Note the bands you would like to include are assigned by passing the bandNames parameter to landsat_to_xarray using the following codes:

‘coastal’, ‘blue’, ‘green’, ‘red’, ‘nir08’, ‘swir16’, ‘swir22’, ‘pan’, ‘cirrus’, ‘lwir11’, ‘lwir12’, ‘qa_pixel’

Process Landsat scenes to acquire sea surface temperature

# Convert bounding box to polar for checking if landsat has any data in bounding box
source_crs = 'epsg:4326' 
target_crs = 'epsg:3031' # Coordinate system of the file

sbox,checkbox = stu.lsat_reproj(source_crs,target_crs,(bbox[0],bbox[1],bbox[2],bbox[3]))

# Create polygon for later cropping
polygon = Polygon([(sbox[0][0],sbox[0][1]),(sbox[3][0],sbox[3][1]),(sbox[2][0],sbox[2][1]),(sbox[1][0],sbox[1][1])])

# Create min/max boundaries for trimming image before crop_xarray to cut down on processing times
minx, miny, maxx, maxy = polygon.bounds
polarx = [minx, maxx]
polary = [miny, maxy]
# Include only Landsat 8 scenes
catalog_list = [x for x in items if x.id[3]=='8']

sceneid = catalog_list[0]
print(sceneid.id)
    
scene = catalog[sceneid.id]
timestr = scene.metadata['datetime'].strftime('%H%M%S')

outFile = f'{SSTpath}/{sceneid.id}_{timestr}_Cel.tif'
# Open all desired bands for one scene
ls_scene = stu.landsat_to_xarray(sceneid,catalog)
ls_scene = ls_scene.rio.write_crs("epsg:3031", inplace=True)

Masking Unwanted Pixels in Landsat Thermal Imagery

In our Landsat SST algorithm, the first preprocessing step is to ensure that we only process ocean pixels.

Why?
Thermal infrared measurements are highly sensitive to atmospheric effects, particularly water vapor, and cannot provide accurate surface temperature if clouds are present.
Additionally, we don’t want SST from land or ice pixels.

This means our first preprocessing task is masking—identifying and excluding pixels that shouldn’t be processed.


Sources for Pixel Classification

Landsat imagery includes a qa_pixel band with bit flags that encode surface classification for each pixel.

We can:

  1. Use the qa_pixel band to mask unwanted pixels (our approach here).
  2. Replace or augment with a machine learning classifier (e.g., neural network) for more accurate cloud detection.

Note: The standard Landsat cloud classification is not well-suited for detecting certain types of cloud. ML-based classifiers often outperform it for certain conditions.


Step 1 – Inspect the QA Band

Let’s first inspect what’s inside the QA band:

qa = ls_scene.sel(band='qa_pixel').astype('uint16')
unique_values, counts = np.unique(qa, return_counts=True)
print("Unique QA codes:", unique_values)
cd /home/jovyan/landsatproduct-cookbook/
def create_masks(ls_scene, cloud_mask=True, ice_mask=False, ocean_mask=False):
    """
    Creates cloud, ice, and ocean masks from a Landsat scene QA band. By default, 
    clouds are labeled as 1, ice as 2, ocean as 3, and all other pixels are NaN.

    Parameters
    ----------
    ls_scene : xarray.DataArray
        A Landsat scene loaded with a 'qa_pixel' band (as created by `landsat_to_xarray`).
    cloud_mask : bool, optional
        Whether to generate the cloud mask. Default is True.
    ice_mask : bool, optional
        Whether to generate the ice mask. Default is False.
    ocean_mask : bool, optional
        Whether to generate the ocean mask. Default is False.

    Returns
    -------
    xarray.DataArray
        The same input xarray object, but with an added `"mask"` coordinate. 
        In that mask, cloud pixels are assigned 1, ice pixels 2, ocean pixels 3, 
        and everything else is set to NaN.
    """
    
    cloud = []
    ocean = []
    ice = []

    qa = ls_scene.sel(band='qa_pixel').astype('uint16')

    n,c = np.unique(qa, return_counts=True)

    for j in range(len(n)):
        longform = f'{n[j]:016b}'
        if (longform[-7]=='0')|(longform[-3]=='1'): #bit 2 and 6 are for cirrus and clear sky
            cloud.append(n[j])
        if longform[-8:]=='11000000': #bit 6 and 7 give clear sky and water, lower bits need to be 0 
            ocean.append(n[j])
        if longform[-7:]=='1100000': #bit 5 and 6 give ice and clear sky 
            ice.append(n[j])

    if 0 in cloud:
        cloud.remove(0)
    if 1 in cloud:
        cloud.remove(1)

    # mask cloud, ice, and ocean
    if cloud_mask==True:
        # cloud is 2
        mask_c = xr.where(qa.isin(cloud), 1, np.nan)

    if ice_mask==True:
        mask_c = xr.where(qa.isin(ice), 2, mask_c)

    if ocean_mask==True:
        mask_c = xr.where(qa.isin(ocean), 3, mask_c)

    ls_scene.coords['mask'] = (('y', 'x'), mask_c.data)
        
    return ls_scene

##########################

def normalize(array):
    '''
    normalize a dask array so all value are between 0 and 1
    '''
    array_min = array.min(skipna=True)
    array_max = array.max(skipna=True)
    return (array - array_min) / (array_max - array_min)

##########################

def search_stac(url, collection, gjson_outfile=None, bbox=None, timeRange=None, filename=None):
    """
    Search a STAC API for Landsat images based on either:
    - Bounding box and time range, or
    - Specific filename (STAC 'id').

    Parameters:
    -----------
    url : str
        URL to the STAC API.
    collection : str
        Collection name (e.g., "landsat-c2-l2").
    gjson_outfile : str or None
        Output file to save the search result as GeoJSON (optional).
    bbox : list or None
        Bounding box [west, south, east, north] (optional).
    timeRange : str or None
        Time range in ISO format, e.g., '2021-09-01/2023-03-31' (optional).
    filename : str or None
        Exact filename (product ID) to search for (optional).

    Returns:
    --------
    item_collection : pystac.ItemCollection
        Collection of matching STAC items.
    """
    
    api = pystac_client.Client.open(url)

    if filename:
        # Search by filename (ID)
        search = api.search(
            collections=[collection],
            ids=[filename],
        )
        # print(f"Searching for filename: {filename}")
    
    elif bbox and timeRange:
        # Search by bbox and timeRange
        search = api.search(
            bbox=bbox,
            datetime=timeRange,
            collections=[collection],
        )
        # print(f"Searching for items in bbox {bbox} and timeRange {timeRange}")
    
    else:
        raise ValueError("Must provide either a filename, or both bbox and timeRange.")

    items = search.item_collection()

    # print(f"Found {len(items)} item(s)")

    if gjson_outfile:
        items.save_object(gjson_outfile)
    
    return items