Using `load_ard` to load and cloud mask Landsat and Sentinel-2

Sign up to the DEA Sandbox to run this notebook interactively from a browser
Compatibility: Notebook currently compatible with both the NCI and DEA Sandbox environments
Products used: ga_ls5t_ard_3, ga_ls7e_ard_3, ga_ls8c_ard_3, ga_ls9c_ard_3, ga_s2am_ard_3, ga_s2bm_ard_3

Description

This notebook demonstrates how to use the load_ard function to import a time series of cloud-free observations from multiple Landsat (i.e. Landsat 5, 7, 8 and 9) or Sentinel-2 (i.e. Sentinel-2A and 2B) satellite products. The function can automatically apply pixel quality masking (e.g. cloud masking) to the input data and return all available data from multiple sensors as a single combined xarray.Dataset.

Optionally, the function can be used to return only observations that contain a minimum proportion of good quality, non-cloudy or shadowed pixels. This can be used to extract visually appealing time series of observations that are not affected by cloud.

The function supports the following Digital Earth Australia products:

Geoscience Australia Landsat Collection 3:

ga_ls5t_ard_3, ga_ls7e_ard_3, ga_ls8c_ard_3, ga_ls9c_ard_3

Geoscience Australia Sentinel-2 Collection 3:

ga_s2am_ard_3, ga_s2bm_ard_3

This notebook demonstrates how to use load_ard to:

Load and combine Landsat 5, 7, 8 and 9 data into a single xarray.Dataset
Mask out clouds using the “Fmask” cloud mask
Filter resulting data to keep only cloud-free observations
Clean and dilate a cloud mask using morphological filtering
Discard Landsat 7 SLC-off failure data
Load and combine Sentinel-2A and Sentinel-2B data into a single xarray.Dataset
Mask out clouds using the “s2cloudless” cloud mask
Advanced: Filter data before loading using metadata and custom functions
Advanced: Lazily load data using Dask

Getting started

To run this analysis, run all the cells in the notebook, starting with the “Load packages” cell.

Load packages

[1]:

import datacube
import matplotlib.pyplot as plt

import sys
sys.path.insert(1, '../Tools/')
from dea_tools.datahandling import load_ard
from dea_tools.plotting import rgb

Connect to the datacube

[2]:

dc = datacube.Datacube(app='Using_load_ard')

Cloud masking using `mask_pixel_quality`

By plotting a time slice from the data we loaded above, you can see an area of white pixels where clouds, shadows or invalid data have been masked out and set to NaN:

[5]:

# Plot single cloud-masked observation
rgb(ds, index=0)

../../../_images/notebooks_How_to_guides_Using_load_ard_12_0.png

By default, load_ard applies a pixel quality mask to loaded data using the fmask (Function of Mask) cloud mask (this can be changed using the cloud_mask parameter; see Cloud masking with s2cloudless below). The default mask is created based on fmask categories ['valid', 'snow', 'water'] which will preserve non-cloudy or shadowed land, snow and water pixels, and set all invalid, cloudy or shadowy pixels to NaN. This can be customised using the fmask_categories parameter. To deactive cloud masking completely, set mask_pixel_quality=False:

[6]:

# Load available data with cloud masking deactivated
ds_cloudy = load_ard(dc=dc,
                     products=[
                         'ga_ls5t_ard_3', 'ga_ls7e_ard_3', 'ga_ls8c_ard_3',
                         'ga_ls9c_ard_3'
                     ],
                     measurements=['nbart_green', 'nbart_red', 'nbart_blue'],
                     mask_pixel_quality=False,
                     **query)

# Plot single observation
rgb(ds_cloudy, index=0)

Finding datasets
    ga_ls5t_ard_3
    ga_ls7e_ard_3
    ga_ls8c_ard_3
    ga_ls9c_ard_3
Loading 12 time steps

../../../_images/notebooks_How_to_guides_Using_load_ard_14_1.png

In addition to masking out cloud, load_ard allows you to discard any satellite observation that contains less than a minimum proportion of good quality (e.g. non-cloudy) pixels. This can be used to obtain a time series of only clear, cloud-free observations.

To discard all observations with less than X% good quality pixels, use the min_gooddata parameter. For example, min_gooddata=0.90 will return only observations where less than 10% of pixels contain cloud, cloud shadow or other invalid data, resulting in a smaller number of clear, cloud free images being returned by the function:

[7]:

# Load available data filtered to 90% clear observations
ds_noclouds = load_ard(dc=dc,
                       products=[
                           'ga_ls5t_ard_3', 'ga_ls7e_ard_3', 'ga_ls8c_ard_3',
                           'ga_ls9c_ard_3'
                       ],
                       measurements=['nbart_green', 'nbart_red', 'nbart_blue'],
                       min_gooddata=0.90,
                       mask_pixel_quality=False,
                       **query)

# Plot single observation
rgb(ds_noclouds, index=0)

Finding datasets
    ga_ls5t_ard_3
    ga_ls7e_ard_3
    ga_ls8c_ard_3
    ga_ls9c_ard_3
Counting good quality pixels for each time step using fmask
Filtering to 1 out of 12 time steps with at least 90.0% good quality pixels
Loading 1 time steps

../../../_images/notebooks_How_to_guides_Using_load_ard_16_1.png

There are significant known limitations to the cloud masking algorithms employed by Sentinel-2 and Landsat. For example, bright objects like buildings and coastlines are commonly mistaken for cloud. We can improve on the false positives detected by Landsat and Sentinel-2’s pixel quality mask by applying binary morphological image processing techniques (e.g. binary_closing, binary_erosion etc.). The Open Data Cube library odc-algo has a function odc.algo.mask_cleanup that can perform a few of these operations. Below, we will try to improve the cloud mask by applying a number of these filters.

In this example, we use a “morphological opening” operation that first shrinks cloudy areas inward by five pixels, then expands any remaining pixels by five pixels. This operation is useful for removing small, isolated pixels (e.g. false positives caused by bright buildings or sandy coastlines) from our cloud mask, while still preserving the shape of larger clouds. Finally, we apply a “morphological dilation” operation to expand all cloudy areas by five pixels to mask out the thin edges of clouds. Choosing larger values for these parameters (e.g. 10 pixels instead of 5 pixels) will cause load_ard to more aggressively remove noisy pixels, or expand the edges of the cloud mask - you may need to experiment to choose values that work for your application.

Feel free to experiment with the values in filters.

[8]:

# Set the filters to apply
filters = [("opening", 5), ("dilation", 5)]

# Load data
ds_filtered = load_ard(dc=dc,
                       products=[
                           "ga_ls5t_ard_3", "ga_ls7e_ard_3", "ga_ls8c_ard_3",
                           'ga_ls9c_ard_3'
                       ],
                       measurements=['nbart_green', 'nbart_red', 'nbart_blue'],
                       mask_filters=filters,
                       **query)

Finding datasets
    ga_ls5t_ard_3
    ga_ls7e_ard_3
    ga_ls8c_ard_3
    ga_ls9c_ard_3
Applying morphological filters to pixel quality mask: [('opening', 5), ('dilation', 5)]

/home/jovyan/Robbi/dea-notebooks/How_to_guides/../Tools/dea_tools/datahandling.py:487: UserWarning: As of `dea_tools` v0.3.0, pixel quality masks are inverted before being passed to `mask_filters` (i.e. so that good quality/clear pixels are False and poor quality pixels/clouds are True). This means that 'dilation' will now expand cloudy pixels, rather than shrink them as in previous versions.
  warnings.warn(

Applying fmask pixel quality/cloud mask
Loading 12 time steps

Below, you will notice that in the second image the cloud mask is cleaner and less noisy. Buildings in the urban area are removed from the cloud mask while true cloud is more cleanly removed from the dataset.

[9]:

# Plot the data
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
rgb(ds, index=0, ax=ax[0])
rgb(ds_filtered, index=0, ax=ax[1])
ax[0].set_title('Fmask without dilation filtering')
ax[1].set_title('Fmask with dilation filtering applied');

../../../_images/notebooks_How_to_guides_Using_load_ard_20_0.png

Discarding Landsat 7 SLC-off failure data

On May 31 2003, Landsat 7’s Scan Line Corrector (SLC) that compensated for the satellite’s forward motion failed, introducing linear data gaps in all subsequent Landsat 7 observations. For example, if we plot all our loaded data we can see that some Landsat 7 images contains visible striping:

[10]:

# Plot Landsat data
rgb(ds, col="time")

../../../_images/notebooks_How_to_guides_Using_load_ard_22_0.png

Although this data still contains valuable information, for some applications (e.g. generating clean composites from multiple images) it can be useful to discard Landsat 7 imagery acquired after the SLC failure. This data is known as “SLC-off” data.

This can be achieved using load_ard using the ls7_slc_off. By default this is set to ls7_slc_off=True which will include all SLC-off data. Set to ls7_slc_off=False to discard this data instead; observe that the function now reports that it is ignoring SLC-off observations:

Finding datasets
    ga_ls5t_ard_3
    ga_ls7e_ard_3 (ignoring SLC-off observations)
    ga_ls8c_ard_3

[11]:

# Load available data after discarding Landsat 7 SLC-off data
ds_ls7 = load_ard(dc=dc,
              products=[
                  'ga_ls5t_ard_3', 'ga_ls7e_ard_3', 'ga_ls8c_ard_3',
                  'ga_ls9c_ard_3'
              ],
              measurements=['nbart_green', 'nbart_red', 'nbart_blue'],
              ls7_slc_off=False,
              **query)

Finding datasets
    ga_ls5t_ard_3
    ga_ls7e_ard_3 (ignoring SLC-off observations)
    ga_ls8c_ard_3
    ga_ls9c_ard_3
Applying fmask pixel quality/cloud mask
Loading 6 time steps

If we plot our data now, we can see that all of the stripey Landsat 7 scenes have now disappeared:

[12]:

# Plot Landsat data
rgb(ds_ls7, col="time")

../../../_images/notebooks_How_to_guides_Using_load_ard_26_0.png

Advanced

Lazy loading with Dask

Rather than load data directly - which can take a long time and large amounts of memory - all datacube data can be lazy loaded using Dask. This can be a very useful approach for when you need to load large amounts of data without crashing your analysis, or if you want to subsequently scale your analysis by distributing tasks in parallel across multiple workers.

The load_ard function can be easily adapted to lazily load data rather than loading it into memory by providing a dask_chunks parameter using either the explicit or query syntax. The minimum required to lazily load data is dask_chunks={}, but chunking can also be performed spatially (e.g. dask_chunks={'x': 2048, 'y': 2048}) or by time (e.g. dask_chunks={'time': 1}) depending on the analysis being conducted.

Note: For more information about using Dask, refer to the Parallel processing with Dask notebook.

[22]:

# Lazily load available Sentinel 2 data
ds_dask = load_ard(dc=dc,
                   products=['ga_s2am_ard_3', 'ga_s2bm_ard_3'],
                   measurements=['nbart_green', 'nbart_red', 'nbart_blue'],
                   dask_chunks={'x': 2048, 'y': 2048},
                   **query)

# Print output data
ds_dask

Finding datasets
    ga_s2am_ard_3
    ga_s2bm_ard_3
Applying fmask pixel quality/cloud mask
Returning 10 time steps as a dask array

[22]:

<xarray.Dataset>
Dimensions:      (time: 10, y: 229, x: 356)
Coordinates:
  * time         (time) datetime64[ns] 2021-06-28T00:06:32.739641 ... 2021-08...
  * y            (y) float64 -3.955e+06 -3.955e+06 ... -3.962e+06 -3.962e+06
  * x            (x) float64 1.544e+06 1.544e+06 ... 1.554e+06 1.554e+06
    spatial_ref  int32 3577
Data variables:
    nbart_green  (time, y, x) float32 dask.array<chunksize=(1, 229, 356), meta=np.ndarray>
    nbart_red    (time, y, x) float32 dask.array<chunksize=(1, 229, 356), meta=np.ndarray>
    nbart_blue   (time, y, x) float32 dask.array<chunksize=(1, 229, 356), meta=np.ndarray>
Attributes:
    crs:           EPSG:3577
    grid_mapping:  spatial_ref

Note that the data loads almost instantaneously, and that each of the arrays listed under Data variables are now described as dask.arrays. If we inspect one of these dask.arrays, we can view a visualisation of how the data has been broken into small “chunks” of data that can be loaded in parallel:

[23]:

ds_dask.nbart_red

[23]:

<xarray.DataArray 'nbart_red' (time: 10, y: 229, x: 356)>
dask.array<to_float-2c2ce37b, shape=(10, 229, 356), dtype=float32, chunksize=(1, 229, 356), chunktype=numpy.ndarray>
Coordinates:
  * time         (time) datetime64[ns] 2021-06-28T00:06:32.739641 ... 2021-08...
  * y            (y) float64 -3.955e+06 -3.955e+06 ... -3.962e+06 -3.962e+06
  * x            (x) float64 1.544e+06 1.544e+06 ... 1.554e+06 1.554e+06
    spatial_ref  int32 3577
Attributes:
    units:         1
    nodata:        -999
    crs:           EPSG:3577
    grid_mapping:  spatial_ref

To load the data into memory, you can run:

[24]:

ds_dask.compute()

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Australia data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on GitHub.

Last modified: December 2023

Compatible datacube version:

[25]:

print(datacube.__version__)

1.8.6

Using `load_ard` to load and cloud mask Landsat and Sentinel-2

Description

Getting started

Load packages

Connect to the datacube

Loading and combining data from multiple Landsat sensors

Query syntax

Cloud masking using `mask_pixel_quality`

Discarding Landsat 7 SLC-off failure data

Loading and combining Sentinel-2A and Sentinel-2B

Cloud masking with `s2cloudless`

Advanced

Filtering data

Using existing dataset metadata

Using a custom function

Filter to a single season

Lazy loading with Dask

Additional information

Tags

Using load_ard to load and cloud mask Landsat and Sentinel-2

Description

Getting started

Load packages

Connect to the datacube

Loading and combining data from multiple Landsat sensors

Query syntax

Cloud masking using mask_pixel_quality

Discarding Landsat 7 SLC-off failure data

Loading and combining Sentinel-2A and Sentinel-2B

Cloud masking with s2cloudless

Advanced

Filtering data

Using existing dataset metadata

Using a custom function

Filter to a single season

Lazy loading with Dask

Additional information

Tags

Using `load_ard` to load and cloud mask Landsat and Sentinel-2

Cloud masking using `mask_pixel_quality`

Cloud masking with `s2cloudless`