Extracting training data from the ODC bd39832a1fdf456291b0c335ceb40149

Background

Training data is the most important part of any supervised machine learning workflow. The quality of the training data has a greater impact on the classification than the algorithm used. Large and accurate training data sets are preferable: increasing the training sample size results in increased classification accuracy (Maxell et al 2018). A review of training data methods in the context of Earth Observation is available here

When creating training labels, be sure to capture the spectral variability of the class, and to use imagery from the time period you want to classify (rather than relying on basemap composites). Another common problem with training data is class imbalance. This can occur when one of your classes is relatively rare and therefore the rare class will comprise a smaller proportion of the training set. When imbalanced data is used, it is common that the final classification will under-predict less abundant classes relative to their true proportion.

There are many platforms to use for gathering training labels, the best one to use depends on your application. GIS platforms are great for collection training data as they are highly flexible and mature platforms; Geo-Wiki and Collect Earth Online are two open-source websites that may also be useful depending on the reference data strategy employed. Alternatively, there are many pre-existing training datasets on the web that may be useful, e.g. Radiant Earth manages a growing number of reference datasets for use by anyone.

Description

This notebook will extract training data (feature layers, in machine learning parlance) from the open-data-cube using labelled geometries within a geojson. The default example will use the crop/non-crop labels within the 'data/crop_training_WA.geojson' file. This reference data was acquired and pre-processed from the USGS’s Global Food Security Analysis Data portal here.

To do this, we rely on a custom dea-tools function dea-tools.classification.ollect_training_data. The principal goal of this notebook is to extract a set of useful feature layers for generating a cropland mask for WA.

  1. Preview the polygons in our training data by plotting them on a basemap

  2. Define a feature layer function to pass to collect_training_data

  3. Extract training data from the datacube using collect_training_data

  4. Export the training data to disk for use in subsequent notebooks.

Note: You can find a more detailed description of the collect_training_data function and its attributes in the Collect_training_and_validation_data notebook.


Getting started

To run this analysis, run all the cells in the notebook, starting with the “Load packages” cell.

Load packages

[1]:
%matplotlib inline

import datacube
import xarray as xr
import geopandas as gpd
from odc.geo.xr import assign_crs

import sys
sys.path.insert(1, '../../Tools/')
from dea_tools.bandindices import calculate_indices
from dea_tools.classification import collect_training_data

Analysis parameters

  • path: The path to the input vector file from which we will extract training data. A default geojson is provided.

  • field: This is the name of column in your shapefile attribute table that contains the class labels. The class labels must be integers

  • ncpus: The number of cores for parallelism of the data collection

[2]:
path = 'data/crop_training_WA.geojson'
field = 'class'
ncpus = 2

Preview input data

We can load and preview our input data shapefile using geopandas. The shapefile should contain a column with class labels (e.g. ‘class’). These labels will be used to train our model.

Remember, the class labels must be represented by integers.

[3]:
# Load input data vector file
input_data = gpd.read_file(path)

# Print data
input_data.head(5)
[3]:
class geometry
0 1 POINT (116.60407 -31.46883)
1 1 POINT (117.03464 -32.4083)
2 1 POINT (117.30838 -32.33747)
3 1 POINT (116.74607 -31.6375)
4 1 POINT (116.85817 -33.00131)

Our training point dataset has 430 points. For this demo, we will take a small sample of these points (150 points) to improve run times. Once you have run this notebook once, feel free to revisit this step and run the analysis on the full 430 point dataset.

[4]:
# Sample 150 points randomly
input_data_sample = input_data.sample(150).reset_index(drop=True)

# Plot training data in an interactive map
input_data_sample.explore(column=field)
[4]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Extracting training data

The function collect_training_data takes our geojson containing class labels and extracts training data (features) from the datacube over the locations specified by the input geometries. The function will also pre-process our training data by stacking the arrays into a useful format and removing any NaN or inf values.

The below variables can be set within the collect_training_data function:

  • zonal_stats : An optional string giving the names of zonal statistics to calculate across each polygon (if the geometries in the vector file are polygons and not points). Default is None (all pixel values are returned). Supported values are ‘mean’, ‘median’, ‘max’, and ‘min’.

In addition to the zonal_stats parameter, we also need to set up a datacube query dictionary for the Open Data Cube query such as: - measurements: the bands to load from the satellite), - resolution: the pixel size - output_crs: the output projection.

These options will be added to a query dictionary that will be passed into collect_training_data using the parameter collect_training_data(dc_query=query, ...). The query dictionary will be the only argument in the feature layer function which we will define and describe in a moment.

Note: collect_training_data also has a number of additional parameters for that allow it to: - In the instance where ncpus > 1, the function will automatically run in parallel. - Individual points/polygons can be loaded from different time ranges by passing both time_field and time_delta variables, resulting in a time-range calculated as time_field +- time_delta - Implements a retry queue for samples that may fail due to i/o limitations or s3 read failures during heavy multiprocessing. - Check out the docs to learn more.

[5]:
# Set up our inputs to collect_training_data
zonal_stats = None

# Set up the inputs for the ODC query
time = ('2014')
resolution = (-30, 30)
output_crs = 'epsg:3577'
[6]:
# Generate a new datacube query object
query = {
    'time': time,
    'resolution': resolution,
    'output_crs': output_crs,
}

Defining feature layers

To create the desired feature layers, we pass instructions to collect training data through the feature_func parameter.

  • feature_func: A function for generating feature layers that is applied to the data within the bounds of the input geometry. The ‘feature_func’ must accept a ‘dc_query’ object, and return a single xarray.Dataset or xarray.DataArray containing 2D coordinates (i.e x, y - no time dimension). e.g.

    def feature_function(query):
        dc = datacube.Datacube(app='feature_layers')
        ds = dc.load(**query)
        ds = ds.mean('time')
        return ds
    

Below, we will define a more complicated feature layer function than the brief example shown above. We will load satellite bands and the ternary Median Abosolute Deviation dataset from the Landsat 8 geomedian product, calculate some additional band indices, and finally append fractional cover percentiles for the photosynthetic vegetation band from the same year: fc_percentile_albers_annual.

[7]:
def feature_layers(query):

    # Connect to the datacube
    dc = datacube.Datacube(app='custom_feature_layers')

    # Load ls8 geomedian
    ds = dc.load(product='ga_ls8cls9c_gm_cyear_3',
                 **query)

    # Calculate some band indices
    da = calculate_indices(ds,
                           index=['NDVI'],
                           drop=False,
                           collection='ga_ls_3')

    # Add Fractional cover percentiles
    fc = dc.load(product='ga_ls_fc_pc_cyear_3',
                 measurements=['pv_pc_10', 'pv_pc_50', 'pv_pc_90'], # only the PV band
                 like=ds, # will match geomedian extent
                 time=query.get('time'),  # use time if in query
                 dask_chunks=query.get('dask_chunks')  # use dask if in query
                )

    # Merge results into single dataset
    result = xr.merge([da, fc], compat='override')

    return result

Now, we can pass this function to collect_training_data. This will take a few minutes to run on the default sandbox as it only has two CPUs.

[8]:
df = collect_training_data(
    gdf=input_data_sample,
    dc_query=query,
    ncpus=ncpus,
    return_coords=False,
    field=field,
    zonal_stats=zonal_stats,
    clean=True,
    feature_func=feature_layers
)
Collecting training data in parallel mode
Percentage of possible fails after run 1 = 0.0 %
Removed 0 rows with NaNs or Infs
Output shape: (150, 15)
[9]:
df.head(4)
[9]:
nbart_blue nbart_green nbart_red nbart_nir nbart_swir_1 nbart_swir_2 sdev edev bcdev count NDVI pv_pc_10 pv_pc_50 pv_pc_90
class
1.0 824.0 1231.0 1615.0 3164.0 3395.0 2160.0 0.005129 1130.387451 0.097714 22.0 0.324126 11.0 16.0 81.0
1.0 818.0 1193.0 1591.0 2588.0 3769.0 2608.0 0.001464 1478.450562 0.123940 14.0 0.238574 4.0 7.0 58.0
0.0 550.0 909.0 1347.0 2241.0 2888.0 2130.0 0.000216 142.548538 0.016247 17.0 0.249164 13.0 17.0 25.0
0.0 426.0 734.0 1234.0 2189.0 2817.0 1989.0 0.000279 275.391174 0.030092 13.0 0.278995 11.0 20.0 23.0

Export training data

Once we’ve collected all the training data we require, we can write the data to disk. This will allow us to import the data in the next step(s) of the workflow.

[10]:
# Set the name and location of the output file
output_file = "results/test_training_data.csv"

# Export files to disk
df.to_csv(output_file)

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Australia data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Discord chat or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on GitHub.

Last modified: Feb 2026

Compatible datacube version:

[11]:
print(datacube.__version__)
1.9.10

Tags

Tags Landsat 8 geomedian, Landsat 8 TMAD, machine learning, collect_training_data, Fractional Cover