dea_tools.classification

Machine learning classification tools for analysing remote sensing data using the Open Data Cube.

License: The code in this notebook is licensed under the Apache License, Version 2.0 (https://www.apache.org/licenses/LICENSE-2.0). Digital Earth Australia data is licensed under the Creative Commons by Attribution 4.0 license (https://creativecommons.org/licenses/by/4.0/).

Contact: If you need assistance, please post a question on the Open Data Cube Discord chat (https://discord.com/invite/4hhBQVas5U) or on the GIS Stack Exchange (https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the open-data-cube tag (you can view previously asked questions here: https://gis.stackexchange.com/questions/tagged/open-data-cube).

If you would like to report an issue with this script, you can file one on GitHub (GeoscienceAustralia/dea-notebooks#new).

Last modified: February 2026

Functions

SKCV(coordinates, n_splits, cluster_method, ...)

Generate spatial k-fold cross validation indices using coordinate data.

collect_training_data(gdf, dc_query[, ...])

This function provides methods for gathering training/validation data from the ODC over geometries stored within a geopandas geodataframe.

fit_xr(model, input_xr)

Utilise our wrappers to fit a vanilla sklearn model.

predict_xr(model, input_xr[, chunk_size, ...])

Using dask-ml ParallelPostfit(), runs the parallel predict and predict_proba methods of sklearn estimators.

sklearn_flatten(input_xr)

Reshape a DataArray or Dataset with spatial (and optionally temporal) structure into an np.array with the spatial and temporal dimensions flattened into one dimension.

sklearn_unflatten(output_np, input_xr)

Reshape a numpy array with no 'missing' elements (NaNs) and 'flattened' spatiotemporal structure into a DataArray matching the spatiotemporal structure of the DataArray

spatial_clusters(coordinates[, method, ...])

Create spatial groups on coorindate data using either KMeans clustering or a Gaussian Mixture model Last modified: September 2020 :param n_groups: The number of groups to create. This is passed as 'n_clusters=n_groups' for the KMeans algo, and 'n_components=n_groups' for the GMM. If using method='Hierarchical' then this paramter is ignored. :type n_groups: int :param coordinates: A numpy array of coordinate values e.g. np.array([[3337270., 262400.], [3441390., -273060.], ...]) :type coordinates: np.array :param method: Which algorithm to use to seperate data points. Either 'KMeans', 'GMM', or 'Hierarchical'. If using 'Hierarchical' then must set max_distance. :type method: str :param max_distance: If method is set to 'hierarchical' then maximum distance describes the maximum euclidean distances between all observations in a cluster. 'n_groups' is ignored in this case. :type max_distance: int :param **kwargs: Additional keyword arguments to pass to sklearn.cluster.Kmeans or sklearn.mixture.GuassianMixture depending on the 'method' argument. :type **kwargs: optional,.

spatial_train_test_split(X, y, coordinates, ...)

Split arrays into random train and test subsets.

Classes

HiddenPrints()

For concealing unwanted print statements called by other functions

KMeans_tree(*args, **kwargs)

A hierarchical KMeans unsupervised clustering model.

class dea_tools.classification.HiddenPrints[source]

For concealing unwanted print statements called by other functions

class dea_tools.classification.KMeans_tree(*args: Any, **kwargs: Any)[source]

A hierarchical KMeans unsupervised clustering model. This class is a clustering model, so it inherits scikit-learn’s ClusterMixin base class.

Parameters:
  • n_levels (integer, default 2) – number of levels in the tree of clustering models.

  • n_clusters (integer, default 3) – Number of clusters in each of the constituent KMeans models in the tree.

  • **kwargs (optional) – Other keyword arguments to be passed directly to the KMeans initialiser.

fit(X, y=None, sample_weight=None)[source]

Fit the tree of KMeans models. All parameters mimic those of KMeans.fit().

Parameters:
  • X (array-like or sparse matrix, shape=(n_samples, n_features)) – Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous.

  • y (Ignored) – not used, present here for API consistency by convention.

  • sample_weight (array-like, shape (n_samples,), optional) – The weights for each observation in X. If None, all observations are assigned equal weight (default: None)

predict(X, sample_weight=None)[source]

Send X through the KMeans tree and predict the resultant cluster. Compatible with KMeans.predict().

Parameters:
  • X ({array-like, sparse matrix}, shape = [n_samples, n_features]) – New data to predict.

  • sample_weight (array-like, shape (n_samples,), optional) – The weights for each observation in X. If None, all observations are assigned equal weight (default: None)

Returns:

labels – Index of the cluster each sample belongs to.

Return type:

array, shape [n_samples,]

dea_tools.classification.SKCV(coordinates, n_splits, cluster_method, kfold_method, test_size, balance, n_groups=None, max_distance=None, train_size=None, random_state=None, **kwargs)[source]

Generate spatial k-fold cross validation indices using coordinate data. This function wraps the ‘SpatialShuffleSplit’ and ‘SpatialKFold’ classes. These classes ingest coordinate data in the form of an np.array([[Eastings, northings]]) and assign samples to a spatial cluster using either a KMeans, Gaussain Mixture, or Agglomerative Clustering algorithm. This cross-validator is preferred over other sklearn.model_selection methods for spatial data to avoid overestimating cross-validation scores. This can happen because of the inherent spatial autocorrelation that is usually associated with this type of data.

Last modified: Dec 2020

Parameters:
  • coordinates (np.array) –

    A numpy array of coordinate values e.g. np.array([[3337270., 262400.],

    [3441390., -273060.], …])

  • n_splits (int) – The number of test-train cross validation splits to generate.

  • cluster_method (str) – Which algorithm to use to seperate data points. Either ‘KMeans’, ‘GMM’, or ‘Hierarchical’

  • kfold_method (str) – One of either ‘SpatialShuffleSplit’ or ‘SpatialKFold’. See the docs under class:_SpatialShuffleSplit and class: _SpatialKFold for more information on these options.

  • test_size (float, int, None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.15.

  • balance (int or bool) –

    if setting kfold_method to ‘SpatialShuffleSplit’: int

    The number of splits generated per iteration to try to balance the amount of data in each set so that test_size and train_size are respected. If 1, then no extra splits are generated (essentially disabling the balacing). Must be >= 1.

    if setting kfold_method to ‘SpatialKFold’: bool

    Whether or not to split clusters into fold with approximately equal

    number of data points. If False, each fold will have the same number of clusters (which can have different number of data points in them).

  • n_groups (int) – The number of groups to create. This is passed as ‘n_clusters=n_groups’ for the KMeans algo, and ‘n_components=n_groups’ for the GMM. If using cluster_method=’Hierarchical’ then this parameter is ignored.

  • max_distance (int) – If method is set to ‘hierarchical’ then maximum distance describes the maximum euclidean distances between all observations in a cluster. ‘n_groups’ is ignored in this case.

  • train_size (float, int, or None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

  • random_state (int, RandomState instance or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • **kwargs (optional,) – Additional keyword arguments to pass to sklearn.cluster.Kmeans or sklearn.mixture.GuassianMixture depending on the cluster_method argument.

Return type:

generator object _BaseSpatialCrossValidator.split

dea_tools.classification.collect_training_data(gdf: geopandas.GeoDataFrame, dc_query: dict[str, Any], ncpus: int = 1, return_coords: bool = False, return_time_coords: bool = False, feature_func: callable = None, field: str = None, zonal_stats: str | None = None, clean: bool = True, fail_threshold: float = 0.05, fail_ratio: float = 0.5, max_retries: int = 2, time_field: str | None = None) pandas.DataFrame[source]

This function provides methods for gathering training/validation data from the ODC over geometries stored within a geopandas geodataframe. The function will return a pandas.DataFrame where the index contains class labels and the columns contain feature values generated by a user-defined feature_func.

  • In the instance where ncpus > 1, the function will automatically run in parallel.

  • Zonal statistics are supported where the provided vector file contains polygons, otherwise all pixel values are returned.

  • Individual points/polygons can be loaded from different time ranges by passing the time_field parameter.

  • Implements a retry queue for samples that may fail due to i/o limitations or s3 read failures.

Parameters:
  • gdf (geopandas geodataframe) – geometry data in the form of a geopandas geodataframe. Must contain a class labels column, can optionally contain a column with time stamps, specified with the`time_field` param.

  • dc_query (dictionary) – Datacube query object, should not contain lat and long (x or y) variables as these are supplied by the geopolygon column in the ‘gdf’.

  • ncpus (int) – The number of cpus/processes over which to parallelize the gathering of training data (only if ncpus is > 1). Defaults to 1.

  • feature_func (function) –

    A function for generating feature layers that is applied to the data within the bounds of the input geometry. The ‘feature_func’ must accept a ‘dc_query’ object, and return a single xarray.Dataset or xarray.DataArray:

    def feature_function(query):

    dc = datacube.Datacube(app=’feature_layers’) ds = dc.load(**query) ds = ds.mean(‘time’) return ds

  • field (str) – Name of the column in the gdf that contains the class labels

  • return_coords (bool) – If True, then the output data will contain two extra columns ‘x_coord’ and ‘y_coord’ corresponding to the x,y coordinate of each sample.

  • return_time_coords (bool) – If True, then the output data will contain an extra column ‘time_coord’, corresponding to the time stamp of each sample.

  • zonal_stats (string, optional) – An optional string giving the names of zonal statistics to calculate for each polygon. Default is None (all pixel values are returned). Supported values are ‘mean’, ‘median’, ‘max’, ‘min’.

  • clean (bool) – Whether or not to remove missing values in the returned dataset. If True (default), rows with any NaNs or Infs in any numeric columns will be dropped from the dataset.

  • time_field (str, optional) – Name of the column containing time(range) data in the input gdf, for the case where each row should load from a different time(range). If loading from the same time(range) for all rows, then its preferable to pass time as a key:variable in the ‘dc_query’. Note the time values must be in a format that datacube.load() accepts. For example, as a tuple with strings (‘2017-01-01’, ‘2017-01-31’). Defaults to None.

  • fail_threshold (float, default 0.05) – Silent read fails on S3 can result in some rows of the returned data containing NaN values. The’fail_threshold’ fraction specifies a % of acceptable fails. e.g. Setting ‘fail_threshold’ to 0.05 means if >5% of the samples in the training dataset fail then those samples will be returned to the multiprocessing queue. Below this fraction the function will accept the failures and return the results.

  • fail_ratio (float) – A float between 0 and 1 that defines if a given training sample has failed. Default is 0.5, which means if 50 % of the measurements in a given sample return null values, and the number of total fails is more than the fail_threshold, the samplewill be passed to the retry queue.

  • max_retries (int, default 2) – Maximum number of times to retry collecting samples. This number is invoked if the ‘fail_threshold’ is not reached.

Returns:

Where the index contains class labels and the columns contain feature values

Return type:

pandas.DataFrame

dea_tools.classification.fit_xr(model, input_xr)[source]

Utilise our wrappers to fit a vanilla sklearn model.

Last modified: September 2019

Parameters:
  • model (scikit-learn model or compatible object) – Must have a fit() method that takes numpy arrays.

  • input_xr (xarray.DataArray or xarray.Dataset.) – Must have dimensions ‘x’ and ‘y’, may have dimension ‘time’.

Returns:

  • model (a scikit-learn model which has been fitted to the data in)

  • the pixels of input_xr.

dea_tools.classification.predict_xr(model, input_xr, chunk_size=None, persist=False, proba=False, max_proba=True, clean=False, return_input=False)[source]

Using dask-ml ParallelPostfit(), runs the parallel predict and predict_proba methods of sklearn estimators. Useful for running predictions on a larger-than-RAM datasets.

Last modified: September 2020

Parameters:
  • model (scikit-learn model or compatible object) – Must have a .predict() method that takes numpy arrays.

  • input_xr (xarray.DataArray or xarray.Dataset.) – Must have dimensions ‘x’ and ‘y’

  • chunk_size (int) – The dask chunk size to use on the flattened array. If this is left as None, then the chunks size is inferred from the .chunks method on the input_xr

  • persist (bool) – If True, and proba=True, then ‘input_xr’ data will be loaded into distributed memory. This will ensure data is not loaded twice for the prediction of probabilities, but this will only work if the data is not larger than distributed RAM.

  • proba (bool) – If True, predict probabilities

  • max_proba (bool) – If True, the probabilities array will be flattened to contain only the probabiltiy for the “Predictions” class. If False, the “Probabilities” object will be an array of prediction probablities for each class

  • clean (bool) – If True, remove Infs and NaNs from input and output arrays

  • return_input (bool) – If True, then the data variables in the ‘input_xr’ dataset will be appended to the output xarray dataset.

Returns:

output_xr – An xarray.Dataset containing the prediction output from model. if proba=True then dataset will also contain probabilites, and if return_input=True then dataset will have the input feature layers. Has the same spatiotemporal structure as input_xr.

Return type:

xarray.Dataset

dea_tools.classification.sklearn_flatten(input_xr)[source]

Reshape a DataArray or Dataset with spatial (and optionally temporal) structure into an np.array with the spatial and temporal dimensions flattened into one dimension.

This flattening procedure enables DataArrays and Datasets to be used to train and predict with sklearn models.

Last modified: September 2019

Parameters:

input_xr (xarray.DataArray or xarray.Dataset) – Must have dimensions ‘x’ and ‘y’, may have dimension ‘time’. Dimensions other than ‘x’, ‘y’ and ‘time’ are unaffected by the flattening.

Returns:

input_np – A numpy array corresponding to input_xr.data (or input_xr.to_array().data), with dimensions ‘x’,’y’ and ‘time’ flattened into a single dimension, which is the first axis of the returned array. input_np contains no NaNs.

Return type:

numpy.array

dea_tools.classification.sklearn_unflatten(output_np, input_xr)[source]

Reshape a numpy array with no ‘missing’ elements (NaNs) and ‘flattened’ spatiotemporal structure into a DataArray matching the spatiotemporal structure of the DataArray

This enables an sklearn model’s prediction to be remapped to the correct pixels in the input DataArray or Dataset.

Last modified: September 2019

Parameters:
  • output_np (numpy.array) – The first dimension’s length should correspond to the number of valid (non-NaN) pixels in input_xr.

  • input_xr (xarray.DataArray or xarray.Dataset) – Must have dimensions ‘x’ and ‘y’, may have dimension ‘time’. Dimensions other than ‘x’, ‘y’ and ‘time’ are unaffected by the flattening.

Returns:

output_xr – An xarray.DataArray with the same dimensions ‘x’, ‘y’ and ‘time’ as input_xr, and the same valid (non-NaN) pixels. These pixels are set to match the data in output_np.

Return type:

xarray.DataArray

dea_tools.classification.spatial_clusters(coordinates, method='Hierarchical', max_distance=None, n_groups=None, verbose=False, **kwargs)[source]

Create spatial groups on coorindate data using either KMeans clustering or a Gaussian Mixture model Last modified: September 2020 :param n_groups: The number of groups to create. This is passed as ‘n_clusters=n_groups’

for the KMeans algo, and ‘n_components=n_groups’ for the GMM. If using method=’Hierarchical’ then this paramter is ignored.

Parameters:
  • coordinates (np.array) –

    A numpy array of coordinate values e.g. np.array([[3337270., 262400.],

    [3441390., -273060.], …])

  • method (str) – Which algorithm to use to seperate data points. Either ‘KMeans’, ‘GMM’, or ‘Hierarchical’. If using ‘Hierarchical’ then must set max_distance.

  • max_distance (int) – If method is set to ‘hierarchical’ then maximum distance describes the maximum euclidean distances between all observations in a cluster. ‘n_groups’ is ignored in this case.

  • **kwargs (optional,) – Additional keyword arguments to pass to sklearn.cluster.Kmeans or sklearn.mixture.GuassianMixture depending on the ‘method’ argument.

Returns:

labels – Index of the cluster each sample belongs to.

Return type:

array, shape [n_samples,]

dea_tools.classification.spatial_train_test_split(X, y, coordinates, cluster_method, kfold_method, balance, test_size=None, n_splits=None, n_groups=None, max_distance=None, train_size=None, random_state=None, **kwargs)[source]

Split arrays into random train and test subsets. Similar to sklearn.model_selection.train_test_split but instead works on spatial coordinate data. Coordinate data is grouped according to either a KMeans, Gaussain Mixture, or Agglomerative Clustering algorthim. Grouping by spatial clusters is preferred over plain random splits for spatial data to avoid overestimating validation scores due to spatial autocorrelation.

Parameters:
  • X (np.array) – Training data features

  • y (np.array) – Training data labels

  • coordinates (np.array) –

    A numpy array of coordinate values e.g. np.array([[3337270., 262400.],

    [3441390., -273060.], …])

  • cluster_method (str) – Which algorithm to use to seperate data points. Either ‘KMeans’, ‘GMM’, or ‘Hierarchical’

  • kfold_method (str) – One of either ‘SpatialShuffleSplit’ or ‘SpatialKFold’. See the docs under class:_SpatialShuffleSplit and class: _SpatialKFold for more information on these options.

  • balance (int or bool) –

    if setting kfold_method to ‘SpatialShuffleSplit’: int

    The number of splits generated per iteration to try to balance the amount of data in each set so that test_size and train_size are respected. If 1, then no extra splits are generated (essentially disabling the balacing). Must be >= 1.

    if setting kfold_method to ‘SpatialKFold’: bool

    Whether or not to split clusters into fold with approximately equal number of data points. If False, each fold will have the same number of clusters (which can have different number of data points in them).

  • test_size (float, int, None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.15.

  • n_splits (int) – This parameter is invoked for the ‘SpatialKFold’ folding method, use this number to satisfy the train-test size ratio desired, as the ‘test_size’ parameter for the KFold method often fails to get the ratio right.

  • n_groups (int) – The number of groups to create. This is passed as ‘n_clusters=n_groups’ for the KMeans algo, and ‘n_components=n_groups’ for the GMM. If using cluster_method=’Hierarchical’ then this parameter is ignored.

  • max_distance (int) – If method is set to ‘hierarchical’ then maximum distance describes the maximum euclidean distances between all observations in a cluster. ‘n_groups’ is ignored in this case.

  • train_size (float, int, or None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

  • random_state (int,) – RandomState instance or None, optional If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • **kwargs (optional,) – Additional keyword arguments to pass to sklearn.cluster.Kmeans or sklearn.mixture.GuassianMixture depending on the cluster_method argument.

Returns:

Contains four arrays in the following order:

X_train, X_test, y_train, y_test

Return type:

Tuple