dea_tools.validation

Tools for validating outputs and producing accuracy assessment metrics.

License: The code in this notebook is licensed under the Apache License, Version 2.0 (https://www.apache.org/licenses/LICENSE-2.0). Digital Earth Australia data is licensed under the Creative Commons by Attribution 4.0 license (https://creativecommons.org/licenses/by/4.0/).

Contact: If you need assistance, please post a question on the Open Data Cube Discord chat (https://discord.com/invite/4hhBQVas5U) or on the GIS Stack Exchange (https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the open-data-cube tag (you can view previously asked questions here: https://gis.stackexchange.com/questions/tagged/open-data-cube).

If you would like to report an issue with this script, you can file one on GitHub (GeoscienceAustralia/dea-notebooks#new).

Last modified: July 2025

Functions

`eval_metrics`(x, y[, round, all_regress])	Calculate a set of common statistical metrics based on two input actual and predicted vectors.
`xr_random_sampling`(da[, n, sampling, ...])	Efficient and scalable random sampling of a 2D classified xarray.DataArray.

dea_tools.validation.eval_metrics(x, y, round=3, all_regress=False)[source]

Calculate a set of common statistical metrics based on two input actual and predicted vectors.

These include:

Pearson correlation
Root Mean Squared Error
Mean Absolute Error
R-squared
Bias
Linear regression parameters (slope, p-value, intercept, standard error)

Parameters:

x (numpy.array) – An array providing “actual” variable values
y (numpy.array) – An array providing “predicted” variable values
round (int) – Number of decimal places to round each metric to. Defaults to 3
all_regress (bool) – Whether to return linear regression p-value, intercept and standard error (in addition to only regression slope). Defaults to False

Return type:

A pandas.Series containing calculated metrics

dea_tools.validation.xr_random_sampling(da, n=None, sampling='stratified_random', manual_class_ratios=None, oversample_factor=5, random_seed=None, out_fname=None, verbose=True)[source]

Efficient and scalable random sampling of a 2D classified xarray.DataArray. Returns a GeoDataFrame of point samples based on specified sampling strategy.

Parameters:

da (xarray.DataArray) – A classified 2-dimensional xarray.DataArray
n (int) – Total number of points to sample. Ignored if providing a dictionary of {class:numofpoints} to ‘manual_class_ratios’
sampling (str, optional) – The sampling strategy to use. Options include: ‘stratified_random’ = Create points that are randomly distributed within each class, where each class has a number of points proportional to its relative area. ‘equal_stratified_random’ = Create points that are randomly distributed within each class, where each class has the same number of points. ‘random’ = Create points that are randomly distributed throughout the image. ‘manual’ = user definined, each class is allocated a specified number of points, supply a manual_class_ratio dictionary mapping number of points to each class
manual_class_ratios (dict, optional) – If setting sampling to ‘manual’, the provide a dictionary of type {‘class’: numofpoints} mapping the number of points to generate for each class.
oversample_factor (float, optional (default=5)) – A multiplier used to increase the number of random candidate pixels initially drawn when sampling very large classes (>1 billion pixels). For such large classes, the function randomly samples a subset of pixel coordinates and checks which ones match the target class. To reduce the chance of undersampling, oversample_factor controls how many candidate coordinates are initially drawn. For example, if 100 samples are required and oversample_factor=5, 500 random (x, y) coordinates will be sampled first. Only those matching the class will be retained and then randomly subsampled down to the desired number of samples. If too few valid matches are found, a warning is issued. Increasing this value can improve success rates when sampling sparse or spatially fragmented classes in large datasets, at the cost of more memory and computation.
random_seed (int | None, optional) – Controls the random number generation for reproducibility.
out_fname (str, optional) – If providing a filepath name, e.g ‘sample_points.geojson’, the function will export a geojson (or shapefile) of the sampling points to file.
verbose (bool, optional (default=True)) – If True, print statements will track progress and print warnings

Return type:

geopandas.GeoDataFrame