Accessing Gridded Climate Data

Sign up to the DEA Sandbox to run this notebook interactively from a browser
Compatibility: Notebook currently compatible with the DEA Sandbox & NCI environments
Products used: ERA5 and ANUClim (these datasets are external to the Digital Earth Australia platform)

Background & Description

This notebook demonstrates how to access and use Australian-specific climate data stored on the National Computing Infrastructure (NCI), and global climate reanalysis data (ERA5) hosted on Google’s cloud infrastructure. The content is structured into two parts.

The first section shows how to access an Australian climate dataset through the THREDDS platform. NCI’s THREDDS provides a wide range of climate, satellite, and other gridded environmental datasets, which can be accessed programmatically via xarray.
The second section focuses on the use of a pre-processed version of the ECMWF Reanalysis v5 (ERA5) Climate Fields. These datasets are also accessed using xarray.

Note: Additional details about the datasets are provided further in the notebook.

Load packages

Import Python packages that are used for the analysis.

[1]:

%matplotlib inline

import os
import gcsfs
import requests
import datacube
import xarray as xr
import pandas as pd
import numpy as np
import contextily as ctx
import geopandas as gpd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
from odc.geo.xr import assign_crs
from odc.geo.geom import Geometry, BoundingBox

import sys
sys.path.insert(1, '../Tools/')
from dea_tools.dask import create_local_dask_cluster

Analysis parameters

The following code cell is used to define variable needed later.

vector_path: a file path to the polygon that defines our analysis region, the default example uses a boundary of Victoria.

[2]:

# Boundary of victoria
vector_path = '../Supplementary_data/External_data_Climate/victoria_boundary.geojson'

Section 1: ANUClim climate data

In the first section of this notebook, we demonstrates retrieval of the Australian National University’s Climate data product (‘ANUClim’). These datasets are produced using topographically conditional spatial interpolation of Australia’s extensive network of weather stations.

The available variables include: * tavg: Average surface air temperature * rain: Total precipitation * vpd : Vapour pressure deficit * srad: Incoming solar radiation * evap: Pan evaporation

The entire archive can be viewed here

Data is available at a daily and monthly cadence, and has a 1 x 1 km spatial resolution, making it one of the highest resolution climate data products available for Australia. The time range of the data depends on the climate variable. For example, rainfall goes back to 1900, while air temperature begins in 1960. Data is updated annually.

Note: ANUClim is stored in the 'EPSG:4283' projection

Analysis parameters for ANUClim

var: the climate variable to load.
cadence: specifies whether to load monthly or daily data. Must be either ‘month’ or ‘day’.
year_start & year_end: the year range of data you want to load, use integer values here.

[3]:

# Which variable should we load?
var = 'rain' # other options: vpd, srad, rain, evap,  tavg

# Define the cadence of the data to load
cadence= 'month' # other options: day

# Define the time window (inclusive of end year) (use integers only)
year_start, year_end = 2020, 2020

Load our region of interest

[4]:

# open the vector file
gdf = gpd.read_file(vector_path)

# convert to GDA94 to match ANUClim's CRS
gdf = gdf.to_crs('epsg:4283')

# define bounding box of the geometry
# we'll use this to select data before masking
bb = gdf.total_bounds

Load the climate data through THREDDS

Extract the links for loading the target year data from THREDDS using xarray.

[5]:

# the top level folder on the NCI
thredds_data_link = 'https://thredds.nci.org.au/thredds/dodsC/gh70/ANUClimate/v2-0/stable/'
thredds_catalog_link=f'https://thredds.nci.org.au/thredds/catalog/gh70/ANUClimate/v2-0/stable/{cadence}/'

# list of years to load
years = [str(i) for i in range(year_start, year_end+1)]

i=0
pp = []

# loop through years and load
for y in years:
    print(" {:02}/{:02}\r".format(i + 1, len(years)), end="")

    # get list of all the file links on the THREDDS server
    ff = f'{thredds_catalog_link}{var}/{y}/catalog.html'
    soup = BeautifulSoup(requests.get(ff).content, "html.parser")

    # extract just the href links
    file_names = []
    for link in soup.select('a[href*=".html"]'): # loop over all <a> elements where the href attribute contains ".html"
        href = link["href"]
        if "dataset" in href: # add only file names of datasets to list
            file_names.append(href)

    # create a list of links for xarray
    list_of_links = [thredds_data_link+f.replace('catalog.html?dataset=gh70/', '') for f in file_names]

    # open with xarray and tidy up
    ds = xr.open_mfdataset(list_of_links)
    ds = assign_crs(ds, crs='epsg:4283') #GDA94
    ds = ds.drop_vars('crs')[var]
    ds.attrs['nodata'] = np.nan

    # select data from within the bounding box of the geometry
    # this is faster than masking, which we can do later
    ds = ds.sel(lat=slice(bb[3], bb[1]), lon=slice(bb[0],bb[2]))

    pp.append(ds)
    i += 1

# join the years together
climate = xr.concat(pp, dim='time').sortby('time')
climate.load() #bring into memory

# mask out all pixels outside of our ROI
# convert to a Geometry object for masking
geometry = gdf.iloc[0].geometry
geom = Geometry(geom=geometry, crs=gdf.crs) # use same CRS of the vector file
climate = climate.odc.mask(poly=geom)

 01/01

Plot annual data

Below we resample the climate data into an annual series. As we have loaded rainfall in the default example, we calculate an annual sum. This statistic may not make sense if you are loading a different variable, e.g. temperature.

[6]:

climate_annual = climate.resample(time='1YE').sum()
climate_annual = climate_annual.odc.mask(poly=geom) # need to remask after a sum to remove zeros

/env/lib/python3.10/site-packages/odc/geo/_xr_interop.py:500: UserWarning: grid_mapping=crs is not pointing to valid coordinate
  warnings.warn(

[7]:

fig, ax = plt.subplots(figsize=(6, 4), layout='constrained')

im = climate_annual.isel(time=0).plot(cmap='viridis', add_labels=False, robust=True, add_colorbar=False, ax=ax)
ax.set_title(climate_annual.isel(time=0).time.dt.year.item())

cb = fig.colorbar(im, ax=ax, shrink=0.75, orientation='vertical')
cb.ax.set_title(var + ' \n mm yr\u207B\u00B9', fontsize=8);

../../../_images/notebooks_How_to_guides_External_data_Climate_15_0.png

Section 2: ECMWF Reanalysis v5 (ERA5)

We can also load ERA5 reanalysis datasets using Google’s collection of Analysis-Ready, Cloud Optimized ERA5 data. ERA5 has a coarse spatial resolution (~25 km) compared with ANUClim, but provides hourly reconstructions of past climate and has many more climate fields at both surface and pressure levels.

A list of variables to load is available in the data summary table. Some commonly used long and short variable names are the following: * 2 metre temperature : 2t * total_precipitation : tp * 2m_dewpoint_temperature : 2d * evaporation : e * 10m_u_component_of_wind : 10u * 10m_v_component_of_wind : 10v * Surface short-wave (solar) radiation downwards : ssrd

Analysis parameters for ERA5

era5_var: Provide a variable name to load
time_range: length of time to load

[8]:

era5_var = 'total_precipitation'
time_range = ('2020-01-01', '2020-12-31')

Open a dask client

This helps with ERA5 data as it has a larger volume of time steps to summarise.

[9]:

create_local_dask_cluster()

Client

Client-421a5cbe-6110-11f0-8194-a257d99bdb0e

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: /user/chad.burton@ga.gov.au/proxy/8787/status

Cluster Info

LocalCluster

ca9e844b

Dashboard: /user/chad.burton@ga.gov.au/proxy/8787/status	Workers: 1
Total threads: 2	Total memory: 15.00 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-05038069-2097-41e3-92b4-66463e14f10b

Comm: tcp://127.0.0.1:38757	Workers: 1
Dashboard: /user/chad.burton@ga.gov.au/proxy/8787/status	Total threads: 2
Started: Just now	Total memory: 15.00 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:41071	Total threads: 2
Dashboard: /user/chad.burton@ga.gov.au/proxy/44255/status	Memory: 15.00 GiB
Nanny: tcp://127.0.0.1:44099
Local directory: /tmp/dask-scratch-space/worker-ptuthvpu

Define a function for loading ERA5

[10]:

def load_era5(
    x=None,
    y=None,
    crs="EPSG:4326",
    time=None,
    bands=None,
    chunks=None,
    path="gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
):
    # Lazily load Zarr from Google Cloud Platform
    era5_ds = xr.open_zarr(path, chunks=chunks, storage_options=dict(token="anon"))

    # Select bands
    era5_ds = era5_ds[bands] if bands is not None else era5_ds

    # Clip to extent
    if x is not None:
        bbox = BoundingBox.from_xy(x, y, crs=crs).to_crs(crs)
        era5_ds = era5_ds.sel(
            longitude=slice(bbox.left, bbox.right),
            latitude=slice(bbox.top, bbox.bottom),
        )

    # Select time
    if time is not None:
        era5_ds = era5_ds.sel(time=slice(time[0], time[-1]))

    return era5_ds.odc.assign_crs(crs)

Convert the polygon back to EPSG:4326

[11]:

# convert our vector back to EPSG 4326 for ERA5
gdf = gdf.to_crs('epsg:4326')

# define bounding box of the geometry
# we'll use this to select data before masking
bb = gdf.total_bounds

Load total precipitation

[12]:

era5_climate = load_era5(
    x=(bb[0],bb[2]),
    y=(bb[1],bb[3]),
    crs="EPSG:4326",
    time=time_range,
    bands=[era5_var],
    chunks=dict(x=10, y=10, time=2000), # really small chunks
    path="gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
)

era5_climate = era5_climate[era5_var]
era5_climate

[12]:

<xarray.DataArray 'total_precipitation' (time: 8784, latitude: 21, longitude: 36)> Size: 27MB
dask.array<getitem, shape=(8784, 21, 36), dtype=float32, chunksize=(2000, 21, 36), chunktype=numpy.ndarray>
Coordinates:
  * latitude     (latitude) float32 84B -34.0 -34.25 -34.5 ... -38.75 -39.0
  * longitude    (longitude) float32 144B 141.0 141.2 141.5 ... 149.5 149.8
  * time         (time) datetime64[ns] 70kB 2020-01-01 ... 2020-12-31T23:00:00
    spatial_ref  int32 4B 4326
Attributes:
    long_name:   Total precipitation
    short_name:  tp
    units:       m

Resample to annual

As ERA5 has an hourly time step, we need to use dask to resample into annual time fields as there is a lot of data even though the spatial extent of the subset of data is small. This can take a few minutes on the default sandbox.

[13]:

era5_climate_annual = era5_climate.resample(time='1YE').sum().load()

Mask with the polygon

[14]:

geometry = gdf.iloc[0].geometry
geom = Geometry(geom=geometry, crs=gdf.crs)
era5_climate_annual = era5_climate_annual.odc.mask(poly=geom) # need to remask after a sum to remove zeros

Plot an annual summary data

[15]:

fig, ax = plt.subplots(figsize=(5, 4), layout='constrained')

im = (era5_climate_annual*1000).isel(time=0).plot(cmap='viridis', add_labels=False, robust=True, add_colorbar=False, ax=ax);  # multiply by 1000 to convert into mm
ax.set_title(era5_climate_annual.isel(time=0).time.dt.year.item())

cb = fig.colorbar(im, ax=ax, shrink=0.75, orientation='vertical')
cb.ax.set_title(var + ' \n mm yr\u207B\u00B9', fontsize=8);

../../../_images/notebooks_How_to_guides_External_data_Climate_32_0.png

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Australia data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Discord chat or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on GitHub.

Last modified: April 2025

Compatible datacube version:

[16]:

print(datacube.__version__)

1.8.19