CleF - Climate Finder search ESGF data at NCI

Contents:

README

Clef searches the Earth System Grid Federation datasets stored at the Australian National Computational Infrastructure, both data published on the NCI ESGF node as well as files that are locally replicated from other ESGF nodes.

Currently it searches for the following datasets:

  • CMIP5 raijin projects: rr3, where NCI is the primary publisher and al33 for replicas
  • CMIP6 raijin projects: 0i10 for replicas

The search returns both the path of data that is already available at NCI as well as information on data that is on external ESGF nodes but not yet available locally.

Install

Clef is pre-installed into a Conda environment at NCI. Load it with:

module use /g/data3/hh5/public/modules
module load conda/analysis3-unstable
We are constantly adding new features, the development version is available in a separate environment::
module use /g/data3/hh5/public/modules module load conda source activate clef-test

You can install it to your own environment with:

conda install -c coecms -c conda-forge clef

But note that the MAS database necessary for running clef can only be accessed from NCI systems

Use

clef cmip5

Find CMIP5 files matching the constraints:

clef cmip5 --model BCC-CSM1.1 --variable tas --experiment historical --table day

You can filter CMIP5 by the following terms:

  • ensemble/member
  • experiment
  • experiment-family
  • institution
  • model
  • table/cmor_table
  • realm
  • frequency
  • variable
  • cf-standard-name

See clef cmip5 --help for all available filters and their aliases

--latest will check the latest versions of the datasets on the ESGF

website, and will only return matching files

It will return a path for all the files available locally at NCI and a dataset-id for the ones that haven’t been downloaded yet.

You can use the flags --local and --missing to return respectively only the local paths or the missing dataset-id:

clef --local cmip5 --model MPI-ESM-LR --variable tas --table day
clef --missing cmip5 --model MPI-ESM-LR --variable tas --table day

NB these flags come immediately after the command “clef” and before the sub-command “cmip5” or “cmip6”. They are also clearly mutually exclusive. You can repeat arguments more than once:

clef --missing cmip5 --model MPI-ESM-LR -v tas -v tasmax -t day -t Amon

clef cmip6

You can filter CMIP6 by the following terms:

  • activity
  • experiment
  • institution
  • source_type
  • model
  • member
  • table
  • realm
  • frequency
  • variable
  • version

See clef cmip6 --help for all available filters

Develop

Development install:

conda env create -f conda/dev-environment.yml
source activate clef-dev
pip install -e '.[dev]'

The dev-environment.yml file is for speeding up installs and installing packages unavailable on pypi, requirements.txt is the source of truth for dependencies.

To work on the database tables you may need to start up a test database.

You can start a test database either with Docker:

docker-compose up # (In a separate terminal)
psql -h localhost -U postgres -f db/nci.sql
psql -h localhost -U postgres -f db/tables.sql
# ... do testing
docker-compose rm

Or with Vagrant:

vagrant up
# ... do testing
vagrant destroy

Run tests with py.test (they will default to using the test database):

py.test

Build the documentation using Sphinx:

python setup.py build_sphinx
firefox docs/_build/index.html

New releases are packaged and uploaded to anaconda.org by CircleCI when a new Github release is made

Documentation is available on ReadTheDocs, both for stable and latest versions.

Getting Started

CleF is presently installed in an anaconda environment, which must be loaded before use (on either VDI or Raijin):

$ module use /g/data3/hh5/public/modules
$ module load conda/analysis3-unstable

NB there is a clef version available on analysis3 but the one in unstable is more recent and has fixes for some bugs.

clef is accessed through the command-line clef program. There are presently two main commands:

  • clef cmip5 to execute searches on the CMIP5 dataset
  • clef cmip6 to execute searches on the CMIP6 dataset
  • clef ds to execute searches on non-ESGF climate datasets

Examples

The search works like the ESGF search website, e.g. https://esgf.nci.org.au/search/esgf_nci. Results can be filtered by using flags matching the ESGF search facets.

CMIP5

::
$ clef cmip5 –model ACCESS1.0
–experiment historical –frequency mon –variable ua –variable va

/g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r1i1p1/v20120727/ua/ /g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r1i1p1/v20120727/va/ /g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r2i1p1/v20130726/ua/ /g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r2i1p1/v20130726/va/ /g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r3i1p1/v20140402/ua/ /g/data1/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-0/historical/mon/atmos/Amon/r3i1p1/v20140402/va/

Everything available on ESGF is also available locally

CMIP6

::
$ clef cmip6 –activity CMIP
–experiment historical –source_type AOGCM –table Amon –grid gr –resolution “250 km” –variable ua –variable va

/g/data1b/oi10/replicas/CMIP6/CMIP/CNRM-CERFACS/CNRM-CM6-1/historical/r1i1p1f2/Amon/ua/gr/v20180917/ /g/data1b/oi10/replicas/CMIP6/CMIP/CNRM-CERFACS/CNRM-CM6-1/historical/r1i1p1f2/Amon/va/gr/v20180917/

Available on ESGF but not locally: CMIP6.CMIP.CNRM-CERFACS.CNRM-CM6-1.historical.r2i1p1f2.Amon.ua.gr.v20181126 CMIP6.CMIP.CNRM-CERFACS.CNRM-CM6-1.historical.r2i1p1f2.Amon.va.gr.v20181126

ds

$ clef ds -f netcdf --standard-name air_temperature
ta: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_pl/1.0/ta/ta_6hr_ERAI_historical_oper_an_pl_<YYYYMMDD>_<YYYYMMDD>.nc
tas: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_sfc/1.0/tas/tas_6hr_ERAI_historical_oper_an_sfc_<YYYYMMDD>_<YYYYMMDD>.nc
ta: /g/data/ub4/erai/netcdf/6hr/atmos/oper_an_ml/1.0/ta/ta_6hr_ERAI_historical_oper_an_ml_<YYYYMMDD>_<YYYYMMDD>.nc
mn2t: /g/data/ub4/erai/netcdf/3hr/atmos/oper_fc_sfc/1.0/mn2t/mn2t_3hr_ERAI_historical_oper_fc_sfc_<YYYYMMDD>_<YYYYMMDD>.nc
mx2t: /g/data/ub4/erai/netcdf/3hr/atmos/oper_fc_sfc/1.0/mx2t/mx2t_3hr_ERAI_historical_oper_fc_sfc_<YYYYMMDD>_<YYYYMMDD>.nc
tas: /g/data/ub4/erai/netcdf/3hr/atmos/oper_fc_sfc/1.0/tas/tas_3hr_ERAI_historical_oper_fc_sfc_<YYYYMMDD>_<YYYYMMDD>.nc

Options

clef –missing

clef --missing <dataset> searches ESGF for files that haven’t been downloaded to NCI. It returns ESGF dataset IDs for each dataset that has one or more missing files:

$ clef --missing cmip5 --model HadCM3 --experiment historical \
               --table day --ensemble r1i1p1 \
               --variable ta
Available on ESGF but not locally:
cmip5.output2.MOHC.HadCM3.historical.day.atmos.day.r1i1p1.v20110728
NOTE: ESGF keeps track of only the most recent versions of each file for a given dataset version,
so if the files in the NCI mirror and ESGF don’t match this command can return false positives.

clef –local

clef --local <dataset> searches the local file system for files that have been downloaded to NCI. It returns the path to the file on NCI’s /g/data disk:

$ clef --local cmip5 --model HadCM3 --experiment historical --table day --ensemble r1i1p1 \
                     --variable ta --all-versions
/g/data1/ua6/unofficial-ESG-replica/tmp/tree/cmip-dn1.badc.rl.ac.uk/thredds/fileServer/esg_dataroot/cmip5/output1/MOHC/HadCM3/historical/day/atmos/day/r1i1p1/v20110728/ta/
/g/data1/ua6/unofficial-ESG-replica/tmp/tree/esgf-data1.ceda.ac.uk/thredds/fileServer/esg_dataroot/cmip5/output1/MOHC/HadCM3/historical/day/atmos/day/r1i1p1/v20140110/ta/

NOTE: Presently the default behaviour for all the ESGF-node based searches is to check for the most recent (latest) version on ESGF, and return only files with that version. This can be disabled with the --all-versions flag. The –local option instead currently returns by default all available versions, including versions unpublished by the ESGF but that are still available locally, Most of the older CMIP5 collection (ua6 project) has been replaced by the new one (al33i project), this does not include older or superceded versions. If you are looking for one of these versions you could try using the ARCCSSive module https://github.com/coecms/arccssive to locate it or ask the helpdesk.

tips

If your search does not return any results try again at a later time. The tool is searching the ESGF website first and sometimes one or more nodes can be disconnected and the returned results are incomplete. Try the –local flag to at least get what is available locally. For CMIP5 you can use the older ARCCSSive tool if in doubt.

Integrating the ESGF search in your code

The code sub-module contains functions which are used to run –local option and can be used to integrate this query in your own python scripts:

from clef.code import *

After importing them you need to open a connection with the NCI MAS database to be able to run your queries:

db = connect()
s = Session()

The search function takes 3 inputs: the db session, the project (i.e. currently ‘cmip5’ or ‘cmip6’) and a dictionary containing the query constraints.:

results = search(s, project='cmip5', **constraints)

The keys available to define your constraints depend on the project you are querying and the attributes stored by the database. You can use any of the facets used for ESGF but in future we will be adding other options based on extra fields which are stored as attributes.

Examples

::
constraints = {‘variable’: ‘tas’, ‘model’: ‘MIROC5’, ‘cmor_table’: ‘day’, ‘experiment’: ‘rcp85’} results = search(s, project=’cmip5’, **constraints) results[0] {‘filenames’: [‘tas_day_MIROC5_rcp85_r1i1p1_20060101-20091231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20500101-20591231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20200101-20291231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20800101-20891231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20600101-20691231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20100101-20191231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20900101-20991231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20700101-20791231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20400101-20491231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_20300101-20391231.nc’, ‘tas_day_MIROC5_rcp85_r1i1p1_21000101-21001231.nc’], ‘project’: ‘CMIP5’, ‘institute’: ‘MIROC’, ‘model’: ‘MIROC5’, ‘experiment’: ‘rcp85’, ‘frequency’: ‘day’, ‘realm’: ‘atmos’, ‘r’: ‘1’, ‘i’: ‘1’, ‘p’: ‘1’, ‘ensemble’: ‘r1i1p1’, ‘cmor_table’: ‘day’, ‘version’: ‘20120710’, ‘variable’: ‘tas’, ‘pdir’: ‘/g/data1b/al33/replicas/CMIP5/output1/MIROC/MIROC5/rcp85/day/atmos/day/r1i1p1/v20120710/tas’, ‘periods’: [(‘20060101’, ‘20091231’), (‘20500101’, ‘20591231’), (‘20200101’, ‘20291231’), (‘20800101’, ‘20891231’), (‘20600101’, ‘20691231’), (‘20100101’, ‘20191231’), (‘20900101’, ‘20991231’), (‘20700101’, ‘20791231’), (‘20400101’, ‘20491231’), (‘20300101’, ‘20391231’), (‘21000101’, ‘21001231’)], ‘fdate’: ‘20060101’, ‘tdate’: ‘21001231’, ‘time_complete’: True}

search returns a list of dictionary, one for each dataset. You can see from the first result the dictionary content, the last key time_complete is the result of a check run on the time axis beuilt by joining together the files periods. If the time axis is contiguos is true, otherwise is False. NB that this has been calculated only using the dates listed in the files, the actual timesteps have not been checked.

Both the keys and values of the constraints get checked before being passed to the query function. This means that if you passed a key or a value that does not exist for the chosen project, the function will print a list of valid values and then exit. Let’s re-write the constraints dictionary to show an example.:

constraints = {'v': 'tas', 'm': 'MIROC5', 'table': 'day', 'e': 'rcp85', 'activity':'CMIP'}
results = search(s, project='cmip5', **constraints)
Warning activity is not a valid constraint name
Valid constraints are:
dict_values([['source_id', 'model', 'm'], ['realm'], ['time_frequency', 'frequency', 'f'], ['variable_id', 'variable', 'v'], ['experiment_id', 'experiment', 'e'], ['table_id', 'table', 'cmor_table', 't'], ['member_id', 'member', 'ensemble', 'en', 'mi'], ['institution_id', 'institution', 'institute'], ['experiment_family']])

You can see that the function told us ‘activity’ is not a valid constraints for CMIP5, in fact that can be used only with CMIP6 NB. that the search accepted all the other abbreviations, we allowed more than one term to be used for each key. The full list is available from the github repository: https://github.com/coecms/clef/blob/master/clef/data/valid_keys.json

More complex queries We are adding functions that can facilitate more complex queries, an example is the ‘matching’ function It is easier to understand how matching work starting from an example. A user might want to get all the model/ensemble combinations which have both tasmin and tasmax To do this use the standard query I would have to do pass these constraints to a query :: constraints = {‘variable’: ‘tasmin’, ‘cmor_table’: ‘day’, ‘experiment’: ‘rcp85’} found all the model/ensemble which have tasmin / rcp85 / day then repeat the same for ‘tasmax’ and finally check which model/ensemble combinations have both. The ‘matching’ function simplify all of this. First of all I can pass to it multiple values:

constraints = {'variable': ['tasmin','tasmax'], 'cmor_table': ['day'], 'experiment': ['rcp85']}
Then I need define the attribute for which I want all the values to be present::
allvalues=[‘variable’]
I need to define what are the attributes whose combination define a simulation, model and ensemble, i.e. each model/ensemble combination define a simulation, in some cases you might want to add to these also the version::
fixed=[‘model’,’ensemble’]
Finally we call matching::
results, selection = matching(s, allvalues, fixed, **constraints)

The function returns two lists, the first ‘results’ contains a dictionary for each simulation that has either tasmin or tasmax for {rcp85, day}. The second ‘selection’ has only the simulations that has both ‘tasmin’ and ‘tasmax’. Other examples Find simulations which have ‘tasmin’ and ‘tasmax’ and both ‘rcp85’ and ‘rcp45’ experiments:

constraints = {'variable': ['tasmin','tasmax'], 'cmor_table': ['day'], 'experiment': ['rcp85', 'rcp45']}
allvalues=['variable', 'experiment']
fixed=['model','ensemble']
results, selection = matching(s, allvalues, fixed, **constraints)
Find simulations which have ‘tasmin’ and ‘tasmax’ for either ‘rcp85’ or ‘rcp45’ experiments::
constraints = {‘variable’: [‘tasmin’,’tasmax’], ‘cmor_table’: [‘day’], ‘experiment’: [‘rcp85’, ‘rcp45’]} allvalues=[‘variable’] fixed=[‘model’,’ensemble’, ‘experiment’] results, selection = matching(s, allvalues, fixed, **constraints)

By default we are searching for CMIP5 if we want to do the same for CMIP6 we need to change the project value and use the right facet names:: Find simulations which have ‘tasmin’ and ‘tasmax’ for ‘piControl’ experiment:

constraints = {'variable_id': ['tasmin','tasmax'], 'table_id': ['day'], 'experiment_id': ['piControl']}
allvalues=['variable_id']
fixed=['source_id','member_id']
results, selection = matching(s, allvalues, fixed, project='CMIP6', **constraints)
In particular for CMIP6, for which data is still getting published, you might want to execute the same search on the remote ESGF data catalogue rather than locally. In that case we change the ‘local’ argument from its default value True to False::
constraints = {‘variable_id’: [‘tasmin’,’tasmax’], ‘table_id’: [‘day’], ‘experiment_id’: [‘piControl’]} allvalues=[‘variable_id’] fixed=[‘source_id’,’member_id’] results, selection = matching(s, allvalues, fixed, project=’CMIP6’, local=False, **constraints)

NB currently using the abbreviated version for the constraints won’t work, you will have to use the attributes full names.

Architecture

_images/architecture.png

clef –missing

  1. Resolve any constraint wildcards by looking for matches in the local database, e.g.:

    SELECT DISTINCT model
    FROM esgf_dataset
    WHERE model ILIKE 'ACCESS%'
    ;
    
  2. Call find_missing_id() with the resolved constraints

  1. Search ESGF using the constraints, returning the checksum of each matching file
  2. Match the ESGF checksums against the local metadata database
  3. Return the ESGF id for any files whose checksums cannot be found in the local database

clef –local

  1. Query the local database for files:

    SELECT path
    FROM esgf_paths
    NATURAL JOIN esgf_metadata_dataset_link
    NATURAL JOIN esgf_dataset
    WHERE model ILIKE 'ACCESS%'
    -- ...
    ;
    
  2. If using the --latest flag, query ESGF using the constraints to

    retrieve checksums, match these checksums against the local results and return only those found. This is the default behaviour

CleF API

clef.db

Database connection functions

class clef.db.Session

sqlalchemy.orm.session.Session connected to the MAS database

connect() must be called before creating any new sessions

clef.db.connect(url='postgresql://clef.nci.org.au:5432/postgres', user=None, debug=False)[source]

Connect to the MAS database and sets up the session

Parameters:
  • url – Database URL
  • user – Username (password will be prompted via getpass)
  • debug – Print debugging information
Returns:

sqlalchemy.engine.Engine

clef.model

Model of NCI’s MAS database

The MAS database has two main tables - path and metadata. These base tables are available in the model as Path and Metadata, they have a SQLAlchemy relationship so that the two table can be joined in queries.

There may be multiple Metadata entries for a single Path, these represent different metadata types, such as checksums, netCDF attributes and POSIX file attributes. The type can be identified from Metadata.type, and is used as a polymorphic identity to SQLAlchemy’s single table inheritance, creating the Checksum, Netcdf and Posix models.

The C5Dataset and C6Dataset models represent datasets like you would find on ESGF, although without a version. They are created in the database from a DISTINCT view of the NetCDF attributes, and can be used to group paths on the filesystem into datasets.

class clef.model.C5Dataset(**kwargs)[source]

A CMIP5-era ESGF dataset

This class only has access to attributes from the file itself, so version information is not present.

See the CMIP documentation for descriptions of the attributes

cmor_table
ensemble
experiment
institute
model
project
realm
time_frequency
class clef.model.C6Dataset(**kwargs)[source]

A CMIP6-era ESGF dataset

This class only has access to attributes from the file itself, so version information is not present.

See the CMIP documentation for descriptions of the attributes

activity_id
experiment_id
frequency
grid_label
institution_id
member_id
nominal_resolution
project
realm
source_id
source_type
sub_experiment_id
table_id
variable_id
variant_label
class clef.model.Checksum(**kwargs)[source]

Checksum of a file on Raijin

md5

md5 checksum

path

type: Path

sha256

sha256 checksum

class clef.model.ExtendedMetadata(**kwargs)[source]

Extra metadata not present in the file’s attributes

class clef.model.Info(**kwargs)[source]

General information about a dataset file

This is a database view, its columns shouldn’t be used for searching as they are large and not indexed.

contact
description
further_info_url
license
parent_experiment_id
source
title
tracking_id
variant_info
class clef.model.Metadata(**kwargs)[source]

Generic base class for Metadata of a file on Raijin

See Posix and Netcdf for specific metadata information

json

Metadata value

path

type: Path

type

Metadata type

class clef.model.Netcdf(**kwargs)[source]

NetCDF metadata of a file on Raijin

As would be found by ncdump -h

attributes

File attributes

dimensions

File dimensions

variables

File variables

class clef.model.Path(**kwargs)[source]

Path of a file on Raijin, with links to metadata

c5dataset

type: C5Dataset

c6dataset

type: C6Dataset

checksum

type: Checksum

netcdf

type: Netcdf

path

File path at NCI

class clef.model.Posix(**kwargs)[source]

Posix metadata of a file on Raijin

As would be found by ls

class clef.model.pg_json_property(attr_name, index, cast_type)[source]

clef.esgf

Functions for searching the ESGF and matching the results against the MAS database

exception clef.esgf.ESGFException[source]

Error from the ESGF API

clef.esgf.esgf_query(query, fields, limit=5000, offset=0, distrib=True, replica=False, latest=None, **kwargs)[source]

Search the ESGF

Searches the ESGF using its API. Keyword arguments not listed here are passed on to the API search, they can either be single values or lists.

Parameters:
  • query (str) – Full text query
  • fields (list) – Fields to return
  • limit (int) – Maximum items to return
  • offset (int) – Starting offset of returned items (use with limit for paging)
  • distrib (bool) – Distribute the search across all nodes
  • replica (bool) – Return replicated datasets
  • latest (bool or None) – Return only latest (True), only not latest (False) or all versions (None)
  • **kwargs – See the ESGF API docs
Returns:

API response from ESGF, decoded from JSON into a Python dict

clef.esgf.find_checksum_id(query, **kwargs)[source]

Get checksums and IDs of matching files from ESGF

Searches ESGF using esgf_query(), then converts the response into a SQLAlchemy selectable for further processing

Parameters:**kwargs – See esgf_query()
Returns:
Values table of matching File objects, containing
  • checksum
  • id
  • dataset_id
  • title
  • version

This table can be joined against the MAS database tables

clef.esgf.find_local_path(session, subq, oformat='file')[source]

Find the filesystem paths of ESGF matches

Converts the results of match_query() to local filesystem paths, either to the file itself or to the containing dataset.

Parameters:
  • format ('file' or 'dataset') – Return the path to the file or the dataset directory
  • subq – result of func:esgf_query
Returns:

Iterable of strings with the paths to either paths or datasets

clef.esgf.find_missing_id(session, subq, oformat='file')[source]

Returns the ESGF id for each file in the ESGF query that doesn’t have a local match

Parameters:
  • format ('file' or 'dataset') – Return the path to the file or the dataset directory
  • subq – result of func:esgf_query
Returns:

Iterable of strings with the ESGF file or dataset id

Convert search terms to a ESGF search URL

Returns a link to the user-facing ESGF web search matching a particular query. This is helpful for error messages, users can follow the URL to find the matches as ESGF sees them

Note that this link is to the ESGF user-facing search page, rather than the web API that esgf_query() uses.

Parameters:**kwargs – As esgf_query()
Returns:URL to the ESGF search website
Return type:str
clef.esgf.match_query(session, query, latest=None, **kwargs)[source]

Match ESGF results against clef.model.Path

Matches the results of find_checksum_id() with the Path table. If latest is True the checksums will be matched, otherwise only the file name is used in order to spot outdated versions that have been removed from ESGF.

Parameters:
  • latest (bool) – Match the checksums (True) or filenames (False)
  • **kwargs – See esgf_query()
Returns:

Joined result of clef.model.Path and find_checksum_id()

Indices and tables