Green deal: Environmental compliance
12th May 2022‘Now or never’ for limiting warming
27th June 2022Artificial intelligence (AI) modelling is only possible with large amounts of quality data. ‘Garbage in, garbage out’ is an often-quoted tru-ism in this field, reflecting that AI models need to be trained using large volumes of data, but also that this data must be quality controlled, pre-processed and mathematically manipulated to achieve the desired performance from the model, writes Jenny Hanafin, Earth Observation Activity Lead at ICHEC.
This process requires both human and computational effort and creating training data is usually the most costly part of an AI exercise. So, training datasets (TDS) are a valuable resource but sharing them openly is not straightforward.
With the arrival of New Space, the cost of putting satellites into space has come down by orders of magnitude, resulting in a flurry of new satellites in orbit and an exponential increase in the volume of data available. With such large volumes of earth observation (EO) data now available, AI is becoming necessary to carry out processing and analysis to glean insights. The lack of training datasets is becoming a major bottleneck in more widespread and systematic use of machine learning in EO, however. The aim of the AIREO project was to provide resources to standardise aspects of TDS, along with tools to allow data creators to share their data so that users can use it appropriately.
One major issue is a knowledge gap between AI practitioners (TDS users, usually IT specialists) and EO experts (TDS creators, usually scientists or data scientists). The users often do not have experience with essential concepts such as map projections, file formats, calibration, and quality assurance. Other issues include: a lack of, or inaccessibility of, high-quality TDS; absence of standards resulting in inconsistent and heterogeneous TDS; limited discoverability and interoperability of TDS; and lack of best-practices and guidelines for generating, structuring and describing TDS. To address these some basic principles were established: TDS should be self-explanatory; TDS should be shared following FAIR principles; TDS should be published in a form ready for use in AI/ML applications. Based on these principles, a set of specifications and best practice guidelines for data creators and users was produced, along with a python library and some sample datasets to help users to apply the recommendations.
Partners
The AIREO activity is led by the Irish Centre for High-End Computing at National University of Ireland, Galway, in collaboration with Ireland’s Centre for Applied AI (CeADAR) at University College Dublin and is funded by the European Space Agency (ESA) Phi-Lab. ICHEC is the Irish national centre for high-performance computing and hosts the Irish archive of ESA data.
Community driven
One of the keys for success in this project was engagement with expert community members to identify how to develop the resources to fulfil the needs of both data creators and users. More than 100 experts across the globe have provided input through the AIREO
network. They have had access to all up-to-date material released by the project and have helped with workshops, one-to-one consultations and surveys organised by the project to give feedback and provide direction for further work.res to establish how FAIR the data is.
Compliance levels allow users to quickly assess whether a dataset is fully described in metadata and is FAIR-compliant, whether it contains only the minimal required set of metadata or whether it is somewhere between these two.
Documentation and metadata specification
The specifications and guidelines were developed using FAIR principles which stipulate that data should be findable, accessible, interoperable, and reusable. In order to make training data sets (TDS) more reusable, one of the aims of the specification was to standardise the metadata included with an AIREO TDS. The specification also establishes different levels of required, recommended and optional metadata elements to assist data creators in prioritising more important metadata.
The metadata is based on existing Open Geospatial Consortium (OGC) and SpatioTemporal Asset Catalog (STAC) standards and specifications relevant to earth observation data and machine learning applications with some additions to include innovative elements identified by the AIREO activity. Additional elements include quality indicator metadata to help data providers:
- publish structured data quality estimates and elements for users;
- assess the FAIRness of their datasets and where improvements could be made and convey this to users; and
- describe data provenance: sources, processing history and feature engineering recipes.
Compliance
A key innovation is the definition of AIREO compliance levels. Compliance levels allow users to quickly assess whether a dataset is fully described in metadata and is FAIR-compliant, whether it contains only the minimal required set of metadata or whether it is somewhere between these two. These levels also assist data providers to prioritise which metadata they could focus on to provide a more accessible dataset for users.
AIREO python library
The open-source AIREO python library provides basic functionality for data providers and users and can be accessed through the following link: aireo_lib. The aim of the library functions are:
- to help dataset creators to generate and document TDS which are FAIR, are of high quality and adhere to the AIREO specifications;
- to assist dataset users to perform high level exploratory data analysis on an AIREO TDS. The library allows the users to explore the statistical properties of the TDS through the metadata in the catalogue and to visualise key aspects of the data; and
- to help users load an AIREO TDS and access it through common data formats used by the ML community (numpy arrays, xarray, etc.) so it can be used in training ML models in widely used libraries and platforms with minimal effort.
Jenny Hanafin is the Earth Observation Activity Lead at ICHEC. Her extensive experience in many aspects of remote sensing, includes operating satellite sensors for EUMETSAT while a postdoc at Imperial College London, to developing a system to retrieve atmospheric humidity from the Ordnance Survey Ireland network of GPS receivers for use in the Met Éireann forecast model. Her qualifications include BSc in Marine Science, National University of Ireland, Galway, and PhD in Physical Oceanography and Meteorology at the Rosenstiel School of Marine and Atmospheric Science, University of Miami. |
Future developments
At this point, the AIREO Specification, Best Practice Guidelines and Python library are available to all interested parties on the AIREO website: www.aireo.net.
This release is a vital step to enable more training datasets to become available. The resources themselves are at the stage where they will develop through hands-on use by the community and feedback received will be used in future versions and updates.
T: 01 524 1608
E: jenny.hanafin@ichec.ie
W: www.ichec.ie