Data quality tests and assertions

The Task Group will provide a report of the practical tests, assertions, principles, software and key references associated with assessing data quality of biodiversity records. This should provide a basis, along with the other Data Quality Task Groups of a standard approach to data quality that should be used by all agencies providing biodiversity-related data.

GitHub

Convener

Lee Belbin

Motivation

  • Other than data availability, ‘Data Quality’ is probably the most significant issue for users of biodiversity data and this is specially so for the research community.
  • This Task Group is reviewing practical aspects relating to ‘data quality’ with a goal to provide Best Current Practice at the key interface between data users and data providers: tests and assertions.
  • If an internationally agreed standard suite of core tests and resulting assertions can be used by all Data Providers and hopefully Data Collectors, then greater use and more appropriate use could be made of biodiversity data.
  • Data providers and particularly aggregators such as GBIF and its nodes would have increased credibility with the user communities and can provide more effective information for judging fitness for use.
  • The tests and assertions will initially be based on the Darwin Core standard.
  • I (Lee Belbin) raised the need for a practical set of tools related to Data Quality at the TDWG 2010 Conference at Woods Hole. What I was asking for was at least the public display of the rules that were being used by GBIF to flag issues in their records. This didn’t happen, so we are trying again and we will include any agency that provides biodiversity records to the public.

Goals, outputs and outcomes

  • A set of tests and resulting data assertions that are in use by key Data Providers to flag record issues (January 2017).
  • A set of principles that have emerged in the process of identifying and refining the tests and assertions. These would be expected to form the basis of a paper on ‘data quality’ tests and assertions’ (January 2017).
  • A set of in-use software tools that can be used to assist with data quality (January 2017). These will be based on the GBIF Data Quality software resource.
  • Create a set of fundamental publications related to ‘data quality’ (March 2017 )
  • Submit a standard set of tests and assertions for consideration as a TDWG Standard (August 2018).

Strategy

Becoming involved

  • This Task Group would welcome anyone who has a practical interest in data quality and/or has experience with the tests, rules, assertions, tools or workflows.
  • Contact the Convener

History and context

The Task Group was established in 2014 as a Task Group of the TDWG Data Quality Interest Group viz, Task Group 2: Tools, Services and Workflows. The new name and charter better reflects the work and goals of the Task Group as tests and assertions are more stable and longer lasting than the tools which will link to the relevant tests and workflows that will also depend on them. Services seems better placed with the TDWG Biodiversity Services and Clients IG.

Resources

  • Belbin, L., Daly, J., Hirsch, T., Hobern, D. and LaSalle, J. (2013). A specialist’s audit of aggregated occurrence records: An ‘aggregators’ response. ZooKeys 305: 67–76. https://doi.org/10.3897/zookeys.305.5438.
  • Chapman, AD (2005a). Principles and Methods of Data Cleaning – Primary Species and Species Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. 75p. Available online at http://www.gbif.org/document/80528
  • Chapman, AD (2005b). Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. 61p. https://doi.org/10.15468/doc.jrgg-a190
  • Costello MJ, Michener WK, Gahegan M, Zhang Z-Q, Bourne P, Chavan V (2012). Quality assurance and intellectual property rights in advancing biodiversity data publications version 1.0, Copenhagen: Global Biodiversity Information Facility, 40p, ISBN: 87‐92020‐49‐6.
  • Mesibov R (2013) A specialist’s audit of aggregated occurrence records. ZooKeys 293: 1-18. https://doi.org/10.3897/zookeys.293.5111
  • Otegui J, Ariño AH, Encinas MA, Pando F (2013) Assessing the Primary Data Hosted by the Spanish Node of the Global Biodiversity Information Facility (GBIF). PLoS ONE 8(1): e55144. https://doi.org/10.1371/journal.pone.0055144