Framework on data quality
Most of the organizations today confront DQ problems using ad hoc approaches for fixing errors which cause DQ problems. However, such efforts do not assure that data will be fit for use for every purpose. This Task Group is working on the organization of concepts related both DQ needs and solutions for the Assessment and Management of fitness for use of biodiversity data. We expect that outcomes from the Task Group will allow the Biodiversity Informatics community to join efforts to tackle DQ issues by sharing and reusing DQ requirements, methods, tools, services, workflows and best practices which can be used for DQ measurement, validation, recommendation and error prevention and correction.
This task group has completed its work. Please see the GitHub repository (linked above) for the results.
Convenor
Allan Koch Veiga
Motivation
- A consistent approach to assess and manage DQ is currently critical for biodiversity data users. However, to achieve this goal has been particularly difficult because of the idiosyncrasies inherent to the concept of quality. DQ assessment and management cannot be performed if we have not clearly established the quality needs according to a data users standpoint.
- Our understanding about “DQ Assessment” is the deed performed by data users or curators to judge the extent of the fitness for use of a data (single record or dataset) for a specific purpose; and “DQ Management” is the deed performed by any actor (software, people, institution) to improve DQ in order to turn data fitter for use for a wider range of uses.
- A conceptual framework should support the Biodiversity Informatics community to describe, from a data users perspective, the meaning of “fitness for use” in a common and standardized way.
- A collaborative model can generate a searchable repository of common and reusable components such as DQ Profiles (definition of what “quality” means for a specific purpose), DQ policies, dimensions (measurable aspects of quality), criteria, enhancements, specifications (methods) and mechanism (tools, services, workflows) for a range of purposes of data usages, enabling institutions to compose their own DQ needs and solutions to better suit their goals concerning fitness for use.
Goals, outputs and outcomes
- A formal Conceptual Framework for the Assessment and Management of the fitness for use of biodiversity data.
- Establish a “common language” in order for the Biodiversity Informatics community to express and share their understanding of DQ needs and solutions, to increase the reusability and decrease the duplication of efforts.
- A case study that describes how to use the Conceptual Framework for performing the Assessment and Management of fitness for use in an institution.
- Methods and guidelines to use the Framework.
- Establish a common vocabulary for the whole DQ Interest Group.
Strategy
- Join, organize and formalize ideas and concepts concerning DQ in a Conceptual Framework.
- Evaluate the proposed Framework with a case study.
- Propose a method for using/applying the Framework for the Assessment and Management of fitness for use.
- Support Biodiversity Informatics community with guidelines and training about the Framework.
- Support and follow the application of the Framework for the Assessment and Management in some Biodiversity Informatics organizations.
- Evaluate and enhance the Framework and its vocabulary by promoting discussions and forums with the DQ Interest Groups members.
Becoming involved
- This Task Group would welcome anyone who has a practical and theoretical interest in data quality and/or has experience with ontology, data/information/knowledge management, data policy, data governance and with any stage of life cycle of biodiversity data (capturing, handling or using data).
- Contact the Convener.
Resources
- Veiga, AK, Saraiva, AM, Chapman, AD, Morris, PJ, Gendreau, C, Schiegel, D, Robertson, TJ (2017). A conceptual framework for quality assessment and management of biodiversity data. PLOS ONE 12(6): e0178731. https://doi.org/10.1371/journal.pone.0178731
- Veiga, AK., Cartolano Jr., EA, Saraiva, AM (2014). Data Quality Control in Biodiversity Informatics: The Case of Species Occurrence Data. IEEE Latin America Transactions. ISSN: 1548-0992. Volume: 12, Issue: 4. Available: http://www.ewh.ieee.org/reg/9/etrans/ieee/issues/vol12/vol12issue4June2014/20KochVeiga.htm
- Veiga, AK., Saraiva, AM (2012). A Guideline for Dealing with Data Quality. In proceedings of Biodiversity Information Standards (TDWG) 2012 Annual Conference. Beijing, China. https://static.tdwg.org/conferences/2012/presentations/AllanKochVeigaTDWG.pdf
- Veiga, AK, Saraiva, AM, Cartolano, EA (2012). Data Quality Concepts and Methods Applied to Biological Species Occurrence Data. In book: ICT for Agriculture, rural development and environment – Where we are? Where we will go? Czech Centre for Science and Society Wirelessinfo.
- Wang R, Reddy M, Kon H (1995). Toward quality data: An attribute- based approach. Journal of Decision Support Systems.vol. 13, no. 3-4. pp. 349-372. https://doi.org/10.1016/0167-9236(93)E0050-N
- Strong DM, Lee Y, Wang RY (1997). Data Quality in Context. Communications of the ACM. pp. 103-110.
- Wang RY, Strong DM (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems. vol.12, no. 4, pp. 5–33. http://mitiq.mit.edu/Documents/Publications/TDQMpub/14_Beyond_Accuracy.pdf
- Ge M, Helfert M (2007). A review of information quality research - develop a research agenda. ICIQ, page 76-91. MIT.
- Dalcin, EC (2005). Data quality concepts and techniques applied to taxonomic databases. PhD dissertation, University of Southampton, United Kingdom.
- Chapman, AD (2005a). Principles and Methods of Data Cleaning: Primary Species and Species-Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. 75p. http://www.gbif.org/document/80528
- Chapman, AD (2005b). Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. 61p. https://doi.org/10.15468/doc.jrgg-a190
- Otegui J, Ariño AH, Encinas MA, Pando F (2013) Assessing the Primary Data Hosted by the Spanish Node of the Global Biodiversity Information Facility (GBIF). PLoS ONE 8(1): e55144. https://doi.org/10.1371/journal.pone.0055144
- http://community.gbif.org/pg/groups/21292/gbiftdwg-biodiversity-data-quality-interest-group/
- https://github.com/tdwg/infrastructure/issues/48