Geochemical Society policy on geochemical databases
Adopted 27 November 2007
Geochemists produce a large amount of data. In the past decade there has been a proliferation of databases that span the gamut from real-time data collected in the field to laboratory analyses and experiments, while simultaneously spanning a range in geological materials from fluids to hard rock. Open access to data, especially those collected with public funds, is already mandated by some divisions of funding agencies, and is talked about by politicians. But, regardless of outside pressures, we, as the Geochemical Society, need to consider whether having centralized databases is in the best interest of our profession, i.e. do databases lead to good science? Stated differently, are there examples of studies where the compilation of a large amount of data has resulted in good science and has moved the field forward? There are many. Perhaps the most cited example in geosciences is the compilation of Dziewonski and Anderson1 that led to PREM. This compilation, that resulted in a reference Earth model, has been cited over 2600 times. In geochemistry the most prolific example is most likely the Zindler and Hart publication2 on chemical geodynamics, which for the first time presented a comprehensive view of the mantle through a compilation of all mantle isotope data. It has been cited over 1300 times. Rudnick and Fountain3 on the composition of the continental crust, and McCollom and Shock4 on the geochemistry of micro-organisms at hydrothermal vents, are other examples where extensive compilations of data have resulted in large steps forward (paradigm shifting). There are a number of recent papers that use a database to compile data and present important advances; we have referenced just a few examples5. These publications make a compelling argument that excellent science can result from analysis of larger datasets. The establishment of databases will make these larger scale data analysis more feasible.
With the large amount of data published, keeping track of all data that might bear on a subject is becoming harder to do. When data from one study are placed in a larger context there often is a need to compare the new data with those of previous studies. Instead of having to search the literature, a search of a database can provide the data in digital format that can be efficiently used without cumbersome and error-prone re-entry of the data into digital format.
Databases not only allow analysis of data that otherwise would be difficult or simply not done, it also is an efficient tool for data retrieval.
Most submissions of manuscripts for publication are based on a digital form of the manuscript and the data. It should therefore be straightforward to also submit the data to the appropriate database. For these reasons the Geochemical Society views the development of databases as a very positive direction for our profession and supports their development, and considers it to be in the best interest of our profession that geoscientists submit their data to the appropriate database. The goal of this document is to outline the policies of the Geochemical Society with regard to databases in order to facilitate maximum simplicity in data uploading and mining while increasing our ability to perform more comprehensive, high-quality geochemistry.
This policy is intended for observational data (experimental and analytical) that are collected on samples. This data can be subdivided into two classes:
- Data collected in the field (such as the USGS hydrological data, or a CTD cast from a research vessel);
- Laboratory generated data by individual researchers. Although the first type of data is often not published, this policy covers both. Presently, we exclude model data, although these, at a later stage, will likely need to be included. And, just as the databases are still under development, this document should this be viewed as a living document.
Databases and electronic data
It is recognized that a large amount of published geochemical data do not fit in any of the existing databases. It is therefore important that the all data should be available in an electronic format so it can be easily incorporated in databases, even though the database might not yet exist. Examples of databases that are in development, but not yet operational are a database on low temperature kinetics (Critical Zone Exploration Network or CZEN) and Library of Experimental PhasE Relations (LEPER).
Data generated in the laboratory merit inclusion into a database after the manuscript in which the data are presented is accepted for publication (normally after peer review). There is no guarantee of data quality that can prevent inaccurate data from being input into a database (the same can be said of peer-reviewed journals). Hence, data documentation (see below) will be critical in order for the users to evaluate data quality. We note that the existence of a database makes data evaluation more straightforward for two reasons: the data are available in digital form rather than as points on a plot, and the data can be readily compared to related data from other studies.
Databases should contain tools to retrieve and analyze the data. It is seen that the primary role of the databases is to make the data more readily available to the scientist in an open access format. Data mining tools are in development to allow for related datasets to be discovered. Such tools are becoming critical because the amount of data is increasing and multidisciplinary studies are becoming more common.
As databases are presently developed through different initiatives, in different countries, it will be of the utmost importance that the databases achieve interoperability. In order to create a working environment in the geosciences as a whole that allows scientists more comprehensive views of Earth processes, interoperability between widely varying types of datasets will be an essential goal.
The need for data quality and data comparison requires that data are well-documented and that metadata need to be included in the electronic data files. Data in databases should include all data that is needed for a reinterpretation of the result. Data bases should include metadata that describe the sample, sample handling and the measurement techniques. The amount of metadata to be included is that which is required for scholarly publication. The Geochemical Society adopts the philosophy as described by Staudigel et al.6 on what should be included in the metadata. In general there are two types of metadata:
- Related to sample or experiment
- Related to analysis.
- Allow reproduction of the experiment;
- Describe quality of the analytical data;
- Allow comparison with other laboratories (standardization).
By providing templates and examples of the metadata associated with similar studies, the community can begin to create a more consistent set of parameters associated with a given dataset. Currently, the literature has many examples of similar studies that report different lists of experimental parameters, making comparisons of results and interpretations difficult. The precise make-up of the metadata will differ from database to database. Metadata entry should be afforded with the minimum amount of effort by the submitter.
The publication is the primary citation for the data and publications should be cited instead of the database. Databases that house unpublished data should have the same standards for metadata as the published data. Those databases can be cited by their URL and should include the date they were accessed.
Enforcement and communication of policy
The GS has no method of enforcement; the policy the GS adopts can be an advisory only. It is therefore important that the officers of the GS set the example in abiding by the GS database and data publication policy. Enforcement of such a policy lies with funding agencies first and publishers second. The GS can encourage. The GS is prepared to take the lead in having an international committee that acts as a liaison with the funding agencies. Members of the different countries can approach their funding agencies.
Guidelines accompanying the GS data policy
These guidelines are some of the practical consequences of the policy as well as some "best practices".
- Databases housing geochemical information should be available to the community at large (open access)
- The metadata, which include sample or experiment description as well as analytical results on standards, are as important as the data. It is the metadata that allow comparison with other labs and use of the data by other studies. Published papers should have a consistent location for the metadata such as an appendix.
- Published data and metadata should be available in electronic format.
- Databases should make templates available for metadata entry. Databases should take the lead in defining what data are metadata.
- Submission of the metadata and data to the database should be made convenient and with the minimum amount of repetition. This is an essential component of a database.
- After final acceptance of a manuscript for publication, any new data that it contains should be submitted for entry into an established database, if an appropriate database exists. This can be enforced by editors and funding agencies.
- In order for published data to be recognized and cited as a publication (instead of citation to the database), it is important that it is linked to a single identifier: the publication. Separate digital object identifiers (DOI) for data will erode the importance of the publication.
- As databases are constantly evolving and growing, citation to a database is discouraged. In cases where there is no alternative, time and date of access of the database should be included in the citation
- Database managers should develop tools for analyzing and retrieving data.
1 A M Dziewonski and D L Anderson, Physics of the Earth and Planet Interiors 25 (4), 297 (1981).
2 A Zindler and S R Hart, Ann. Rev. Earth Plan. Sci. 14, 493 (1986).
3 R L Rudnick and S Gao, in The Crust, edited by R L Rudnick (Elsevier, Amsterdam, 2004), Vol. 3, pp. 1.
4 T M McCollom and E L Shock, Geochimica et Cosmochimica Acta 16 (20), 4375 (1997).
5 C. Class and S. L. Goldstein, Nature 436 (7054), 1107 (2005); V J M Salters and A Stracke, Geochemistry Geophysics Geosystems 5 (5), 2003GC000597 (2004); R. K. Workman and S. R. Hart, Earth and Planetary Science Letters 231 (1-2), 53 (2005).
6 H Staudigel, J Helly, A P Koppers et al., Geochemistry Geophysics Geosystems 4, 2002GC000314 (2003).