From 7d77abc6b3b30f29ca6f52e066a64f3cc8ddd889 Mon Sep 17 00:00:00 2001 From: Peter Sefton <peter.sefton@uts.edu.au> Date: Thu, 12 Jul 2018 09:44:24 +1000 Subject: [PATCH] edits --- paper.md | 128 ++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 79 insertions(+), 49 deletions(-) diff --git a/paper.md b/paper.md index 25e105f..c9a1bb8 100644 --- a/paper.md +++ b/paper.md @@ -30,20 +30,20 @@ presentation and subsequent publication. In illuminating the term *Research Object* the call for proposals for Research Object 2018 uses the phrase "multi-part research outcomes with their context". The DataCrate specification is a research data packaging and dissemination -specification designed to capture exactly this; outcomes (also inputs) and -context. It is designed to assemble files that represent arbitrary research -outcomes, inputs and contextual information that helps to make sense of them. -DataCrates can contain any kind of data, and the context can include, but is not -limited to, data about the people, software and equipment used in the research as -well as supporting documents such as publications, funding agreements or README -files. +specification designed to capture exactly that; outcomes (also inputs) and +context. DataCrate specifies how to gather-together data in such a way that it can (a) be -packaged via Zip, tar, a disc image, a multi-art package like Rar TODO or (b) be +packaged via Zip, tar, a disc image, a multi-art package or (b) be hosted on a web server or file share for inspection by potential users and/or used directly on High Performance Computing systems or otherwise accessed and analysed. +DataCrates can contain any kind of data, and the context +can include, but is not limited to, data about the people, software and +equipment used in the research as well as supporting documents such as +publications, funding agreements or README files. + # Methodology @@ -78,25 +78,27 @@ The data packaging in HIEv used the Bagit [@kunzeBagItFilePackaging] packaging spec to cover requirement `6` - BagIt also doesn't get in the way of any of the other requirements. -The main innovation in HIEv's packaging was to add a machine-generated HTML file -that covered both `4` (as much context as possible) & `5` (human and machine -readable metadata). To do this, HIEv produces a summary of the context, with -information about the facilities used -- their name, nature and location -- and -technical details about the payload files, thus satisfying `5`, using RDFa -[@RDFa2014] to embed metadata in the HTML file gave both human (HTML) and -machine (RDFa) views of the data. - - -[Cr8it] [@seftonPickPackagePublish2014] was another implementation that used the -same basic idea for data packages but which was never standardized. - -The first two DataCrate proto-implementations had no guidelines for what metadata to -use beyond what was hard-wired each code-base, so there was no hope of easy -interoperability or safe extensibility, and there were no repositories into -which data could be published, but feedback from the eResearch community and the -very limited number of researchers exposed to the systems was strong, so in 2016 -when UTS began work on a new Research Data Management service -[@wheelerEndtoEndResearchData2018], the DataCrate standard was born. +The main innovation in HIEv's packaging was to add an HTML file that covered +requirements `4` (as much context as possible) & `5` (human and machine readable +metadata). To do this, HIEv produces a summary of the context, with information +about the facilities used -- their name, nature and location -- and technical +details about the payload files, thus satisfying `5`, using RDFa [@RDFa2014] to +embed metadata in the HTML file gave both human (HTML) and machine (RDFa) views +of the data. + +[Cr8it] [@seftonPickPackagePublish2014] was another early implementation that +used the same basic idea for data packages in the OwnCloud file sharing +application. + +The first two DataCrate proto-implementations had no guidelines for what +metadata to use beyond what was hard-wired each code-base, so there was no hope +of easy interoperability or safe extensibility, and there were no repositories +into which data could be published, but we had good feedback from the eResearch +community and from the very limited number of researchers exposed to the +systems, so in 2016 when UTS began work on a new Research Data Management +service [@wheelerEndtoEndResearchData2018], we decided to properly specify a +data packaging format that met the above requirements and the DataCrate standard +was born. A team based at UTS, with some external collaborators started a process to work out (a) was there an existing standard which had emerged since the HIEv work @@ -114,8 +116,8 @@ implementations so that was to be kept unless a better alternative surfaced -- the decisions were around formalising metadata standards. BagIt, which had been used in HIEv and Cr8it is an obvious standard on which to -base a packaging format - it is widely used in the research data community, -there is cross-platform (requirement `3`) [tooling +base a research data packaging format - it is widely used in the research data +community, there is cross-platform (requirement `3`) [tooling available](https://en.wikipedia.org/wiki/BagIt#Tools) and it covers the integrity aspects of packaging data. @@ -165,31 +167,58 @@ using JSON-LD and specifying how to represent temporal and geographical coverage, etc DataCrate aims to encourage common behaviours. The other main alternative was the Research Object Bundle specification -[@soiland-reyesResearchObjectBundle2014]. At the time we started the DataCrate -work the Research Object domain name had expired, and the project looked to be -inactive. The domain has been re-instated, but this experience highlighted the -risk around adopting niche standards. This was a contributing factor in our -decision to use schema.org as the basic metadata standard for DataCrate, more -about which below. The outage helped surfaced another requirement / principle -`8` -- DataCrates needed to be useful even after the potential disappearance of -the original team, and we should avoid defining our own metadata terms. - -Research Object Bundle uses the Universal Container Format - for which the -documentation is now unavailable from Adobe and which does not have -integrity features such as checksums. There has been [some work done on aligning -Research Object Bundle with BagIt](https://github.com/ResearchObject/bagit-ro) -this is not yet a specification that can be followed, eg there is this note about -[incompatibilities](https://github.com/ResearchObject/bagit-ro#considerations). +[@soiland-reyesResearchObjectBundle2014]. + +Rather than BagIt, Research Object Bundle uses the Universal Container Format - +for which the documentation is now unavailable from Adobe and which does not +have integrity features such as checksums but there is [a version of Research +Object which uses BagIt](https://github.com/ResearchObject/bagit-ro). Research Object Bundle *does* use Linked-Data and for that reason was given careful consideration as a base-format for DataCrate. However, there were some -implementation details that were not optimal. +implementation details that we thought would make it hard for tool-makers. + +The use of "aggregations" and "annotations" introduces two extra layers of +abstraction for describing resources. + +For example, using this [sample from the bagit-ro tool] + +There is a section that lists aggregated files: + +"aggregates": [ + + { "uri": "../data/numbers.csv", + "mediatype": "text/csv" + }, + +And a separate place to describe *annotations* on those files: + +"annotations": [ + + { "about": "../data/numbers.csv", + "content": "annotations/numbers.jsonld", + "createdBy": { + "name": "Stian Soiland-Reyes", + "orcid": "http://orcid.org/0000-0001-9842-9718" + } + } + +With the actual description of the numbers.csv file residing in `annotations/numbers.jsonld`. + +{ "@context": { "@vocab": "http://purl.org/dc/terms/", "dcmi": "http://purl.org/dc/dcmitype/Dataset"}, + "@id": "../../data/numbers.csv", + "@type": "dcmi:Dataset", + "title": "CSV files of beverage consumption", + "description": "A CSV file listing the number of cups/mugs consumed per person." + +} + +The +The PAV ontology used by has several -TODO: Spell this out better. +Our experience in implementing systems tells us that it very likely that in real +life -- Use of "aggregations" introduces an extra layer in packaging -- presenting metadata as a series of annotations is an abstraction we can do - without - The ontology they use based on PROV is all about nuanced kinds of authorship that we don't think implementers will get right - Uses lots of little files @@ -249,5 +278,6 @@ More use-cases including preservation # References +[sample from the bagit-ro tool]: https://github.com/ResearchObject/bagit-ro/blob/f5fca3abad60c86b3c4f95948b5d64c3bc8e51c6/example1/metadata/manifest.json [HIEv system at Western Sydney University]: https://www.westernsydney.edu.au/eresearch/home/projects/completed_projects/hiev [Cr8it]: https://www.westernsydney.edu.au/eresearch/home/projects/cr8it -- GitLab