From 7d77abc6b3b30f29ca6f52e066a64f3cc8ddd889 Mon Sep 17 00:00:00 2001
From: Peter Sefton <peter.sefton@uts.edu.au>
Date: Thu, 12 Jul 2018 09:44:24 +1000
Subject: [PATCH] edits

---
 paper.md | 128 ++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 79 insertions(+), 49 deletions(-)

diff --git a/paper.md b/paper.md
index 25e105f..c9a1bb8 100644
--- a/paper.md
+++ b/paper.md
@@ -30,20 +30,20 @@ presentation and subsequent publication.
 In illuminating the term *Research Object* the call for proposals for Research
 Object 2018 uses the phrase "multi-part research outcomes with their context".
 The DataCrate specification is a research data packaging and dissemination
-specification designed to capture exactly this; outcomes (also inputs) and
-context. It is designed to assemble files that represent arbitrary research
-outcomes, inputs and contextual information that helps to make sense of them.
-DataCrates can contain any kind of data, and the context can include, but is not
-limited to, data about the people, software and equipment used in the research as
-well as supporting documents such as publications, funding agreements or README
-files.
+specification designed to capture exactly that; outcomes (also inputs) and
+context.
 
 DataCrate specifies how to gather-together data in such a way that it can (a) be
-packaged via Zip, tar, a disc image, a multi-art package like Rar TODO or (b) be
+packaged via Zip, tar, a disc image, a multi-art package or (b) be
 hosted on a web server or file share for inspection by potential users and/or
 used directly on High Performance Computing systems or otherwise accessed and
 analysed.
 
+DataCrates can contain any kind of data, and the context
+can include, but is not limited to, data about the people, software and
+equipment used in the research as well as supporting documents such as
+publications, funding agreements or README files.
+
 
 #  Methodology
 
@@ -78,25 +78,27 @@ The data packaging in HIEv used the Bagit [@kunzeBagItFilePackaging]
 packaging spec to cover requirement `6` - BagIt also doesn't get in the way of
 any of the other requirements.
 
-The main innovation in HIEv's packaging was to add a machine-generated HTML file
-that covered both `4` (as much context as possible) & `5` (human and machine
-readable metadata). To do this, HIEv produces a summary of the context, with
-information about the facilities used -- their name, nature and location -- and
-technical details about the payload files, thus satisfying `5`, using RDFa
-[@RDFa2014] to embed metadata in the HTML file gave both human (HTML) and
-machine (RDFa) views of the data.
-
-
-[Cr8it] [@seftonPickPackagePublish2014] was another implementation that used the
-same basic idea for data packages but which was never standardized.
-
-The first two DataCrate proto-implementations had no guidelines for what metadata to
-use beyond what was hard-wired each code-base, so there was no hope of easy
-interoperability or safe extensibility, and there were no repositories into
-which data could be published, but feedback from the eResearch community and the
-very limited number of researchers exposed to the systems was strong, so in 2016
-when UTS began work on a new Research Data Management service
-[@wheelerEndtoEndResearchData2018], the DataCrate standard was born.
+The main innovation in HIEv's packaging was to add an HTML file that covered
+requirements `4` (as much context as possible) & `5` (human and machine readable
+metadata). To do this, HIEv produces a summary of the context, with information
+about the facilities used -- their name, nature and location -- and technical
+details about the payload files, thus satisfying `5`, using RDFa [@RDFa2014] to
+embed metadata in the HTML file gave both human (HTML) and machine (RDFa) views
+of the data.
+
+[Cr8it] [@seftonPickPackagePublish2014] was another early implementation that
+used the same basic idea for data packages in the OwnCloud file sharing
+application.
+
+The first two DataCrate proto-implementations had no guidelines for what
+metadata to use beyond what was hard-wired each code-base, so there was no hope
+of easy interoperability or safe extensibility, and there were no repositories
+into which data could be published, but we had good feedback from the eResearch
+community and from the very limited number of researchers exposed to the
+systems, so in 2016 when UTS began work on a new Research Data Management
+service [@wheelerEndtoEndResearchData2018], we decided to properly specify a
+data packaging format that met the above requirements and the DataCrate standard
+was born.
 
 A team based at UTS, with some external collaborators started a process to work
 out (a) was there an existing standard which had emerged since the HIEv work
@@ -114,8 +116,8 @@ implementations so that was to be kept unless a better alternative surfaced --
 the decisions were around formalising metadata standards.
 
 BagIt, which had been used in HIEv and Cr8it is an obvious standard on which to
-base a packaging format - it is widely used in the research data community,
-there is cross-platform (requirement `3`) [tooling
+base a research data packaging format - it is widely used in the research data
+community, there is cross-platform (requirement `3`) [tooling
 available](https://en.wikipedia.org/wiki/BagIt#Tools) and it covers the
 integrity aspects of packaging data.
 
@@ -165,31 +167,58 @@ using JSON-LD and specifying how to represent temporal and geographical
 coverage, etc DataCrate aims to encourage common behaviours.
 
 The other main alternative was the Research Object Bundle specification
-[@soiland-reyesResearchObjectBundle2014]. At the time we started the DataCrate
-work the Research Object domain name had expired, and the project looked to be
-inactive. The domain has been re-instated, but this experience highlighted the
-risk around adopting niche standards. This was a contributing factor in our
-decision to use schema.org as the basic metadata standard for DataCrate, more
-about which below. The outage helped surfaced another requirement / principle
-`8` -- DataCrates needed to be useful even after the potential disappearance of
-the original team, and we should avoid defining our own metadata terms.
-
-Research Object Bundle uses the Universal Container Format - for which the
-documentation is now unavailable from Adobe and which does not have
-integrity features such as checksums. There has been [some work done on aligning
-Research Object Bundle with BagIt](https://github.com/ResearchObject/bagit-ro)
-this is not yet a specification that can be followed, eg there is this note about
-[incompatibilities](https://github.com/ResearchObject/bagit-ro#considerations).
+[@soiland-reyesResearchObjectBundle2014].
+
+Rather than BagIt, Research Object Bundle uses the Universal Container Format -
+for which the documentation is now unavailable from Adobe and which does not
+have integrity features such as checksums but there is [a version of Research
+Object which uses BagIt](https://github.com/ResearchObject/bagit-ro).
 
 Research Object Bundle *does* use Linked-Data and for that reason was given
 careful consideration as a base-format for DataCrate. However, there were some
-implementation details that were not optimal.
+implementation details that we thought would make it hard for tool-makers.
+
+The use of "aggregations" and "annotations" introduces two extra layers of
+abstraction for describing resources.
+
+For example, using this [sample from the bagit-ro tool]
+
+There is a section that lists aggregated files:
+
+"aggregates": [
+
+  { "uri": "../data/numbers.csv",
+    "mediatype": "text/csv"
+  },
+
+And a separate place to describe *annotations* on those files:
+
+"annotations": [
+
+    { "about": "../data/numbers.csv",
+      "content": "annotations/numbers.jsonld",
+      "createdBy": {
+        "name": "Stian Soiland-Reyes",
+        "orcid": "http://orcid.org/0000-0001-9842-9718"
+      }
+    }
+
+With the actual description of the numbers.csv file residing in `annotations/numbers.jsonld`.
+
+{ "@context": { "@vocab": "http://purl.org/dc/terms/", "dcmi": "http://purl.org/dc/dcmitype/Dataset"},
+  "@id": "../../data/numbers.csv",
+  "@type": "dcmi:Dataset",
+  "title": "CSV files of beverage consumption",
+  "description": "A CSV file listing the number of cups/mugs consumed per person."
+
+}
+
+The
+The PAV ontology used by has several 
 
-TODO: Spell this out better.
+Our experience in implementing systems tells us that it very likely that in real
+life
 
-- Use of "aggregations" introduces an extra layer in packaging
-- presenting metadata as a series of annotations is an abstraction we can do
-  without
 - The ontology they use based on PROV is all about nuanced kinds of authorship
   that we don't think implementers will get right
 - Uses lots of little files
@@ -249,5 +278,6 @@ More use-cases including preservation
 # References
 
 
+[sample from the bagit-ro tool]: https://github.com/ResearchObject/bagit-ro/blob/f5fca3abad60c86b3c4f95948b5d64c3bc8e51c6/example1/metadata/manifest.json
 [HIEv system at Western Sydney University]: https://www.westernsydney.edu.au/eresearch/home/projects/completed_projects/hiev
 [Cr8it]: https://www.westernsydney.edu.au/eresearch/home/projects/cr8it
-- 
GitLab