edits

eb44a321 · PTSEFTON · 1ca82fef · eb44a321 · eb44a321 · eb44a321
Commit eb44a321 authored 6 years ago by PTSEFTON
--- a/bibliography.bib
+++ b/bibliography.bib
--- a/build/paper.pdf
+++ b/build/paper.pdf
--- a/paper.md
+++ b/paper.md
@@ -91,29 +91,47 @@ used the same basic idea for data packages in the OwnCloud file sharing
 application.

 The first two DataCrate proto-implementations had no guidelines for what
-metadata to use beyond what was hard-wired each code-base, so there was no hope
-of easy interoperability or safe extensibility, and there were no repositories
-into which data could be published, but we had good feedback from the eResearch
-community and from the very limited number of researchers exposed to the
-systems, so in 2016 when UTS began work on a new Research Data Management
-service [@wheelerEndtoEndResearchData2018], we decided to properly specify a
-data packaging format that met the above requirements and the DataCrate standard
-was born.
+metadata to use beyond what was hard-wired into each code-base, so there was no
+hope of easy interoperability or safe extensibility, and there were no
+repositories into which data could be published, but we had good feedback about
+the concept from the eResearch community and from the very limited number of
+researchers exposed to the systems, so in 2016 when UTS began work on a new
+Research Data Management service [@wheelerEndtoEndResearchData2018], we decided
+to properly specify a data packaging format that met the above requirements and
+the DataCrate standard was born.

 A team based at UTS, with some external collaborators started a process to work
 out (a) was there an existing standard which had emerged since the HIEv work
 that met the requirements? (b) If no, which RDF vocabularies should we use? and
-(c) the mechanics of organising the files in the packages.
+(c) the mechanics of organising the files in the packages. At this point the
+    requirements had evolved to be:
+
+We were looking to find or build a data packaging format which had the following di:
+
+1.  Checksums per-file and the ability to include linked resources (features of
+    BagIt)
+2.  Linked-data metadata in JSON-LD format using well-documented ontologies /
+    vocabularies with coverage for:
+    -  Discovery and citation metadata; who created it, what is it, where did
+       the work take place, where in the world is it *about*. And, the same
+       metadata at the file level.
+    -  Technical metadata, what size is each file, what format is the file in,
+       who or what created the file.
+3.  A convention for including an HTML file which describes the dataset, and
+    potentially all of its files with a human-readable view of `2.`.
+
+
+These could all be accomplished by an update of the HIEV data package but it was
+important to make sure were were not re-inventing something that had been done
+elsewhere.

 ### Existing standards

 We were not able to find any general-purpose packaging specification with
 anything like the HTML+RDFa index that HIEv data packages have, allowing for
-human and machine readable metadata. In light of that our approach was to choose
-a base standard that covered the other requirements and add to it, as with the
-HIEv approach. Using BagIt plus extra files worked well in our initial
-implementations so that was to be kept unless a better alternative surfaced --
-the decisions were around formalising metadata standards.
+human and machine readable metadata. Using BagIt plus extra files worked well in
+our initial implementations so that was to be kept unless a better alternative
+surfaced -- the decisions were around formalising metadata standards.

 BagIt, which had been used in HIEv and Cr8it is an obvious standard on which to
 base a research data packaging format - it is widely used in the research data
@@ -124,17 +142,18 @@ integrity aspects of packaging data.
 ### Alternatives considered

 Frictionless data packages [TODO ref], which uses a simple JSON format as a
-manifest has roughly equivalent packaging features to BagIt having
-checksum features built in. In their favour, frictionless data packages have the
-ability to describe the headers in tabular data files.  However, but they do not
-meet the requirement `7` of having linked-data metadata, so while the JSON
-metadata is technically machine readable, it is not easy to relate to the
-semantic web as it does not use linked-data standards, and the terms are defined
-locally to the specification. It is also unclear how to extend the
-specification in a standardised way, contrasting with linked-data approaches which *automatically*
-allow extension by the use of URIs.
-
-As an example, [the spec](https://frictionlessdata.io/specs/data-package/) does not give a single way to describe temporal coverage
+manifest has roughly equivalent packaging features to BagIt having checksum
+features built in. In their favour, frictionless data packages have the ability
+to describe the headers in tabular data files.  However, but they do not meet
+the requirement `7` of having linked-data metadata, so while the JSON metadata
+is technically machine readable, in that is simple to parse, it is not easy to
+relate to the semantic web as it does not use linked-data standards, and the
+terms are defined locally to the specification, without URIs. It is also unclear
+how to extend the specification in a standardised way, contrasting with
+linked-data approaches which *automatically* allow extension by the use of URIs.
+
+As an example, [the spec](https://frictionlessdata.io/specs/data-package/) does
+not give a single way to describe temporal coverage

 > Adherence to the specification does not imply that additional, non-specified
 > properties cannot be used: a descriptor MAY include any number of properties
@@ -161,29 +180,51 @@ As an example, [the spec](https://frictionlessdata.io/specs/data-package/) does
 > and stored in CSV.
 > <https://github.com/frictionlessdata/specs/blob/0860ecd6bbb7685425e6493165c9b1a1c91eb16b/specs/data-package.md>

-This *laissez faire* extension mechanism in frictionless Data Packages is likely
-to result in a proliferation of highly divergent non-standardised metadata - by
-using JSON-LD and specifying how to represent temporal and geographical
-coverage, etc DataCrate aims to encourage common behaviours.
+We think that this *laissez faire* extension mechanism in frictionless Data
+Packages is likely to result in a proliferation of highly divergent
+non-standardised metadata - by using JSON-LD and specifying how to represent
+temporal and geographical coverage, etc DataCrate aims to encourage common
+behaviours. In DataCrate, the approach is to use schema.org's temporalCoverage
+property. Here is an [example from v0.2](https://github.com/UTS-eResearch/datacrate/blob/22aebdcd179cb3f9b8141ca350ffafa202f5b523/spec/0.2/data_crate_specification_v0.2.md) of the specification.
+
+>{
+>  "@id": "https://doi.org/10.5281/zenodo.1009240",
+>  "@type": "Dataset",
+>   <...>
+>  "name": "Sample dataset for DataCrate v0.2",
+>  "publisher": {
+>    "@id": "http://uts.edu.au"
+>  },
+>  "temporalCoverage": "2017"
+>},
+
+
+In the [DataCrate JSON-LD context] this expands to URI:
+<https://schema.org/temporalCoverage>. And, in the HTML file that displays metadata for humans, the HTML is self-documenting:
+
+![Screenshot showing how the term "temporalCoverage is linked - the link resolves to the schema.org page for temporalCoverage"](temporal_coverage.png)
+

 The other main alternative was the Research Object Bundle specification
 [@soiland-reyesResearchObjectBundle2014].

-Rather than BagIt, Research Object Bundle uses the Universal Container Format -
-for which the documentation is now unavailable from Adobe and which does not
-have integrity features such as checksums but there is [a version of Research
-Object which uses BagIt](https://github.com/ResearchObject/bagit-ro).
+Rather than BagIt, the original version of Research Object Bundle uses the
+Zip-based Universal Container Format - a format for which the documentation now
+seems to be unavailable from Adobe and which does not have integrity features
+such as checksums but there is [a version of Research Object which uses
+BagIt](https://github.com/ResearchObject/bagit-ro).

-Research Object Bundle *does* use Linked-Data and for that reason was given
+RO BagIt *does* use Linked-Data and for that reason was given
 careful consideration as a base-format for DataCrate. However, there were some
-implementation details that we thought would make it hard for tool-makers.
+implementation details that we thought would make it hard for tool-makers
+(including the core team at UTS).

 The use of "aggregations" and "annotations" introduces two extra layers of
 abstraction for describing resources.

 For example, using this [sample from the bagit-ro tool]

-There is a section that lists aggregated files:
+There is a section in the manifest that lists aggregated files:

 >"aggregates": [
 >
@@ -213,45 +254,60 @@ With the actual description of the numbers.csv file residing in `annotations/num
 >}


- The  PAV ontology they use based on PROV is all about nuanced kinds of authorship
-  that we don't think implementers will get right
- Uses lots of little files
-
-The initial version of DataCrate (v0.1) was developed in 2017. V0.1 persisted with
-HTML+RDFa for human and machine readability but this was cumbersome and was
-removed in favour of an approach where the human-centred HTML page is generated
-from a machine-readable JSON-LD file rather than the other way around.
-
-After looking at a variety of standards, including Dublin Core [TODO REF] which
-is very limited in coverage and DCAT  [TODO REF] which is more complete for
-describing data sets, but silent on the issue of describing files or other
-contextual entities, using schema.org as the base metadata standard was judged
-by the team to be the best way to meet our goals.
-
-Schema.org is the most widely used linked-data vocabulary in the world [TODO:
-find a reference].
-
-We recommend other ontologies
-where schema.org has gaps in its coverage, this contrasts with the frictionless
-data approach of encouraging people to make up their own string-based metadata.
-
-
-TODO: More on these:
-
-PCDM for repository content.
-
-SPAR ontologies for scholarly communications.
-
-TBA - what to do for scientific discipline metadata.
-
-# Implementation
-
-The specification. A quick summary
-
+# Implementation of DataCrate
+
+In our judgement, the level of indirection and number of files involved in the
+Research Object approach were not suitable for DataCrate; the implementation
+cost for tool makers would be too high. In making this choice we forewent the
+benefits of being able to make assertions about the provenance of annotations as
+distinct resources, and the more intellectually satisfying abstractions about
+aggregations offered by ORE. We settled on a an approach which used just three
+extra files:
+
+1.  A single CATALOG.json file, containing JSON-LD which describes the
+    folder/file hierarchy of the data crate and associated contextually relevant
+    entities, such as people in one place.
+2.  An index.html file with a human-readable summary of the catalog file.
+3.  Optionally, a `DataCite.xml` file containing a data citation (a text version
+    of which is prominent in the HTML file if it exists).
+
+The initial version of DataCrate (v0.1) was developed in 2017. V0.1 persisted
+with HTML+RDFa for human and machine readability but this was cumbersome to
+generate and was removed at [the suggestion of Eoghan Ó
+Carragáin](https://github.com/UTS-eResearch/datacrate/issues/14)  in favour of
+an approach where the human-centred HTML page is generated from a
+machine-readable JSON-LD file rather than the other way around.
+
+We looked at a variety of standards, including Dublin Core
+[@kunzeDublinCoreMetadata2007] which is very limited in coverage and DCAT
+[@maaliDataCatalogVocabulary2014] which is more complete for describing data
+sets at the top level, but silent on the issue of describing files or other
+contextual entities and relationships bewteen them. We discovered that
+Schema.org has the widest range of terms needed to describe who, what where
+metadata for datasets. Schema.org is also the most widely used linked-data vocabulary
+in the world [TODO: find a reference].
+
+The DataCrate recommend other ontologies where schema.org has gaps in its coverage.
+
+For describing datasets such as exported content from digital
+object repository systems DataCrate uses the Portland Common Data Model
+[@PortlandCommonData], (PCDM), which is a simple ontology for describing nested
+[Collections](http://pcdm.org/models#Collection) of [Objects](https://pcdm.org/2016/04/18/models#Object), with Objects [having](http://pcdm.org/models#hasFile) [Files](http://pcdm.org/models#File).
+
+For concepts related to the scholarly process (but biased towards publishing)
+DataCrate uses terms from the SPAR ontologies
+[@shottonIntroductionSemanticPublishing2010] for scholarly communications. For
+example, schema.org does not have a class for Project so the key "Project" is
+mapped to the frapo term [Project](https://sparontologies.github.io/frapo/current/frapo.html#d4e2428)
+
+NOTE: we are in the process of adding more coverage for provenance - showing how
+files are created by software etc. Will be asking the Research Object team for
+assistance with this.

 # Conclusion

-Link to showcase examples.
+DataCrate (which will be in version 1 by the time of the Research Object
+workshop has been tested on a variety of research data sets). Some examples are:

 Points to note (TODO):

@@ -275,3 +331,5 @@ More use-cases including preservation
 [sample from the bagit-ro tool]: https://github.com/ResearchObject/bagit-ro/blob/f5fca3abad60c86b3c4f95948b5d64c3bc8e51c6/example1/metadata/manifest.json
 [HIEv system at Western Sydney University]: https://www.westernsydney.edu.au/eresearch/home/projects/completed_projects/hiev
 [Cr8it]: https://www.westernsydney.edu.au/eresearch/home/projects/cr8it
+
+[DataCrate JSON-LD context]: https://github.com/UTS-eResearch/datacrate/blob/master/spec/0.2/context.json