Skip to content
Snippets Groups Projects
Commit eb44a321 authored by PTSEFTON's avatar PTSEFTON
Browse files

edits

parent 1ca82fef
Branches
No related merge requests found
This diff is collapsed.
No preview for this file type
......@@ -91,29 +91,47 @@ used the same basic idea for data packages in the OwnCloud file sharing
application.
The first two DataCrate proto-implementations had no guidelines for what
metadata to use beyond what was hard-wired each code-base, so there was no hope
of easy interoperability or safe extensibility, and there were no repositories
into which data could be published, but we had good feedback from the eResearch
community and from the very limited number of researchers exposed to the
systems, so in 2016 when UTS began work on a new Research Data Management
service [@wheelerEndtoEndResearchData2018], we decided to properly specify a
data packaging format that met the above requirements and the DataCrate standard
was born.
metadata to use beyond what was hard-wired into each code-base, so there was no
hope of easy interoperability or safe extensibility, and there were no
repositories into which data could be published, but we had good feedback about
the concept from the eResearch community and from the very limited number of
researchers exposed to the systems, so in 2016 when UTS began work on a new
Research Data Management service [@wheelerEndtoEndResearchData2018], we decided
to properly specify a data packaging format that met the above requirements and
the DataCrate standard was born.
A team based at UTS, with some external collaborators started a process to work
out (a) was there an existing standard which had emerged since the HIEv work
that met the requirements? (b) If no, which RDF vocabularies should we use? and
(c) the mechanics of organising the files in the packages.
(c) the mechanics of organising the files in the packages. At this point the
requirements had evolved to be:
We were looking to find or build a data packaging format which had the following di:
1. Checksums per-file and the ability to include linked resources (features of
BagIt)
2. Linked-data metadata in JSON-LD format using well-documented ontologies /
vocabularies with coverage for:
- Discovery and citation metadata; who created it, what is it, where did
the work take place, where in the world is it *about*. And, the same
metadata at the file level.
- Technical metadata, what size is each file, what format is the file in,
who or what created the file.
3. A convention for including an HTML file which describes the dataset, and
potentially all of its files with a human-readable view of `2.`.
These could all be accomplished by an update of the HIEV data package but it was
important to make sure were were not re-inventing something that had been done
elsewhere.
### Existing standards
We were not able to find any general-purpose packaging specification with
anything like the HTML+RDFa index that HIEv data packages have, allowing for
human and machine readable metadata. In light of that our approach was to choose
a base standard that covered the other requirements and add to it, as with the
HIEv approach. Using BagIt plus extra files worked well in our initial
implementations so that was to be kept unless a better alternative surfaced --
the decisions were around formalising metadata standards.
human and machine readable metadata. Using BagIt plus extra files worked well in
our initial implementations so that was to be kept unless a better alternative
surfaced -- the decisions were around formalising metadata standards.
BagIt, which had been used in HIEv and Cr8it is an obvious standard on which to
base a research data packaging format - it is widely used in the research data
......@@ -124,17 +142,18 @@ integrity aspects of packaging data.
### Alternatives considered
Frictionless data packages [TODO ref], which uses a simple JSON format as a
manifest has roughly equivalent packaging features to BagIt having
checksum features built in. In their favour, frictionless data packages have the
ability to describe the headers in tabular data files. However, but they do not
meet the requirement `7` of having linked-data metadata, so while the JSON
metadata is technically machine readable, it is not easy to relate to the
semantic web as it does not use linked-data standards, and the terms are defined
locally to the specification. It is also unclear how to extend the
specification in a standardised way, contrasting with linked-data approaches which *automatically*
allow extension by the use of URIs.
As an example, [the spec](https://frictionlessdata.io/specs/data-package/) does not give a single way to describe temporal coverage
manifest has roughly equivalent packaging features to BagIt having checksum
features built in. In their favour, frictionless data packages have the ability
to describe the headers in tabular data files. However, but they do not meet
the requirement `7` of having linked-data metadata, so while the JSON metadata
is technically machine readable, in that is simple to parse, it is not easy to
relate to the semantic web as it does not use linked-data standards, and the
terms are defined locally to the specification, without URIs. It is also unclear
how to extend the specification in a standardised way, contrasting with
linked-data approaches which *automatically* allow extension by the use of URIs.
As an example, [the spec](https://frictionlessdata.io/specs/data-package/) does
not give a single way to describe temporal coverage
> Adherence to the specification does not imply that additional, non-specified
> properties cannot be used: a descriptor MAY include any number of properties
......@@ -161,29 +180,51 @@ As an example, [the spec](https://frictionlessdata.io/specs/data-package/) does
> and stored in CSV.
> <https://github.com/frictionlessdata/specs/blob/0860ecd6bbb7685425e6493165c9b1a1c91eb16b/specs/data-package.md>
This *laissez faire* extension mechanism in frictionless Data Packages is likely
to result in a proliferation of highly divergent non-standardised metadata - by
using JSON-LD and specifying how to represent temporal and geographical
coverage, etc DataCrate aims to encourage common behaviours.
We think that this *laissez faire* extension mechanism in frictionless Data
Packages is likely to result in a proliferation of highly divergent
non-standardised metadata - by using JSON-LD and specifying how to represent
temporal and geographical coverage, etc DataCrate aims to encourage common
behaviours. In DataCrate, the approach is to use schema.org's temporalCoverage
property. Here is an [example from v0.2](https://github.com/UTS-eResearch/datacrate/blob/22aebdcd179cb3f9b8141ca350ffafa202f5b523/spec/0.2/data_crate_specification_v0.2.md) of the specification.
>{
> "@id": "https://doi.org/10.5281/zenodo.1009240",
> "@type": "Dataset",
> <...>
> "name": "Sample dataset for DataCrate v0.2",
> "publisher": {
> "@id": "http://uts.edu.au"
> },
> "temporalCoverage": "2017"
>},
In the [DataCrate JSON-LD context] this expands to URI:
<https://schema.org/temporalCoverage>. And, in the HTML file that displays metadata for humans, the HTML is self-documenting:
![Screenshot showing how the term "temporalCoverage is linked - the link resolves to the schema.org page for temporalCoverage"](temporal_coverage.png)
The other main alternative was the Research Object Bundle specification
[@soiland-reyesResearchObjectBundle2014].
Rather than BagIt, Research Object Bundle uses the Universal Container Format -
for which the documentation is now unavailable from Adobe and which does not
have integrity features such as checksums but there is [a version of Research
Object which uses BagIt](https://github.com/ResearchObject/bagit-ro).
Rather than BagIt, the original version of Research Object Bundle uses the
Zip-based Universal Container Format - a format for which the documentation now
seems to be unavailable from Adobe and which does not have integrity features
such as checksums but there is [a version of Research Object which uses
BagIt](https://github.com/ResearchObject/bagit-ro).
Research Object Bundle *does* use Linked-Data and for that reason was given
RO BagIt *does* use Linked-Data and for that reason was given
careful consideration as a base-format for DataCrate. However, there were some
implementation details that we thought would make it hard for tool-makers.
implementation details that we thought would make it hard for tool-makers
(including the core team at UTS).
The use of "aggregations" and "annotations" introduces two extra layers of
abstraction for describing resources.
For example, using this [sample from the bagit-ro tool]
There is a section that lists aggregated files:
There is a section in the manifest that lists aggregated files:
>"aggregates": [
>
......@@ -213,45 +254,60 @@ With the actual description of the numbers.csv file residing in `annotations/num
>}
- The PAV ontology they use based on PROV is all about nuanced kinds of authorship
that we don't think implementers will get right
- Uses lots of little files
The initial version of DataCrate (v0.1) was developed in 2017. V0.1 persisted with
HTML+RDFa for human and machine readability but this was cumbersome and was
removed in favour of an approach where the human-centred HTML page is generated
from a machine-readable JSON-LD file rather than the other way around.
After looking at a variety of standards, including Dublin Core [TODO REF] which
is very limited in coverage and DCAT [TODO REF] which is more complete for
describing data sets, but silent on the issue of describing files or other
contextual entities, using schema.org as the base metadata standard was judged
by the team to be the best way to meet our goals.
Schema.org is the most widely used linked-data vocabulary in the world [TODO:
find a reference].
We recommend other ontologies
where schema.org has gaps in its coverage, this contrasts with the frictionless
data approach of encouraging people to make up their own string-based metadata.
TODO: More on these:
PCDM for repository content.
SPAR ontologies for scholarly communications.
TBA - what to do for scientific discipline metadata.
# Implementation
The specification. A quick summary
# Implementation of DataCrate
In our judgement, the level of indirection and number of files involved in the
Research Object approach were not suitable for DataCrate; the implementation
cost for tool makers would be too high. In making this choice we forewent the
benefits of being able to make assertions about the provenance of annotations as
distinct resources, and the more intellectually satisfying abstractions about
aggregations offered by ORE. We settled on a an approach which used just three
extra files:
1. A single CATALOG.json file, containing JSON-LD which describes the
folder/file hierarchy of the data crate and associated contextually relevant
entities, such as people in one place.
2. An index.html file with a human-readable summary of the catalog file.
3. Optionally, a `DataCite.xml` file containing a data citation (a text version
of which is prominent in the HTML file if it exists).
The initial version of DataCrate (v0.1) was developed in 2017. V0.1 persisted
with HTML+RDFa for human and machine readability but this was cumbersome to
generate and was removed at [the suggestion of Eoghan Ó
Carragáin](https://github.com/UTS-eResearch/datacrate/issues/14) in favour of
an approach where the human-centred HTML page is generated from a
machine-readable JSON-LD file rather than the other way around.
We looked at a variety of standards, including Dublin Core
[@kunzeDublinCoreMetadata2007] which is very limited in coverage and DCAT
[@maaliDataCatalogVocabulary2014] which is more complete for describing data
sets at the top level, but silent on the issue of describing files or other
contextual entities and relationships bewteen them. We discovered that
Schema.org has the widest range of terms needed to describe who, what where
metadata for datasets. Schema.org is also the most widely used linked-data vocabulary
in the world [TODO: find a reference].
The DataCrate recommend other ontologies where schema.org has gaps in its coverage.
For describing datasets such as exported content from digital
object repository systems DataCrate uses the Portland Common Data Model
[@PortlandCommonData], (PCDM), which is a simple ontology for describing nested
[Collections](http://pcdm.org/models#Collection) of [Objects](https://pcdm.org/2016/04/18/models#Object), with Objects [having](http://pcdm.org/models#hasFile) [Files](http://pcdm.org/models#File).
For concepts related to the scholarly process (but biased towards publishing)
DataCrate uses terms from the SPAR ontologies
[@shottonIntroductionSemanticPublishing2010] for scholarly communications. For
example, schema.org does not have a class for Project so the key "Project" is
mapped to the frapo term [Project](https://sparontologies.github.io/frapo/current/frapo.html#d4e2428)
NOTE: we are in the process of adding more coverage for provenance - showing how
files are created by software etc. Will be asking the Research Object team for
assistance with this.
# Conclusion
Link to showcase examples.
DataCrate (which will be in version 1 by the time of the Research Object
workshop has been tested on a variety of research data sets). Some examples are:
Points to note (TODO):
......@@ -275,3 +331,5 @@ More use-cases including preservation
[sample from the bagit-ro tool]: https://github.com/ResearchObject/bagit-ro/blob/f5fca3abad60c86b3c4f95948b5d64c3bc8e51c6/example1/metadata/manifest.json
[HIEv system at Western Sydney University]: https://www.westernsydney.edu.au/eresearch/home/projects/completed_projects/hiev
[Cr8it]: https://www.westernsydney.edu.au/eresearch/home/projects/cr8it
[DataCrate JSON-LD context]: https://github.com/UTS-eResearch/datacrate/blob/master/spec/0.2/context.json
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment