GitLab now enforces expiry dates on tokens that originally had no set expiration date. Those tokens were given an expiration date of one year later. Please review your personal access tokens, project access tokens, and group access tokens to ensure you are aware of upcoming expirations. Administrators of GitLab can find more information on how to identify and mitigate interruption in our documentation.
This *laissez faire* extension mechanism in frictionless Data Packages is likely
to result in a proliferation of highly divergent non-standardised metadata - by
using JSON-LD and specifying how to represent temporal and geographical
coverage, etc DataCrate aims to encourage common behaviours.
We think that this *laissez faire* extension mechanism in frictionless Data
Packages is likely to result in a proliferation of highly divergent
non-standardised metadata - by using JSON-LD and specifying how to represent
temporal and geographical coverage, etc DataCrate aims to encourage common
behaviours. In DataCrate, the approach is to use schema.org's temporalCoverage
property. Here is an [example from v0.2](https://github.com/UTS-eResearch/datacrate/blob/22aebdcd179cb3f9b8141ca350ffafa202f5b523/spec/0.2/data_crate_specification_v0.2.md) of the specification.
Research Object Bundle*does* use Linked-Data and for that reason was given
RO BagIt*does* use Linked-Data and for that reason was given
careful consideration as a base-format for DataCrate. However, there were some
implementation details that we thought would make it hard for tool-makers.
implementation details that we thought would make it hard for tool-makers
(including the core team at UTS).
The use of "aggregations" and "annotations" introduces two extra layers of
abstraction for describing resources.
For example, using this [sample from the bagit-ro tool]
There is a section that lists aggregated files:
There is a section in the manifest that lists aggregated files:
>"aggregates": [
>
...
...
@@ -213,45 +254,60 @@ With the actual description of the numbers.csv file residing in `annotations/num
>}
- The PAV ontology they use based on PROV is all about nuanced kinds of authorship
that we don't think implementers will get right
- Uses lots of little files
The initial version of DataCrate (v0.1) was developed in 2017. V0.1 persisted with
HTML+RDFa for human and machine readability but this was cumbersome and was
removed in favour of an approach where the human-centred HTML page is generated
from a machine-readable JSON-LD file rather than the other way around.
After looking at a variety of standards, including Dublin Core [TODO REF] which
is very limited in coverage and DCAT [TODO REF] which is more complete for
describing data sets, but silent on the issue of describing files or other
contextual entities, using schema.org as the base metadata standard was judged
by the team to be the best way to meet our goals.
Schema.org is the most widely used linked-data vocabulary in the world [TODO:
find a reference].
We recommend other ontologies
where schema.org has gaps in its coverage, this contrasts with the frictionless
data approach of encouraging people to make up their own string-based metadata.
TODO: More on these:
PCDM for repository content.
SPAR ontologies for scholarly communications.
TBA - what to do for scientific discipline metadata.
# Implementation
The specification. A quick summary
# Implementation of DataCrate
In our judgement, the level of indirection and number of files involved in the
Research Object approach were not suitable for DataCrate; the implementation
cost for tool makers would be too high. In making this choice we forewent the
benefits of being able to make assertions about the provenance of annotations as
distinct resources, and the more intellectually satisfying abstractions about
aggregations offered by ORE. We settled on a an approach which used just three
extra files:
1. A single CATALOG.json file, containing JSON-LD which describes the
folder/file hierarchy of the data crate and associated contextually relevant
entities, such as people in one place.
2. An index.html file with a human-readable summary of the catalog file.
3. Optionally, a `DataCite.xml` file containing a data citation (a text version
of which is prominent in the HTML file if it exists).
The initial version of DataCrate (v0.1) was developed in 2017. V0.1 persisted
with HTML+RDFa for human and machine readability but this was cumbersome to
generate and was removed at [the suggestion of Eoghan Ó
Carragáin](https://github.com/UTS-eResearch/datacrate/issues/14) in favour of
an approach where the human-centred HTML page is generated from a
machine-readable JSON-LD file rather than the other way around.
We looked at a variety of standards, including Dublin Core
[@kunzeDublinCoreMetadata2007] which is very limited in coverage and DCAT
[@maaliDataCatalogVocabulary2014] which is more complete for describing data
sets at the top level, but silent on the issue of describing files or other
contextual entities and relationships bewteen them. We discovered that
Schema.org has the widest range of terms needed to describe who, what where
metadata for datasets. Schema.org is also the most widely used linked-data vocabulary
in the world [TODO: find a reference].
The DataCrate recommend other ontologies where schema.org has gaps in its coverage.
For describing datasets such as exported content from digital
object repository systems DataCrate uses the Portland Common Data Model
[@PortlandCommonData], (PCDM), which is a simple ontology for describing nested
[Collections](http://pcdm.org/models#Collection) of [Objects](https://pcdm.org/2016/04/18/models#Object), with Objects [having](http://pcdm.org/models#hasFile)[Files](http://pcdm.org/models#File).
For concepts related to the scholarly process (but biased towards publishing)
DataCrate uses terms from the SPAR ontologies
[@shottonIntroductionSemanticPublishing2010] for scholarly communications. For
example, schema.org does not have a class for Project so the key "Project" is
mapped to the frapo term [Project](https://sparontologies.github.io/frapo/current/frapo.html#d4e2428)
NOTE: we are in the process of adding more coverage for provenance - showing how
files are created by software etc. Will be asking the Research Object team for
assistance with this.
# Conclusion
Link to showcase examples.
DataCrate (which will be in version 1 by the time of the Research Object
workshop has been tested on a variety of research data sets). Some examples are:
Points to note (TODO):
...
...
@@ -275,3 +331,5 @@ More use-cases including preservation
[sample from the bagit-ro tool]:https://github.com/ResearchObject/bagit-ro/blob/f5fca3abad60c86b3c4f95948b5d64c3bc8e51c6/example1/metadata/manifest.json
[HIEv system at Western Sydney University]:https://www.westernsydney.edu.au/eresearch/home/projects/completed_projects/hiev