diff --git a/build/paper.pdf b/build/paper.pdf index 0624ada38e55ddb40dffe797449898e9508c7610..4ae064b522b258fa9bebf24840e83401bc0750f3 100644 Binary files a/build/paper.pdf and b/build/paper.pdf differ diff --git a/paper.md b/paper.md index c772be435bc8b06cdda71cb6eb6818e4ae3142d4..64e94d425442e45fc654bd2b5c60fabd1c486f42 100644 --- a/paper.md +++ b/paper.md @@ -271,6 +271,20 @@ extra files: 3. Optionally, a `DataCite.xml` file containing a data citation (a text version of which is prominent in the HTML file if it exists). +Uses for DataCrates planned at UTS include: +- Making DataCrates available for download from a public website, both packaged + as Zip files and expanded so that users can peruse the `index.html` file and + access individual files. +- Using the DataCrate with additional metadata for archiving and preserving + data (in a project to begin in 2019) +- Using the DataCrate format to allow exchange of data between systems, for + example sending data from a repository in a university facility such as the + Omero microscopy repository [@OMEROFlexibleModeldriven] to a git project + system like GitLab. +- Automatically detecting metadata in DataCrates which are uploaded to our + research-integtiry driven Research Data Management system + [@wheelerEndtoEndResearchData2018]. + The initial version of DataCrate (v0.1) was developed in 2017. V0.1 persisted with HTML+RDFa for human and machine readability but this was cumbersome to generate and was removed at [the suggestion of Eoghan Ó @@ -282,19 +296,19 @@ We looked at a variety of standards, including Dublin Core [@kunzeDublinCoreMetadata2007] which is very limited in coverage and DCAT [@maaliDataCatalogVocabulary2014] which is more complete for describing data sets at the top level, but silent on the issue of describing files or other -contextual entities and relationships bewteen them. We discovered that +contextual entities and relationships betwteen them. We discovered that Schema.org has the widest range of terms needed to describe who, what where -metadata for datasets. Schema.org is also the most widely used linked-data vocabulary -in the world [TODO: find a reference]. +metadata for datasets. -The DataCrate recommend other ontologies where schema.org has gaps in its coverage. +The DataCrate spec recommend other ontologies where schema.org has gaps in its +coverage: -For describing datasets such as exported content from digital +- For describing datasets such as exported content from digital object repository systems DataCrate uses the Portland Common Data Model [@PortlandCommonData], (PCDM), which is a simple ontology for describing nested [Collections](http://pcdm.org/models#Collection) of [Objects](https://pcdm.org/2016/04/18/models#Object), with Objects [having](http://pcdm.org/models#hasFile) [Files](http://pcdm.org/models#File). -For concepts related to the scholarly process (but biased towards publishing) +- For concepts related to the scholarly process (but biased towards publishing) DataCrate uses terms from the SPAR ontologies [@shottonIntroductionSemanticPublishing2010] for scholarly communications. For example, schema.org does not have a class for Project so the key "Project" is @@ -304,11 +318,45 @@ NOTE: we are in the process of adding more coverage for provenance - showing how files are created by software etc. Will be asking the Research Object team for assistance with this. +# Tools + +There are a number of tools for DataCrate in development. + +At the University of Technology Sydney, the Provisioner is an open framework for integrating good research data management practices into everyday research workflows. It uses DataCrates as a flexible interchange format to move datasets between diverse research apps such as lab notebooks, code repositories (where data is included by-reference), survey tools, collection management tools, and into archival and publication workflows. Examples of DataCrates moving through the research lifecycle will be provided. + +HIEv DataCrate - At the Hawkesbury Institute for the Environment at Western Sydney University, HIEV harvests a wide range of environmental data (and associated file level metadata) from both automated sensor networks and analysed datasets generated by researchers. Leveraging built-in APIs within the HIEv a new packaging function has been developed, allowing for selected datasets to be identified and packaged in the DataCrate standard, complete with metadata automatically exported from the HIEv metadata holdings into the JSON-LD format. Going forward this will allow datasets within HIEv to be published regularly and in an automated fashion, in a format that will increase their potential for reuse. + +Calcytejs is a command line tool for packaging data into DataCrate developed at the University of Technology Sydney which allows researchers to describe any data set via the use of spreadsheets which the tool auto-creates in a directory tree. + +[Omeka DataCrate Tools](https://github.com/UTS-eResearch/omeka-datacrate-tools) is Python tool in early development to export data from Omeka Classic repositories into the DataCrate format. + +A tool currently in development for exporting DataCrates from the Omero microscopy repository will also be presented. + + # Conclusion DataCrate (which will be in version 1 by the time of the Research Object workshop has been tested on a variety of research data sets). Some examples are: +- Data relating to the IDRC funded project (described in https://doi.org/10.3897/rio.2.e8880) [to examine data management policies and implementation for development funders](https://data.research.uts.edu.au/examples/v0.2/Data_Package-IDRC_Opportunities_and_Challenges_Open_Research_Strategies/). The project involved two parts: a review based on desk work and expert interviews and seven case studies of existing IDRC-funded projects. The case studies were supported by an Introductory Workshop in which the idea of data was examined and the issues involved in sharing discussed in detail. This was followed by an implementation phase in which the projects were supported in developing Data Management Plans. The performance against those plans was then assessed both by the participants and as part of the overall project to generate case studies that are to be published as part of the related RIO Journal Collection. The final project report will also be part of the same collection. + +- Some [Matlab code](https://data.research.uts.edu.au/examples/v0.2/GTM/) that + supports a research article. This is a good illustration of how common names + (like Lu, J) can be disambiguated using Orcid identifiers. + +- An small [archival collection of speleological (cave) mapping + data](https://data.research.uts.edu.au/examples/v0.2/Glop_Pot/). + +- Another cave dataset. This one is a [3d survey conducted using a lidar + scanner mounted on a + drone](https://data.research.uts.edu.au/examples/v0.2/Victoria_Arch_pub/). + +- A [sample data set with one picture in it](https://data.research.uts.edu.au/examples/v0.2/sample/). + +- Some [clinical trial + data](https://data.research.uts.edu.au/examples/v0.2/timluckett/) - this + dataset shows how researcher affiliations can be modelled using linked data. + Points to note (TODO): Looking good so far - tool developers are coming on board (Western Sydney, MIF,