Versioning of data sets and metadata

Over the last weeks, I have been thinking (on and off) about data versioning and how to represent versioning in metadata. One of the implementations that guided my thinking is the new data versioning introduced to Zenodo and its interaction with DataCite.

zenodo versioning

Zenodo

Zenodo has recently introduced the notion of versioning for their DOIs and by that defined a way to have versioning for datasets in Zenodo. They introduced the notion of “concept DOI” – an additional DOI that identifies the set of versions of a dataset. The “concept DOIs” thus point to the abstract entity – or concept if you will – of a data set that persists while the actual data sets is replaced by newer versions.

This means that every data set is associated with at least two DOIs: the “concept DOI” and the orginial version of the data set; called simply “Version 1” in Zenodo. Every new version of a data set is then associated with the “concept DOI”. The screenshot below shows that the repository behind Zenodo knows which version are instances of the same “concept”. Furthermore, the repository represents which version is the current version and how the versions relate to each other in regard to newness.

zenodo versioning

The “concept DOI” is a special among the DOIs associated with a particular set of data sets. In the Zenodo repository, the “concept DOI” points to the most current version of the data set in the Zenodo repository. This is a decision that is addressed in the Zenodo FAQs:

Where does the Concept DOI resolve to?

Currently the Concept DOI resolves to the landing page of the latest version of your record. This is not fully correct, and in the future we will change this to create a landing page specifically representing the concept behind the record and all of its versions. (from Zenodo FAQs: Versioning)

While this solution might not be “fully correct” or conceptually pure, it is a very pragmatic solution that satisfies the requirements of most if not all users. The most current version of the data set, plus a reference to an ordered set of all previous versions is a good representation of the abstract entity ”concept” of a data set.

The implementation of versioning in Zenodo is simple and pragmatic. Inside the Zenodo repository and its user interface it works very well and provides a good representation of the non-trivial structures that arise when repositories provide versioning of data sets.

Representing versions in DataCite

The situation changes when Zenodo’s versioning is represented in the metadata format of DataCite. DataCite Metatdata provides the element relatedIdentifier that allows to reference other entities. The element features an attribute relationType with an associated controlled vocabulary:

Element Attribute Controlled Vocabulary
relatedIdentifier relationType IsCitedBy
    Cites
    IsSupplementTo
    IsSupplementedBy
    IsContinuedBy
    Continues
    HasMetadata
    IsMetadataFor
    IsNewVersionOf
    IsPreviousVersionOf
    IsPartOf
    HasPart
    IsReferencedBy
    References
    IsDocumentedBy
    Documents
    IsCompiledBy
    Compiles
    IsVariantFormOf
    IsOriginalFormOf
    IsIdenticalTo
    IsReviewedBy
    Reviews
    IsDerivedFrom
    IsSourceOf

This controlled vocabulary provides two terms to express versioning: IsNewVersionOf and IsPreviousVersionOf. These two predicated allow to state the relation between versions in terms of temporal order. Unfortunately, the vocabulary does not provide a predicate to unambiguously express the relation between the set of versions and the individual versions.

While the Zenodo seems to have an internal representation of the relations IsNewVersionOf and IsPreviousVersionOf, it apparently was decided that the relation between the set of version and the versions as the members of this set should be represented in DataCite metadata. Since this metadata schema lacks the vocabulary to express this relation, Zenodo uses the predicates IsPartOf and HasPart.

IsPartOf and HasPart are intended to express meronymic relationship between data sets and parts of these sets. The relationship between the set of versions – represented by the “concept DOI” 10.5281/zenodo.837256 – and one particular version (doi:10.5281/zenodo.837257) is expressed as follows in the metadata:

<alternateIdentifiers>
    <alternateIdentifier alternateIdentifierType="url">
        https://zenodo.org/record/10.5281/zenodo.837257
    </alternateIdentifier>
</alternateIdentifiers>
<relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsPartOf">
        10.5281/zenodo.837256
    </relatedIdentifier>
</relatedIdentifiers>
<alternateIdentifiers>
    <alternateIdentifier alternateIdentifierType="url">
        https://zenodo.org/record/10.5281/zenodo.837256
    </alternateIdentifier>
</alternateIdentifiers>
<relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="HasPart">
        10.5281/zenodo.837257
    </relatedIdentifier>
</relatedIdentifiers>

While this solution does express the set member relationship between the set of versions and a specific version, it does so with predicates that are otherwise used to express the relation between a composite data set and its individually referenced parts. Furthermore, Zenodo’s DataCite metadata does not provide any information about the oredered relation between the different versions. In fact, the metadata of any version does not reference any other version, but only the set of versions.

Let me show you what that means for the representation of Zenodo’s versioning of data sets in DataCite and in particular in DataCites user interface.

Example case

As an example, we will take a look at the following data set:

Karimzadeh, Mehran, Ernst, Carl, Kundaje, Anshul, Hoffman, Michael M., 2017.Umap and Bismap: quantifying genome and methylome mappability. 10.5281/zenodo.705645

This data set and its two versions (v1.0 and v1.1) are represented by these threee DOIs:

The relevant parts of the DataCite metadata XML and the representation of this information in the DataCite search user interface is discussed in the following sections.

Concept dataset (10.5281/zenodo.705645)

The set of versions is representeed by the “concept DOI” and its associated metadata. The metadata contains two relatedIdentifier of the HasPart type:

<relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="HasPart">
        10.5281/zenodo.60943
    </relatedIdentifier>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="HasPart">
        10.5281/zenodo.800648
    </relatedIdentifier>
</relatedIdentifiers>

The interface represents this information by displaying the metadata associated with the “concept DOI” and under the heading “Related Works” the metadata associated with the two versions:

concept dataset

DataCite does not seem to a any knowledge of the order relation between the two versions and in particular it does not have a concept of what is the current version of this data set. In the case of two versions this is inconvenient, but for datasets with twenty or more versions this representation is impractical.

Old version (10.5281/zenodo.60943)

The metadata of version 1.0 contains a single element relatedIdentifier of the IsPartOf type referencing the set of versions by means of the “concept DOI”:

<relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsPartOf">
        10.5281/zenodo.705645
    </relatedIdentifier>
</relatedIdentifiers>

There is no reference to any other version of the data set. This is particularly striking as the other version (v1.1) is the most current version. Accordingly, the DataCite interface displays only one related work: the metadata associated with the DOI of the set of versions.

concept dataset

The most striking aspect of this representation is that there is no reference indication that a newe version of this data exists and that the related work is the abstract set of datasets and not another version.

New version (10.5281/zenodo.800648)

Just like the metadata of version 1.0, the metadata of version 1.1 contains a single element relatedIdentifier of the IsPartOf type referencing the set of versions by means of the “concept DOI”:

<relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsPartOf">
        10.5281/zenodo.705645
    </relatedIdentifier>
</relatedIdentifiers>

There is again no reference to the other – the original – version and again only the metadata associated with the DOI of the set of versions is diplayed without any indication that this data set is the abstract set of datasets and not another version. There is also no indication this is the current version of this data set.

concept dataset

Versioning in DataCite and Zenodos solution

The DataCite metadata format allows the representation of data versioning. However, it only allows to express the order of precedence between different versions using the relations IsNewVersionOf and IsPreviousVersionOf.

zenodo versioning

While this representation identifies the original and the implicit version implicitly, DataCite’s vocabulary does not allow any unambiguous reference to the set of versions.

Zenodo’s choice to use the to use the predicates IsPartOf and HasPart results in a representation that does not directly express any relation between individual versions. More importantly it represens the versions not as an ordered set, but just as a set and thus does not provide information which version is the most current instance of a data set or maybe less importantly which version is the original version.

zenodo versioning

Additionally, the predicalte imply a part-whole relationship different from the relation between particular versions of a data set and an abstract representation of the set of these versions.

Versioning

Versioning is a non-trivial feature of data representation. Versions form an ordered set and a system should provide means to refer to each individual version and to the set of versions. A good representation can answer the following questions:

  • Is this a “concept” of a data set or a version of a data set?
  • Of which data set is this instance a version?
  • Which instances are versions of this data set?
  • Is this the currect version of the data set?
  • What is the current version of the data set?
  • What is the original version of this data set?

In fact, the internal versioning in Zenodo’s repository as well as its representation in the interface does answer all these questions. While the representation of Zenodo’s data versioning in DataCite metadata can answer none of these questions.

While it is not necessary to make all relations explicit when publishing metadata, it is crucial to understand that different systems might expect information about different aspects of the structures and may derive other aspects by applying additional heuristics to this information. Zenodo’s choice of the DataCite’s predicates IsPartOf and HasPart is a good example for a mismatch of representations. Resulting in the loss of most of the information available in Zenodo’s user interface.