2013-11-07

Quick review - DataCite Metadata Schema v3.0

Data Publication is a recent movement in the data-intensive science domains, which tries to define the service of data as a part of scholarly work to be acknowledged, thus tries to make data referable from literature.  ICSU WDS has established a working group for the data publication, and DataCite is a non-profit organization (seemingly) involved in it.

They published the DataCite Metadata Schema v3.0, which is interesting.



Very frankly speaking, the Dublin Core is compact and universal, but it is not necessarily designed for scientific data.  ISO 19115/19139 is, on the contrary, full of all sorts of conceivable features, but that means it's up to the application profile to seek a really practical subset.  I'm working for WMO profile of ISO 19115, so basically I'm always curious about any compact and practical schema coming from out of 19115, in order to check the soundness of my idea.

The DataCite schema is written in a bunch of XSD files, which I'm not good at reading.  I've converted it into Relax NG Compact Syntax for reference.

Overall Structure

The schema is mostly flat.  Most elements are placed directly under the root element resource, and the order doesn't matter.  That's very good thing.  It's not easy thing for everybody remembering, discussing, and coding many lines of XPaths.

A notable difference from the "oai_dc" schema for Dublin Core is that some elements (like creator) are placed in a container with the same name in plural form (like creators).  This is a trick to allow multiple instances of such elements while limiting the number of some element (like identifier).  The syntax of xsd:any can control the cardinality of child elements, but all children must have the same cardinality unfortunately.  If they had chosen RELAX NG as the schema language!!

Anyway that difference indicates the people had more knowledge on the cardinality required for metadata elements.  For example it is no good to have multiple identifiers, while it is acceptable to have multiple alternate identifiers.  The schema reflects such thoughts.

Resource Identifiers

Three types of resource identifiers are defined: identifier, alternateIdentifier, and relatedIdentifier.  The identifier must appear once and only once, and that must be DOI currently.  The other types of identifiers go to alternateIdentifier about which the atttribute @alternateIdentifierType describes the type for example ISBN (or probably URL).  The last one relatedIdentifier indicates the identifier of the resource in various relation (indicated by @relationType) such as Cites, IsPartOf, IsCompiledBy, etc.

It's interesting that the attribute relatedIdentifier/@relatedIdentifierType is a controlled vocabulary but does not have to be DOI.  It should be a practical decision to allow the network of well-known catalogue identifiers such as ISBN.

Another notable thing is a relationtype code IsIdenticalTo.  It is anticipated that the same identical resource may have more than one DOI because of different locations, or more precisely, parallelly-working custodians.

Anyway it's worth remarking again that a resource must have a DOI in order to join the catalogue using the DataCite schema.


Dates

Two dates are defined: publicationYear and date.  As in ISO 19115, they have different structures.  The publicationYear is mandatory, unrepeatable, year only and no annotation attribute is defined.  That is used in the citation text.

The date is opposite: optional, repeatable, polymorphic (year, date, datetime, or range), and requires @dateType, such as Submitted, Accepted, Available, or Issued.  Special attention seems to be paid for embargo - the Available date (and publicationYear) is the end of embargo and the Submitted and Accepted dates comes before that, indicating the beginning of embargo period.

Responsible Parties

There are three elements for responsible parties: creator, publisher, and contributor.  The first two are required.

I'm surprised they have different internal structures - the publisher is only a single free-text field, while the others may have "nameIdentifier" substructure in addition to free-text name.  That illustrates that the motivation of DataCite is primarily encouragement of fair attribution to researchers with ID such as ORCID or ISNI. The publisher is only used in citation text, so only the name is needed.


Multilingualization

The attribute xml:lang is acceptable in three elements - title, subject, and description.  In my memory the discussion on multilingalization in the WMO/WIS community is focused on title and abstract.

Subject Thesaurus and Rights URI

The subject corresponds to keywords in ISO 19115.  It's basically the same that the thesaurus (not the individual keyword) may have computer-readable code or identifier.

However, the rights element may have @rightsURI attribute, which identifies the individual keyword (such as WMOEssential) instead of the classification scheme (such as WMO Resolution 40).  This is different from the subject but should be practical, because the URI here is used in display of hyperlink, while the thesaurus id is just for identification.

Version

A resource may have version.  Major version change needs new identifier.  But it is allowed to change the resource for an identifier; that is called minor version change.

Georeference


Point, box, or place name may be used.  No special element for vertical location - probably subject keyword is used for that purpose.  No special element for time as location, but there is date[@dateType=Valid] instead, which may be date or datetime range.






No comments :

Post a Comment