2013-08-27

libxml2 version 2.8.0 or later incorporates the patch for GML XSD

Daniel Veillard's libxml2 is one of the most commonly-available XML parser library which includes an implementation of XSD validator.  Version 2.7.8 and before had a bug that breaks xmllint(1) command with XSD schema of GML and hence ISO 19139.

I wrote a patch in 2011, and I found it was incorporated in the Version 2.8.0.  Great!

2013-08-19

Update and maintenance of WIS OAI Monitor: id case insensitivity and reliability of incremental harvesting

These weeks I did extensive (but incomplete unfortunately) maintenance work of the monitoring site of OAI-PMH synchronization of WIS Discovery metadata.  I believe I ought to share what's happening.

= 1. Reliability of Incremental Harvesting

Apparently my monitor had indicated unnatural large number of diffs (differences between the same metadata sets between centers).  After performing full harvesting, the number reduced significantly.

That meanst that the changes (mostly additions) were not effectively harvested by incremental harvesting i.e. the OAI-PMH "ListIdentifiers" request with "from=" parameter having a little (24h in my case) before the present time.  I think that is due to some problem in the implementation or the operation, but I still need more information to give effective guidance.

Anyway the monitoring software must be changed in its design strategy.

Right now it retrieves full OAI sets every 3 hours from relatively fast servers (>300 record/sec), and only "increments" (changes within past 24 hours) are retrieved from relatively slow servers (around 3 record/sec).  It was deemed necessary to achieve high temporal resolution, but really unfortunately, the loss of incremental harvesting happens on slow servers.  As a result, a reliable monitoring should use full harvesting for all kinds of servers, under current situation.

Right now the entire WIS data catalogue consists of 1.4e5 records.  It's a simple math that it takes at least 13 hours to take full harvest from 3 record/sec server.   In reality it is not nice thing to maintain (even) daily monitoring since monitoring should not take up majority of server resources.

Something has to be done.

= 2. Case-insensitivity of Metadata id's

Another source of increase of #diffs is the case-insensitivity of metadata identifier.  The first meeting of IPET-MDI agreed that the identifier should be treated as case-insensitive, and so was written in the WMO Core Metadata Profile, which is now an appendix to the Manual on WIS.

Before 2013-08-16T12Z, the monitoring did comparison of the OAI sets in case-sensitive manner.  I didn't notice problem when it was written.  But recently there were some records whose id's were partly lowercased.   Some other centres followed the change, and others not.  It is highly suspected that the synchronization does not work for the latter, but it is out of the scope of id-only monitoring.

Anyway the rule is rule, so I changed the software to make the case-insensitive diffs.