Yesterday I got stack a tricky features of XML.
Well, somehow I had to remove redundant the default namespace declarations (something looking like attribute "xmlns") from a large number of huge size of XML documents. That job is incomplete, and this is intermediate memorandum.
Before really doing the job, I wanted to see how many instances are found where. It was surprising that XSLT cannot do the job. The "namespace::" syntax of XPath does not find the literal text in XML seriarization, but rather matches conceptual namespace nodes which are copied to all child elements [XPath]. So it cannot detect redundant NS declaration.
I think it is necessary to program with XML parser. I wrote a ruby script to work with libxml2 Reader interface.
No warranty, and not to be considered official position of my employer and other international bodies.
2013-12-18
XSLT to extract metadata from OAI-PMH GetRecord response
Between the GISCs of WMO Information System, a metadata record is exchanged using OAI-PMH. OAI-PMH is an HTTP-based protocol, in which the server's response is an XML document that encapsulates metadata record(s).
It sounds so easy. It's just extracting /OAI-PMH/GetRecord/record/metadata/gmd:MD_Metadata. Following command would suffice:
$ xmllint --xpath '//*[local-name()="MD_Metadata"][1]' input.xml > output.xml
Even with older version of libxml2, a few lines of equivalent XSLT would do the same job. Until today I have thought so. But it was no good.
WMO Core Metadata Profile version 1.3 somehow prohibits the use of default namespace declaration. But above command produces undesired default namespace declaration if the OAI-PMH uses it. Oh no....!
It is a bit tricky to remove namespace declaration. The exclude-result-prefixes parameter
works only in the literal result elements. That means you have to write <gmd:MD_Metadata> instead of xsl:copy-of or xsl:copy or xsl:element.
It sounds so easy. It's just extracting /OAI-PMH/GetRecord/record/metadata/gmd:MD_Metadata. Following command would suffice:
$ xmllint --xpath '//*[local-name()="MD_Metadata"][1]' input.xml > output.xml
Even with older version of libxml2, a few lines of equivalent XSLT would do the same job. Until today I have thought so. But it was no good.
WMO Core Metadata Profile version 1.3 somehow prohibits the use of default namespace declaration. But above command produces undesired default namespace declaration if the OAI-PMH uses it. Oh no....!
It is a bit tricky to remove namespace declaration. The exclude-result-prefixes parameter
works only in the literal result elements. That means you have to write <gmd:MD_Metadata> instead of xsl:copy-of or xsl:copy or xsl:element.
<xsl:stylesheet version="1.0" xmlns:gmd="http://www.isotc211.org/2005/gmd" xmlns:oai="http://www.openarchives.org/OAI/2.0/" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" exclude-result-prefixes="oai" > <xsl:output method="xml" omit-xml-declaration="no" /> <xsl:template match="/"> <xsl:apply-templates select=".//oai:metadata/*[1]"/> </xsl:template> <xsl:template match="gmd:MD_Metadata"> <!-- this is the literal result element --> <gmd:MD_Metadata> <!-- you might wish to override xsi:schemaLocation by ISO standard --> <xsl:apply-templates select="*|@*|text()"/> </gmd:MD_Metadata> </xsl:template> <xsl:template match="*"> <!-- xsl:copy-of brings undesirable xmlns= even under MD_Metadata --> <xsl:copy> <xsl:apply-templates select="*|@*|text()"/> </xsl:copy> </xsl:template> <xsl:template match="@*"> <xsl:copy-of select="." /> </xsl:template> </xsl:stylesheet>
2013-12-17
Note on NASA DIF (Directory Interchange Format) and GCMD Keywords
For long time I knew only the name of DIF (Directory Interchange
Format) used in GCMD (Global Change Master Directory) which is a
catalogue operated by NASA. Recent days I'm getting interacting with more people who are interested in using GCMD keywords in the WMO/WIS Discovery Metadata which is extension of ISO 19139.
Resources I found in a quick research:
In the original mapping by GCMD, there is no such issue. The ISO element "keyword" is mapped from only "Keyword" in DIF which is free text. But the most complex DIF element "Parameters" is mapped to old ISO element "category" which is probably superseded by topicCategory which is unfortunately enumeration and hence no longer extendable. So the mapping does not have contemporary meaning, really unfortunately.
So I move to more realistic mapping implementation by AADC. It creates MD_Keywords from following DIF elements:
TT-ApMD-2 (see para 28) was aware about that situation, and recommended slightly changing the title of thesaurusName/*/title like following:
I know this is ugly and there are still some opinions, and really hope we get some agreement....
Resources I found in a quick research:
- DIF Writer's Guide
- An entry point of most of official resources.
- Contains XML Schema or template that illustrates the structure.
- Some elements refer to keywords i.e. controlled vocabulary
- NASA gives a mapping table to ISO 19115
- some elements seem to be given in old names and structure
- AADC (Australian Antarctic Data Centre) provides a converter into various profiles of ISO 19115/19139
- probably well-done, but apparently using some extension to XSLT
In the original mapping by GCMD, there is no such issue. The ISO element "keyword" is mapped from only "Keyword" in DIF which is free text. But the most complex DIF element "Parameters" is mapped to old ISO element "category" which is probably superseded by topicCategory which is unfortunately enumeration and hence no longer extendable. So the mapping does not have contemporary meaning, really unfortunately.
So I move to more realistic mapping implementation by AADC. It creates MD_Keywords from following DIF elements:
- Parameters - Controlled
- Keyword - Free text
- Sensor_Name (Instruments) - Controlled
- Source_Name (Platform) - Controlled
- Paleo_Temporal_Coverage - Free numeric date range
- Paleo_Temporal_Coverage/Chronostratigraphic_Unit - Controlled but the list not online, probably something like shown in Wikipedia
- Project - Controlled
- IDN_Node - some identifier I don't know
- Location - Controlled
TT-ApMD-2 (see para 28) was aware about that situation, and recommended slightly changing the title of thesaurusName/*/title like following:
"NASA/Global Change Master Directory (GCMD) Earth Science Keywords. Version 8.0.0.0.0. (for theme)"
I know this is ugly and there are still some opinions, and really hope we get some agreement....
WMO Common Code Table C-15 (Physical quantities) and QUDT unit of measurement
New common code table C-15 (Physical quantities) is under development in the WMO Manual on Codes. This is to be served as online registory http://codes.wmo.int/common/c-15. In my understanding the primary motivation at the moment is to provide semantic description of quantities used in the Aviation XML.
The table is of course a list of entries, each describes a quantity, for example "airTemperature" http://codes.wmo.int/common/c-15/me/airTemperature. Looking at the table, there is a field "generalization" with value "ThermodynamicTemperature" that links to http://qudt.org/vocab/quantity#ThermodynamicTemperature.
This is a link to QUDT. The top page describes only SI and CGS systems, but there seems to be care for other conventional units.
The table is of course a list of entries, each describes a quantity, for example "airTemperature" http://codes.wmo.int/common/c-15/me/airTemperature. Looking at the table, there is a field "generalization" with value "ThermodynamicTemperature" that links to http://qudt.org/vocab/quantity#ThermodynamicTemperature.
This is a link to QUDT. The top page describes only SI and CGS systems, but there seems to be care for other conventional units.
2013-12-06
ambiguity in pressure level heights of TAC TEMP which really casued trouble
It comes to attention recently that unnatural values in geopotential height is sometimes reported in BUFR TEMP message for 89532 SYOWA in Antarctica. That message is converted by RTH Tokyo from the traditional alphanumeric code (TAC) FM 35. The issues is partly a problem in the conversion software (handling of negative values), but also partly stemming from inherent ambiguity in the TAC TEMP format; location-independent algorithm fails to estimate of "upper digits" especially on 700 hPa.
The TAC/BUFR conversion is commonly seen worldwide, and other converter might have that problem, though the situation is not surveyed yet.
2013-12-04
"iso" URN namespace - machine-readable reference to ISO standards
I found IETF RFC5141 http://tools.ietf.org/html/rfc5141 defines the "iso" namespace of the URN. That makes it possible to cite ISO standards in a computer-readable manner. The metadata standards used in the WMO Core Metadata Profile can be called like following:
- urn:iso:std:iso:19115:2003:en
- urn:iso:std:iso:19115:cor-1:2006:en
- urn:iso:std:iso:ts:19139:2007:en
I'm not trying to change WCMP (for example gmd:metadataStandardName) since (for now) I don't know request that the field has to be computer-readable. But I think it is worth sharing.
[article also posted to WMO/IPET-MDRD]
Subscribe to:
Posts
(
Atom
)