Monday, March 26, 2007

Metadata: document- versus content-level

We have now created and agreed a working model (for lack of a better term) of a metadata schema. This schema has been integrated into our workflow software and it is this way that the metadata will be gathered. Clearly, as we use it, tweaks will need to be made and the workflow software is flexible enough for us to do so.

We are also trying to introduce XML content-level metadata and to date, we have managed to get a series of templates agreed. These templates are tightly controlled in terms of their structure so introducing XML from a structural perspective should be straight forward and we’re evaluating a couple of tools for ensuring that the XML tags are correctly applied.

Software Challenges
The biggest problem that we’re having with the software that will enable the move from templates in MS Word to XML encoded content is the fact that much of the writing is fulfilled by external suppliers and this raises licensing problems. They aren’t insurmountable but it will require a flexible software vendor, disciplined suppliers and fair bit of negotiation to agree something.

Document- and Content-level metadata: relationship
So if the XML metadata is focussed mainly on document structure (e.g. this content is the Introduction, this content is the Methodology, etc.) how does this relate to the document-level metadata which looks at subject, relational and bibliographic aspects of the document? The two are mutually exclusive to some extent but how relevant is the document-level metadata to the content? Should it be captured and accompany the content? I don’t really know the answers to this these questions and they’re the easier ones!

At the moment, I think that the best solution would be to include within the content-level metadata a reference to the document(s) of which it forms part. Someone could then move from content-level to document-level metadata if they wanted to see subject, relational, or bibliographic data. Of course, the minute you reuse some content, say a “Findings” paragraph in an “Introduction” paragraph, that structural element changes. So the structural element needs to exist within the context of a document of origin / reuse. But wait, because here is where it starts getting really tricky…

If we want to assign subject metadata at the content-level will we have to double our work? I can’t think of any other way...the subject of a particular paragraph will not be the same as that of the document as a whole.

Also, how do we manage bibliographic metadata at the content-level? For example, the first time a paragraph is written (probably as part of a larger document), the author associated with the paragraph and the document are one and the same and is probably pretty easy to establish. What do we do when the paragraph is combined with paragraphs that are also taken form other documents and ones that are new? I think that one can argue that the author of the individual paragraphs is clear but who is the author of the document?

Conclusion (resignedly)

The more I think about it, the more I think that we are just going to have to manage two levels of metadata. At the point of creation, the content- and document- level metadata are the same but as content is reused, two distinct and different levels of metadata emerge. To be honest, it would be a big step forward if we were to introduce structural metadata at the content-level and as this presents the fewest or simplest (not simple, mind you, simplest) challenges, I think we will pursue this objective and reassess where we go from there.

No comments: