Wednesday, February 28, 2007

Rail Industry Metadata Standard (RIMS)

Having done some research, it looks like there isn’t yet a metadata standard that is either designed for use or in common use by the rail industry. Although it wasn’t a surprise, it was a bit of a set back as it meant having to assemble a proposed metadata schema for the description of materials within the rail industry, engage in consultation starting with the R&D senior managers, and testing to validate it.

Building the metadata standard: the birth of RIMS
I have chosen to draw from the Dublin Core Metadata Standard (DCMS), and the Dublin Core Terms (DCTerms) to flesh it out a little. I have also drawn from the e-Government Metadata Standard (eGMS) to ensure that any public sector aspects are also covered. Now, the eGMS is based on the DMS and so far, what I have described would pretty much describe the eGMS. The rail industry, and our R&D work in particular, requires a little more granularity than the eGMS currently provides, something the Cabinet Office acknowledges by encouraging enhancement and refinement to suit different contexts. So I have also added some additional elements that are specific to the rail industry (e.g. asset type) and to our organisation (research topic). The complete picture is what we will consider to be the Rail Industry Metadata Standard (RIMS).

Identifying, assembling and building controlled vocabularies
Once I had a proposed set of elements, I set about sorting out the required controlled vocabularies. In my, albeit relatively limited, experience, this is the most difficult part. For many of the fields drawn from established standards, it was pretty straight forward (e.g. date formats us the W3C-recomended date-time format). For some of the elements that I had to create (e.g. research topic), it was also pretty straight forward because such lists were specific to the company and, in many cases, already in current use. Others from both established standards and the new set, however, were much more difficult. One such example is the asset type element – how granular do you go? For most of us, the term ‘locomotive’ is sufficiently descriptive but for our engineers it’s just too broad. My approach to these controlled vocabularies has been to put together a starting point and seek input and comments. So far, I have only engaged the R&D team and the lists have been heavily refined and accepted by them.

Subject
My other challenge has been sorting out a subject matter controlled vocabulary and it is proving to be a somewhat daunting task. The Integrated Public Service Vocabulary (IPSV), recommended as the controlled vocabulary for DCMS Subject, treats everything to do with the rail industry as ‘Rail Transport’. Clearly, this isn’t going to be sufficient for our requirements. I started to have a go at this task in the same way as I approached sorting out some of the other controlled vocabularies but it has proven to be too big. At the moment, it’s on hold while I move the rest of the project forward with a space reserved for subject tags and start to look for other initiatives both here and around Europe that are working towards creating a controlled vocabulary of some sort for the rail industry.

Handling the metadata
There are basically two different ways of managing document metadata: you can hold the metadata in a table which includes the location of the document described and then use this table to search and retrieve documents or you can embed the metadata into the documents themselves and search that (In reality, the search software or engine will most likely create its own table of metadata as in the case of the first method but this is a temporary table that is understood to need regular updating so is not the source of the metadata). Each method has its strengths and weaknesses (e.g. the table is quicker and simpler to deliver while the embedded data means that when someone downloads the document to a local space, the metadata travels with it and isn’t lost).

At the moment, we are also in the process of introducing a business process management system (we are calling it the Research Management System or RMS). The RMS will allow us to store documents as well as manage their production and approval. As a result, it makes sense that we piggy back the metadata assignment on the RMS work meaning that we will be going down the table route. This isn’t my preferred option but it is the one that will mean that we get metadata gathered and stored sooner. Once that process in embedded, we can look at technologies that will enable us to embed that gathered metadata into the files so that users downloading them from our website take the metadata with them.

One challenge that remains, and for which we have a few options but haven’t decided on any one yet, is what we do with the legacy collection. It has been decided that past projects and their associated documents will not be uploaded into the RMS. So the RMS presents us with the solution for future publications but it doesn’t deal with the existing collection. It is most likely that we will upload the previous publications to a separated segment of the RMS which will store the metadata in the same way but there are a couple of alternatives solutions…more on this as it develops.

Where from here?
The next thing to do with this standard is to confirm that it works in practice which will be part of the embedding process for the RMS. We will then look to consult the rest of the organisation on the suitability of the metadata schema and its associated controlled vocabularies for wider use in the company. I guess you could think of our work as a bit of a pilot for the rest of the company.

I’d like to publish our schema and controlled vocabularies under creative commons and invite other organisations to comment on it or use it in their organisations.


No comments: