3 Design principles

Standards are intended to make it easier to combine sets of data produced by different scholars in different places using different tools in different environments. Combination of such data sets is desirable when (as is the case with manuscript descriptions) they describe resources which are widely scattered and yet of global interest. Standards make such unification feasible at many levels: at one level, a standard such as that defining the electrical systems of Europe make it feasible for a computer to be transported from one place to another; at another level, a standard such as that defining the operating systems used by different computers makes it possible to share software. The level of standardization which concerns us, however, is higher still and concerns the definition of an abstract model underlying different data representations held within a computer. Provided that we have a clear agreement about the components of this abstract data model, we can expect different software systems to behave identically on different computer systems in so far as they process data in terms of that model. This is surely a better approach to standardization than the alternative, which is to insist that everyone should use identical tools and identical systems.

But how generic a model should we attempt to define? The TEI has always tended to adopt a `broad church' approach, facilitating the creation of data sets in which almost any combination of components is legal, from the most simple to the most complex. This seems particularly appropriate for manuscript descriptions, where each country, each intellectual domain, even each cataloguer for each different cataloguing project, seems to have developed a distinct cataloguing style. On the other hand, without a more narrowly defined set of rules, the task of integrating records produced to differing aspects of a very permissive standard may prove difficult or impossible.

The current draft permits, at one extreme, the making of simple manuscript inventories which contain no more than a list of manuscript identifiers, to which may be added just a few words of description of each manuscript, or an image of the manuscript. At the other, the draft also supports highly formalized descriptions, with elaborate structural mark-up distinguishing their various components, possibly also containing complete manuscript transcriptions or digital facsimiles. The distinction is not just one of length: a lengthy description might be just a manuscript identifier accompanied by a quantity of plain text; or a short description might be highly structured, with many distinctions made concrete in the markup.

This flexibility is a practical response to a recognition of the fact that, in the real world, requirements constantly change. Perhaps, in an ideal world, one might begin by making a simple inventory, with minimal details, and elaborate this by progressive addition of information. But more realistically, financial constraints may make it impossible ever to progress beyond that initial minimal inventory. Or one might be in the contrary position of having already made a substantial investment of expertise in the creation of `legacy data' which cannot immediately be mapped to the structures defined by the model. Here the clear answer will be to begin by importing information from the existing set of descriptions simply as unstructured prose, whether it is held in printed or digital form, or in some relational database format, so that subsequent work can take place to distinguish statements of date, place, provenance, or description.

Accordingly, this draft does not offer rigid definitions of what might constitute a `short' or `first-level' record, as against a `long' or `full' record. What is short for one purpose, may be too long for another; what is long for one purpose may be the bare minimum for another. Moreover, although we have defined a minimal level summary element, we have done so recognizing fully that its use may be entirely inappropriate in some situations. Some discussion of the thinking that lead to this apparently contradictory design decision may perhaps be helpful.

At an early stage of the design process, it was felt that a key advantage of using SGML was the ability it gave to integrate full text and database systems, by tagging arbitrary stretches of text as if they were database records. As with other seductive notions, it soon became apparent that there is a danger in this over simplification. Should we, following this principle, define a structure for our manuscript description in which almost any component could appear almost anywhere? Or should we aim for a more formal prescription, which declares: if you want to say something about the binding of the manuscript (for example) you can only say it in a <binding> element within a <physDesc> element. The first approach offers the cataloguer a heady sense of near complete freedom; its danger is that the resulting descriptions are likely to be so heterogeneous as to be practically useless for purposes of integration and efficient retrieval --- which was the prime justification for making the computer-readable records in the first place. The disadvantage of the second approach is that an over-rigid formalism would lead to frustration among cataloguers, and (rather quickly) refusal to adopt the standard.

Our solution was to attempt to have the best of both worlds by adopting what might be regarded as a truly British compromise. Our DTD permits the cataloguer to use either simple paragraphs of prose, which may contain anything humanly comprehensible, or to group such descriptions under more formal and precisely-defined elements for distinct and identifiable manuscript phenomena. Thus: a cataloguer is free to speak of any aspect of the manuscript binding, as it bears upon the history or intellectual content of the manuscript, within the <p> elements provided within the <history> and <msContents> elements. But if a formally-structured statement about the binding itself is required it should be located within the <binding> element provided within the <physDesc> element.

The alternative of permitting the encoder to use the <binding> element at any point might well lead to cataloguers' feeling obliged to mark every reference to binding with the <binding> tag, regardless of context and content. This would encourage a superfluity of effort which would lead to so many different kinds of information being contained in <binding> as to render the element semantically ill-defined and therefore useless.

The danger of an open standard, such as this is designed to be, is that it may be misused. We do not encourage, for example, manuscript descriptions which consist of no more than an identifier and a lengthy prose description with no formal distinctions through markup of statements of date, origin, provenance, and the like. However, the standard itself cannot be used to prescribe that descriptions must conform to this or that model: it can only be used to enable the various models. These proposals seek to create a framework which can accommodate what we know of existing standards, and to enable (in time) greater precision and more efficient retrieval of detail in cataloguing. Only domain experts can determine how the standard should be applied in their areas of expertise. Our claim is that we have at least given them a tool which is at least worthy of this task.

The implementation phase of MASTER, due to start in the autumn of 1999, will be the first attempt to put these proposals into practice, using both XML and relational technologies. As this process gets underway, we expect to find many shortcomings in the model described here, although we believe it to be fundamentally a sound one. We also expect to identify the need for definition of a wide range of content definition rules to ensure compatibility of the materials fitted into the structures we have defined. To use a metaphor from the libraries world: we anticipate that MASTER will need to be more like AACR2 and less like MARC during the next phase.

Finally, because the background and interests of all the participants in the primary groups involved in preparation of these proposals (EAMMS, Digital Scriptorium, MASTER, the TEI workgroup) are in western European medieval manuscripts, these proposals have so far concerned themselves only with the description of manuscripts from that tradition. However, we hope that our work can be reapplied in other traditions and would warmly welcome comments indicating its usefulness (or lack of it) from those with expertise in manuscripts from different cultures.


Previous
Up
Next