9. Extending the Guidelines

9.1 Building the DTDs

The tags that have been described in earlier Chapters of this Report are only part of what makes up an SGML description of a document encoding scheme. The manner in which document components delimited by these tags are to fit together, along with other features of the encoding scheme, are represented in the document type definitions (DTDs). The draft versions of the TEI DTDs are contained in Appendix C. Readers are invited to comment specifically on the design principles used in constructing these DTDs, as described in this Chapter of the Report.

This Chapter does not attempt to be a complete description of how to write and modify DTDs. Readers desiring such a description should consult the SGML standard document and other tutorial materials.

The use of SGML allows the specification of rigid structures for texts. However, we anticipate that texts encoded using the scheme described in this report will be diverse in structure and content. Some of the aspects of these texts will not have been anticipated; it must be possible to extend the scheme presented here to incorporate new information. Some texts will have components and features that are close to those in this Report, but differ from what is described here either in their content or in their relationship to other components of the text. Accordingly, the DTDs have been written to allow a great deal of flexibility both in using the tags and attributes defined by the project, and in modifying or extending the encoding scheme. This Section describes the approach to designing the DTDs; the remaining Sections in this Chapter describe how to modify the scheme.

The motivating principle for the design of the DTDs has been to allow but not require structural constraints on documents. An encoded document is seens as comprising a header and a body. The header can contain SGML declarations and additional declarations required to conform to the TEI, as described in previous Chapters. The body contains the encoded text itself.

The body is expected to comprise some parts that can be well described using tags and structural constraints, and some parts that can not be so described. This body part can be viewed as a text with some well-structured components. These structured components correspond to aggregates of information marked with some tags from the DTDs; an SGML parser would be able to represent them as well-structured subtrees. The remainder of the text is those parts that are difficult to encode using the tags, or that do not fit together in some nicely predetermined fashion. Some documents can be entirely coded as structured components that fit together predictably; other documents may require much less structured encodings.

The DTDs in the Appendix attempt to accommodate this diversity by viewing the body of a document as a mixture of text and aggregates. At a high level in the description, the body is simply data with embedded structures. This is represented in SGML by a definition of this form: ]]> This states that the body element comprises parsable character data, together with the possible inclusion of structures as defined by the parameter entity agg. This parameter entity resolves to a list of structures that might appear in the document; for example, if we consider only structural features, this might include chapters, paragraphs, and lists. There is no specification here of how the various aggregates referenced by agg will fit together.

It is reasonable to expect that each aggregate will itself have some defined structure represented by an SGML model. The inclusion mechanism, as used in the element declaration just given, allows any of the aggregates referenced in the inclusion list to appear anywhere within the body. In particular, this means that any aggregate can occur within itslef. This rarely makes sense because an aggregate is assumed to be somethin with a well-defined structure. To prohibit this arbitrary self-embedding, the element declaration for the aggregate should specify its own name in an exclusion list, as in this example. ]]> This indicates that a chapter is a sequence of paragraphs which does not contain a chapter. This is true even though a specification at a higher level in the hierarchy had indicated that a chapter could appear anywhere.

The essence of this approach is to treat the document body as a stream of text which contains more or less structured portions. It will be necessary for users of the scheme to change the amount and kind of structuring. The next sections indicate how to do so.

9.2 Modifying the Guidelines

A document to be presented to an SGML parser, and hence a document encoded according to one of the TEI project DTDs, must include the DTD to be used (perhaps implicitly) and the document instance. The DTD might be given explicitly in the file or, more commonly, it might be in another file and only be referenced.

If the document type declaration is given separately the document file can contain a declaration that references it, together with some declarations that change it. These changes are in the declaration subset. The document type declaration in this case is of this form: ]> ]]> This notation is to be interpreted as follows: this is a declaration of a doctype; the name of the object being declared is book; the declaration is to be found in the system entity (file) named simpleDTD - if the entire declaration is to be presented explicitly in the current file, this part of the declaration is omitted; the square brackets enclose a list of other declarations - entities, elements, attributes, and so on - that are used to modify the definition in simpleDTD, and that are called the declaration subset.

The order in which these parts of the declaration are presented is important. An SGML parser will interpret the declarations from the external entity after the declarations that are explicitly given. Since it is legal in SGML to define some things - notably, parameter entities - more than once, and since the first definition encountered by the parser is the one used, this gives the local declarations precedence over those that are read in from the file.

It can be useful to have some global declarations interpreted before some of the local declarations. In this case, the global declarations are split into separate entities (files); those that should be interpreted among the local declarations can be explicitly included where necessary. The form of the document type declaration would be like this: %globals ]> ]]> In the extreme, to gain full control over the order in which declarations are elaborated, the reference ot an external declaration can be omitted, and explicit references ot external entities included in the local declarations. This technique is used to modularize the definitions in the TEI DTDs; the details are explained in what follows.

The remainder of this Section shows how to make specific changes to DTDs. They are all considered to be taking place in the context of a declaration such as the one shown above.

Extensive modification can essentially result in a complete redefinition of the DTD or at least of its structural aspects. This might be viewed as a bad thing, in that the standardization achieved by the TEI has been done away with. However, if the modifications are accomplished using SGML mechanisms, as is the case with all of the changes described here, there remains a well-defined object with a clearly specifed structure and a clear relationship to the TEI DTDs and tags. These techniques should be used with caution, and with an awareness of the complexities that can be introduced.

9.2.1 Renaming Tags and Attributes

The tag names to be used in an SGML document instance are derived from the specification of the structure of the document in the DTD. Users of the TEI encoding scheme will sometimes want to specify their own tag or attibute names, perhaps using names already in use in a particular organization or project, or perhaps using names in some other language. To facilitate this renaming parameter entities can be used to assign new string values to be used in the document type declaration.

For example, consider renaming the paragraph tag from p to par. The parameter entity nP contains the name of the tag. An external entity (file) contains all of the entity declarations for the names; this entity can be explicitly included in the local declarations as described previously. In the external declaration there is an entity declaration like this: ]]> Rather than using the name p in other declarations in the DTD, the parameter entity is used throughout. Thus, to redefine the tag name, it is sufficient to include a declaration like this at the beginning of the local declarations, before the global declaration of the name entities is included: ]]> As the parser processes the declarations, it will first encounter the declaration supplied locally. Since SGML's rule is that the first declaration encountered for an entity is the one that applies, when the parser subsequently encounters the entity declaration for nP in the external declaration, it is simply ignored.

A similar technique can be used to rename attributes. To rename the attribute src as source it is sufficient to provide an entity declaration of this form: ]]> in the local declarations.

9.2.2 Changing Attribute Values

The DTDs must include declarations for the allowed values for the attributes that are used with tags. These are defined by making reference to the entity containing the name of the attribute, and then specifying the allowed values and the default to be assumed if the attribute is missing (this was discussed in Chapter 3). For example, there might be an attribute language with a set of possible values, one of which must be chosen. An entity declaration such as this one could occur in the external declaration: ]]>

Suppose it is required to redefine the language attribute to allow any string to be used as the specification of a language, and to require that the attribute be present. This can be done using an entity declaration in the local declarations. When the local declarations are being processed, the external declarations have not yet been seen by the parser. Accordingly, it is not possible to refer to a parameter entity that is declared in the external declaration but not in the local declarations. This means that the reference to the name of the attribute in the string defining the content of a parameter entity for an attribute cannot refer to the parameter entity defining the name of the attribute, aLanguage in this case, since it will not yet have been defined (unless, of course, it has been redefined locally). But there is no need to refer to the parameter entity, since it exists simply to allow redefinition of the name of the attribute, language in this case. If the name is not being redefined it can be written directly in the string; if it is being redefined, it is certainly known to the writer of the local declarations and again it can be placed directly into the string. ]]>

9.2.3 Changing Models

The structural aspects of a document are reflected int he SGML models that specify the content of elements. It is this part of the DTDs that corresponds to a grammar for the class of documents. By redefining the model for a tag, it is possible to restrict where tags can occur, to allow tags to occur in new places, or even - in the extreme case - to redefine the structure of the entire body of the document and thus do away with syntactic restrictions altogether.

The external declarations of models refer to elements indirectly using the paramter entities that contain the names of the entities. For example, the model for a simplified chapter could be defined like this: ]]> This defines a chp as an optional number (no), followed by a chapter title (ct), followed by zero or more paragraphs (p).

As was true for references to attribute names via parameter entities in local attribute declarations, so for element names in local model declarations: any names referenced in the local declarations must be declared before they are used. This means that the actual names of the elements must be used in local model declarations, rather than the symbolic names (i.e., parameter entity references).

This declaration could be used locally to redefine the structure that is defined in the global declarations. ]]>

If a model is defined using the SGML keyword any, it can have arbitrary content; any defined aggregate can be used anywhere (except where prohibited by exclusions relative to some other structure currently being parsed). This means that it is possible to do away with all constraints on the structure of some particular piece of a document.

Specifically, it is possible to redefine the entire body of the document to have any arbitrary content, as follows. ]]>

9.3 Defining Additional Features

9.3.1 Adding New Tags

Adding a tag to the encoding scheme requires two things. First, there must be a declaration of the element (name, together with model); earlier examples show some ways of doing this. The model may reference other elements - either contained in the existing DTD or supplied in the local declarations - and thus be the definition of a new aggregate of arbitrary complexity.

Second, this new aggregate must be ties in to the existing grammatical structure. This is done by modifying one or more of the existing models to reference the newly declared element.

9.3.2 Defining New Attributes for Tags

Attributes are associated with tags (elements) through the attlist declaration. Adding an attribute to a tag that exists in the DTD requires the description of the attribute in the attribute list for the tag. There are two cases.

The first case is the addition of an attribute to a tag that has no attributes in the DTD. The attibute list declaration for the tag must be given in a local declaration. The next Section will show where the declaration should be placed. The declaration will be similar to this example: ]]>

The second case is the addition of an attribute to a tag that already has attributes. In this case the global declarations contain an attlist declaration for the tag. Unlike parameter entities, attribute lists for tags can be declared only once. This means that the existing declaration must be parameterized in sucha a way that new attributes can be added. This is accomplished by having a parameter entity defined for this purpose for each tag that has attributes. The parameter entity is defined to be the empty string, but it can be redefined by a local declaration. For example, this set of declarations might occur in the global declarations: ]]> The entity bTag is defined as the empty string, and thus contributes nothing to the attribute list declaration for the tag. However, a local declaration, occurring before this pair of global declarations, can assign some valid attribute specification to the entity bTag. Here is an example. ]]> When the attribute list declaration is interpreted, the tag is given two attributes, one from the string given in the declaration, and the other from the string given earlier in the parameter entity declaration. The example in the next section shows how these are to be placed.

Adding an attribute to a new tag that is defined in the local declarations is done by giving an attribute list declaration with the declaration of the element.

9.4 Worked Example

This Section shows a simple DTD and a sequence of transformations required to support the changes described earlier in the Chapter. The DTD shows only some simple structural tagging, using a small subset of the AAP tags and a simplified grammar. The first version is straightforward; the later versions are obtained by simple transformations that could be automated. The final version is more complex. The presentation of the sequence of transformations is intended to make the final version more accessible. The example document will not be modified in intermediate steps, save as is required to access the changing global declarations. The final version will demonstrate how to use these declarations to support changes to the document.

9.4.1 A Simple Structural DTD

Here is a DTD for a simple class of documents. There are only a few strucural elements defined here. The overall structure of the document is given as was described earlier in the Chapter: there is front matter (simply character data here) and a body; the body is character data with included aggregates (only one in this simple example). ]]>

Here is a simple document that conforms to this DTD, and that makes reference to an external entity to find it. Front matter. Some text. Chapter one.

First paragraph.

Second paragraph.

Some additional text. Chapter two.

Third paragraph.

Fourth paragraph.

The last text. ]]>

9.4.2 Supporting Renaming of Tags

This DTD will now be modified in several stages. The first transformation is required to allow renaming of tags. All of the element names are defined in parameter entities.

The entities for the tag names are gathered into a separate file. It contains these definitions: ]]>

The remaining part of the DTD now requires some changes. First, the models that refer to these names are changed to refer to the entities. Second, models may require some minor syntactic modifications. The only modification required is exemplified by the insertion of brackets in the model here for the chapter declaration so that the occurrence indication (the question mark) comes after a bracket and not immediately after the name of the entity: ]]>

The document is changed by making reference to these two external files. %gNames ]> ... body of document unchanged ... ]]>

The global declaration of names is interpreted first, because it is explicitly called for in the local declarations. Tags could be renamed by parameter entity declarations placed before the explicit reference to the global name declarations. For example, to redefine the name of a document component from p to par we could do this: %gNames ]> ... body of document using par for p ... ]]>

9.4.3 Supporting Redefinition of Models

The next transformation allows models to be redefined. All that is required is the provision of parameter entity for the model for each element, and the use of the entity name in the element declarations. If no models are redefined, there are no required changes to the document itself.

It is important to note that there is an entity for each element rather than for each model. If there were an entity for each model, all the elements defined in terms of that model would have be changed in the same manner if the parameter entity were redefined. ]]>

9.4.4 Supporting Renaming of Attributes

Attribute names must be declared in parameter entities, and these entities referenced in the attribute definition lists. The attribute name declarations can be placed in the same file as the tag name declarations. ]]>

The DTD is changed to reference the entities. ]]>

9.4.5 Supporting Redefining Attributes

There are two changes that are required to support redefinition of attributes. The first is the provision of entity names for the definition of each attribute, so that the attribute description can be redeclared if necessary. The second is the addition of an entity to each attribute definition declaration to allow new attributes to be added to tags that have attributes.

The DTD is changed as follows: ]]>

The document itself can now use all of the capabilities that have been provided by these praeterizations of the DTD. %gNames ]> Front matter. Some text. 1Chapter one.

First paragraph.

Second paragraph.

Notice this. Some additional text. 2Chapter two.

Third paragraph.

Fourth paragraph.

The last text. ]]>

As a final example, we now restructure this to use external entities to group together the sets of definitions being applied by the document writer. Here is the entity for the name redefinitions. ]]> Here is the entity for the other redefinitions. ]]> The revised included definition is now this. %myNames %gNames %myDefs ]> ... document as in last instance ... ]]>

All of the versions of this example have been parsed by an SGML parser.