TCW10: Draft Recommendations on the formatting of TEI documents

Brett Zamir

Draft circulated to TEI Council 1 April 08

Email from Brett Zamir of , tagged in TEI Lite by LB

First tagged draft posted on Council website

While TEI makes no claims about how a document is to be rendered–except to the extent that it allows description of the original formatting and to the extent that publishers wish to reflect that formatting–it is a general expectation that the TEI documents a project is creating can be rendered at some point in a more human-readable manner. While the desired formatting should not prompt one to violate the TEI Abstract model by altering the semantics of a document in order to adjust output formatting, the process of creating formatting output (or even well before doing so), may lead a project to reconsider some aspects of its semantic markup, as well as the more conventional means of adjusting a stylesheet, so the issues raised here go beyond the work of a designer. This chapter will discuss various issues related to formatting a TEI document.

Respecting the original formatting

In order to fully respect the original formatting of a document, it is necessary to consider (and thoroughly use) the global @rendition attribute and/or the @render attribute on tagUsage elements to point to rendition elements and/or the global @rend attribute to provide its own definition without referring elsewhere.

While semantic hooks and reliance on the typical formatting of specific tags for cases where @rend and @rendition are not needed to override typical behaviors might be sufficient to create an output document reflecting the formatting of the original document, if the original rendering is important information to preserve, this can be done more explicitly by ensuring that all elements are given a tagUsage element which uses its @render attribute to point to a rendition element which contains the styling details that can be applied to all instances of the tag. In this way, there is no ambiguity about how a tag is to be rendered. This is also recommended practice when tagUsage elements are provided. (see ).

However, note that tagUsage elements do not allow for the specification of a default rendering behavior for element-attribute combinations–only for specific elements regardless of any attributes or attribute values. They also do not allow the specification of a default behavior for an element based on its position (e.g., quote within cit), so as this point, it may be inevitable to heavily rely on @rend and @rendition if one must thoroughly indicate this original formatting, independent of a stylesheet.

For maximum specificity in encoding formatting details, the content supplied within rendition and @rend ought to use a formal formatting language. However, besides not specifying a particular formatting language that ought to be used, at present TEI does not offer a fail-safe means of translating @rend and rendition content into a stylesheet (even though this information could potentially be used to create @style attribute content, or, by using XSLT 2.0, simultaneously create a CSS stylesheet alongside the X/HTML output.

While it might be sufficient in most cases, there is no formal means at present to express that a particular style rule should target CSS pseudo-elements (like :before, :after, :first), such as one might wish to do in specifying the addition of distinct content at the beginning or end of a tag (e.g., adding left and right curly quotes to a q element; this is nothing to speak of needing to target a specific attribute, say xml:lang to determine which type of quotation marks to add). And even where the output does faithfully reflect the original, it should be fairly common to need to optimize the resulting CSS stylesheet, since it would likely be fairly large with there being no means besides tagUsage to indicate a behavior to apply to element-attribute combinations or elements in particular positions.

Original vs. output rendering

One might find oneself tempted to force the rendition mechansims in TEI (@rend or @rendition or tagUsage’s @render with rendition) to go beyond their intended use for describing the original rendering of a document. However, it is important to keep in mind that if one wishes to control how the formatting will be output, independently of the original formatting (adding details of formatting not expressed about the original with the TEI render mechanisms or overriding these details), one must not subvert the semantics of a TEI document (at the risk of introducing TEI non-conformance and interoperability issues) for the sake of controlling formatting–that should instead be handled by a stylesheet.

This might necessitate a different stylesheet or, as is probable for most cases, modifications to the default stylesheets provided by TEI, if the parameter options (assuming XSL stylesheets are used) are insufficient to express a project’s output requirements. It is, for example, possible to allow the stylesheets to recognize multiple attributes, even at the same time.

Returning changes to the default stylesheets

The stylesheets from TEI are evolving with the TEI project, however, so it may be possible that the TEI project might be open to certain changes (whether optional or required) to its default stylesheets, if the changes offered may be of interest to a wider audience and TEI has the resources to implement the changes. Given that the code of these stylesheets is open source, it may benefit both TEI and its users as well as an individual project for stylesheet improvements (or other TEI resources for that matter) to be returned back to the community, as it precludes the individual project from needing to make modifications each time an update occurs.

With more standard (but not standardized) styling expectations (and stylesheet), the more likely that TEI processing applications might be used to render TEI in a familiar format (such as when obtained directly off the web, etc.), even while allowing publishers’ full freedom to deviate from such common conventions if they wish.

Influence of formatting (or accessibility) concerns on markup

While one should not subvert the semantics of a TEI document in order to control formatting, besides customizing a stylesheet alone, the viewing of a formatted document might prompt a project to consider changes to the original TEI documents, such as:

giving a more detailed encoding of the original rendering (as that information can be used to produce output rendering, assuming again that the original document being represented indeed possesses this rendering), using rendition, tagUsage render, @rendition, and @rend
adding more semantic “hooks” whether this is the use of hitherto unused elements or attributes such as @n or @type (potentially with the adding of generic elements like seg) which can provide more semantic detail about certain text that can in turn be targeted by a stylesheet to provide more granular control in output formatting. This may also have the benefit of providing more semantic richness to the document (ideally using the more specific elements already recommended for this purpose). Such semantic ‘hooks’ can also be of the variety that ensures that the output formatting includes sufficient accessibility features such as to make available alternative text along with any graphics or images that could not otherwise be interpreted by a speech browser.

Semantic information and output formatting

Besides formatting concerns leading one to add additional semantic distinctions into a TEI document, one may also wish to encode a certain degree of semantic information (to the extent allowed in the output formatting language) into one’s formatting output and consider the extent to which output formatting markup is separated from any more generic output structural markup (e.g., creating CSS to hold styles with XHTML used to present the structure or encoding structural and formatting markup together). These are both discussed below.

Encoding semantic information within formatted output

While it may often be the case that TEI will be converted to a formatted output in which semantic information is lost, certain output formats allow some if not all semantic information to be retained in some manner. For example, XHTML can use the approach of microformats (http://microformats.org) to use the global and generic XHTML @class attribute to contain information such as the original TEI tag name. While it would likely be too cumbersome to originate documents in such a format (assuming all TEI semantics could be encoded with such an approach), it offers the advantage that one might, for example, use a web browser to obtain a document already pre-rendered, yet use a microformat processor within the browser (possibly available as a browser extension) to search for semantic information.

Encoding formatting within structural and semantic output

It has become a generally recommended practice for even XHTML documents on the web to separate their formatting content (as with CSS) into a separate file from the structural content (of paragraphs, generic divisions, etc.). This offers various advantages such as speed in downloading (by browser caching for repeatedly used stylesheet files or by those using speech browsers being able to avoid downloading visually-oriented stylesheets), or flexibility in subsequent style changes. While one might define an XSL stylesheet to create specific XHTML @class attribute values which are associated with those classes targeted in a predefined CSS stylesheet, XSLT 2.0 might be employed to utilize information such as contained within @rend or rendition elements to specify the creation of a CSS document while also simultaneously creating the XHTML output document. See the sections on preserving original formatting.

Despite the generally recommended practice of separating styles from structure and semantic information, given the present absence of a means of making queries which utilize style information contained in separate files, it may be conceivable for some to wish to have their formatting output mixed in with structural output (the @style attribute might be harder to parse in a query than if specific XHTML formatting structures were used–even though these may be deprecated in later versions)–just as one might prefer to encode say italic emphasis using emph n=”italic” class=”font-style:italic” rather than use a more formal but correct syntax of emph class=”font-style: italic;” since the former is easier to parse — so that queries can take advantage of both styling information and/or semantic information. For example, if one views a document and sees that italic text is used for emphasis, one might wish to search for a certain phrase contained within italic text, for example, because one recalled the text occurring there, or because one identified a pattern represented by italic text but where one did not know what the exact name of the pattern was, and thus not knowing what specific tag one must search for to find the desired text.

General categories of elements to consider in formatting (or not formatting at all)

Before considering the usual rendering and default rendering options for specific elements (along with any specific attributes), it is worth considering some general issues pertaining to certain types of elements.

Likelihood of printing out specific elements

Elements differ in the likelihood a project will wish to render them in a formatted output document. They range from editorial information which might never be printed out for common viewing, to elements which will sometimes be printed out (such as a choice listing the original text and a regularized or corrected form) to elements such as paragraphs which almost certainly will be printed out.

For the case of those which will always be printed out, one can make their styling explicit by using tagUsage elements with a @render attribute pointing to a rendition element with the styling details (and optionally the code). One might optionally even indicate specific elements which should not be displayed (though depending on the stylesheet language, this might not strictly be necessary).

Moreover, a stylesheet might wish to depend on element-specific or global attributes (whether semantic or rendition-related) to target elements with or without these attributes or with specific values to display or not display them selectively.

Elements which occur in the header, will generally not be printed out, though for some project’s purposes, display of this information (e.g., bibliographic data) may be useful to include in the formatting output.

While other elements that occur within the running text will generally be printed out, it is important to understand that with TEI–which, as cannot be emphasized enough, is not a formatting language–this will not always be the case. If one has editorial information that should not be printed out within the running text (or at least should not appear alongside the running text), as a project might not wish the encoder-added information to disrupt the flow of the text (e.g., of a narrative) and for which it might even be considered irreverent by some viewers (such as for scriptural works), it will be important to be aware of all such tags that a project might not want printed out so that the stylesheet (possibly in conjunction with special semantic TEI markup if not markup indicating original rendering) does not display those tags’ content.

Elements which are defined by the following macros are generally not to be displayed:

macro.limitedContent (http://www.tei-c.org/release/doc/tei-p5-doc/html/ref-macro.limitedContent.html ): desc, fDescr, figDesc, fsDescr, meeting, rendition, tagUsage, witness
macro.phraseSeq.limited (http://www.tei-c.org/release/doc/tei-p5-doc/html/ref-macro.phraseSeq.limited.html ): activity, age, authority, channel, classCode, constitution, creation. derivation, domain, factuality, funder, interaction, interp, langKnown, language, locale, metSym, preparedness, principal, purpose, resp, span, sponsor, valDesc

Moreover, there are some elements such as those in model.noteLike (note and witDetail) which while they might occur in the outputted document, might also in other cases or within some projects not always be rendered if at all. model.global.meta with members such as alt, altGrp, certainty, fLib, fs, fvLib, index, interp, interpGrp, join, joinGrp, link, linkGrp, respons, span, spanGrp, and timeline as well as elements containing the global @exclude attribute may or may not be output when included within a document.

Still others include items such as may be contained within choice: abbr, am, corr, ex, expan, orig, reg, seg, sic, unclear .. A project may need to consider whether to output these elements with both choices being shown in some manner (even as a mere tooltip that is exposed when certain text is hovered over) or whether to only show one of the choices (such as that reflecting the original or some regularization, correction, expansion, or abbreviation).

Likewise with elements belonging to model.pPart.transcriptional: add, app, corr, damage, del, orig, reg, restore, sic, supplied, unclear . One may or may not wish to indicatesupplied text for example, or choose how to format damaged sections in some particular manner.

(any others????)

Since there is no way of knowing whether some of the elements mentioned above such as note refer to text that should be printed out or not, one must rely on other mechanisms to specify or glean this information. One way would be to use attributes, such as @resp to detect whether the note was the responsibility of a markup editor of the document, or whether it was provided by an original annotator of the document. However, as the detection of this might not always be clear (especially if the markup annotator also served as the original annotator), the user of other attributes such as @type, or where @type is not available, possibly @n or even xml:id might be used.

Note that despite its being listed above, an element such as figDesc, while it might not normally be displayed immediately to a visual browser, might still nevertheless be important (or even required in some formatting languages or in use with projects needing by law to adhere to accessibility regulations) for the sake of being accessible to those with visual disabilities who might depend on speech browsers or tooltips to be able to get a sense of what a particular graphic, photograph, etc. was displaying. It is certainly good practice to follow such an encoding, both within TEI documents and in the formatted output, where available.

Behaviour within corr and similar elements

One exception to @rend, @rendition (or their absence) indicating original formatting is that within a corr tag.

The addition of a @rend within corr (or reg) would apply to an intended output formatting, rather than to the original formatting as it normally does.

The TimesThe Times]]>

Likewise, especially if the default rendering of title were specified by a tagUsage and rendition element (here to indicate a title as being by default in italic), the absence of @rend or @rendition would still imply a change of formatting:

The TimesThe Times ]]>

Items needing replication

Some elements or elements with certain attributes may need special consideration for output such as @copyOf or join, as these might indicate that certain formatting output might need to be created such as might (as with other cases described earlier) not be evident by simply stripping the markup out of the document.

Text attributes

Most attributes are used with coded values, as they are not mean to represent human language or to be displayed. Text attributes represent the exception to this, though it is commonly preferred for an XML language to represent these attributes as elements so that further nested subelements representing markup at the phrasal level, etc. can be added within as needed.

Text attributes have generally been removed from TEI, and some of the ones that remain one might not wish to output in a formatted version anyways, but if one wishes to include, for example, @reason in one’s output, one will be unable to add styling which depends on child elements for more specific formatting since the information is expressed within an attribute (but one can style differently depending on the element’s @xml:lang, as that does apply for text attributes, as well as any other attributes on the element). Likewise for the dictionary attributes, @expand, @norm, @split, @value, and @orig which represent the remaining text attributes????.

((((Syd prepared a list of potential text attributes to review to see if they were still text attributes–it’d be nice to be able to give such an exhaustive list here.))))

I think the element-specific details might be logically incorporated as documentation elements within XSL that could be extracted for automatic inclusion within the TEI reference pages, making clear that the formatting discussed is only that of the default behavior used in TEI-provided stylesheets (though also discussing the range of options that the stylesheet makes available through parameters). I really think giving awareness of these formatting issues in the context of considering these elements would be more helpful than waiting for people to discover them separately in the stylesheets.

Consideration of default transformation behavior

While, as mentioned earlier, there is no required mapping of TEI elements and attributes to specific output document structures (e.g., XHTML/CSS, LaTeX, etc.), the fact that TEI provides a default set of stylesheets to work with (albeit a parameterized one) and that these are presumably well-used [by the number of downloads????] indicates that there are a general set of expectations about how most TEI structures will appear when output. The effort required to create one’s own stylesheets from scratch for such a large vocabulary as TEI provides, or even to significantly modify existing stylesheets (no less each time as improvements and adjustments are made to the default files), also makes the understanding of how documents will be transformed an imperative for many projects. Thus, it becomes necessary to understand how TEI might commonly be transformed (or understood to be transformed), even beyond the extent to which the stylesheets themselves are documented and express (mostly in technical language) the templates used to transform TEI into a formatting language.

The default stylesheets provided by TEI serve as a good basis for discussion on how formatting can be performed and are documented here for the sake of those who wish to know how each structure they might use in a TEI document might be rendered by default. The stylesheets nor this discussion should be taken as any kind of requirement to use these stylesheets as a base, or even at all.

Specific formatting for specific elements (and any attributes)

(to be displayed on reference pages?)

Specific formatting for specific categories of formatting (images, etc.)

(to be compiled after reference pages have their information fleshed out)