Dictionaries

The encoding of electronic dictionaries, both monolingual and multi-lingual, has been and remains the subject of much intense research effort, an adequate summary of which is beyond the scope of the present work.cite some useful pointersWithin the community of computing lexicologists, two broad paradigms have emerged: the scriptural and the prophetic cite some examples. For those guided by the scriptures, the electronic dictionary is only a rather more flexible version of a printed dictionary. Their concern is with the management of strings of characters, whether derived directly from publishers' typesetting tapes or scanned or retyped from printed originals. For those guided by the prophets, what count are the underlying linguistic phenomena, of which the printed dictionary is one possible reflection among many. Their concern is with the management of linguistic concepts as actualised in electronic lexica, perhaps never intended for direct human consumption. At present, we propose Guidelines for the former school only, since it is here that most effort has already been expendedMost of the proposals summarised here derives from work carried out within the DEI (cite RSA/FWT papers, F&V, who else? . Moreover, it seems probable that the general purpose mechanisms described in chapter will be more than adequate to the task of describing the contents of an electronic lexicon, although this has yet to be put to the test.

Form and content

Printed dictionaries are among the most typographically complex of the materials discussed in these guidelines. Even a small dictionary will use different sizes of italic, roman and bold fonts, small capitals, variable leading and a range of special symbols and abbreviatory conventions far in excess of those likely to be encountered in the most avant garde of concrete poems. Dictionaries are also unusual in the extent to which typographic conventions are consciously used to denote structural function: some dictionaries even include a key at the start explaining that, for example, headwords are always in bold face, cross references always employ a certrain symbol, subordinate entries are always indicated by small capitals and so forth. This is not to deny that in dictionaries, as with other texts, and particularly in older printed dictionaries, the function of some features may remain ambiguous or unclear. The use of italics may have many purposes in a dictionary entry, not all of which are immediately apparent. This section will however propose tags only for agreed structural components of dictionary entries. For tags appropriate to the rendering of an entry independent of any consideration of its function, (for example ambiguous font usage, page layout, pagination etc.) see section .

Basic components

At its simplest, a dictionary is composed of entries. Each entry has a single word or phrase at its head (the headword ), to which are attached groups of information from the following list:

In different dictionaries, components from this list may be organised in different ways. We make no attempt here to define a universal structure for all possible dictionaries. Instead, we propose items within each of the four categories above which should, in our opinion, be distinguished by whatever tagging system is employed, together with proposed names for them. We also give some suggestions for the kinds of change which may be necessary when editing e.g. typesetting tapes or scanner output.

The following elements may appear anywhere within an entry. form Orthographic form of a word pos Part of speech. Different dictionaries use codes taken from different lists representing a variety of linguistic theories. Dictionaries also vary in the extent of encoding used. Whatever is recorded in the dictionary should be entered in this field: we do not propose the establishment of a universal code for parts of speech in this context. The mapping between the codes used here and those proposed in section 8 however... pron Pronunciation. This element is subdivided as follows: sound A string representing the pronunciation in whatever notation the dictionary uses. extent Amount of pronunciation given (whole or affix) label Usage label: attribute `type' indicates whether this pronunciation is regionally, socially, historically or otherwise marked. codes Other encoded information about the pronunciation, e.g. syllabification or stress pattern, where these are not apparent from the string given by soundinclude a reference to A&T on stress syllable/syllable distance encoding/s here? label Any usage label attached to a word, for example to indicate its geographic origin, subject domain, register, usage status, syntactic coding etc. An attribute `type' may be used to indicate the type of label, with values such as "geog", "domain", "register". As with hw.type an exhaustive list is only possible for any one dictionary and is not attempted here. xref A cross reference to some other entry in the dictionary. This may contain sufficient text to identify the target reference, e.g. the headword plus homynym and sense numbers, or may simply use phrases such as preceding. In either case, a `REFID' attribute should be used in addition to provide the value of the ID attribute of the element referred to.

HW: Headword of entry

Dictionary entries are conventionally presented as individual articles, each headed by some morphological form of the lexical item or items treated within them. The rest of the article may be organised in different ways, for example to demonstrate the historical development of a particular lexical item, or in terms of the semantic field it denotes, but there will always be one word, word-element or phrase regarded as the head of the entry, in terms of which the alphabetic sequence of the dictionary is organised. This we call the headword, and it has a tag hw, with two possible attributes `id', an identification number which should be unique within the dictionary, and `key' a sort key used to determine the alphabetic sequence of the entry within the dictionary body. The following elements are unique to the content of the HW element: hw.type Type of entry. Entries may be categorised in a number of ways, for example as "main", "cross- reference", "supplementary" etc. Although a closed set of possible values for this element may be defined for any one dictionary, this may not be possible in general. hw.hom homonym number. An artifice commonly used to distinguish orthographically identical but lexically different headword forms.

VL: Variant Forms

Some dictionaries include lists of historical, morphological or other derived forms of words under the headword. Others arrange such variants hierarchically as main lemma, sublemma, sub-sub-lemma etc. Such lists should be distinguished from Related Entries (see ), which comprise lists of essentially independent lexical entries, each with their own internal structure but grouped with the headword for convenience. The VL tag is used to group variant forms etc. given within the headword entry as a single unit, often with interspersed text such as "Also" , "chiefly", "Hence" etc. Individual components of such lists should all be tagged with the appropriate tags (form, pos, usage etc.). Other elements, unique to the the content of the VL/tag> element are: vl.type Type of list (historical, inflexional etc.)

MG: Meaning group

The meat of a dictionary entry is that part where individual senses are defined or translated and exemplified. This section is delimited as a block by the MG tag. Within the block, individual senses are tagged by the SENSE tag, which should carry an `ID' attribute to identify individual numbered senses in addition to the sense-number (if any) present in the text. Each sense element may contain the following elements, in addition to form, pos or usage: sn Sense number, as given in the text. In complex dictionaries, such as the OED, this label may also be used to indicate the hierarchic position of the sense within the whole structure. gloss Definition text, giving the meaning or translation. eg An example of usage. Where possible, this should be subdivided further to include the following subelements: eg.date Date of citation eg.auth Author of citation eg.work Work cited eg.detail Other detail of citation source eg.text Text of citation Note that, in some dictionaries, examples are given in a separate list, linked to individual senses by reference. In such cases, a separate eglist should be used to delimit the block of examples, and an additional element (a sense.ref) included in each constituent eg element to supply the identifier of the relevant sense.

RE: Related entries

The RE tag introduces a block of degenerate headword entries grouped under the main headword for some purpose. Some dictionaries may categorise such blocks, distinguishing, for example, direct derivatives or inflected forms of the headword, compound words and phrases containing the headword. Other elements, unique to the the content of the RE element are: re.type Type of related entries (compounds, phrases etc.)

Etymologies

Etymologies, except in very specialist dictionaries, generally contain loosely organised lists of related words from different languages, with some indication of their relationship to the headword and possibly a translation. Occasionally an etymology is sufficiently clearly understood for there to be both an ultimate root form, or etymon, and a clear hierarchic path by which its derivatives were created. More usually, speculation is rife. cite Abate.

We propose here a view of an etymology as a structure composed of a number of simple et.nodes, each of which may carry a unique identifier attribute `id' and comprising a form, a language, an optional gloss, an optional note and an optional cross-reference. The relationships between nodes, where these can be specified, are represented by an et.link tag, which carries two attributes, one ('rel') specifying the type of relationship (e.g. 'from', 'cf', 'variant', `ancestor'), the other (`id') the target of the relationship.

A further complication is that detailed etymologies (unlike other parts of dictionaries) are often presented as continuous prose essays from which individual nodes in the presumed hierarchic tree cannot easily be extracted without doing some violence to the text of the essay. In extreme cases, it may be necessary to present the etymology both as a structured list of et.nodes and as a separate prose text, distinguished by the tag et.text. Even in simple cases, some duplication of the text as printed will usually be necessary. Examples O.E. æppel, Cf. Ger apfel; O.N. epli; Ir. abhal; W. afal [Chambers Twentieth century dictionary ed A.M.Macdonald. New ed 1972] O.E.

æppel Ger apfel O.N. epli; Ir. abhal; W. afal ME appel < OE æppel, fruit, apple (also, eyeball, anything round); akin to OIr aball (Welsh afall), apple tree. [Websters New World Dictionary, 3rd ed., cited by Abate] MEappel OEæppel fruit, apple (also, eyeball, anything round); OIraballapple tree Welshafallapple tree. Something on expanding abbreviations e.g. tilde for headword in entry etc.?