Dictionaries
The encoding of electronic dictionaries, both monolingual and
multi-lingual, has been and remains the subject of much intense
research effort, an adequate summary of which is beyond the scope
of the present work.cite some useful pointersWithin the
community of computing lexicologists, two broad paradigms have
emerged: the scriptural and the prophetic
cite some examples. For those guided by the scriptures,
the electronic dictionary is only a rather more flexible version
of a printed dictionary. Their concern is with the management of
strings of characters, whether derived directly from publishers'
typesetting tapes or scanned or retyped from printed originals. For
those guided by the prophets, what count are the underlying
linguistic phenomena, of which the printed dictionary is one
possible reflection among many. Their concern is with the
management of linguistic concepts as actualised in electronic
lexica, perhaps never intended for direct human consumption. At
present, we propose Guidelines for the former school only, since
it is here that most effort has already been expendedMost of
the proposals summarised here derives from work carried out within
the DEI (cite RSA/FWT papers, F&V, who else? . Moreover, it
seems probable that the general purpose mechanisms described in
chapter will be more than adequate to the task
of describing the contents of an electronic lexicon, although this
has yet to be put to the test.
Form and content
Printed dictionaries are among the most typographically complex
of the materials discussed in these guidelines. Even a small
dictionary will use different sizes of italic, roman and bold
fonts, small capitals, variable leading and a range of special
symbols and abbreviatory conventions far in excess of those likely
to be encountered in the most avant garde of concrete poems.
Dictionaries are also unusual in the extent to which typographic
conventions are consciously used to denote structural function:
some dictionaries even include a key at the start explaining that,
for example, headwords are always in bold face, cross references
always employ a certrain symbol, subordinate entries are always
indicated by small capitals and so forth. This is not to deny that
in dictionaries, as with other texts, and particularly in older
printed dictionaries, the function of some features may remain
ambiguous or unclear. The use of italics may have many purposes in
a dictionary entry, not all of which are immediately apparent. This
section will however propose tags only for agreed structural
components of dictionary entries. For tags appropriate to the
rendering of an entry independent of any consideration of its
function, (for example ambiguous font usage, page layout,
pagination etc.) see section .
Basic components
At its simplest, a dictionary is composed of entries.
Each entry has a single word or phrase at its head (the
headword ), to which are attached groups of
information from the following list:
- forms: associated written or spoken forms of lexical items
- meanings: sets of definitions (in a monolingual dictionary) or
translations (in a multilingual one), usually including examples
of usage
- related forms: lists of related lexical items, such as compound
phrases, derived forms etc.
- etymology: descriptions or structured analysis of the presumed
history of lexical items
In different dictionaries, components from this list may be
organised in different ways. We make no attempt here to define a
universal structure for all possible dictionaries. Instead, we
propose items within each of the four categories above which
should, in our opinion, be distinguished by whatever tagging system
is employed, together with proposed names for them. We also give
some suggestions for the kinds of change which may be necessary
when editing e.g. typesetting tapes or scanner output.
The following elements may appear anywhere within an entry.
form Orthographic form of a word
pos Part of speech. Different dictionaries use codes
taken from different lists representing a variety of
linguistic theories. Dictionaries also vary in the extent
of encoding used. Whatever is recorded in the dictionary
should be entered in this field: we do not propose the
establishment of a universal code for parts of speech in
this context. The mapping between the codes used here
and those proposed in section 8 however...
pron Pronunciation. This element is subdivided as follows:
sound A string representing the pronunciation in
whatever notation the dictionary uses.
extent Amount of pronunciation given (whole
or affix)
label Usage label: attribute `type' indicates
whether this pronunciation is regionally,
socially, historically or otherwise marked.
codes Other encoded information about the
pronunciation, e.g. syllabification or stress
pattern, where these are not apparent from the
string given by soundinclude a
reference to A&T on stress syllable/syllable
distance encoding/s here?
label Any usage label attached to a word, for example to
indicate its geographic origin, subject domain, register,
usage status, syntactic coding etc. An attribute `type'
may be used to indicate the type of label, with values
such as "geog", "domain", "register". As with
hw.type an exhaustive list is only possible
for any one dictionary and is not attempted here.
xref A cross reference to some other entry in the
dictionary. This may contain sufficient text to identify
the target reference, e.g. the headword plus homynym and
sense numbers, or may simply use phrases such as
preceding
. In either case, a `REFID' attribute
should be used in addition to provide the value of the
ID attribute of the element referred to.
HW: Headword of entry
Dictionary entries are conventionally presented as individual
articles, each headed by some morphological form of the lexical
item or items treated within them. The rest of the article may be
organised in different ways, for example to demonstrate the
historical development of a particular lexical item, or in terms
of the semantic field it denotes, but there will always be one
word, word-element or phrase regarded as the head of the entry, in
terms of which the alphabetic sequence of the dictionary is
organised. This we call the headword, and it has a tag
hw, with two possible attributes `id', an identification
number which should be unique within the dictionary, and `key' a
sort key used to determine the alphabetic sequence of the entry
within the dictionary body. The following elements are unique to
the content of the HW element:
hw.type Type of entry. Entries may be categorised in a
number of ways, for example as "main", "cross-
reference", "supplementary" etc. Although a closed
set of possible values for this element may be
defined for any one dictionary, this may not be
possible in general.
hw.hom homonym number. An artifice commonly used to
distinguish orthographically identical but lexically
different headword forms.
VL: Variant Forms
Some dictionaries include lists of historical, morphological or
other derived forms of words under the headword. Others arrange
such variants hierarchically as main lemma, sublemma, sub-sub-lemma
etc. Such lists should be distinguished from Related Entries (see
), which comprise lists of essentially independent
lexical entries, each with their own internal structure but grouped
with the headword for convenience. The VL tag is used
to group variant forms etc. given within the headword entry as a
single unit, often with interspersed text such as "Also" ,
"chiefly", "Hence" etc. Individual components of such lists should
all be tagged with the appropriate tags (form, pos, usage etc.).
Other elements, unique to the the content of the VL/tag>
element are:
vl.type Type of list (historical, inflexional etc.)
MG: Meaning group
The meat of a dictionary entry is that part where individual
senses are defined or translated and exemplified. This section is
delimited as a block by the MG tag. Within the block,
individual senses are tagged by the SENSE tag, which
should carry an `ID' attribute to identify individual numbered
senses in addition to the sense-number (if any) present in the
text. Each sense element may contain the following elements, in
addition to form, pos or usage:
sn Sense number, as given in the text. In complex
dictionaries, such as the OED, this label may also be
used to indicate the hierarchic position of the sense
within the whole structure.
gloss Definition text, giving the meaning or translation.
eg An example of usage. Where possible, this should be
subdivided further to include the following subelements:
eg.date Date of citation
eg.auth Author of citation
eg.work Work cited
eg.detail Other detail of citation source
eg.text Text of citation
Note that, in some dictionaries, examples are given in a separate
list, linked to individual senses by reference. In such cases, a
separate eglist should be used to delimit the block of
examples, and an additional element (a sense.ref)
included in each constituent eg element to supply the
identifier of the relevant sense.
RE: Related entries
The RE tag introduces a block of degenerate headword
entries grouped under the main headword for some purpose. Some
dictionaries may categorise such blocks, distinguishing, for
example, direct derivatives or inflected forms of the headword,
compound words and phrases containing the headword.
Other elements, unique to the the content of the RE
element are:
re.type Type of related entries (compounds, phrases
etc.)
Etymologies
Etymologies, except in very specialist dictionaries, generally
contain loosely organised lists of related words from different
languages, with some indication of their relationship to the
headword and possibly a translation. Occasionally an etymology is
sufficiently clearly understood for there to be both an ultimate
root form, or etymon, and a clear hierarchic path by which its
derivatives were created. More usually, speculation is rife.
cite Abate.
We propose here a view of an etymology as a structure composed
of a number of simple et.nodes, each of which may carry
a unique identifier attribute `id' and comprising a form, a
language, an optional gloss, an optional note and an optional
cross-reference. The relationships between nodes, where these can
be specified, are represented by an et.link tag, which
carries two attributes, one ('rel') specifying the type of
relationship (e.g. 'from', 'cf', 'variant', `ancestor'), the other
(`id') the target of the relationship.
A further complication is that detailed etymologies (unlike other
parts of dictionaries) are often presented as continuous prose
essays from which individual nodes in the presumed hierarchic tree
cannot easily be extracted without doing some violence to the text
of the essay. In extreme cases, it may be necessary to present the
etymology both as a structured list of et.nodes and as
a separate prose text, distinguished by the tag et.text.
Even in simple cases, some duplication of the text as printed will
usually be necessary.
Examples
O.E. æppel, Cf. Ger apfel; O.N. epli; Ir. abhal; W. afal
[Chambers Twentieth century dictionary ed A.M.Macdonald. New
ed 1972]
O.E.
ME appel < OE æppel, fruit, apple (also, eyeball, anything
round); akin to OIr aball (Welsh afall), apple tree.
[Websters New World Dictionary, 3rd ed., cited by Abate]
ME
Something on expanding abbreviations e.g. tilde for headword
in entry etc.?