Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
Only Sessions at Location/Venue 
 
Session Overview
Session
Session 3A: Long Papers
Time:
Wednesday, 14/Sept/2022:
2:30pm - 4:00pm

Session Chair: Gustavo Fernandez Riva, Universität Heidelberg
Location: ARMB: 2.98

Armstrong Building: Lecture Room 2.98. Capacity: 168

Show help for 'Increase or decrease the abstract text size'
Presentations
ID: 113 / Session 3A: 1
Long Paper
Keywords: Middle Ages, lexicography, glossary, quantitative analysis, Latin

Vocabularium Bruxellense. Towards Quantitative Analysis of Medieval Lexicography

K. Nowak1, I. Krawczyk1, R. Alexandre2

1Institute of Polish Language (Polish Academy of Sciences), Poland; 2Institut de recherche et d'histoire des textes, France

The Vocabularium Bruxellense is a little-known example of medieval Latin lexicography (Weijers 1989). It has survived in a single manuscript dated to the 12th century and currently held at the Royal Library of Belgium in Brussels. In this paper we present the digital edition of the dictionary and the results of a quantitative study of its structure and content based on the TEI-conformant XML annotation.

First, we briefly discuss a number of annotation-related issues. For the most part, they result from the discrepancy between medieval and modern lexicographic practices which are accounted for in the 9th chapter of the TEI Guidelines (TEI Consortium). For example, a single paragraph of a manuscript may contain multiple dictionary entries which are etymologically or semantically related to the headword.

Medieval glossaries are also less consistent in their use of descriptive devices. For instance, the dictionary definitions across the same work may greatly vary as to their form and content. As such, they require fine-grained annotation if the semantics of the TEI elements is not to be strained.

Second, we present the TEI Publisher-based digital edition of the Vocabularium (Reijnders et al. 2022). At the moment, it provides basic browsing and search functionalities, making the dictionary available to the general public for the first time since the Middle Ages.

Thirdly, we demonstrate how the TEI-conformant annotation may enable a thourough quantitative analysis of the text which sheds a light on its place in a long tradition of medieval lexicography. We focus on two major aspects, namely the structure and the sources of the dictionary. As for the first, we present summary statistics of the almost 8,000 entries of the Vocabularium, expressed as a number of entries per letter and per physical page. We show that half of the entries are relatively short: a number among them contain only a one-word gloss and only 25% of entries contain 15 or more tokens.

Based on the the TEI XML annotation of nearly 1200 quotes, we were able to make a number of points concerning the function of quotations in medieval lexicographic works which is hardly limited to attesting specific language use. We observe that quotations are not equally distributed across the dictionary, as they can be found in slightly more than 10% of the entries, whereas nearly 7,000 entries have no quotations at all. The quotes are usually relatively short with only 5% containing 10 or more words. Our analysis shows that the most quoted author is by a wide margin Virgil followed by Horace, Lucan, Juvenal, Ovid, Plautus, and Terence (19). Church Fathers and medieval authors are seldom quoted, we have also discovered only 86 explicit Bible quotations so far.

In conclusions, we argue that systematic quantitative analyses of the existing editions of the medieval glossaries might provide useful insight into the development of this important part of the medieval written production.

Nowak-Vocabularium Bruxellense Towards Quantitative Analysis-113.odt


ID: 162 / Session 3A: 2
Long Paper
Keywords: standardization, morphology, morphosyntax, ISO, MAF, stand-off annotation

ISO MAF reloaded: new TEI serialization for an old ISO standard

P. Banski1, L. Romary2, A. Witt1

1IDS Mannheim, Germany; 2INRIA, France

The ISO Technical Committee TC 37, Language and terminology, Subcommittee SC 4, Language resource management (https://www.iso.org/committee/297592.html, ISO TC37 SC4 henceforth) has been, for nearly 20 years now, the locus of much work focusing on standardization of annotated language resources. Through the subcommittee’s liaison with the TEI-C, many of the standards developed there use customizations of the TEI Guidelines for the purpose of serializing their data models. Such is the case of the feature structure standards (ISO 24610-1:2006, ISO 24610-2:2011), which together form chapter 18 of the Guidelines, as well as the standard on the transcription of the spoken language (ISO 24624:2016, reflected in ch. 8) or the Lexical Markup Framework (LMF) series, where ISO 24613-4:2021 mirrors ch. 9 of the Guidelines.

The Morphosyntactic Annotation Framework (ISO 24611:2012) was initially published with its own serialization format, interwoven with suggestions on how its fragments can be rendered in the TEI. In a recent cyclic revision process, a decision was made to divide the standard in two parts, and to replace the legacy serialization format with a customization of the TEI that makes use of the recent developments in the Guidelines – crucially, the work on the standOff element and the work on the att.linguistic attribute class. The proposed contribution reviews fragments of the revised standard and presents the TEI devices used to encode it. At the time of the conference, ISO/CD 24611-1 “Language resource management — Morphosyntactic annotation framework (MAF) — Part 1: Core model” will have been freshly through the Committee Draft ballot by the national committees mirroring ISO TC37 SC4.

In what follows, we briefly outline the basic properties of the MAF data model and review selected examples of its serialization in the TEI.

Banski-ISO MAF reloaded-162.odt


ID: 108 / Session 3A: 3
Long Paper
Keywords: lexicography, dictionaries, semantic web

TEI Modelling of the Lexicographic Data in the DARIAH-PL Project

K. Nowak, D. Mika, W. Łukasik

Institute of Polish Language (Polish Academy of Sciences), Poland

The main goal of project “DARIAH-PL Digital Research Infrastructure for the Arts and Humanities” project is building the Dariah.lab infrastructure, which would allow for sharing and integrated access to digital resources and data from various fields of the humanities and arts. Among numerous tasks that the Institute of Polish Language, Polish Academy of Sciences coordinates, we are working towards the integration of our lexicographic data with the LLOD resources (Chiarcos et al. 2012). The essential step of this task is to convert the raw text into TEI-compliant XML format (TEI Consortium).

In this paper we would like to outline the main issues involved in TEI XML modelling of these heterogeneous lexicographic data.

In the first part, we will give a brief overview of the formal and content features of the dictionaries. For the most part, they are paper-born works developed with the research community in mind and as such are rich in information and complex in structure. They cover diachronic development (from medieval Polish and Latin to present day Polish) and its functional variation of Polish (general language vs. dialects, proper names).

On a practical level, this meant that, first, substantial effort had to be put into optimizing the quality of the OCR output. Since, except for grobid-dictionaries (Khemakhem et al. 2018), there are no tools at the moment that would enable easy conversion of lexicographic data, the subsequent phase of structuring of dictionary text had to be applied on a per resource basis.

The TEI XML annotation has three main goals. First, it is a means of preserving the textuality of paper-born dictionaries which makes heavy use of formatting necessary to convey information and employ a complex system of text-based internal cross-references. Second, TEI modelling aims at a better understanding of each resource and its explicit description. The analysis is performed by lexicographers who may, however, come from a lexicographic tradition different from the one embodied in a particular dictionary, and thus need to make their interpretation of the dictionary text explicit. Regardless, in this way we may also detect and correct editorial inconsistencies, which are natural for collective works developed over many years. Third, the annotated text is meant to be the input of the alignment and linking tasks, it is then crucial that functionally equivalent structures were annotated in a systematic and coherent way. As we plan to provide an integrated access to the dictionaries, the TEI XML representation is also where the first phase of data reconciliation takes place. It does not only concern the structural units of a typical dictionary entry, such as <sense/> or <form/>, but also mapping between units of analytical language the dictionaries employ, such as labels, bibliographic reference system etc.

Nowak-TEI Modelling of the Lexicographic Data in the DARIAH-PL Project-108.docx