Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
Closing Keynote: Emmanuel Ngue Um, 'Tone as “Noiseless Data”: Insight from Niger-Congo Tone Languages'
Time:
Friday, 16/Sept/2022:
2:30pm - 4:00pm

Session Chair: Martina Scholger, University of Graz
Location: ARMB: 2.98

Armstrong Building: Lecture Room 2.98. Capacity: 168


With Closing Remarks, Dr James Cummings, Local TEI2022 Conference Organiser

Presentations
ID: 166 / Closing Keynote: 1
Invited Keynote

Tone as “Noiseless Data”: Insight from Niger-Congo Tone Languages

E. Ngue Um

University of Yaoundé 1 & University of Bertoua (Cameroon), Cameroon

Text processing assumes two layers of textual data: a "noisy" layer and a "noiseless" layer. The “noisy” layer is generally considered unsuitable for analysis and is eliminated at the pre-processing stage. In current Natural Language Processing (NLP) technologies like text generation in machine translation, the representation of tones as diacritical symbols in the orthography of Niger-Congo languages leads to these symbols being pre-processed as “noisy” data. As an illustration, none of the 15 Niger-Congo tone languages modules available on Google Translate delivers in a systematic and consistent manner, text data that contains linguistic information encoded through tone melody.

The Text Encoding Initiative (TEI) is a framework which can be used to circumvent the “noisiness” brought about by diacritical tone symbols in the processing of text data of Niger-Congo languages.

In novel work, I propose a markup scheme for tone that encompasses:

a) The markup of tone units within an <m> (morpheme) element; this aims to capture the functional properties of tone units, just like segmental morphemes.

b) The markup of tonal characters (diacritical symbols) within a <g> (glyph) element and the representation of the pitch by hexadecimal data representing the Unicode character code for that pitch; this aims to capture tone marks as autonomous symbols, in contrast with their combining layout when represented as diacritics.

c) The markup of downstep and upstep within an <accid> (accidental) element mirroring musical accidentals such as “sharp” and “flat”; this aims to capture strictly melodic properties of tone on a separate annotation tier.

The objectives of tone encoding within the TEI framework are threefold:

a) To harness quantitative research on tone in Niger-Congo languages.

b) To leverage “clean” language data of Niger-Congo languages that can be used more efficiently in machine learning tasks for tone generation in textual data.

c) To gain better insights into the orthography of tone in Niger-Congo languages.

In this paper, I will show how this novel perspective to the annotation of tone can be applied productively, using a corpus of language data stemming from 120 Niger-Congo languages.

Ngue Um-Tone as “Noiseless Data”-166.pdf