Olsen: Words, Objects and Attributes

Introduction

Wrong Place? Wrong Time? Or just plain wrong?

The expressive power of the Text Encoding Initiative Guidelines poses significant implementation challenges for large database systems.

Groping my way to a possible general solution that can be implemented on large databases.

Text object model using related SQL tables to manage object hierarchies in conjunction with a full text indexing system that is also aware of object hierarchies for searching and retrieval.

This talk:

implementation overview: an approach rather than a product,
outline some implications arising from SQL-FT interactions,
indicate what might be required for a complete TEI-XML system

Caveat emptor: it all works, in one version or another. Not yet a coherent system

Talk Home

Olsen: Words, Objects and Attributes: Slide 1

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Problematic

Next Slide

A paradoxical problem: the expressive power of the TEI.

Ability to define almost any type of textual object in almost any hierarchical configuration with almost any set of attributes.

Expression of TEI in XML: supports wide array of XML tools for document management, navigation, display.

Not clear that these tools will scale to handle large databases. Existing systems are expensive and not aimed at academic environment.

Object model (in PhiloLogic): handles most problems so far encountered, and may be logically extended as a model to handle more.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 2

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

SQL and Text Objects

Next Slide

A KISS rule: All data not actually textual may be extracted and managed in hierarchy related SQL tables, corresponding to embedded textual objects.

Textual objects may range from the individual word to databases containing many collections.

Object management is handled by one or more dedicated subsystems configured to use the full power of SQL, including arithmetic, logical, and pattern matching operations.

Marriage of full text and SQL has the advantage of using two distinct components that are robust, fast, and scalable.

PhiloLogic word --> object indexing scheme has been used extensively in many document types and language groups.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 3

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Text = Objects and Attributes

Next Slide

A textual object, in my understanding of SGML/XML, is an arbitrary block of data that may contain additional objects and words or that may be constituent part of an object.

Typically considered in as tree, one may express more than one hierarchy of objects: e.g. "book-chapter-verse" and "pages"

So far, words are the lowest (atomic) level of object.

A word index is then a logical address of nested objects to the word object, e.g.:

Volume-->
     Book-->
        Chapter-->
           SubChapter-->
              paragraph-->
                  sentence-->
                       phrase-->
                                word

For indexing purposes, words are simply given logical addresses without reference to object type or any other data.

Implied object divisions: calculate phrases, sentences, etc. from context if explicit tagging is unavailable.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 4

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Text Objects in Philo

Next Slide

Text objects are assigned unique identifiers reflecting the object hierarchy, from document/file (not always the same) to word.

These can be used to

select objects for searching
proximity search bounding
retrieval of objects

The paragraph in Diderot's Encyclopédie, for example, containing the phrase étincelles lumineuses has the logical address: 35:78:0:51 being the 51st child object of the 78th division object, of the 35th file/document.

Descriptive data associated with this object is stored in an SQL table

35:78 a main article Electricité, by d'Aumont, in the class of knowledge Physique, and is associated with the page object 35:43, the 43rd page object (not the page number, 5:469, which is stored in a related table) of the 35th file.

Object attributes are stored in SQL tables and can have different values at different levels.

Headword Type Author Class of Knowledge P.S. Vol:Obj Obj ID

LUNE artm d'Alembert Astr. s.f. 9:3472 70:12:0

Lune arts XXX Chimie. NA 9:3473 70:12:1

Lune arts Venel Chimie. NA 9:3474 70:12:2

Lune arts XXX Hist. nat. Chimie, Metallurgie a Mineralogie. NA 9:3475 70:12:3

Lune cornée arts XXX Chimie Metall. NA 9:3476 70:12:4

Lune arts Jaucourt Mythologie. NA 9:3477 70:12:5

Previous Slide

Olsen: Words, Objects and Attributes: Slide 5

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Word Objects in PhiloLogic

Next Slide

Distinction: word index entry and word object database.

Internal raw index (from Platts' A Dictionary of Urdu, Classical Hindi, and English):

word Desire 0 212 0 1 1 0 62070 6
word delight 0 212 0 1 1 1 62078 6
word pleasure 0 212 0 1 1 2 62087 6
word taste 0 212 0 1 1 3 62097 6
word relish 0 212 0 1 1 4 62104 6

showing the object hierarchy to the word, with byte offset and page tag. This corresponds to the dictionary entry (shown in non-Unicode display option):

abhiruc (p. 6)
S abhiruc abhiruc, s.f. Desire, delight, pleasure, taste, relish. --- abhiruc rakhna (-men). To have or take delight (in).

The index form of each word serves as the relational operator between the object hierarchy and word object management subsystem.

word upanām 0 595 0 1 0 3 143854 0012
word upanām-vāćī-śabd 0 595 0 1 1 5 143924 0012
word a&ttod;hmanā 0 908 0 1 1 4 226849 0018
word a&ttod;kal-paććū 0 974 0 1 2 3 242980 0019
word &ttodtod;ara&htod;-se 0 1293 0 1 11 17 344921 0027

Entry forms may include any set of characters, including arbitrary SGML entities, ISCII, and Unicode representations of characters, as required.

Word object management subsystem(s) serve as a way to allow the user to

optionally search on simplified representations of words, and
optionally display simplified or Unicode representation of results

word pattern expansion -- we use a two field database containing the index form and a simplified form of each word in the database.

kalaka kā<od;aka

kalyanani kalyā&ntod;āni

susrusin śuśrū&stod;in

Or other representations for search and display

This can be logically extended by adding fields containing, for example, word lemmas and other appropriate lexical information to be used for searching, possibly including language attributes or values for objects in which words are to be distinguished but that do not have attributes (notes?).

Previous Slide

Olsen: Words, Objects and Attributes: Slide 6

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Object Hierarchies

Next Slide

Object hierarchies are constructed by mapping the logical address of words found in occurrence indexes

upanām 0 595 0 1 0 3 (or 0:595:0:1:0:3)
to one or more SQL tables that have as many fields (including joins from parent objects) and the address of the object, which is always a right truncation of any particular word address.

Object hierarchies are completely independent of types or any other information.

35:78:0:51 is a paragraph in an encyclopedia while 140:6:1:3 is a chapter (Chapitre III. La famille) of Sagnac's Législation civile....

The nested hierarchy may be displayed as a table of contents, such as document/file 140

Object Types are managed in the SQL handlers rather than in word indicies. This means that the object types, nesting rules, and the SGML/XML encoding used to represent this information is independent of the identified objects.

No clear technical distinction between

files and documents
metadata and "description".

as all are considered simply objects that may be nested within others.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 7

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Tying it all together: relational objects

Next Slide

Word occurrence indexes have two relational elements

index entry of the word
logical address of the word, including all parent objects

Add a housekeeping relationship: an object to byte offset map.

Everything else logically follows from these three things.

The four step program:

1. Object Selection

colet% ./gimme "author=venel" "class=chim" | head
15:184:1
17:80:3
17:101:1
18:114:0
18:114:2
20:52:3
20:57:2
20:93:2
20:140:1
20:317:0
.....

2. Word Expansion Vector (pattern matching and/or morphological and/or additional attributes)

dsal% echo ".*cuhri.*" | ./crapser
ćūh&rtod;ā-ćūh&rtod;ī
ćūh&rtod;ī
ćūhrī

where the search string ".*cuhri.*" was applied to a simplified search field and the result is the index form of the word.

3. Word Index Searching
Combine two vectors: objects and index entry forms. For each word occurring in the database we store the occurrence indices defining its logical location in the object hierarchy, such as volume, article, subarticle, paragraph, sentence, as well as the word itself. This makes it possible to perform complex context searches, which can be delimited by any identified structure, such as sentence, paragraph, subarticle and article.

4. Context Extraction and Rendering
Only time we actually touch text. Contextualization from objects to file byte offsets. Rendering is performed by a database specific set of rules, such as conversion of SGML/XML tagging into printable HTML, rendering object navigation and internal cross-reference links, getting inline images, Unicode display and so on.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 8

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Extending Relational Objects

Next Slide

Independence of the object hierarchy from data associated with elements in the hierarchy allows for:

direct access to that object
creation of dynamically joined tables describing objects at different levels
additional data associated with any particular object

SQL tables can be generated or extracted from SGML/XML as a set of wrappers allowing multiple looks into a database.

Supports handling objects of different granularity and definition.

Chadwick-Healy's English Poetry:

database of volumes of poetry
database of individual poems (and other objects)

with metadata/description of each which can be searched separately or in conjunction.

Consider the SGML representation of Schiller's letters as found in his complete works.

The metadata describes the digital volumes. Not a completely useful description of letters.

<letter author="Schiller, Friedrich von" gender=m n=1 id=23L1 tocdate="21. April 1772" date=17720421 date2=17720421 address1="Stoll, Elisabetha Margaretha, geb. Sommer" somhead="Band 23: Brief an Elisabetha Margaretha Stoll, 21. April 1772, (Nr. 1)" somlevel=1 comhead="Bd. 23: Briefwechsel: Brief an Stoll, Elisabetha Margaretha, geb. Sommer, 21. April 1772 (Nr. 1)">

Multiple descriptions are automatically generated by relational joins.

Practical consideration: I like to precompute joins for speed.

No theoretical limitation to number or size of tables that can be used to describe objects.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 9

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Extractors

Next Slide

This is an evolving model, with small development steps, which has allowed us to work toward a more general production system.

Currently, the database load process is a single basically hard coded data extractor, which identifies, metadata, words and selected objects corresponding to the ARTFL Text Encoding specification.

This builds the indicies, objects, and general metadata for a wide variety of databases.

Additional descriptive object data and extensions to metadata is compiled by one of more extractors which builds SQL tables corresponding to the object hierarchy.

In theory, there are a couple of advantages:

basic configuration of any range of databases in a particular encoding scheme can be completely automated (eg TEILite),
extensions can be optionally implemented after preliminary builds leveraging extended data that may be ignored in a preliminary stage,
builds on an existing, robust, production platform.

And at least two downsides:

historical legacy in the form of older code
unknown compatibility/redundancy with new technologies like XPATH and DOM, which are clearly related.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 10

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

SQL-SGML/XML linker

Next Slide

Object Model poses the operational question: what should be encoded as SGML/XML and when?

Objects in this model can have many attributes. Alexander Street Press databases feature extensive object attributes, some well over 100 fields per object.

<sp> <speaker n="CH00166">STAMPFIELD</speaker> Hello. Mose, what are you kicking about? </sp>

CH00166 is related to an entry in a character database that has many fields, some of which are individual title specific and others which apply to characters across the entire collection of 1,000 plays (gender, occupation, nationality, sexual orientation) relationally joined to data pertaining to parent objects (acts, scenes, plays, etc.).

The architecture permits searching for all of the character dialogues by "black, female, heterosexual, teachers, in dramas published between 1980 and 2000", such as "Karen Swanson" in Rita Dove's The Siberian Village and "Rosa Steele" in Afaa Weavers' Rosa.

Operationally, it is far easier to maintain such associations directly rather than to encode this data in SGML/XML.

For export purposes, this could easily be done.

Assuming the relational integration of SGML/XML and SQL tables from database design and data capture, simplifies management of complex encoding projects.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 11

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Multiple Object Hierarchies

Next Slide

Most documents can be represented with one main object hierarchy -- book, chapter, verse -- and pagination as separate flat parallel objects.

PhiloLogic: pagination is not a real object but a secondary set of blocks used for contextualization from searches and page-turning.

Experiment: Secondary objects can be extended by relating one or more SQL tables to blocks of text. Implemented as an SQL wrapper to provide an alternative means of searching descriptive material.

Example: Text glosses and extensive indexing implemented as a large (95 field) table pointing to arbitrary blocks in text:

<part id="S2578-D002">
[some text, divs, etc]
</part>

SQL table contains indexing for flora, fauna, encounters, tribes, places, persons, etc. as well as housekeeping items like sequencing and cross language linking.

Mapping to and from main object hierarchy.

Useful because much of the indexing is too specific to be applied to the typical divisions found in these documents.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 12

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Multiple Views and Threads

Next Slide

Object model provides capability to consider large textual database from different views, eg.

volumes of drama
individual plays
particular acts/scenes
character dialogues

with searching for joined attributes across the entire database without regard to object depth.

SQL driven object management also supports multiple threading of objects across the database, eg.

volumes of letters (in complete works)
individual correspondences (to/from) in chronological order across volumes.
subject or other criteria and chronology

Sequencing data can be added dynamically AND be used as search criteria.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 13

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Extensive Attributes: Example

Next Slide

Multiple tables describing contents. Early Encounters in North America has 7 tables, combined in various ways (Table). A couple of examples of SQL tables associated with SGML text

Letter from Annie Elizabeth Anderson to James House Anderson, October 13, 1861.
Letter from Robert E. Lee to Thomas Jonathan Jackson, 1861.
Act 1, Scene 1 of William Brown's The Escape.
Character Hannah in The Escape.
EENA "Part" Annotation and the Annotated "part" (display not finalized).

Previous Slide

Olsen: Words, Objects and Attributes: Slide 14

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Extensive Attributes: Implications

Next Slide

Object environment supports many related tables which can cover a wide variety of different types of contextual data.

Black Drama - 1850 to the Present (BLDR), for example, is currently built around the full-text of 200 plays by 60 authors (with a final projected size of some 1,200 plays). In addition to the full-text, BLDR has 8 metadata tables containing information describing:

Authors
Plays
Scenes
Characters
Theaters
Productions
Production Companies
Resources (play bills, etc)

The tables are generated by multiple joins from internal databases to incorporate data that is deemed to be useful in describing a particular feature, including elements that might be considered to be traditional metadata as well as data that is extracted from text data or associated with textual data.

Four? General categories

Content Indexing: identification of subjects discussed in the document as well as normalization of a wide variety of terms, including places, flora, fauna, themes, etc.
Contextual Description: extensive descriptions of the context of composition, distribution, and consumption of textual information, such as the social context of composition or number of casualties in battles associated with particular letters or diary entries.
Personal Information: extensive description of data related to the authors or characters of document.
Internal Structures: to support object navigation and inter-relations between objects.

Extensive description of textual data can be integrated with any particular object by a series of (typically) precomputed joins.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 15

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Object vs word attributes

Next Slide

Implementation design question: should all objects be managed in SQL tables?

Many objects do not have significant attributes: notes, stage directions, etc.

For text searching we may want to be able to distinguish items in this class of objects.

Under discussion at ARTFL: a heuristic that would a field in the word occurrence index to identify context of words that occur in "non-attribute value" objects.

These are attributes, of course, that can be associated with a word. A stage direction is a real thing, it simply does not have any additional values typically associated with it.

"Non-attribute value" objects would appear in the standard object hierarchy and could be distinguished for display.

Currently an open question.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 16

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

TEIification?

Next Slide

Interesting fact: full text systems rarely actually process text.

On load: SGML/XML is parsed to extract raw data for word occurrence indexes and SQL table generation.

In real time: processing text is only for formatting, linking to external resources.

Both functions are in highly isolated subsystems.

Retrofitting PhiloLogic to handle TEI-XML would, from my examination, require:

Replacing a hardcoded C program that parses text with an XML Parser to output raw index data and adding to this a coherent SQL table generator.
Creation of an XML style sheet XSLT for different reporting options (since you don't get a whole document) to handle displays and required hooks in a database specific output formatter (perl).
Various housekeeping issues.
Your mileage may vary.

Current production implementation has a fixed object depth. Development implementation has flexible object depth capabilities (and a NOT operator).

Principle: All internal functions of PhiloLogic are text encoding independent and may also be character encoding independent (ISO-Latin, Unicode, character entities).

Previous Slide

Olsen: Words, Objects and Attributes: Slide 17

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Confusing Metadata and Annotation

Next Slide

Comment on original paper:

[...] there is a fundamental problem in that the paper is more about annotation than metadata. Much of the community is confused about what metadata really is, and this paper will only add to that confusion. The example given [...] is clearly an instance of an annotation of the text, not metadata. In TEI terms, metadata is the description of the resource as a whole that goes in the TEI Header. The example given is an instance of what the TEI calls "the analysis and interpretation of text".

The distinction on an encoding level is perfectly valid.

In a dynamic object implementation model, it appears that the distinction becomes hopelessly blurred.

Further: it appears to me there is a working assumption that SGML/XML will represent documents as files and that these will be the basic processing unit.

Object model: no necessary relationship between files, documents, and objects. All dynamically defined.

Insistence on the distinction, which "much of the community is confused about" may be restricting the way we think about how to implement systems to handle TEI encoded databases.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 18

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Truth in Advertising

Next Slide

Model for an implementation rather than a piece of software.

Could certainly leverage new generation XML tools for many aspects.

Currently using XSLT at ARTFL. Need to examine XPATH, etc.

PhiloLogic: ongoing development. It all works but not in a single coherent implementation. Various components performing various functions implemented in a number of different projects.

Working now to tie it all together with commercial and possibly academic collaborators.

Jelloware? Much more than a theory or design, less than a fully articulated system.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 19

Next Slide

Olsen: Words, Objects and Attributes

Previous Slide

Conclusion

Talk Home

The large-scale implementation problems posed by the nearly limitless expressive power of the TEI are subject to a number of different partial and full solutions.

The integration of SQL and full text in an object model provides one partial, relatively low cost, scalable, and robust approach to the problem.

I fully expect other solutions, using the array of new XML aware tools being created, may be of guidance to our own modernization of PhiloLogic.

I am hoping that the text object model outlined here will be of benefit to others working to leverage the current and next generation of richly encoded documents being created by so many academic and commercial organizations.

Previous Slide

Olsen: Words, Objects and Attributes: Slide 20

Talk Home

Headword	Type	Author	Class of Knowledge	P.S.	Vol:Obj	Obj ID
LUNE	artm	d'Alembert	Astr.	s.f.	9:3472	70:12:0
Lune	arts	XXX	Chimie.	NA	9:3473	70:12:1
Lune	arts	Venel	Chimie.	NA	9:3474	70:12:2
Lune	arts	XXX	Hist. nat. Chimie, Metallurgie a Mineralogie.	NA	9:3475	70:12:3
Lune cornée	arts	XXX	Chimie Metall.	NA	9:3476	70:12:4
Lune	arts	Jaucourt	Mythologie.	NA	9:3477	70:12:5

kalaka	kā<od;aka
kalyanani	kalyā&ntod;āni
susrusin	śuśrū&stod;in